ANNIE - Open Source Information Extraction ANNIE: An example of named entities automatically annotated by the Information Extraction system. What's the Problem? Many knowledge acquisition and retrieval applications need to deal with human-written texts (e.g., technical documentation), therefore they need some robust and scalable language processing tools, in order to perform common tasks like tokenisation, part-of-speech tagging, and named entity recognition (automatic annotation of person, organisation, and other entities in the texts). These tools need to be easy to integrate, customisable, and support different languages and document formats. Towards a Solution ANNIE is an open-source, robust Information Extraction (IE) system which relies on finite state algorithms. ANNIE consists of the following main language processing tools: tokeniser, sentence splitter, POS tagger, named entity recogniser. The tokeniser splits text into simple tokens, such as numbers, punctuation, symbols, and words of different types (e.g. with an initial capital, all upper case, etc.). The aim is to limit the work of the tokeniser to maximise efficiency, and enable greater flexibility by placing the burden of analysis on subsequent tools. This means that the tokeniser does not need to be modified for different applications or text types. The sentence splitter segments the text into sentences. This module is required for the tagger. Both the splitter and tagger are domain- and application-independent. The tagger is a modified version of the Brill tagger, which produces a part-of-speech tag as an annotation on each word or symbol. Neither the splitter nor the tagger are a mandatory part of the IE system, but the extra linguistic information they produce increases the power and accuracy of the IE tools. The named entity recogniser consists of pattern-action rules, executed by the finate-state transduction mechanism. It recognises entities like person names, organisations, locations, money amounts, dates, percentages, and some types of addresses. It must be noted that the open-source named entity recogniser is simpler and less powerful than the one we use in-house - the difference is in the linguistic resources used. The figure above shows an example of named entities automatically annotated by the system. The system supports multiple languages through Unicode. So far, ANNIE has been adapted to do IE in Bulgarian, Romanian, Bengali, Greek, Spanish, Swedish, German, Italian, and French, and we are currently porting it to Arabic, Chinese and Russian, as part of the [1]MUSE project. ANNIE can be used and customised in [2]GATE's graphical development environment and also integrated in other applications through its API (Application Programming Interface). Take a Guided Tour [3]Introduction to ANNIE (movie) Try a Demonstration [4]Online demonstration [5]Download ANNIE (part of the GATE distribution) [6]full online documentation Technical requirements: Any platform supported by Java 1.4 Example Applications [7]Semantic indexing of multimedia material Automatic extraction of health and safety information from company reports (Health and Safety Executive/Sheffield University) Extraction of commodity events from news (Master Foods NV) [8]Automatic annotation and ontology population for the Semantic Web (Ontotext, Sirma AI Ltd.) Further Reading Key document: H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002. [9]PDF. Other relevant documents: [10]Tutorial , [11]Publications list References 1. http://www.dcs.shef.ac.uk/research/groups/nlp/muse/ 2. http://gate.ac.uk/ 3. http://gate.ac.uk/sale/talks/load-annie.gif 4. http://gate.ac.uk/annie/index.jsp 5. http://gate.ac.uk/download/index.html 6. http://gate.ac.uk/sale/tao/index.html#annie 7. http://www.dcs.shef.ac.uk/nlp/mumis/ 8. http://www.sirma.bg/OntoText/KIM/ 9. http://gate.ac.uk/sale/acl02/acl-main.pdf 10. http://gate.ac.uk/talks/tutorial3/01.html 11. http://gate.ac.uk/gate/doc/papers.html Hamish Cunningham be525eecfcaf0bfe715efa0524327dcd8d37f368 Valentin Tablan 16187e31c1699199c41f3f36760786958ab33aae Diana Maynard a70f5498055602ecd908fe90f0cb0ba8646d6c42 Kalina Bontcheva 56b37d3533a43c961ddb221777210d77d3152ac8