AKTors.org
AKTors.org AKTTechnologiesPublicationsRelated ProjectsPeople
AKTors.org AcquisitionModellingRetrievalReusePublishingMaintenance
ANNIE - Open Source Information Extraction from The University of Sheffield

ANNIE: An example of named entities automatically annotated by the Information Extraction system.


ANNIE - Open Source Information Extraction fact-file

Owner  :  The University of Sheffield
Researchers
(listed alphabetically)
 :  Dr Kalina Bontcheva [Browse, RDF], Dr Hamish Cunningham [Browse, RDF], Dr Diana Maynard [Browse, RDF], Mr Valentin Tablan [Browse, RDF]
Description  :  http://gate.ac.uk/sale/tao/index.html#annie
Demonstration  :  http://gate.ac.uk/annie/index.jsp
Screencam  :  http://gate.ac.uk/sale/talks/load-annie.gif
Builds on  :  GATE - General Architecture for Text Engineering, eXtensible Markup Language, Java, Hyper Text Markup Language
Addresses challenges  :  Knowledge Retrieval, Knowledge Acquisition

What's the Problem?

Many knowledge acquisition and retrieval applications need to deal with human-written texts (e.g., technical documentation), therefore they need some robust and scalable language processing tools, in order to perform common tasks like tokenisation, part-of-speech tagging, and named entity recognition (automatic annotation of person, organisation, and other entities in the texts). These tools need to be easy to integrate, customisable, and support different languages and document formats.

Towards a Solution

ANNIE is an open-source, robust Information Extraction (IE) system which relies on finite state algorithms. ANNIE consists of the following main language processing tools: tokeniser, sentence splitter, POS tagger, named entity recogniser.

The tokeniser splits text into simple tokens, such as numbers, punctuation, symbols, and words of different types (e.g. with an initial capital, all upper case, etc.). The aim is to limit the work of the tokeniser to maximise efficiency, and enable greater flexibility by placing the burden of analysis on subsequent tools. This means that the tokeniser does not need to be modified for different applications or text types.

The sentence splitter segments the text into sentences. This module is required for the tagger. Both the splitter and tagger are domain- and application-independent.

The tagger is a modified version of the Brill tagger, which produces a part-of-speech tag as an annotation on each word or symbol. Neither the splitter nor the tagger are a mandatory part of the IE system, but the extra linguistic information they produce increases the power and accuracy of the IE tools.

The named entity recogniser consists of pattern-action rules, executed by the finate-state transduction mechanism. It recognises entities like person names, organisations, locations, money amounts, dates, percentages, and some types of addresses. It must be noted that the open-source named entity recogniser is simpler and less powerful than the one we use in-house - the difference is in the linguistic resources used. The figure above shows an example of named entities automatically annotated by the system.

The system supports multiple languages through Unicode. So far, ANNIE has been adapted to do IE in Bulgarian, Romanian, Bengali, Greek, Spanish, Swedish, German, Italian, and French, and we are currently porting it to Arabic, Chinese and Russian, as part of the MUSE project.

ANNIE can be used and customised in GATE's graphical development environment and also integrated in other applications through its API (Application Programming Interface).

Take a Guided Tour

Introduction to ANNIE (movie)

Try a Demonstration

Online demonstration

Download ANNIE (part of the GATE distribution)

full online documentation

Technical requirements: Any platform supported by Java 1.4

Example Applications

Semantic indexing of multimedia material
Automatic extraction of health and safety information from company reports (Health and Safety Executive/Sheffield University)
Extraction of commodity events from news (Master Foods NV)
Automatic annotation and ontology population for the Semantic Web (Ontotext, Sirma AI Ltd.)

Further Reading

Key document:

H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002. PDF.

Other relevant documents: Tutorial , Publications list

Semantic representation

View in the AKT Triplestore Browser or as RDF.

Also available in DOAP RDF (Description Of A Project)