|ILP for Information Extraction from University of Edinburgh|
ILP for Information Extraction fact-file
What's the Problem?
Much of the available material on the web, and in corporate archives and intranets, is in textual form. This presents a problem for the automated processing of the knowledge contained in these information sources. Keyword-based analysis techniques are appropriate for tasks such as document clustering, however, they are inadequate for extracting knowledge. Information extraction aims to extract the key units of knowledge from text, for example: events and actors in those events, dates, authors and titles of documents.
In addition, many knowledge management tools require the knowledge content of an information source to be formalised; in the context of the Semantic Web, this becomes a necessity. However, extracting a formal representation, e.g. as RDF triples, from the huge volumes of text and other unstructured sources currently on-line is a major research task.
The specific objective of this work is to extract a characterisation of the meaning of text - expressed in ontological terms - for further processing by other knowledge-based tools.
Information Extraction (IE) tools attempt to characterise given texts in certain, well-defined terms. For instance, an IE tool might extract certain key features from a scientific text, such as the central concept, e.g. CarbonDioxide, and properties it is stated to have, e.g. its atmospheric concentration.
In order to do this, an IE tool must be supplied with a set of rules describing how these features are to be found in the text (eg., a simple rule might identify the concentration through the occurrence of a number followed by the unit of measure ppmv). However, these rules are often particularly domain-dependent, and can require time and expertise to configure manually.
Towards a Solution
We adopt a Machine Learning approach, known as Inductive Logic Programming (ILP), to the problem of automatically generating these rules. This research explores the complexities of learning from natural language data using ILP techniques. An ontology is used as a background theory in learning. Part of the ontology developed to describe this domain is illustrated in this screenshot of Protege-2000:
A small ontology
A recent paper presents the details of the ILP-based technique, and of the evaluation performed. The paper shows that the approach is capable of learning Information Extraction rules from a relatively small set of marked-up texts. The ontology is a key resource. It specifies the classes and relations that will characterise the text. In this case we are interested in concepts related to global warming.
In order to evaluate the effectiveness of the machine learning approach, a small set of texts were marked-up with ontology relations and the rule learning process repeated with a subset of texts omitted each time. The accuracy of the rules learned, expressed in terms of Precision, Recall, and F-score for one specific relation over a sequence of experiments is illustrated. These results are competitive with other generic approaches.
Take a Guided Tour
Aitken, J.S. Learning Information Extraction Rules: An Inductive Logic Programming approach.Proceedings of the 15th European Conference on Artificial Intelligence ed. van Harmelen, F., IOS Press, Amsterdam, 2002 PDF
Other relevant documents:
Aitken S. and Potter S. Learning from Natural Language Data using ILP: The Role of Background Knowledge and Negative Examples. Technical Report 2002 PDF