Armadillo The information overload we experience from Internet is partly due to vast quantities of redundant information. Redundancy is apparent in the presence of multiple citations of the same facts in superficially different formats. This redundancy can be exploited to bootstrap the annotation process needed for Information Extraction, thus enabling production of machine-readable content for the Semantic Web. For example, the fact that a system knows the name of an author can be used to identify a number of other author names using resources present on the Internet, instead of using rule-based or statistical applications, or hand-built gazetteers. By combining a multiplicity of information sources, internal and external to the system, texts can be annotated with a high degree of accuracy with minimal or no manual intervention. What's the Problem? * Machine readable content is needed for the Semantic Web * Manual annotation, required to create machine readable content, is difficult, slow, time-consuming, tedious and costly * Information Extraction techniques help to automate the process but they require previously annotated documents to bootstrap the process * Even semi-automated techniques such as [1]Melita require considerable, redundant human input. Towards a Solution The information overload we experience from Internet is partly due to vast quantities of redundant information. Redundancy is apparent in the presence of multiple citations of the same facts in superficially different formats. This redundancy can be exploited to bootstrap the annotation process needed for Information Extraction, thus enabling production of machine-readable content for the Semantic Web. For example, the fact that a system knows the name of an author can be used to identify a number of other author names using resources present on the Internet, instead of using rule-based or statistical applications, or hand-built gazetteers. By combining a multiplicity of information sources, internal and external to the system, texts can be annotated with a high degree of accuracy with minimal or no manual intervention. Armadillo utilizes multiple strategies (Named Entity Recognition, external databases e.g. [2]Citeseer , existing gazetteers, various information extraction engines) to model a domain by connecting different entities and objects. For example, in the CS domain, the site www.nlp.shef.ac.uk contains a number of different personnel working for different projects and having authored various different papers. Using these multiple strategies and databases, Armadillo connects these personnel, projects and papers automatically within the given domain. In so doing, Armadillo models the relevant domain and builds an RDF ontology and a knowledge base. Further Reading Key document: [3]Fabio Ciravegna, [4]Sam Chapman, [5]Alexiei Dingli and [6]Yorick Wilks, Learning to Harvest Information for the Semantic Web, in Proceedings of the 1st European Semantic Web Symposium, Heraklion, Greece, May 10-12, 2004. [[7]PDF]. References 1. http://www.aktors.org/technologies/melita/ 2. http://citeseer.nj.nec.com/cs 3. mailto:F.Ciravegna@dcs.shef.ac.uk 4. mailto:S.Chapman@dcs.shef.ac.uk 5. mailto:A.Dingli@dcs.shef.ac.uk 6. mailto:Y.Wilks@dcs.shef.ac.uk 7. http://www.dcs.shef.ac.uk/~sam/papers/esws2004.pdf Sam Chapman 111e263485bf0f932f251e5127c19bcbb06fcefe Alexiei Dingli 3f8f84f1dc750f9522e9e557b313c188e68ce019 Fabio Ciravegna caf51d7ec2e9212eb2f5a45f5313e39d9c21196c Nicholas Weaver 6bac84270ddaceab407e3622dce435288ce34cae