Advanced Knowledge Technologies

Interdisciplinary Research Collaboration

Mid-term Review September 2003

Scientific report

Nigel Shadbolt[1], Fabio Ciravegna, John Domingue, Wendy Hall, Enrico Motta, Kieron O’Hara, David Robertson, Derek Sleeman, Austin Tate, Yorick Wilks

1. Introduction

In a celebrated essay on the new electronic media, Marshall McLuhan wrote in 1962

Our private senses are not closed systems but are endlessly translated into each other in that experience which we call consciousness. Our extended senses, tools, technologies, through the ages, have been closed systems incapable of interplay or collective awareness. Now, in the electric age, the very instantaneous nature of co-existence among our technological instruments has created a crisis quite new in human history. Our extended faculties and senses now constitute a single field of experience which demands that they become collectively conscious. Our technologies, like our private senses, now demand an interplay and ratio that makes rational co-existence possible. As long as our technologies were as slow as the wheel or the alphabet or money, the fact that they were separate, closed systems was socially and psychically supportable. This is not true now when sight and sound and movement are simultaneous and global in extent. (McLuhan 1962, p.5, emphasis in original)[2]

Over forty years later, the seamless interplay that McLuhan demanded between our technologies is still barely visible. McLuhan’s predictions of the spread, and increased importance, of electronic media have of course been borne out, and the worlds of business, science and knowledge storage and transfer have been revolutionised. Yet the integration of electronic systems as open systems remains in its infancy.

The Advanced Knowledge Technologies IRC (AKT) aims to address this problem, to create a view of knowledge and its management across its lifecycle, to research and create the services and technologies that such unification will require. Half way through its six-year span, the results are beginning to come through, and this paper will explore some of the services, technologies and methodologies that have been developed. We hope to give a sense in this paper of the potential for the next three years, to discuss the insights and lessons learnt in the first phase of the project, to articulate the challenges and issues that remain.

2. The semantic web and knowledge management

The WWW provided the original context that made the AKT approach to knowledge management (KM) possible. AKT was initially proposed in 1999, it brought together an interdisciplinary consortium with the technological breadth and complementarity to create the conditions for a unified approach to knowledge across its lifecycle (Table 1). The combination of this expertise, and the time and space afforded the consortium by the IRC structure, suggested the opportunity for a concerted effort to develop an approach to advanced knowledge technologies, based on the WWW as a basic infrastructure.

AKT consortium member

Expertise

Aberdeen

KBSs, databases, V&V

Edinburgh

Knowledge representation, planning, workflow modelling, ontologies

OU

Knowledge modelling, visualisation, reasoning services

Sheffield

Human language technology

Southampton

Multimedia, dynamic linking, knowledge acquisition, modelling, ontologies

Table 1: Some of the specialisms of the AKT consortium

The technological context of AKT altered for the better in the short period between the development of the proposal and the beginning of the project itself with the development of the semantic web (SW), which foresaw much more intelligent manipulation and querying of knowledge.  The opportunities that the SW provided for e.g., more intelligent retrieval, put AKT in the centre of information technology innovation and knowledge management services; the AKT skill set would clearly be central for the exploitation of those opportunities.

The SW, as an extension of the WWW, provides an interesting set of constraints to the knowledge management services AKT tries to provide. As a medium for the semantically-informed coordination of information, it has suggested a number of ways in which the objectives of AKT can be achieved, most obviously through the provision of knowledge management services delivered over the web as opposed to the creation and provision of technologies to manage knowledge.

AKT is working on the assumption that many web services will be developed and provided for users. The KM problem in the near future will be one of deciding which services are needed and of coordinating them. Many of these services will be largely or entirely legacies of the WWW, and so the capabilities of the services will vary. As well as providing useful KM services in their own right, AKT will be aiming to exploit this opportunity, by reasoning over services, brokering between them, and providing essential meta-services for SW knowledge service management.

Ontologies will be a crucial tool for the SW. The AKT consortium brings a lot of expertise on ontologies together, and ontologies were always going to be a key part of the strategy. All kinds of knowledge sharing and transfer activities will be mediated by ontologies, and ontology management will be an important enabling task. Different applications will need to cope with inconsistent ontologies, or with the problems that will follow the automatic creation of ontologies (e.g. merging of pre-existing ontologies to create a third). Ontology mapping, and the elimination of conflicts of reference, will be important tasks. All of these issues are discussed along with our proposed technologies.

Similarly, specifications of tasks will be used for the deployment of knowledge services over the SW, but in general it cannot be expected that in the medium term there will be standards for task (or service) specifications. The brokering meta-services that are envisaged will have to deal with this heterogeneity.

The emerging picture of the SW is one of great opportunity but it will not be a well-ordered, certain or consistent environment. It will comprise many repositories of legacy data, outdated and inconsistent stores, and requirements for common understandings across divergent formalisms. There is clearly a role for standards to play to bring much of this context together; AKT is playing a significant role in these efforts (section 5.1.6 of Management Report). But standards take time to emerge, they take political power to enforce, and they have been known to stifle innovation (in the short term). AKT is keen to understand the balance between principled inference and statistical processing of web content. Logical inference on the Web is tough. Complex queries using traditional AI inference methods bring most distributed computer systems to their knees. Do we set up semantically well-behaved areas of the Web? Is any part of the Web in which semantic hygiene prevails interesting enough to reason in? These and many other questions need to be addressed if we are to provide effective knowledge technologies for our content on the web.

3. AKT knowledge lifecycle: the challenges

Since AKT is concerned with providing the tools and services for managing knowledge throughout its lifecycle, it is essential that it has a model of that lifecycle. The aim of the AKT knowledge lifecycle is not to provide, as most lifecycle models are intended to do, a template for knowledge management task planning. Rather, the original conceptualisation of the AKT knowledge lifecycle was to understand what the difficulties and challenges there are for managing knowledge whether in corporations or within or across repositories.

The AKT conceptualisation of the knowledge lifecycle comprises six challenges, those of acquiring, modelling, reusing, retrieving, publishing and maintaining knowledge (O’Hara 2002, pp.38-43). The six challenge approach does not come with formal definitions and standards of correct application; rather the aim is to classify the functions of AKT services and technologies in a straightforward manner.

Figure 1: AKT's six knowledge challenges

This paper will examine AKT’s current thinking on these challenges. An orthogonal challenge, when KM is conceived in this way (indeed, whenever KM is conceived as a series of stages) is to integrate the approach within some infrastructure. Therefore the discussion in this paper will consider the challenges in turn (sections 4-9), followed by integration and infrastructure (section 10). We will then see the AKT approach in action, as applications are examined (section 11). Theoretical considerations (section 12) and future work (section 13) conclude the review.

4. Acquisition

Traditionally, in knowledge engineering, knowledge acquisition (KA) has been regarded as a bottleneck (Shadbolt & Burton, 1990). The SW has exacerbated this bottleneck problem; it will depend for its efficacy on the creation of a vast amount of annotation and metadata for documents and content, much of which will have to be created automatically or semi-automatically, and much of which will have to be created for legacy documents by people who are not those documents’ authors.

KA is not only the science of extracting information from the environment, but rather of finding a mapping from the environment to concepts described in the appropriate modelling formalism. Hence, the importance of this for acquisition is that – in a way that was not true during the development of the field of KA in the 1970s and 80s – KA is now focused strongly around the acquisition of ontologies. This trend is discernable in the evolution of methodologies for knowledge intensive modelling (Schreiber et al, 2000).

Therefore, in the context of the SW, an important aspect of KA is the acquisition of knowledge to build and populate ontologies, and furthermore to maintain and adapt ontologies to allow their reuse, or to extend their useful lives. Particular problems include the development and maintenance of large ontologies, creating and maintaining ontologies by exploiting the most common, but relatively intractable, source of natural language texts. However, the development of ontologies is also something that can inform KA, by providing templates for acquisition.

AKT has a number of approaches to the KA bottleneck, and in a paper of this size it is necessary to be selective (this will be the case for all the challenges). In this section, we will chiefly discuss the harvesting and capture of large scale content from web pages and other resources, (section 4.1), content extraction of ontologies from text (section 4.2), and the extraction of knowledge from text (section 4.3). These approaches constitute the AKT response to the new challenges posed by the SW; however, AKT has not neglected other, older KA issues. A more traditional, expert-oriented KA tool approach, will be discussed in section 4.4.

4.1. Harvesting

AKT includes in its objectives the investigation of technologies to process a variety of knowledge on a web scale. There are currently insufficient resources marked up with meta-content in machine-readable form. In the short to medium term we cannot see such resources becoming available. One of the important objectives is to have up to date information, and so the ability to regularly harvest, capture and update content is fundamental. There has been a range of activities to support large-scale harvesting of content.

4.1.1 Early harvesting

Scripts were written to “screen scrape” university web sites (the leading CS research departments were chosen), using a new tool Dome (Leonard & Glaser 2001), that is an output of the research of an EPSRC student.

Dome is a programmable XML/HTML editor. Users load in a page from the target site and record a sequence of editing operations to extract the desired information. This sequence can then be replayed automatically on the rest of the site's pages. If irregularities in the pages are discovered during this process, the program can be paused and amended to cope with the new input.

We see below (Figure 2) the system running, and processing a personal web page, also shown. A Dome program has been recorded which removes all unnecessary elements from the source of this page, leaving just the desired data, and the element names and layout have been changed to the desired output format, RDF.

Figure 2: A Dome Script to produce RDF from a Web Page

Other scripts have been written using appropriate standard programming tools to harvest data from other sources. These scripts are run on a nightly basis to ensure that the information we glean is as up to date as possible. As the harvesting has progressed, it has also been done by direct access to databases, where possible. In addition, other sites are beginning to provide RDF to us directly, as planned.

The theory behind this process is that of a bootstrap. Initially, AKT harvests from the web without involving the personnel at the sources at all. (This also finesses any problems of Data Protection, since all information is publicly available.) Once the benefits to the sources of having their information harvested becomes clear, some will contact us to cooperate. The cooperation can take various forms, such as sending us the data or RDF, or making the website more accessible, but the preferred solution is for them to publish the data on their website on a nightly basis in RDF (according to our ontology). These techniques are best suited to data which is well-structured (such as university and agency websites), and especially that which is generated from an underlying database.

As part of the harvesting activity, and as a service to the community, the data was put in almost raw form on a website registered for the purpose: www.hyphen.info. Figure 3 shows a snapshot of the range of data we were able to make available in this form.

 

Figure 3: www.hyphen.info CS UK Page

4.1.2 Late harvesting

The techniques above will continue to be used for suitable data sources. A knowledge mining system to extract information from several sources automatically has also been built (Armadillo – cf section 7.2.2), exploiting the redundancy found on the Internet, apparent in the presence of multiple citations of the same facts in superficially different formats. This redundancy can be exploited to bootstrap the annotation process needed for IE, thus enabling production of machine-readable content for the SW. For example, the fact that a system knows the name of an author can be used to identify a number of other author names using resources present on the Internet, instead of using rule-based or statistical applications, or hand-built gazetteers. By combining a multiplicity of information sources, internal and external to the system, texts can be annotated with a high degree of accuracy with minimal or no manual intervention. Armadillo utilizes multiple strategies (Named Entity Recognition, external databases, existing gazetteers, various information extraction engines such as Amilcare – section 7.1.1 – and Annie) to model a domain by connecting different entities and objects.

4.2. Extracting ontologies from text: Adaptiva

Existing ontology construction methodologies involve high levels of expertise in the domain and the encoding process. While a great deal of effort is going into the planning of how to use ontologies, much less has been achieved with respect to automating their construction. We need a feasible computational process to effect knowledge capture.

The tradition in ontology construction is that it is an entirely manual process. There are large teams of editors or, so-called, ‘knowledge managers’ who are occupied in editing knowledge bases for eventual use by a wider community in their organisation. The process of knowledge capture or ontology construction involves three major steps: first, the construction of a concept hierarchy; secondly, the labeling of relations between concepts, and thirdly, the association of content with each node in the ontology (Brewster et al 2001a).

In the past a number of researchers have proposed methods for creating conceptual hierarchies or taxonomies of terms by processing texts. The work has sought to apply methods from Information Retrieval (term distribution in documents) and Information Theory (mutual information) (Brewster 2002). It is relatively easy to show that two terms are associated in some manner or to some degree of strength. It is possible also to group terms into hierarchical structures of varying degree of coherence. However, the most significant challenge is to be able to label the nature of the relationship between the terms.

This has led to the development of Adaptiva (Brewster et al 2001b), an ontology building environment which implements a user-centred approach to the process of ontology learning. It is based on using multiple strategies to construct an ontology, reducing human effort by using adaptive information extraction. Adaptiva is a Technology Integration Experiment (TIE – section 3.1 of the Management Report).

The ontology learning process starts with the provision of a seed ontology, which is either imported to the system, or provided manually by the user. A seed may consist of just two concepts and one relationship. The terms used to denote concepts in the ontology are used to retrieve the first set of examples in the corpus. The sentences are then presented to the user to decide whether they are positive or negative examples of the ontological relation under consideration.

In Adaptiva, we have integrated Amilcare (discussed in greater detailed below in section 7.1.1). Amilcare is a tool for adaptive Information Extraction (IE) from text designed for supporting active annotation of documents for Knowledge Management (KM). It performs IE by enriching texts with XML annotations. The outcome of the validation process is used by Amilcare, functioning as a pattern learner. Once the learning process is completed, the induced patterns are applied to an unseen corpus and new examples are returned for further validation by the user. This iterative process may continue until the user is satisfied that a high proportion of exemplars is correctly classified automatically by the system.

Using Amilcare, positive and negative examples are transformed into a training corpus where XML annotations are used to identify the occurrence of relations in positive examples. The learner is then launched and patterns are induced and generalised. After testing, the best, most generic, patterns are retained and are then applied to the unseen corpus to retrieve other examples. From Amilcare’s point of view the task of ontology learning is transformed into a task of text annotation: the examples are transformed into annotations and annotations are used to learn how to reproduce such annotations.

Experiments are under way to evaluate the effectiveness of this approach. Various factors such as size and composition of the corpus have been considered. Some experiments indicate that, because domain specific corpora take the shared ontology as background knowledge, it is only by going beyond the corpus that adequate explicit information can be identified for the acquisition of the relevant knowledge (Brewster et al. 2003). Using the principles underlying the Armadillo technology (cf. Section 7.2.2), a model has been proposed for a web-service, which will identify relevant knowledge sources outside the specific domain corpus thereby compensating for the lack of explicit specification of the domain knowledge.

4.3. KA from text: Artequakt

Given the amount of content on the web there is every likelihood that in some domains the knowledge that we might want to acquire is out there.  Annotations on the SW could facilitate acquiring such knowledge, but annotations are rare and in the near future will probably not be rich or detailed enough to support the capture of extended amounts of integrated content. In the Artequakt work we have developed tools able to search and extract specific knowledge from the Web, guided by an ontology that details what type of knowledge to harvest. Artequakt is an Integrated Feasibility Demonstrator (IFD) that combines expertise and resources from three projects – Artiste, the Equator and AKT IRCs.

Many information extraction (IE) systems rely on predefined templates and pattern-based extraction rules or machine learning techniques in order to identify and extract entities within text documents. Ontologies can provide domain knowledge in the form of concepts and relationships. Linking ontologies to IE systems could provide richer knowledge guidance about what information to extract, the types of relationships to look for, and how to present the extracted information. We discuss IE in more detail in section 7.1.

There exist many IE systems that enable the recognition of entities within documents (e.g. ‘Renoir’ is a ‘Person’, ‘25 Feb 1841’ is a ‘Date’). However, such information is sometimes insufficient without acquiring the relation between these entities (e.g. ‘Renoir’ was born on ‘25 Feb 1841’). Extracting such relations automatically is difficult, but crucial to complete the acquisition of knowledge fragments and ontology population.

When analysing documents and extracting information, it is inevitable that duplicated and contradictory information will be extracted. Handling such information is challenging for automatic extraction and ontology population approaches.

Artequakt (Alani et al 2003b, Kim et al 2002) implements a system that searches the Web and extracts knowledge about artists, based on an ontology describing that domain. This knowledge is stored in a knowledge base to be used for automatically producing tailored biographies of artists.

Artequakt's architecture (Figure 4) comprises of three key areas. The first concerns the knowledge extraction tools used to extract factual information items from documents and pass them to the ontology server. The second key area is the information management and storage. The information is stored by the ontology server and consolidated into a knowledge base that can be queried via an inference engine. The final area is the narrative generation. The Artequakt server takes requests from a reader via a simple Web interface. The reader request will include an artist and the style of biography to be generated (chronology, summary, fact sheet, etc.). The server uses story templates to render a narrative from the information stored in the knowledge base using a combination of original text fragments and natural language generation.

Figure 4: Artequakt's architecture

The first stage of this project consisted of developing an ontology for the domain of artists and paintings. The main part of this ontology was constructed from selected sections in the CIDOC Conceptual Reference Model ontology. The ontology informs the extraction tool of the type of knowledge to search for and extract. An information extraction tool was developed and applied that automatically populates the ontology with information extracts from online documents. The information extraction tool makes use of an ontology, coupled with a general-purpose lexical database, WordNet and an entity-recogniser, GATE (Cunningham et al 2002 – see section 10.4) as guidance tools for identifying knowledge fragments consisting not just of entities, but also the relationships between them. Automatic term expansion is used to increase the scope of text analysis to cover syntactic patterns that imprecisely match our definitions.

Figure 5: The IE process in Artequakt

The extracted information is stored in a knowledge base and analysed for duplications and inconsistencies. A variety of heuristics and knowledge comparison and term expansion methods were used for this purpose. This included the use of simple geographical relations from WordNet to consolidate any place information; e.g. places of birth or death. Temporal information was also consolidated with respect to precision and consistency.

Narrative construction tools were developed that queried the knowledge base through an ontology server. These queries searched and retrieved relevant facts or textual paragraphs and generated a specific biography. The challenge is to build biographies for artists where there is sparse information available, distributed across the Web. This may mean constructing text from basic factual information gleaned, or combining text from a number of sources with differing interests in the artist. Secondly, the work also aspires to provide biographies that are tailored to the particular interests and requirements of a given reader. These might range from rough stereotyping such as “A biography suitable for a child” to specific reader interests such as “I'm interested in the artists’ use of colour in their oil paintings” (Figure 6).

Figure 6: The biography generation process in Artequakt

Figure 7: Artequakt-generated biography for Renoir

The system is undergoing evaluation and testing at the moment. It has already provided important components for a successful bid (the SCULPTEUR project) into the EU VI Framework.

4.4. Refiner++

Refiner++ (Aiken & Sleeman 2003) is a new implementation of Refiner+ (Winter & Sleeman 1995), an algorithm that detects inconsistencies in a set of examples (cases) and suggests ways in which these inconsistencies might be removed. The domain expert is required to specify which category each case belongs to; Refiner+ then infers a description for each of the categories and reports any inconsistencies that exist in the dataset. An inconsistency is when a case matches a category other than the one in which the expert has classified it. If inconsistencies have been detected in the dataset, the algorithm attempts to suggest appropriate ways of dealing with the inconsistencies by refining the dataset. At the time of writing, the Refiner++ system has been presented to three experts to use on problems in their domains: anaesthetics, educational psychology, and intensive care.

Although the application can be used to import existing datasets and perform analysis on them, its real strength is for an expert who wants to conceptualize a domain where the inherent task is classification. Refiner++ requires the expert to articulate cases, specifying the descriptors they believe to be important in their domain. This causes the expert to conceptualize their domain, bringing out the hidden relationships between descriptors that might otherwise be ignored.

We hope to produce a “refinement workbench” to include Refiner++, ReTax (Alberdi & Sleeman 1997) and ConRef (Winter et al 1998 – and section 9.2).

5. Modelling

As noted in the previous section, ontologies loom large in AKT – as in the SW – for modelling purposes. In particular, we have already seen the importance of ontologies (a) for directing acquisition, and (b) as objects to be acquired or created. The SW, as we have argued, will be a domain in which services will be key. For particular tasks, agents are likely to require combinations of services, either in parallel or sequentially. In either event, ontologies are going to be essential to provide shared understandings of the domain, preconditions and postconditions for their application and optimal combination. However, in the likely absence of much standardisation, ontologies are not going to be completely shared.

Furthermore, it will not be possible to assume unique or optimal solutions to the problem of describing real-world contexts. Ontologies will be aimed at different tasks, or will make inconsistent, but reasonable, assumptions. Given two ontologies precisely describing a real-world domain, it will not in general be possible to guarantee mappings between them without creating new concepts to augment them. As argued in (Kalfoglou & Schorlemmer 2003a), and section 9.3 above, there is a distinct lack of formal underpinnings here. Ontology mapping will be an important aspect to knowledge modelling, and as we have already seen, AKT is examining these issues closely.

Similarly, the production of ontologies will need to be automated, and documents will become a vital source for ontological information. Hence tools such as Adaptiva (section 4.2) will, as we have argued, be essential. However, experimental evidence amassed during AKT shows that in texts, it is often the essential ontological information that is not expressed, since it is taken to be part of the ‘background knowledge’ implicitly shared by reader and author (Brewster et al 2003). Hence the problem of how to augment information sources for documents is being addressed by AKT (Brewster et al 2003).

A third issue is that of the detection of errors in automatically-extracted ontologies, particularly from unstructured material. It was for these reasons that we have also made some attempts to extract information from semi-structured sources ie programs and Knowledge Bases (Sleeman et al, 2003).

In all these ways, there are plenty of unresolved research issues with respect to ontologies that AKT will address over the remaining half of its remit. However, modelling is a fundamental requirement in other areas, for instance with respect to the modelling of business processes in order to achieve an understanding of the events that a business must respond to. The AKT consortium has amassed a great deal of experience of modelling processes such as these that describe the context in which organisations operate. Section 5.1 looks at the use of protocols to model service interactions, while in section 5.2