AKTors.org
AKTors.org AKTTechnologiesPublicationsRelated ProjectsPeople
AKTors.org AcquisitionModellingRetrievalReusePublishingMaintenance
Dome from The University of Southampton

Dome is an visual programming language for manipulation of XML documents. It is being used in the AKT project to trawl through large websites, extracting information from them by transforming the source HTML into RDF.


Dome fact-file

Owner  :  The University of Southampton
Researchers
(listed alphabetically)
 :  Thomas Leonard [Browse, RDF]
Description  :  http://www.ecs.soton.ac.uk/~tal00r/Dome
Builds on  :  eXtensible Markup Language, Hyper Text Markup Language
Addresses challenges  :  Knowledge Retrieval, Knowledge Acquisition

What's the Problem?

Populating the knowledge base requires collecting data from many web pages. Because of the large number of pages to be examined and the need to regularly update the information, a tool is needed to do this automatically.

Eventually, all pages should provide machine-readable meta-data in a standard format to make this task very easy. In the meantime, and to boot-strap this process, we need to cope with the current situation where each site generates a set of pages in that site's particular format from the site's database.

Towards a Solution

Dome is a programmable XML/HTML editor. Users load in a page from the target site and record a sequence of editing operations to extract the desired information. This sequence can then be replayed automatically on the rest of the site's pages. If irregularities in the pages are discovered during this process, the program can be paused and ammended to cope with the new input.

The source HTML is converted to XHTML using the W3C's HTML-Tidy program automatically by Dome, and tidied up in the process. A Dome program is then recorded which removes all unnecessary elements from the page, leaving just the desired data, and the element names and layout can be changed for a desired output format, such as RDF.

Screenshot of Dome

Dome has a number of simple programming constructs, such as loop-over-sequence (shown above by the blue bars), nesting (the rectangles) and simple exceptions.

Example walk-through

  1. Load the site's index page and extract the desired linkes (perhaps by entering an XPath to select them all).

    Step 1

  2. Create a 'foreach' block to process each selected link. Once the block is created, just the first of the selected links will be selected.

    Step 2

  3. Fetch the link. This replaces the anchor element with the HTML of the referenced page.

    Step 3

  4. Edit the page's contents to convert it to RDF.

    Step 4

  5. Now we are at the end of the first iteration of the foreach block, tell Dome to continue running. This will cause the remaining links to be processed. Add the RDF header and Seq elements to create a single RDF document.

    Step 5

  6. Save out the finished RDF and the Dome program. The program can be run again later, either interactively or using a non-graphical interpretor.

    Step 6

Further Reading

Links

Semantic representation

View in the AKT Triplestore Browser or as RDF.

Also available in DOAP RDF (Description Of A Project)