|Dome from The University of Southampton|
Dome is an visual programming language for manipulation of XML documents. It is being used in the AKT project to trawl through large websites, extracting information from them by transforming the source HTML into RDF.
What's the Problem?
Populating the knowledge base requires collecting data from many web pages. Because of the large number of pages to be examined and the need to regularly update the information, a tool is needed to do this automatically.
Eventually, all pages should provide machine-readable meta-data in a standard format to make this task very easy. In the meantime, and to boot-strap this process, we need to cope with the current situation where each site generates a set of pages in that site's particular format from the site's database.
Towards a Solution
Dome is a programmable XML/HTML editor. Users load in a page from the target site and record a sequence of editing operations to extract the desired information. This sequence can then be replayed automatically on the rest of the site's pages. If irregularities in the pages are discovered during this process, the program can be paused and ammended to cope with the new input.
The source HTML is converted to XHTML using the W3C's HTML-Tidy program automatically by Dome, and tidied up in the process. A Dome program is then recorded which removes all unnecessary elements from the page, leaving just the desired data, and the element names and layout can be changed for a desired output format, such as RDF.
Dome has a number of simple programming constructs, such as loop-over-sequence (shown above by the blue bars), nesting (the rectangles) and simple exceptions.