CIA World Factbook to RDF

Ben Humphreys, Summer 2005

bh704@soton.ac.uk


Brief description

Turns the CIA World Factbook into RDF that is more machine-readable.

Simple method for creating and importing RDF into triplestore:

  1. Run createrdf.pl
    Uses ciaparser to parse all the .html files within the source/ directory, and outputs the RDF in the rdf/ directory.
  2. Run importrdf.pl
    Takes the created rdf files in rdf/ and inserts them into the triplestore on benedict under the model name "factbook". The flag "f" makes sure that the model is flushed before inserting the new data.
    ./importrdf.pl -f

File descriptions

ciaparser.pl

Takes a supplied HTML file and returns parsed RDF to the standard output.

Parses files in the directory source/ below it.

Can be given an argument to specify the filename of the file to parse, without the .html

For example to parse Japan's file, ja.html, you would run

./ciaparser.pl ja

The file parses the UK's file uk.html when no parameters are given.

The file also supports the argument all, to parse all the files in the subdirectory source/

ciaparser.pl is a refactored version of cia.cgi that creates valid RDF with a nested structure. This is the most up-to-date script that should be used to create RDF.
Tested using http://www.w3.org/RDF/Validator/

The script runs through the supplied HTML file, extracting the field name and the data associated with it.
The RDF tag is automatically generated based on that field name.
Various regular expressions are applied to the data to format certain fields into more useful formats, for example, %, kmh are removed and added to a unit information field.

orgtagsfromcia.txt

A resource file containing all the acronyms used for international organisations and their meanings. Used in cia.pl and ciaparser to create URIs to organisations.

old/

Contains versions of files that are no longer used. Kept just in case parts of them become useful again.

misc.zip

A complete mess of small test files and various temporary output files that were used on the way to making something useable. The useful files from there should be in old/ Enter if you dare.

source/

Where all the HTML files for each country are stored. Offline versions of the Factbook were downloaded from: http://www2.cia.gov/print.zip

The print version is used, as it has much cleaner HTML without unnecessary formatting stuff.

rdf/

Directory where produced RDF will be put. This is intentionally left empty to reduce the filesize. The extra RDF tags and other formatting makes the folder well over 10mb.

Just run the createrdf.pl script to fill this directory with RDF.


Extra info

Script basics were created by Daniel Smith (das301@zepler.net) in a previous year. This script's foundations and methods of splitting up data were used in my script.

At the time of writing, the latest version of the RDF was stored in the triplestore on benedict (benedict.ecs.soton.ac.uk) under the model name "factbook".

Last updated: 2005/09/23

Download file:
cia.zip