![]() |
|
![]() |
|
| Scripts and Services |
![]() |
Scripts and Services
Script and Service GuideData AcquisitionUlm UniversityDirectory: http://resist.ecs.soton.ac.uk/~cf105/ulm/ This data was acquired from the university web site. The urls with the staff and department information can be found in the makefile. For the staff, ulm-staffparser.php must be run with each of the staff web pages as the first argument, and with the output filename as the second argument. These output files must then be concatenated. For the departments ulm-departmentparser.php must be run. Now the html has been parsed to csv format makeRDF can be run with the config file ulm-config.xml. The output should then have iso2utf8 run on it and it is ready to be imported into the triplestore. Darmstadt UniversityDirectory: http://resist.ecs.soton.ac.uk/~icm/darmstadt/ This data was acquired form the university website. The address of the staff web page is http://www.informatik.tu-darmstadt.de/web/personen.de.shtml. The research group information is at http://www.informatik.tu-darmstadt.de/web/forschung_und_fachgebiete.en.htm. darmstadt-staffparser.php and darmstadt-groupparser.php must then be run to parse the html to csv/td format. makeRDF should then be run with the config file darmstadt-config.xml and the rdf can then be imported into the triplestore. The Risks DigestDirectory: http://resist.ecs.soton.ac.uk/~cf105/risk/ The issues were acquired from the website Risks. To retrieve all the issues web pages the script risk-issues-wget.php must be run, which downloads them to the directory html/. The script icm-parse-risks-html.php should then be run with the issues html as the arguments, the output should be redirected to the file risk-issues.txt. makeRDF can then be run with the config file risk-issues-config.xml. iso2utf8 must then be run on the rdf output, it is then ready to be imported into the triplestore. Once the issues rdf is imported into the triplestore the classifications alignment process can be done. The neumann classifications are found at http://www.csl.sri.com/users/neumann/illustrative.html, which must be retrieved and stored in a file named illustrative.html. icm-pase-neumann must then be run, which parses illustrative.html into tab deliminated format. alignRisksNeumann.php must then be run, this attempts to align the neumann classifications with the correct Risks article by querying the RKB and using a bag of words. The neumann classifications that could not be aligned must be removed from the file risk-data-aligned.txt (see makefile). makeRDF must then be run with the config file risk-classifications-config.xml. This config also creates three taxonomies of categories, these and the classifications rdf are ready to be imported into the triplestore. The European Union Projects Database (CORDIS)Directory: http://resist.ecs.soton.ac.uk/~cf105/cordis/projects/ The script cordis-queryautomaton.php does several (hundred) queries to the cordis website in an attempt to get all valid research projects and stores them in queries/ as many lists of query results , if a query resulted in >=200 results at the maximum depth of query the query criteria that caused it are written to a logfile cordis-projects-logfile.txt. cordis-querydeeper.php must then be run, which reads the logfile and partitions the query further, after this stage there should only be one result that returns >=200 results, which is dealt with by cordis-quick.php. cordis-projects-processSearchResults.php must then be run with the results pages as the arguments, the script writes a list of project ids to a file cordis-projects-projids.txt. cordis-wgetprojects must then be run with option -f followed by the file of project ids, this will retrieve each project page and store it in html/ . IMPORTANT: Sometimes a blank page is retrieved, to attempt to retrieve these again the best way is to use find to get the empty files, process the filename to only get the project id numbers, write these to a file which you can then use with the -f option with cordis-wgetprojects.php, you will also need to use the option -nl (no log) so it doesn't read the log of already retrieved pages. cordis-projects-html2td.php must then be run with the project pages as the argument, in practice find must be used as ther are too many pages to specify on the command line. If the process is being run again it will be necessary to empty the file cordis-projects.txt as it is appended to and not overwritten by the script. makeRDF can then be run with the config file cordis-projects-config.xml and the output rdf can be imported into the triplestore. Not a simple process I know. To generate the rdf containing the generic areas of interest taxonomy run makeRDF with the config file cordis-gaoi.xml, this makes cordis-gaoi.rdf which can be imported into the triplestore. The National Science Foundation (NSF) DatabaseDirectory: http://resist.ecs.soton.ac.uk/~cf105/nsf/ To get all the project csv files from the NSF website the following process must be used: run nsf-queryautomaton.php; run nsf-joinCSV.php csv/* ; run nsf-querydeeper.php; run nsf-joinCSV.php csv/* . Run those last 2 scripts until nsf-logfile.txt is empty. Then run makeRDF with the config file nsf-config.xml . The generated files nsf-projects-[US State].rdf and nsf-aoi.rdf (Area of Interest taxonomy) can then be imported into the triplestore and asserted. NOTE: Some of the csv records are messed up so do not import any files that don't match nsf-projects-??.rdf, nsf-projects-{7}.rdf or nsf-aoi.rdf . United Nations Location Codes (UN/LOCODES)Directory: http://resist.ecs.soton.ac.uk/~icm/scripts/unlocode/ The source is in the csv files unlocode-countriesandtowns.csv and unlocode-municipalunits.csv. First of all the script divide.php must be run, which splits unlocode-countriesandtowns.csv into separate file unlocode-towns.csv and unlocode-countries.csv. Then lat-long must be run, which attempts to find latitude and longitude for towns that don't (takes a very long time), the output file is unlocode-towns2.csv. The script stripEquals.php must then be run which removes some troublesome records from the towns file and creates a new file unlocode-towns3.csv. makeRDF can then be run with the configfiles unlocode-config.xml (for countries and municipal units) and unlocode-townconfig.xml (for towns, takes a long time). The rdf must then have iso2utf8 run on it, it is then ready to be imported into the triplestore. Digital Bibliography and Library Project (DBLP)Directory: http://resist.ecs.soton.ac.uk/~icm/scripts/acquisition/dblp/ The source data is in an xml document dblp.xml, this has needs seding to ensure each element starts on a new line and putting in dblp-fixed.xml. The script dblp-xml2rdf.php should then be run which parses the xml and generates 16 rdf files that are ready to be imported into the triplestore. ServicesThe Consistent Referencing Service (CRS)To group URIs in the crs the URIs first need to be imported into the crs, this can be done in either of two ways. Literal values to search for can be entered into the import string service, any URIs with an attribute that matches one of the literals in the list will be imported into the crs. If you know which URI you want to import it can be entered into the import reference service. Once the URIs are in the crs you can group them on a particular predicate or predicates using the grouper. To maunually edit bundles you can use the manual bundle management service which allows you to search for bundles (that are in the crs) by keyword and type, you can then combine bundles or remove URIs from existing bundles, these will be put into singleton bundles. To search for literals or URIs in the crs or to export bundles the export service can be used. SPARQL Query InterfaceFairly simple interface, all that is needed is knowledge of how to use SPARQL, there is a link on the page to a wiki page that shows how to use SPARQL. You can specify the output format and which triplestore to query using drop down boxes. ReSIST Knowledge Base (RKB) BrowserAgain, a very simple interface, either enter a literal or a uri to search for in the RKB. Other scriptsGoogle Maps helper functionsDirectory: on the resist server /var/www/html/gmap/ gmap-functions.php contains a number of helper functions which output the required javascript to display a Google Map and add markers and information windows, each of which is fairly self explanatory. To display a simple google map you would write echoDisplayMap(); echoSetCentre(0,0); to display markers and add information you can then add the necessary function calls. Triplestore helper functionsDirectory: on the resist server /var/www/html/gmap/ and various other places There are several files of triplestore helper functions dotted all over the place, which will all do slightly different things. One such file I have written (in the directory mentioned) is rkb-functions.php which contains a function to run a given query on the RKB, no prefixes need to be declared and the output is stripped of column headings, quotes and angle brackets; escaped quotes are replaced with quotes. There is also a function to try and get a "pretty" name for a URI, get a name for a person URI, query the CRS for duplicates, get research projects for a person uri, make an html link to a URIs RKB browser page and more. Foreign CharactersThe character encoding for rdf must be utf8 and the number of acceptable entities is limited. There is a script ~cf105/dblp/inc/utf8.php that contains a function that takes a string and converts any entities to their utf8 character. There is also a script iso2utf8 that has been mentioned several times that converts a file from iso to utf8 makeRDFDirectory: http://resist.ecs.soton.ac.uk/~icm/scripts/makeRDF/ This is a script to assist in RDF generation. It takes a config file, which specifies input (either tab deliminated .txt or comma separated variable .csv files) and output files and the mapping of input column to ontology element. There are three mapping types: literal, resource and userdef, which will not output any rdf for the specified value, so that you can generate your own rdf in a custom mapping (and most likely an external php function). If inline definitions are required userdef must be used. |