At IWMW 2010, last week, a lot of discussion centred around how, in an increasingly austere atmosphere, we can make more use of free stuff. One category of free stuff is linked data. In particular, I was intrigued by Thom Bunting (UKOLN)‘s presentation about extracting information from Wikipedia. It has inspired me to start experimenting with data about UK universities.
Let’s get some terminology out of the way. Dbpedia is a service that extracts machine-readable data from Wikipedia articles. You can look at, for example, everything Dbpedia knows about the University of Bristol. SPARQL is an SQL-like language for querying triples: effectively, all the data is in a single table with three columns. SNORQL is a front-end to Dbpedia that allows you to enter SPARQL queries directly. It’s possible to ask SNORQL for “All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants” and get results in a variety of machine-readable formats.
Sadly, when you look for ways to use Dbpedia data, some of the links are broken, which was initially off-putting. SNORQL is great fun though. SPARQL is a something I’m only just learning, but to anyone familiar with SQL and the basics of RDF it’s straightforward.
List the members of the 1994 Group of universities
SELECT ?uni
WHERE {
?uni rdf:type <http://dbpedia.org/ontology/University> .
?uni skos:subject <http://dbpedia.org/resource/Category:1994_Group>
}
ORDER by ?uni
Results
Get the Longitude and Latitude of the University of York
SELECT ?lat, ?long
WHERE {
:University_of_York geo:lat ?lat .
:University_of_York geo:long ?long
}
Results
List universities in the United Kingdom, with their cities, types, web sites, and numbers of Undergraduate and Postgraduate students
SELECT DISTINCT ?uni, ?city, ?type, ?ug, ?pg, ?web
WHERE {
?uni rdf:type <http://dbpedia.org/ontology/University> .
?uni dbpedia2:country ?uk .
?uni dbpedia2:city ?city .
?uni dbpedia-owl:numberOfPostgraduateStudents ?pg .
?uni dbpedia-owl:numberOfUndergraduateStudents ?ug .
OPTIONAL { ?uni dbpedia2:type ?type } .
OPTIONAL { ?uni dbpedia2:website ?web }
Filter (?uk = :United_Kingdom || ?uk = :England ||?uk = :Wales ||?uk = :Scotland || ?uk= :Northern_Ireland)
}
ORDER by ?uni
Note that in this implementation, “:” is an abbreviation for “http://dbpedia.org/resource/”, so “:United_Kingdom” is just a shorter way of saying “http://dbpedia.org/resource/United_Kingdom”
Results
The data in these examples is sometimes patchy, as you would expect. Glasgow presently appears twice in the list because it is listed as both a “public university” and an “ancient university”. The latter query could do with some tidy up. The HESA data on which the student and staff numbers is based is often a few years old rather than up to date. Web sites URLs are formatted in different ways in different infoboxes, leading to a slight inconsistency (which could be fixed by an extra line of code). Then again, given that it’s drawn from Wikipedia, I’m impressed at the completeness (and of course it’s easy to correct or update the figures).
Chains of doctoral advisors featuring four scientists
SELECT ?a, ?a_birth, ?b, ?b_birth, ?c, ?c_birth, ?d, ?d_birth {
?a rdf:type <http://dbpedia.org/ontology/Scientist> .
?b rdf:type <http://dbpedia.org/ontology/Scientist> .
?c rdf:type <http://dbpedia.org/ontology/Scientist> .
?d rdf:type <http://dbpedia.org/ontology/Scientist> .
?a dbpedia-owl:birthDate ?a_birth .
?b dbpedia-owl:birthDate ?b_birth .
?c dbpedia-owl:birthDate ?c_birth .
?d dbpedia-owl:birthDate ?d_birth .
?d dbpedia-owl:doctoralAdvisor ?c .
?c dbpedia-owl:doctoralAdvisor ?b .
?b dbpedia-owl:doctoralAdvisor ?a
}
ORDER BY ?a_birth
Results
Lots of potential here for tracking the impact of individual academics and institutions.
September 14, 2010 at 11:46 am
[…] example of the potential for DBpedia has been described by Martin Poulter in a post on Getting information about UK HE from Wikipedia which explores some of the ideas I discussed on A Challenge To Linked Data Developers. But […]
September 15, 2010 at 7:56 am
[…] filtered datasets retrieved from SPARQL queries on DBpedia (as illustrated by Martin Poulter in his follow-up blog post ‘Getting information about UK HE from Wikipedia‘) […]
September 21, 2010 at 8:24 am
[…] a post entitled “Getting information about UK HE from Wikipedia” published in July on the Ancient Geek’s blog Martin Poulter commented that “At […]