This is a node.js application that aims at extracting the knowledge represented in the Google infoboxes (aka Google Knowlege Graph Panel).
The Algorithm implemented is the following:
- Query DBpedia for all concepts (types) for which there is at least one instance that has a link to a Freebase ID
- For each of these concepts pick (n) instances randomly
- For each instance, issue a Google Search query:
- if an infobox is available -> scrap the infobox to extract the properties
- if no infoxbox is available, check if Google suggests "do you mean ... ?" and if so, traverse the link and look for an infobox
- if no infobox or correction is available, disambiguate the concept (type) used in the search query and check if an infobox is returned
- if Google suggests disambiguation in an infobox parse all the links in it -> it is best to find which suggestion maps to the current data-type we are using -> check the Freebase - DBpedia mappings
- Cluster properties for each concept
- The result of our expirement is in the results folder
results/dbpedia.json - For a more detailed view for each DBpedia class, one can check the files in
results/dbpedia
- Clone the repo to your local machine
- run
npm installon the root of the local project directory
We Will automatically create all the required Cache folders:
-
Main cache folder "cache" in the root folder of the application
- folder called
GKBinside the cache folder: This will hold the aggregated Google Knowledge boxes extracted for a DBpedia concept (type) - folder called
instances_GKBinside the cache folder: This will hold the Google Knowledge box for a single instance - folder called
instancesinside the cache folder: This will hold the DBpedia instances for each concept (type) - folder called
instance_propertiesinside the cache folder: Thiw ill hold the distinct list of properties for all the instances of a certain concept
- folder called
-
run
node KBE.jsin the console
The application is run in the console and the output will be available in cache/result.json
There is a set of options that you can change found in the file options.json
cache_dbpedia_concepts : true,
limit_dbpedia_concepts : true,
limit_dbpedia_instances : true,
limit_dbpedia_concepts_value : 10,
limit_dbpedia_instances_value: 10,
proxy : nullcache_dbpedia_conceptscache the concepts retrieved from DBpedia.limit_dbpedia_conceptslimit the number of concepts retrieved by DBpedia, false will retrieve all the conceptslimit_dbpedia_instanceslimit the number of instances retrieved for each concept, false will retrieve all the instanceslimit_dbpedia_concepts_valuethe number of concepts you wish to retrievelimit_dbpedia_instances_valuethe number of instances you wish to retrieve for each conceptproxythe proxy address string containing ports i.ehttp:\\proxy:8080
For our experiment the parameters are:
cache_dbpedia_concepts : true,
limit_dbpedia_concepts : false,
limit_dbpedia_instances : true,
limit_dbpedia_concepts_value : null,
limit_dbpedia_instances_value: 100,
proxy : nullMoreover, you can always check the corresponding CSS class name selectors for the Google Knowledge Panel and edit them if needed in the same options.json file.
Currently the CSS selectors are:
"knowledgeBox" : "#kno-result",
"knowledgeBox_disambiguate" : ".kp-blk",
"property" : "._Nl",
"property_value" : ".kno-fv",
"label" : ".kno-ecr-pt",
"description" : ".kno-rdesc",
"type" : "._kx",
"images" : ".bicc",
"special_property" : ".kno-sh",
"special_property_value" : "._Zh",
"special_property_value_link" : "a._dt"
- Properties now have the direct links to DBpedia ontology
- Properties scores are normalized
"Band": {
"summary": {
"label": {
"uri": "http://dbpedia.org/property/label",
"count": 100
},
"description": {
"uri": "http://purl.org/dc/elements/1.1/description",
"count": 100
},
"type": {
"uri": "http://dbpedia.org/property/type",
"count": 100
},
"origin": {
"uri": "http://dbpedia.org/property/origin",
"count": 88.17204301075269
},
"members": {
"uri": "http://dbpedia.org/property/members",
"count": 88.17204301075269
},
"albums": {
"uri": "http://dbpedia.org/property/albums",
"count": 87.09677419354838
},
"leadSingers": {
"uri": "http://dbpedia.org/property/leadSingers",
"count": 6.451612903225806
},
"recordLabel": {
"uri": "http://dbpedia.org/property/recordLabel",
"count": 12.903225806451612
},
"awards": {
"uri": "http://dbpedia.org/property/awards",
"count": 13.978494623655912
},
"nominations": {
"uri": "http://dbpedia.org/property/nominations",
"count": 7.526881720430108
},
"born": {
"uri": "http://dbpedia.org/property/born",
"count": 2.1505376344086025
},
"nationality": {
"uri": "http://dbpedia.org/property/nationality",
"count": 2.1505376344086025
},
"height": {
"uri": "http://dbpedia.org/property/height",
"count": 1.0752688172043012
}
},
"infoboxless": [
"!Action Pact!",
"Allele (band)",
"Anti-Pasti",
"Armageddon (A&M band)",
"Banket (band)",
"Battlelore",
"Ben Folds Five"
],
"Unmapped_Properties": {
"leadSinger": 1,
"recordLabels": 1,
"songs": 1,
"upcomingEvents": 1,
"peopleAlsoSearchFor": 1,
"activeFrom": 1,
"filmMusicCredits": 1,
"activeUntil": 1,
"moviesAndTvShows": 1
}
}