Integrating One More Data Source: The Semantic Web

Fri, 09/30/2016 - 10:36 Vincenzo

The Semantic Web

According to the W3C Linked Data page, the Semantic Web refers to a technology stack to support the “Web of data”. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.

  • RDF. Resource Description Framework is a standard data model for representing the metadata of resources in the Web; it represents all resources - even those that cannot be directly retrieved. RDF especially helps to process, mix, expose, and share such metadata. In terms of the relational model, an RDF statement specifies a relationship between two resources and it is similar to a triple relation with subject, predicate, and object.
  • OWL. Ontology Web Language is based on the basic elements of RDF, but uses a wider vocabulary to describe properties and classes.
  • SKOS. Simple Knowledge Organization System is also based on RDF and specifically designed to express hierarchical information. If needed, it is also extendable into OWL.
  • SPARQL. Simple Protocol and RDF Query Language is an RDF-based query language used to retrieve and manipulate public and private metadata stored in RDF format.

A commonly used instance of the semantic web is the DBPedia project, which was created to extract structured content from Wikipedia.

Our latest release KNIME Analytics Platform 3.2 includes a great feature: semantic web integration! A full node category is dedicated to querying and manipulating semantic web resources. The new semantic web nodes treat the web of data exactly like a database, with connector nodes, query nodes, and manipulation nodes. Additional nodes are provided to read and write files in various formats.

Since one image is better than a thousand words and a workflow is better than a thousand images, I am going to guide you through a simple example of queries and manipulation of data from the semantic web. And since the month of July this year was quite gray and rainy, we decided to pine after past summers and explore the number of sun hours a normal July should have had, at least in Italy.

The goal of this example is to gather and visualize information from the semantic web about the total number of sun hours in July for a number of Italian cities.

Connecting to DBPedia to extract city name and sun hours

The desired information, i.e. the number of hours of sun per Italian city in the month of July, is likely available on Wikipedia, and therefore on DBPedia.

Therefore, the first step is to connect to DBpedia using the SPARQL Endpoint node. The SPARQL Endpoint node connects to an Endpoint URI (to be defined in the configuration window) through the SPARQL technology and outputs a live connection to that semantic web Endpoint. The DBPedia Endpoint URL is http://dbpedia.org/sparql.

Once the connection to DBPedia Endpoint has been established, we query the system with a SPARQL Query node for a list of Italian cities and their associated total number of sun hours in July. The SPARQL SELECT-like query would be as follows:

  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>    PREFIX dbo: <http://dbpedia.org/ontology/>    PREFIX dbp: <http://dbpedia.org/property/>    SELECT * WHERE {        ?city dbo:country <http://dbpedia.org/resource/Italy>  ;        dbp:julSun ?sunHours ;    } order by desc(?sunHours)

The SPARQL Query node takes a semantic web connection as input, runs the SELECT-like query written in its configuration window, and produces the resulting data at the output port in the shape of a KNIME data table. In our case, the result of the above query is a data table with two columns: the link to the city description on Wikipedia and the city’s number of sun hours in July.

Another interesting node in the SPARQL category is the SPARQL List Graph Names node. The SPARQL List Graph Names node would give us the names and URIs of all graphs (RDFs) in the DBPedia Endpoint. This node is shown in the example workflow just for the sake of explanation. We did not really need the list of the queried RDF graphs in our workflow.

Retrieving latitude and longitude from Google Maps Geocoding via REST Service

The whole operation of retrieving information from the semantic web is now concluded. With two nodes and a SPARQL query, we retrieved a list of Italian cities and their number of sun hours in July from the semantic web. The goal of this blog post would be concluded.

However, for the sake of completeness and beauty of presenting the results we will dedicate a bit more time to display the cities on a geographical map and to color them accordingly in a gray-to-yellow heat map, where gray is for the hours of cloud and yellow, of course, for the hours of sun.

A few String Manipulation nodes are used to clean up the data table, removing quotation marks and other URLs residual characters.

Latitude and longitude for each city are retrieved via REST service from the Google Maps Geocoding API. As for all Google API services, a project needs to be created via the Google API console, which has to be enabled for the Google Maps Geocoding service. After that, an API key can be created to access the service.

At this point, just allow us a few minutes to digress from the topic of this blog post and show off with the new REST category also available in the KNIME Analytics Platform 3.2 or later versions. In particular, the GET Request node and the JSON Path node will be enough to access the service and retrieve latitude and longitude values.

This new GET Request node can feed multiple requests to the target REST service. We just supply a list of request URLs, one for each city, and get a data table with latitudes and longitudes in return. The Google API key created when registering to the Google service needs to be provided in the Authentication tab of the configuration window of the GET Request node. The Connection Settings tab of the configuration window requires the data column with the list of request URLs, which were previously prepared in a Table Creator node.

The JSON Path node retrieves objects in a JSON structure. Each object is identified through a JSON path. The JSON paths can be written manually or better built interactively by double-clicking the object of interest in the preview frame in the configuration window. We use a JSON Path node to extract the cities’ latitude and longitude from the JSON structure that was returned by the Google Maps Geocoding REST service

Visualizing cities as colorful points on map of Italy

After joining all these values together, converting latitudes and longitudes to numbers, and defining the gray-to-yellow heat map, we are ready to display the cities as colored points on a map of Italy, using the OSM Map View node, included in the KNIME Open Street Map (OSM) integration.

Et voilà. Figure 1 visualizes the Italian cities on an OSM View and color codes them from gray to yellow on the basis of the corresponding number of sun hours in July.

To summarize: sun hours and Italian cities were retrieved from DBPedia using the new nodes for the semantic web; while latitude and longitude values were retrieved for each city from Google Maps Geocoding REST service using the GET Request node from the new KNIME REST extension.

Figure 1. Italian cities on an OSM Map View colored from gray to yellow based on the corresponding number of sun hours in July, as retrieved from DBPedia using the KNIME nodes for connection to and data extraction from the semantic web.

 

Conclusions and best wishes for a sunny September wherever you are!

The full workflow is shown in figure 2, with the semantic web nodes to connect to and query the semantic web; the GET Request node to query the Google Maps Geocoding service; and finally the OSM Map View node for the map visualization in figure 1.

Remember that this workflow was built with KNIME Analytics Platform 3.2.0, which provides these new nodes for semantic web integration and for multiple queries to REST services.

If you want to try to query the semantic web yourself, the workflow is available on the KNIME EXAMPLES server (you can find it in the top left panel in the KNIME workbench) under /08_Other_Analytics_Types/06_Semantic_Web/11_Semantic_Web_Analysis_Accessing_DBpedia.

Figure 2. This workflow displays Italian cities as points on a map, color-coded from gray to yellow according to their number of sun hours in July (Figure 1). The first step is to query DBPedia for the necessary information using the new semantic web nodes. At the same time, we queried the Google Map Geocoding service for the latitude and longitude of the same Italian cities. After cleaning, joining, converting, and coloring, the city name and its associate color and coordinates are used to populate the Italy map provided by the OSM Map View node.

 


(click on the image to see it in full size)