Copyright © 2017.... This work is licensed under XXX,
This work is supported by XXX
This document presents a set of best practices for publishing statistics on the Web following the linked data principles. These practices aim at supporting data publishers to model their data and to apply common linked data standards. The wide adoption of such practices can increase interoperability among portals of statistical data on the Web, and thus facilitate the integration of relevant datasets as well as the development of generic software tools that can be reused across different datasets.
The aim of this document is to trigger and contribute to a discussion on the development of a set of best practices for publishing statistics on the Web. It describes the results of a research work conducted by CERTH in the course of the OpenGovIntelligence project.
The approach that was followed comprises two steps:
Comments are very welcome, please send them to the authors.
International organizations, National Statistical Institutes, and public authorities are increasingly opening up their statistical data through Web portals for others to reuse. Recently, many of these portals (e.g. the Italian and Irish National Statistical Institutes portals, the Flemish Government portal and the Scottish Government portal) have adopted the linked data principles that facilitates data integration on the Web through the reuse of standard Web technologies such as HTTP, RDF and URIs and standard vocabularies such as RDF data cube, SKOS and XKOS. However, the flexibility of these standards enables statistical data portals to follow different publishing approaches, thus causing interoperability conflicts. The result is the creation of linked statistical data silos that hamper the integration and combination of related statistical data from different portals as wells as the development of generic exploitation tools (e.g. visualizations) that can be reused across different portals.
The audience of this document includes linked data experts, statisticians,
Scope goes here
Prefix | Namespace IRI | Description |
---|---|---|
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# | Resource Description Framework (RDF) |
rdfs | http://www.w3.org/2000/01/rdf-schema# | RDF Schema vocabulary (RDFS) |
qb | http://purl.org/linked-data/cube# | The RDF Data Cube Vocabulary |
skos | http://www.w3.org/2004/02/skos/core# | Simple Knowledge Organization System (SKOS) |
xkos | http://rdf-vocabulary.ddialliance.org/xkos# | Extended Knowledge Organization System |
year | http://reference.data.gov.uk/id/year/ | reference.data.gov.uk time intervals |
qudt | http://qudt.org/vocab/unit# | QUDT units vocabulary |
BP1: Defining the measure
John needs to define a measure e.g. unemployment. He is wondering what is the best way to do this.
The cube measure represents the phenomenon being observed (e.g. unemployment). At the QB vocabulary measures are RDF properties of qb:MeasureProperty type, they are defined at the cube structure (qb:DataStructureDefinition) and are used to assign numerical values to the observations (e.g. unemployment = 7.8%). The selection of the measure property may lead into two types of conflicts:
Re-use sdmx-measure:obsValue as a measure. In this case homonym conflicts appear, since the same measure property represents different measures. This approach has many disadvantages: i) it requires additional meta-data to indicate what is measured, ii) it does not enable the use of multiple measures at a cube and iii) it prevents the linking with other cubes. However, sdmx-measure:obsValue facilitates the conversion of existing SDMX data to the QB vocabulary.
Define a new measure property that is also a sub-property of sdmx-measure:obsValue. In this case both homonym and synonym conflict may appear. Homonym conflicts appear when the new measure property is generic (e.g. count, ratio), while synonym conflicts appear when two different measure properties model the same measure (e.g. unemployment). Moreover, defining the new measure property as sub-property of sdmx-measure:obsValue is a redundancy because it does not add any additional semantics than defining the measure as a qb:MeasureProperty.
BP1.1: Define a new measure that is not sub-property of sdmx-measure:obsValue. The new measure enable the annotation with additional properties (e.g. labels, comments).
BP2: Defining the unit
John has already defined unemployment as the measure of his cube. Now he is wondering a) whether or not to include the unit of the measure in the cube, b) what RDF property to use to define the unit c) where to define the unit and d) what values to assign.
The measure on its own is a plain numerical value. To correctly interpret this value we need to define the unit i.e. the quantity or increment by which the measure is counted or described. However, it is a common practice to not use units of measure (e.g. the portal of the Italian National Institute of Statistics http://datiopen.istat.it/).
At the QB vocabulary, units can be modeled as RDF properties of qb:AttributeProperty type that are defined at the cube structure (qb:DataStructureDefinition). The selection of the unit property may lead to synonym conflicts since different properties can be used for the same purpose. A common practice to address this challenge is to re-use sdmx-attribute:unitMeasure property. In practice however, new properties are defined as sub-properties of sdmx-attribute:unitMeasure hindering this way interoperability. For example, many existing portals follow this approach:
The QB vocabulary enables the definition of units at different levels i.e. at the qb:DataSet, at the qb:MeasureProperty and at the qb:Observation. By default units are assigned to qb:Observations, however using the property qb:componentAttachment different levels can also be used. In particular the options are:
Define the unit at the qb:DataSet. In this case the querying of available units is facilitated since units can easily be identified directly from the structure of the cube. However, if multiple units are defined at the same cube, then there is no way to map the unit with the values at the qb:Observation. Another disadvantage of this approach is that qb:Observations cannot easily be re-used at another context, because they do not contain all the relevant information i.e. the unit is missing.
Define the unit at the qb:MeasureProperty. In this case the querying of available units is also facilitated since units can be identified directly from the structure. Moreover, multiple units can be assigned at a cube. However, different qb:MeasureProperty should be defined for different units although they represent the same measure, thus hindering interoperability e.g. two separate qb:MeasureProperty should be defined for unemployment, one assigned to percentage and another assigned to the absolute number. In this case qb:Observations cannot easily be re-used at another context, because they do not contain all the relevant information i.e. the unit is missing.
Define the unit at the qb:Observation. In this case multiple units can be assigned to a cube and there is no need to define different qb:MeasureProperty for different units. Moreover, qb:Observations can be re-used at another context since they contain all relevant information. However, the querying of the available units is not efficient because the query should iterate over all qb:Observation.
Comparison matrix of the levels to define the unit:
qb:DataSet | qb:MeasureProperty | qb:Observation | |
---|---|---|---|
Query optimization | Yes | Yes | No |
Multiple units at one cube | No | Yes | Yes |
Use one qb:MeasureProperty for different units of the same measure | Yes | No | Yes | Re-use qb:Observations at another context | No | No | Yes |
The values of the unit property (e.g. percentage) can be expressed using predefined URIs (e.g. http://qudt.org/vocab/unit#Percent). However, the use of different URIs to express the same unit creates synonym conflicts e.g. both http://qudt.org/vocab/unit#Percent and http://statistics.gov.scot/def/concept/measure-units/percentage define percentage.
Three approaches are commonly used to express the unit values:
All approaches are equally used by existing official portals. However, DBpedia is only used for currency units (e.g. http://dbpedia.org/resource/Pound_sterling).
As a result the challenges related to units are the following:
BP2.1: Always include units of measures. The measure on its own is a plain numerical value. To correctly interpret this value we need to include the unit.
BP2.2: Re-use the sdmx-attribute:unitMeasure. This attribute can be used directly to assign values that are not part of a code list (e.g. QUDT). However, when annotation with additional properties (e.g. labels, code-list etc) is required, then new units that are also sub-properties of sdmx-attribute:unitMeasure should be defined.
BP2.3: Define the unit at qb:Observation. In case that query optimization of available units is required, then define the units also at the qb:DataSet.
BP2.4: Use URIs from the QUDT units vocabulary. It is the more complete vocabulary, however if QUDT is not sufficient, then DBpedia or code lists can be used.
Example:
BP3: Defining multiple units per measure
John realizes that the data he wants to publish contain unemployment both as a percentage and as an absolute number, so he needs to include both units. Now he is wondering: i) whether to include both units at the same cube or define separate cubes for each unit and ii) where to define the unit e.g. at the structure or at the observation.
It is a common practice to have data with multiple units for the same measure (e.g. percentage and absolute number for unemployment). In this case there are two modeling approaches that can be followed and are valid according to the QB vocabulary:
The selection of the approach to be followed is affected by the level where the unit is defined as described in challenge 2.3. For example, if multiple units for the same measure are defined at the qb:DataSet, then there is no way to map the units with the values at the qb:Observation. Thus, the definition of multiple units at the qb:DataSet should be avoided.
BP3.1: Publish one cube with multiple units and define the unit at each qb:Observation. Conceptually it is preferable to have all related units of the same measure at the same cube. In case that query optimization of available units is required, then define the units also at the qb:DataSet. For example:
See the appendix for other alternatives that address the same challenge.
BP4: Defining multiple measures
John wants to publish also data about poverty in Belgium. John is wondering whether to publish the data about unemployment and poverty at the same cube or publish separate cubes. In the case that both measures are included at the same cube, he is also wondering what is the best way to do so, considering that the measures have multiple units (percentage and absolute number).
It is a common practice to have data with multiple measures (e.g. unemployment and poverty). In this case, there are two modeling approaches that can be followed and are valid according to the QB vocabulary:
The first approach is covered by BP2 (measure with one unit) and at BP3 (measure with multiple units).
Considering the second approach (multiple measures at one cube), the QB vocabulary proposes two practices to represent multiple measures:
Multi-measure observations: Define multiple qb:MeasureProperty at the data structure definition and use all measures to every qb:Observation. By defining all the measures at one observation, this practice reduces the size of the produced cube. However, a limitation of this practice is that it does not allow the association of an attribute property (e.g. unit) to a single measurement. An attribute property attached to the observation will apply to all measurements, thus it cannot represent multiple measures with multiple units.
Measure dimension: Define multiple qb:MeasureProperty at the data structure definition, but restrict observations to having a single measure. This is achieved by defining an extra dimension, the qb:measureType, to denote which particular qb:MeasureProperty is included at the qb:Observation. The use of a single measure at observations produces a large number of observations, thus increasing the size of the produced cube. However, this practice enables the definition of multiple units and multiple measures.
Comparison matrix of the practices to represent multiple measures:
Multi-measure observations | qb:Measure dimension | |
---|---|---|
Size | Small | Large |
Support of multiple units and measures | No | Yes |
Support of multiple measures with the same unit | Yes | Yes |
If the data have multiple measures, then it common to publish cubes with multiple measures only when measures are closely related to a single observational event. However, the approach to be followed is up to the data cube publisher.
BP4.1: In case you model multiple measures at multiple cubes with one measure each, then see i) BP2 if the measure has one unit or ii) BP3 if the measures has multiple units.
BP4.2: In case you model multiple measures at one cube then use the measure dimension approach (i.e. observations with a single measure) and define the unit at each observation (as already explained at BP3).
Example:
BP5: Defining dimension properties
John has already defined the measures and the units of his cube. Now he needs to define the dimensions including time, geography, age and gender. He is wondering what RDF properties to use for the dimensions.
Dimensions (e.g. geography, time) provide contextual information for the measures of the cube. At the QB vocabulary, dimensions are RDF properties of qb:DimensionProperty type and are defined at the cube structure (qb:DataStructureDefinition). The selection of the RDF property to represent a dimension may lead to synonym conflicts since different properties can be used for the same purpose. A common practice to address this challenge is to re-use dimensions defined by the linked data version of SDMX (e.g. sdmx-dimension:refArea)
The time, geography, age and gender dimensions are very common at statistical data, thus the linked data version of SDMX has already defined qb:DimensionProperty to express them i.e. sdmx-dimension:refPeriod, sdmx-dimension:refArea, sdmx-dimension:age and sdmx-dimension:sex. These SDMX dimensions are not associated with the valid potential values (i.e. code list or range). A special case is the sdmx-dimension:sex dimension because a code list has already been defined and associated with the dimension.
In practice however, new dimensions are defined as sub-properties of the SDMX dimensions hindering this way interoperability. An advantage of this approach is the fact that it enables the addition of extra annotation to the dimensions (e.g. label, code list, range).
BP5.1 (Applies to time, geography and age): Define a new dimension that is also sub-property of the corresponding SDMX dimension. Define the valid potential values of the new dimension (see BP6 and BP7).
BP5.2 (Applies to gender): Re-use the sdmx-dimension:sex when the associated code list covers the modeling needs. If it does not cover the modeling needs, then define a new dimension and its valid potential values (see BP6 and BP7)
BP6: Defining dimension values
John needs to associate the cube dimensions with the valid potential values. He is wondering what is the best way to do so.
Dimension values can be either data types (e.g. xsd:dateTime) or URIs. Currently, many reference datasets have been defined to populate dimensions (e.g. time). However, some of them (e.g. reference.data.gov.uk) are not modeled as skos:ConceptScheme, skos:Collection or qb:HierarchicalCodeList.
The QB vocabulary allows two complementary approaches for defining the valid potential values of dimensions.
Use the property rdfs:range to define the type of valid values of a qb:DimensionProperty in the usual RDF manner. For example the values of the time dimension can be defined using as rdfs:range the xsd:dateTime.
Use the property qb:codeList to associate a qb:DimensionProperty with a code list (i.e. potential values). In statistical datasets it is common for values to be encoded using a code list and it is useful to easily identify the overall code list with a URI. The values of the qb:codeList can be a skos:ConceptScheme, skos:Collection or qb:HierarchicalCodeList. In such a case the rdfs:range can be a skos:Concept. According to the QB vocabulary a useful design pattern is to define an rdfs:Class whose members are all the skos:Concepts within a particular code list. In that way the rdfs:range can be made more specific.
BP6.1: Always define the rdfs:range of a qb:DimensionProperty.
BP6.2: Define the qb:codeList of a qb:DimensionProperty when the code list is modeled as skos:ConceptScheme or skos:Collection or qb:HierarchicalCodeList. In this case define the rdfs:range as skos:Concept. (for the way to define a new code list see BP9)
The use of the qb:codeList indicates all the potential values of a qb:DimensionProperty. However, it is common to not use all the values at the cube. For example, a code list may contain values for the geography of Europe, but the cube uses only values for Greece. In this case there is no way to retrieve only the used values from the cube structure. They can be only retrieved by iterating over all the cube observations that is a time consuming process.
BP7: Defining values of common dimensions
John now knows how to define the values of his cube’s dimension including time, gender, geography and age. However, he is still wondering: i) whether to use data types or URIs and ii) in the case of URIs what code lists to use.
The values of time dimension can be either periods of time (e.g. 2016) or specific points in time (e.g. 01/01/2016) and thus can be represented either as URIs (e.g. http://reference.data.gov.uk/id/year/2016) or as data-types (e.g. "2016-01-01"^^xsd:date). The selection of the approach to represent the time values may lead to synonym conflicts since different values can be used for the same purpose. Four approaches are commonly used to express the time values:
The values of the geography and age dimensions are usually represented as URIs. The selection of the URI to represent the values may lead to synonym conflicts since different values can be used for the same purpose. Currently there are no standardized universal code-lists that can be used for these dimensions. However, assuming the geography dimension, official location-specific code lists have been defined for specific countries, regions etc.
The values of the gender dimension are usually represented as URIs. The linked data version of SDMX has already defined a code list for gender including the values sdmx-code:sex-F (female), sdmx-code:sex-M (male), sdmx-code:sex-U (undefined), sdmx-code:sex-N (not applicable) and sdmx-code:sex-T (total). In some cases however, the SDMX code list cannot cover the modeling needs e.g. more notions of sex are needed like hermaphroditism, transgender, asexual.
This challenge is addressed by BP5.2
BP7.1a: In case you need to describe a specific point in time then define a new dimension that is also sub-property of sdmx-dimension:refPeriod and set as rdfs:range xsd:date.
BP7.1b: In case you need to describe a period of time then define a new dimension that is also sub-property of sdmx-dimension:refPeriod and set as rdfs:range the interval:Interval (reference.data.gov.uk uses Intervals to define years). However if reference.data.gov.uk is not sufficient, then a code lists can also be used (see BP9).
BP7.2 (Applies to geography and age): Define a new dimension that is also sub-property of the sdmx-dimension:refArea or sdmx-dimension:age respectively. Define the rdfs:range and/or qb:codeList of this dimension (see BP6). If a code list or reference dataset that covers the modeling needs exists, then it should be re-used. Otherwise a new code list should be created (see BP9).
BP8: Defining single value dimensions
The cube includes data only for one year i.e. 2016. John is wondering: i) whether to include this single value at the dataset or not and ii) in case of including, where to define it.
Some datasets contain data with a single value for a dimension (e.g. census data has a single value for the time dimension). The QB vocabulary enables the definition of the single value at different levels.
Include single value at the qb:Dataset. In this case the single value is easily identified directly from the structure. However, it does not enable the future addition of observations with a different value for that particular dimension. Another disadvantage of this approach is that qb:Observations cannot easily be re-used at another context, because they do not contain all the relevant information i.e. dimension with the single values is missing.
Create a qb:Slice for the single value. In this case the single value is easily identified from the structure. There is also the possibility to add observations with a different value by creating another qb:Slice for that value. A disadvantage of this approach is that it imposes the extra burden of defining qb:Slices when publishing data cubes. Moreover, qb:Observations cannot easily be re-used at another context, because they do not contain all the relevant information i.e. dimension with the single values is missing.
Include single value at every qb:Observation. Simplicity and flexibility during future data maintenance is the advantage of defining the single value at this level. This practice has the cost of an increased number of triples, but it enables the addition of more observations at the same dataset with a different value for that particular dimension. Another advantages is the easy re-use of qb:Observations at another context.
BP8.1: Always include the single value of a dimensions at each qb:Observation.
Example
BP9: Defining code lists
John has already defined dimension properties and decided what code lists to use for some of the dimensions (e.g. time, gender). However, there are no appropriate code lists for all the dimensions (e.g. age, geography), thus John has to define them.
In some cases there are no appropriate code lists that cover the modeling needs of the cube. Thus, new code lists should be defined. The QB vocabulary recommends using SKOS to model the code lists. In particular, it recommends to represent the individual code values using skos:Concept and the overall set of values using skos:ConceptScheme or skos:Collection.
It is a common practice to include only relevant values at a code list. In practice, sometime irrelevant values are included at the same code list (e.g. Brussels, Wallone, 20-30, 30-45).
It is common for code lists to include data with a hierarchical structure (e.g. geographical divisions). Thus, they can be represented using hierarchies that consist of generalization/specialization relations (e.g. Brussels isPartOf Belgium) and levels (e.g. country, region, city). In practice however, sometimes the hierarchical data are represented flat in a code list without generalization/specialization relations.
The selection of the properties to represent generalization/specialization relations and levels may lead to synonym conflicts since different properties can be used for the same purpose. At linked statistical data the following approaches exist to represent hierarchical code lists:
Use SKOS vocabulary. This approach defines only generalization/specialization relations i.e. skos:broader and skos:narrower between skos:Concepts.
Use XKOS vocabulary. XKOS is an extension of SKOS that enables the modeling of both generalization/specialization relations (i.e. xkos:isPartOf, xkos:hasPart) and hierarchical levels (i.e. xkos:ClassificationLevel) of skos:Concepts.
Use QB vocabulary. In this case a qb:HierarchicalCodeList should be defined to represent the overall set of values (similar to a skos:ConceptScheme). The qb:hierarchyRoot plays the same role as skos:hasTopConcept, and the value of qb:parentChildProperty plays the same role as skos:narrower. This approach is provided for cases where the terms are not available as SKOS but are available in some other RDF representation suitable for reuse.
Sometimes, case-specific properties are exploited to express hierarchies. This approach enables the definition of generalization/specialization relations not covered by SKOS and XKOS (e.g. administeredBy vs within). However, without using any standard vocabulary the interpretation of the data is not always self descriptive.
"Τotal" values are often used at dimensions (e.g. male, female and total). They facilitate the processing of the data because there is no need to compute them when required. However, “total" values should somehow be differentiated from other values, otherwise there is a risk of false interpretation of the data (e.g. duplicate the values). In practice however, “total” values are often “incognito” on a code list e.g. the linked data version of SDMX contains the values sdmx-code:sex-F, sdmx-code:sex-M and sdmx-code:sex-T (total) without any way to distinguish the "total" value.
The following practices have been proposed to differentiate "total" values.
Use a hierarchy and define total values at the top of the hierarchy. The practice to express the hierarchy depends on challenge 9.2
Put total on slices. The total value for a dimension is represented by leaving out this dimension from the slice. The example shows a slice that defines the total unemployment (35.9%) for all the people (male and female) with age 15-24, at Brussels the year 2016.
BP9.1: Use SKOS to model code lists (recommended by QB vocabulary). Specifically, represent individual code values using skos:Concept and the overall set of values using skos:ConceptScheme or skos:Collection. Always define a separate code list for each descriptive category of the data (e.g. age, geography).
Example
BP9.2: In the case of hierarchical data always use hierarchies to describe them. SKOS should be preferred when the hierarchies are simple. In the case where levels are fully separated and a depth is a meaningful concept then XKOS is appropriate. Finally, when there is a need to express more relations not covered by SKOS or XKOS (e.g. administeredBy vs within) then QB vocabulary should be preferred.
BP9.3: Always define total values for the dimensions (where applicable). Define them at the top a hierarchy (see BP9.2)
BP3: Defining multiple units per measure
Alternative 1: Publish multiple cubes with one unit and define the unit at the qb:DataSet. For example:
Alternative 2: Publish one cube with multiple units and define the unit at the qb:MeasureProperty. For example:
changelog