Presentazione di PowerPoint

GSIM implementation in the

Istat Metadata System:

focus on structural metadata and

on the joint use of GSIM and SDMX

Mauro Scanu (scanu@istat.it)

ISTAT – Italian National Institute of Statistics

Geneva, UNECE - 5 May 2015

Istat has been disseminating data and metadata through a Single ExitPoint since 2009. The SDMX registry was the first centralized structuralmetadata system in Istat. It consists of structural metadata related todisseminated data.

Since 2010 Istat decided to have a centralized system (SUM) thatcontains structural metadata for all the data produced by Istat, from datacollection up to data dissemination. The system should aim at easing:

data retrieval

metadata harmonization/integration

data traceability

In the meantime GSIM began to be discussed: GSIM claims to be thefirst internationally endorsed reference framework for statisticalinformation.

The idea was that SUM should be GSIM compliant. How to combine the(mandatory) SDMX infrastructure and a GSIM-compliant metadatasystem?

In 2013 UNECE organized workgroups on GSIM and other standards.One of them compared GSIM and SDMX (as well as DDI)

Background

Mauro Scanu – Geneva, UNECE - 5 May 2015

Focus of this talk

1.Relationship between SDMX and GSIM

2.What information should be available for a complete and correctdefinition of the meaning of data, and how to structure it in a uniqueway for every theme

SUM and GSIM

SUM adopted GSIM terminology and definitions.

Most of the concepts came from the GSIM groups Concepts andStructure, and some from the Production group

Mauro Scanu – Geneva, UNECE - 5 May 2015

Apologies: I donot describe thisgraph anddifferences withGSIM

Conceptschemes

Code list

Primary measure

Dimensions

SDMX and statistics (and GSIM)

Mauro Scanu – Geneva, UNECE - 5 May 2015

Datastructuredefinitions

Attributes

Time dimension

Frequency dimension

Measure dimension(SDMX version 2.1)

Statistical variables

Time related concepts

Other concept(operative/transformation)

Data content

Classifications

Time related code lists:CL_FREQ,…

List of transformationmethods:CL_ADJUSTMENT,…

List of data contents

GSIM

GSIM

COMMENT: GSIMuseful for harmonizingDSD’s content

SDMX

Data Content 1

Attributes

Data content

List of data contents

GSIM defines everything, but “data content”.

The problem is: what is the complete set of informationthat defines the meaning of a figure in a table

In SUM, we introduced the “data content” concept because

it plays the role of the “title” in the old data tables

it is defined according to the GSIM (specification) lines 47-50: Eachdata is a result of a Process step through the application of a Processmethod on the necessary Inputs. Hence it is modelled by specifying

Statistical Program and Statistical Program Cycle

Process Step (phase)

Process Method

Inputs

Examples:

Monthly average household expenditures

Household budget survey

Time dimension and Freq

Dissemination

Average

Validated sample ofthe HBS

Population: households

Num. Variable: monthlyexpenditures

Activity rate

Labour Force survey

Time dimension and Freq

Dissemination

Ratio

DC1: Active pop

DC2: Pop

DC1: Active pop

DC2: Pop

Mauro Scanu – Geneva, UNECE - 5 May 2015

Data Content 2

Attributes

Data content

List of data contents

Data Content feeds the GSIM concept Measure in aData Structure.

We are not aligned with GSIM in this line: “measurescorrespond to Represented Variables with uncodedValue Domains (Described Value Domains)”

SUM maintains a code list “Data Content” where each item contains all theprevious details.

In this way a user can find the meaning of the data in a hypercube in a uniqueplace.

Furthermore a data producer has a form to complete for describing a new datacontent.

Any data producer should describe the “Data Content” according to the same“model”.

If the “Data Content” item has less information than needed, furtherdimensions or attributes should be included in a data structure in order tobe complete.

If the “Data Content” has more information than needed, some dimensionswould become useless.

These two deviations from a standard data content correspond to acontent “stove pipe” and to the massive use of mappers.

Mauro Scanu – Geneva, UNECE - 5 May 2015

Conclusions

Attributes

SDMX has been a real success for the harmonization of the IT infrastructure for theexchange of data and metadata.

Anyway, SDMX is a real puzzle for those who have (only) a statistical background (interms of concepts, organization of a data structure, …).

GSIM could help a lot a statistician in using SDMX, assigning a concrete statistical role toconcepts before their use in a DSD.

The use of GSIM concepts before their use in a DSD helps in harmonizing the descriptionof a data cube.

Among the concepts already available in GSIM, an additional concept (the “DataContent”) could be useful in order to feed in a standard and complete way a Measure of aData Structure (of macrodata).

This is what we have done in Istat. The corporate DWH (I.Stat) has almost 3000 “datacontents”. In SUM it is possible to search data through different facets:

Statistical program

Reference population of the data

Numerical variables used for the production of a data content

Categorical variables used to cross cut data contents

Categories of a categorical variable used in data structures

Furthermore it is easy to reconstruct the relationships between statistical programs (reuseof data for computation of other data)

This year we are including micro data (for the data collection and validation steps)

Mauro Scanu – Geneva, UNECE - 5 May 2015