SlideShare a Scribd company logo
圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 43
Reuse of Structured Data: Semantics,
Linkage, and Realization
Andrea Wei-Ching Huang
Project Manager (Research)
Institute of Information Science, Academia Sinica, Taiwan
E-mail: andreahg@iis.sinica.edu.tw
Cheng-Jen Lee
Research Assistant
Institute of Information Science, Academia Sinica, Taiwan
E-mail: cjlee@iis.sinica.edu.tw
Tyng-Ruey Chuang
Associate Research Fellow
Institute of Information Science, Academia Sinica, Taiwan
E-mail: trc@iis.sinica.edu.tw
Keywords: CKAN; Data Provenance; Data Quality; Knowledge Base; Linked Open
Data (LOD); Ontology; Semantic Representation
【Abstract】
In order to increase the reuse value of existing datasets, it is now becoming a general practice to add 
semantic links among the records in a dataset, and to link these records to external resources. The 
enriched datasets are published on the web for both human and machine to consume and re‐purpose. 
In this paper, we make use of publicly available structured records from a digital archive catalogue, and 
we demonstrate a principled approach to converting the records into semantically rich and interlinked 
resources for all to reuse. While exploring the various issues involved in the process of reusing and 
re‐purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD), 
and examine twelve well‐known knowledge bases built with a Linked Data approach. We also discuss 
the general issues of data quality, metadata vocabularies, and data provenance. The concrete outcome 
of  this  research  work  is  the  following:  (1)  a  website  data.odw.tw  that  hosts  more  than  840,000 
semantically  enriched  catalogue  records  across  multiple  subject  areas,  (2)  a  lightweight  ontology 
voc4odw  for  describing  data  reuse  and  provenance,  among  others,  and  (3)  a  set  of  open  source 
DOI: 10.6245/JLIS.2017.431/722
44 Journal of Library and Information Science 43(1):7 – 46(April, 2017)
software tools available to all to perform the kind of data conversion and enrichment we did in this 
research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a 
platform  to  host  and  publish  Linked  Data.  Our  extensions  to  CKAN  is  open  sourced  as  well.  As  the 
records we drawn from the originally catalogue are released under the Creative Commons licenses, the 
semantically enriched resources we now re‐publish on the Web are free for all to reuse as well. 
 
【Long Abstract】
Introduction
In order to enhance the reuse value of existing datasets, it is now becoming a general practice to add
semantic links among the records in a dataset, and to link these records to external resources. The
enriched datasets are published on the Web for both the human and the machine to consume and
re-purpose. In the paper, we make use of publicly available structured records from a digital archive
catalogue, and we demonstrate a principled approach to converting the records into semantically rich and
interlinked resources for all to reuse. While exploring the various issues involved in the process of
reusing and re-purposing existing datasets, we review the recent progress in the field of Linked Open
Data (LOD), and examine twelve well-known knowledge bases built with a Linked Data approach. We
also discuss the general issues of data quality, metadata vocabularies, and data provenance.
The concrete outcome of this research work is the following: (1) a website that hosts more than
840,000 semantically enriched catalogue records across multiple subject areas, (2) a lightweight
ontology voc4odw for describing data reuse and provenance, among others, and (3) a set of open source
software tools available to all to perform the kind of data conversion and enrichment we did in this
research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a
platform to host and publish Linked Data. Our extensions to CKAN is open sourced as well. As the
records we have drawn from the originally catalogue are released under the Creative Commons licenses,
the semantically enriched resources we now re-publish on the Web are free for all to reuse as well.
Review of Twelve Knowledge Bases
We begin by first examine twelve knowledge bases built with a Linked Data approach. Five of them
are built by domain knowledge experts (OpenCyc, Getty Art & Architecture Thesaurus, Getty Thesaurus
of Geographic Names, and Ordnance Survey), six of them are collaborative databases (Freebase, YAGO,
DBpedia, Wikidata, LinkedGeoData, GeoNames), and the last one is about ecological observations
based on expert and community collaborations (Encyclopedia of Life). We further compare datasets
圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 45
about geospatial entities with controlled vocabularies: Getty TGN, Open Names (Ordnance Survey),
DBpediaPlace, LinkedGeoData, and GeoNames.
To make good reuse of structured data, ones need to first deal with the problem of data quality.
Currently there exist different evaluation criteria, with various techniques for measuring the quality of
information, data, metadata, and Linked Data. We review four papers on data quality and systematically
compare their evaluation criteria. Moreover, data provenance --- contextual metadata about the source
and use of data --- has proven to be fundamental for assessing authenticity, enabling trust, and allowing
reproducibility. Thus, we examine key mechanisms of data provenance before we move forward to
discussing LOD applications.
Practices
We then make use of structured records from a digital archive catalogue, and convert the records into
semantically rich and interlinked resources on the Web. This is realized as a unified Linked Data
catalogue to several digital archive collections. Our work results in a LOD catalogue available to the public at
the website <http://data.odw.tw>. The following five parts are involved in realizing this website. A catalogue
record, about a species of Pleione Formosana, is used throughout in the paper as an example to demonstrate
the way we model, convert, and represent the semantics of a structured record.
Part 1: Exploring data reuse relations in a shared context -- We review our previous research about the
Relation for Reuse Ontology (R4R). In particular, we provide mechanisms for reusing article, data, and
code with some flexibility of encoding provenance and license information.
Part 2: Comparing two different data conversion approaches to providing LOD for an archive
catalogue -- We show two different scenarios: (1) The LOD catalogue is converted directly from a
relational database, and (2) the LOD catalogue is generated from a series of format conversions --- from
XML to CSV, and then to RDF.
Part 3: Data profiling, cleaning and mapping -- We demonstrate format conversion processes, and we
discuss the pros and cons of various ways in handling broken links in source datasets. In addition, we
mapped and linked catalogue records to three external knowledge bases: GeoNames, Wikidata, and
Encyclopedia of Life.
Part 4: Using CKAN (The Comprehensive Knowledge Archive Network) as a Linked Data platform --
We briefly introduce CKAN, an open source web-based data portal software package for curating and
publishing datasets. CKAN provides data preview, search, and discovery, especially with regard to
geospatial datasets. We built several extensions to CKAN in order to deposit, publish, browse, and
46 Journal of Library and Information Science 43(1):7 – 46(April, 2017)
search Linked Data. Various Linked Data representations of a catalogue record --- Turtle, RDF/XML,
and JSON-LD --- can all be downloaded and reused.
Part 5: Designing ontologies for data representation and reuse -- We design an ontology voc4odw
which includes the following 3 modules:
(1) The Core Model. It is comprise of a data model and a conceptual model. The data model represents
key data structure and relation. It is a framework to illustrate data source, derivation, and provenance.
The conceptual model incorporates SKOS Simple Knowledge Organization System; it also connects
to key event concepts. The conceptual model allows for data contextualization using common and
domain knowledge vocabularies.
(2) The Curation Model. It is responsible for disclosing the identification, classification, and publication
of structured records at a curation platform, such as the classification of themes, the assignment of
data identifiers, and the publication of datasets.
(3) A vocabulary voaf:Vocabulary. It is defined as "A vocabulary used in the Linked Data cloud", from
the Vocabulary of a Friend <http://purl.org/vocommons/voaf>. This module is to relate the Core
Model to external common vocabularies. Some hierarchy relations between different external
vocabularies can be traced with this vocabulary.
【Romanization of Chinese references is offered in the paper.】

More Related Content

PPTX
Open library data and embrace the world library linked data
PDF
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
PDF
Interlinking educational data to Web of Data (Thesis presentation)
PDF
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...
PDF
Engaging Information Professionals in the Process of Authoritative Interlinki...
PDF
Knowledge Organization Systems
PPTX
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
PPTX
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Open library data and embrace the world library linked data
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
Interlinking educational data to Web of Data (Thesis presentation)
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Knowledge Organization Systems
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...

What's hot (20)

PPTX
Dataset description: DCAT and other vocabularies
PDF
Linked Data
PDF
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
PPTX
AAT LOD Microthesauri
PPTX
How to describe a dataset. Interoperability issues
PPTX
Knowledge organization
PDF
TripFS presentation at ldow 2010
PPTX
PPTX
20130622 okfn hackathon t2
PPTX
Linked data HHS 2015
PPTX
Metadata standards
PDF
Www2012 tutorial content_aggregation
PDF
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
PDF
OpenTox - an open community and framework supporting predictive toxicology an...
PPTX
2015 07-tuto3-mining hin
PPTX
Role of Semantic Web in Health Informatics
PPTX
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
PPTX
Research Data Sharing: A Basic Framework
PDF
DataTags, The Tags Toolset, and Dataverse Integration
PPTX
Towards an Infrastructure for Mining Scientific Publications
Dataset description: DCAT and other vocabularies
Linked Data
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
AAT LOD Microthesauri
How to describe a dataset. Interoperability issues
Knowledge organization
TripFS presentation at ldow 2010
20130622 okfn hackathon t2
Linked data HHS 2015
Metadata standards
Www2012 tutorial content_aggregation
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
OpenTox - an open community and framework supporting predictive toxicology an...
2015 07-tuto3-mining hin
Role of Semantic Web in Health Informatics
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
Research Data Sharing: A Basic Framework
DataTags, The Tags Toolset, and Dataverse Integration
Towards an Infrastructure for Mining Scientific Publications
Ad

Similar to Reuse of Structured Data: Semantics, Linkage, and Realization (20)

PDF
Metadata as Linked Data for Research Data Repositories
PDF
Interpretation, Context, and Metadata: Examples from Open Context
PDF
A semantic framework and software design to enable the transparent integratio...
KEY
Pundit @ Open Humanities Hack
KEY
Pundit at the #HumanitiesHack London
PDF
20110728 datalift-rpi-troy
PDF
Crowdsourcing and Cultural Heritage Collections
PPT
Radically Open Cultural Heritage Data on the Web
PDF
IASSIT Kansa Presentation
PPTX
Digital Odyssey 2015 - Open Collections
PPTX
Linked Open Data for Libraries, Archives, and Museums: An Aggregators View
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
PDF
RBMS LODLAM presentation
PDF
20160818 Semantics and Linkage of Archived Catalogs
KEY
Biodiversity Informatics on the Semantic Web
ODP
Retooling a Research Data Repository: data.depositar.io
PDF
Open Government Data on the Web - A Semantic Approach
PPTX
Linked Open Data for Cultural Heritage
PPT
Unlocking Doors: recent initiatives in open and linked data at the National L...
PPTX
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Metadata as Linked Data for Research Data Repositories
Interpretation, Context, and Metadata: Examples from Open Context
A semantic framework and software design to enable the transparent integratio...
Pundit @ Open Humanities Hack
Pundit at the #HumanitiesHack London
20110728 datalift-rpi-troy
Crowdsourcing and Cultural Heritage Collections
Radically Open Cultural Heritage Data on the Web
IASSIT Kansa Presentation
Digital Odyssey 2015 - Open Collections
Linked Open Data for Libraries, Archives, and Museums: An Aggregators View
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
RBMS LODLAM presentation
20160818 Semantics and Linkage of Archived Catalogs
Biodiversity Informatics on the Semantic Web
Retooling a Research Data Repository: data.depositar.io
Open Government Data on the Web - A Semantic Approach
Linked Open Data for Cultural Heritage
Unlocking Doors: recent initiatives in open and linked data at the National L...
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Ad

More from andrea huang (14)

PDF
結構資料的再次使用:語意、連結與實作
PDF
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
PDF
20160602 典藏目錄的語意與連結
PDF
How to clean data less through Linked (Open Data) approach?
PDF
A preliminary study on Wikipedia Dbpdeia and Wikidata
PDF
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
PDF
20130805 Activating Linked Open Data in Libraries Archives and Museums
PDF
101203 An event ontology for crisis-disaster information
PDF
081016 Social Tagging, Online Communication, and Peircean Semiotics
PDF
060817 Participation Collaboration Mapping
PDF
070928 Collaborative Geospatial Mapping And Data Authorization
PDF
041018 Community Gis
PDF
051102 Online Community Mapping
PDF
051207 Commonsense Geography Meets Web Technology
結構資料的再次使用:語意、連結與實作
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
20160602 典藏目錄的語意與連結
How to clean data less through Linked (Open Data) approach?
A preliminary study on Wikipedia Dbpdeia and Wikidata
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
20130805 Activating Linked Open Data in Libraries Archives and Museums
101203 An event ontology for crisis-disaster information
081016 Social Tagging, Online Communication, and Peircean Semiotics
060817 Participation Collaboration Mapping
070928 Collaborative Geospatial Mapping And Data Authorization
041018 Community Gis
051102 Online Community Mapping
051207 Commonsense Geography Meets Web Technology

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Understanding_Digital_Forensics_Presentation.pptx

Reuse of Structured Data: Semantics, Linkage, and Realization

  • 1. 圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 43 Reuse of Structured Data: Semantics, Linkage, and Realization Andrea Wei-Ching Huang Project Manager (Research) Institute of Information Science, Academia Sinica, Taiwan E-mail: [email protected] Cheng-Jen Lee Research Assistant Institute of Information Science, Academia Sinica, Taiwan E-mail: [email protected] Tyng-Ruey Chuang Associate Research Fellow Institute of Information Science, Academia Sinica, Taiwan E-mail: [email protected] Keywords: CKAN; Data Provenance; Data Quality; Knowledge Base; Linked Open Data (LOD); Ontology; Semantic Representation 【Abstract】 In order to increase the reuse value of existing datasets, it is now becoming a general practice to add  semantic links among the records in a dataset, and to link these records to external resources. The  enriched datasets are published on the web for both human and machine to consume and re‐purpose.  In this paper, we make use of publicly available structured records from a digital archive catalogue, and  we demonstrate a principled approach to converting the records into semantically rich and interlinked  resources for all to reuse. While exploring the various issues involved in the process of reusing and  re‐purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD),  and examine twelve well‐known knowledge bases built with a Linked Data approach. We also discuss  the general issues of data quality, metadata vocabularies, and data provenance. The concrete outcome  of  this  research  work  is  the  following:  (1)  a  website  data.odw.tw  that  hosts  more  than  840,000  semantically  enriched  catalogue  records  across  multiple  subject  areas,  (2)  a  lightweight  ontology  voc4odw  for  describing  data  reuse  and  provenance,  among  others,  and  (3)  a  set  of  open  source  DOI: 10.6245/JLIS.2017.431/722
  • 2. 44 Journal of Library and Information Science 43(1):7 – 46(April, 2017) software tools available to all to perform the kind of data conversion and enrichment we did in this  research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a  platform  to  host  and  publish  Linked  Data.  Our  extensions  to  CKAN  is  open  sourced  as  well.  As  the  records we drawn from the originally catalogue are released under the Creative Commons licenses, the  semantically enriched resources we now re‐publish on the Web are free for all to reuse as well.    【Long Abstract】 Introduction In order to enhance the reuse value of existing datasets, it is now becoming a general practice to add semantic links among the records in a dataset, and to link these records to external resources. The enriched datasets are published on the Web for both the human and the machine to consume and re-purpose. In the paper, we make use of publicly available structured records from a digital archive catalogue, and we demonstrate a principled approach to converting the records into semantically rich and interlinked resources for all to reuse. While exploring the various issues involved in the process of reusing and re-purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD), and examine twelve well-known knowledge bases built with a Linked Data approach. We also discuss the general issues of data quality, metadata vocabularies, and data provenance. The concrete outcome of this research work is the following: (1) a website that hosts more than 840,000 semantically enriched catalogue records across multiple subject areas, (2) a lightweight ontology voc4odw for describing data reuse and provenance, among others, and (3) a set of open source software tools available to all to perform the kind of data conversion and enrichment we did in this research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a platform to host and publish Linked Data. Our extensions to CKAN is open sourced as well. As the records we have drawn from the originally catalogue are released under the Creative Commons licenses, the semantically enriched resources we now re-publish on the Web are free for all to reuse as well. Review of Twelve Knowledge Bases We begin by first examine twelve knowledge bases built with a Linked Data approach. Five of them are built by domain knowledge experts (OpenCyc, Getty Art & Architecture Thesaurus, Getty Thesaurus of Geographic Names, and Ordnance Survey), six of them are collaborative databases (Freebase, YAGO, DBpedia, Wikidata, LinkedGeoData, GeoNames), and the last one is about ecological observations based on expert and community collaborations (Encyclopedia of Life). We further compare datasets
  • 3. 圖書館學與資訊科學 43(1):7 – 46(民一○六年四月) 45 about geospatial entities with controlled vocabularies: Getty TGN, Open Names (Ordnance Survey), DBpediaPlace, LinkedGeoData, and GeoNames. To make good reuse of structured data, ones need to first deal with the problem of data quality. Currently there exist different evaluation criteria, with various techniques for measuring the quality of information, data, metadata, and Linked Data. We review four papers on data quality and systematically compare their evaluation criteria. Moreover, data provenance --- contextual metadata about the source and use of data --- has proven to be fundamental for assessing authenticity, enabling trust, and allowing reproducibility. Thus, we examine key mechanisms of data provenance before we move forward to discussing LOD applications. Practices We then make use of structured records from a digital archive catalogue, and convert the records into semantically rich and interlinked resources on the Web. This is realized as a unified Linked Data catalogue to several digital archive collections. Our work results in a LOD catalogue available to the public at the website <http://data.odw.tw>. The following five parts are involved in realizing this website. A catalogue record, about a species of Pleione Formosana, is used throughout in the paper as an example to demonstrate the way we model, convert, and represent the semantics of a structured record. Part 1: Exploring data reuse relations in a shared context -- We review our previous research about the Relation for Reuse Ontology (R4R). In particular, we provide mechanisms for reusing article, data, and code with some flexibility of encoding provenance and license information. Part 2: Comparing two different data conversion approaches to providing LOD for an archive catalogue -- We show two different scenarios: (1) The LOD catalogue is converted directly from a relational database, and (2) the LOD catalogue is generated from a series of format conversions --- from XML to CSV, and then to RDF. Part 3: Data profiling, cleaning and mapping -- We demonstrate format conversion processes, and we discuss the pros and cons of various ways in handling broken links in source datasets. In addition, we mapped and linked catalogue records to three external knowledge bases: GeoNames, Wikidata, and Encyclopedia of Life. Part 4: Using CKAN (The Comprehensive Knowledge Archive Network) as a Linked Data platform -- We briefly introduce CKAN, an open source web-based data portal software package for curating and publishing datasets. CKAN provides data preview, search, and discovery, especially with regard to geospatial datasets. We built several extensions to CKAN in order to deposit, publish, browse, and
  • 4. 46 Journal of Library and Information Science 43(1):7 – 46(April, 2017) search Linked Data. Various Linked Data representations of a catalogue record --- Turtle, RDF/XML, and JSON-LD --- can all be downloaded and reused. Part 5: Designing ontologies for data representation and reuse -- We design an ontology voc4odw which includes the following 3 modules: (1) The Core Model. It is comprise of a data model and a conceptual model. The data model represents key data structure and relation. It is a framework to illustrate data source, derivation, and provenance. The conceptual model incorporates SKOS Simple Knowledge Organization System; it also connects to key event concepts. The conceptual model allows for data contextualization using common and domain knowledge vocabularies. (2) The Curation Model. It is responsible for disclosing the identification, classification, and publication of structured records at a curation platform, such as the classification of themes, the assignment of data identifiers, and the publication of datasets. (3) A vocabulary voaf:Vocabulary. It is defined as "A vocabulary used in the Linked Data cloud", from the Vocabulary of a Friend <http://purl.org/vocommons/voaf>. This module is to relate the Core Model to external common vocabularies. Some hierarchy relations between different external vocabularies can be traced with this vocabulary. 【Romanization of Chinese references is offered in the paper.】