Adding Open Data Value
to 'Closed Data' Problems
Dr Simon Price
Research Fellow, University of Bristol
Data Scientist, Capgemini Insights & Data
Who am I?
• 30 years software development and leadership roles
• Moved into Data Science via PhD in Machine Learning (2014)
• Research Fellow in Machine Learning group
 ~20 Machine Learning researchers
• Led project to establish Bristol’s open research data repository
• One of the organisers of Open Data Institute (ODI) Bristol
• Data Scientist in Big Data Analytics team
 ~100 Data Scientists, Big Data Engineers and Data Analysts
• Focus on Open Source and Big Data technologies to solve client problems
Outline
1. Case study: open data + ‘closed data’
2. Deriving value from open data
3. Data Science with ‘closed data’
Case study: SubSift
Conferences using SubSift
• ECML-PKDD: European Conference on
Machine Learning and Principles and
Practice of Knowledge Discovery in
Databases
• KDD: ACM SIGKDD International
Conference on Knowledge Discovery and
Data Mining
• PAKDD: Pacific-Asia Conference on
Knowledge Discovery and Data Mining
• SDM: SIAM International Conference on
Data Mining
Journals using SubSift
• Machine Learning
• Data Mining and Knowledge Discovery
https://doi.org/10.1145/2979672
Initial problem addressed by SubSift
Matching submitted conference papers to possible reviewers in Programme Committee
confidential
‘closed data’
open data
Initial SubSift workflow
Generic SubSift workflow
Personalised session recommendations
Expert finding
Why did SubSift recommend this person?
Profiling our organisation
Profiling staff at meetings
Open data opportunities?
Open research data
• data.bris.ac.uk
• Research data storage facility
• Each researcher gets 10TB "forever"
Adding Open Data Value to 'Closed Data' Problems
 140+ datasets live on opendata.bristol.gov.uk
 Mostly static but some real-time data
 Examples
• Government: Elections since 2007
• Community: Quality of Life survey
• Education: School Results
• Energy: Installed PV, Energy Use in Council Buildings
• Environment: Real time & Historic Air Quality, Flood Alerts (EA)
• Land use: 2013 Planning applications
• Health: Life expectancy/ Mortality, Obesity, NHS Spend
Open government data
Adding Open Data Value to 'Closed Data' Problems
Deriving value from open data
1. Data Science
2. Using open data to enrich and connect ’closed data’
Adding Open Data Value to 'Closed Data' Problems
statistics software
engineering
machine
learning
data
science
statistics software
engineering
machine
learning
data
science
application
domains
research
domains
Big Data Analytics
Insights & Data
www.capgemini.com/insights-data
25Copyright © Capgemini 2017. All Rights Reserved
June 2017
Example Data Science application
Assurance Scoring
http://ow.ly/4nbEUI
Using existing enterprise data plus any
useful open data, detect potentially
fraudulent transactions
26Copyright © Capgemini 2017. All Rights Reserved
June 2017
Example Data Science application
Assurance Scoring
http://ow.ly/4nbEUI
27Copyright © Capgemini 2017. All Rights Reserved
June 2017
Machine Learning
Transform Selection Model
Training
Validation
Test
Feature Extraction and Selection Model Building
Variety of output files: logs, graphics, saved models, etc.
Testing: Unit tests, monitoring tests and integration tests
Vector Build
Input Data
Manipulate, Explore
Data
Machine Learning Framework (Python, Scala, Spark)
28Copyright © Capgemini 2017. All Rights Reserved
June 2017
Graph Links - Matching
Key part of assurance scoring – bringing data together from disparate
sources
Probability of Match: 80%
Attribute Data Source 1 Data Source 2
Name Richard Smith Rich Smith
Phone Number 07123 456 789 07123 456 798
Favourite Sport Football Cricket
29Copyright © Capgemini 2017. All Rights Reserved
June 2017
Related to:
- record linkage
- duplicate detection
- reference resolution
- object identity
- entity matching
Connect graph
descriptions using
background knowledge
from open data sources.
e.g. Linked Open Data
Advanced matching
30Copyright © Capgemini 2017. All Rights Reserved
June 2017
Linked Open Data
Data Science with ‘closed data’

The information contained in this presentation is proprietary.
© 2012 Capgemini. All rights reserved.
www.capgemini.com
About Capgemini
With more than 120,000 people in 40 countries, Capgemini is one
of the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2011 global revenues
of EUR 9.7 billion.
Together with its clients, Capgemini creates and delivers
business and technology solutions that fit their needs and drive
the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the
Collaborative Business ExperienceTM, and draws on Rightshore ®,
its worldwide delivery model.
Rightshore® is a trademark belonging to Capgemini
Problems of opening up ‘closed data’
Research data now open by default - including sensitive data
Funders
Journals
data.bris has 3 levels of access:
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Data Science with ‘closed data’
Data science with ‘closed data’
• Custom R server running
inside secure data
repository / warehouse
• Enables non-disclosive,
remote analysis of
sensitive research data.
Number of Letters
NumberofWords
Non-disclosive Disclosive
Non-disclosive visualisation
Single-partition DataSHIELD
Multiple-partition DataSHIELD
DataSHIELD partition models
horizontal verticalideal
Adding Open Data Value to 'Closed Data' Problems
http://www.simonprice.info
simon.price@capgemini.com
@simonprice_info

More Related Content

DOCX
Self Study Business Approach to DS_01022022.docx
PDF
GTU GeekDay Data Science and Applications
PPTX
Data Science applications in business
PDF
Data Science in Action
PDF
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
PDF
Programming for data science in python
PDF
Introduction to Data Science
PPTX
Data science applications and usecases
Self Study Business Approach to DS_01022022.docx
GTU GeekDay Data Science and Applications
Data Science applications in business
Data Science in Action
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Programming for data science in python
Introduction to Data Science
Data science applications and usecases

What's hot (20)

PDF
Data science
PPTX
Introduction to Data Science by Datalent Team @Data Science Clinic #9
DOC
2005)
PDF
Data science
PPTX
Introduction to data science
PDF
Data Science
PDF
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
PDF
Data Science Lecture: Overview and Information Collateral
PDF
Unit 3 part 2
PDF
Introduction To Data Science
PPTX
Data science
PDF
Machine learning in action at Pipedrive
PPTX
data science
PDF
Machine Learning part 3 - Introduction to data science
PPTX
Data science | What is Data science
PPTX
Big data and data science overview
PPTX
What is Datamining? Which algorithms can be used for Datamining?
PDF
Introduction to data science intro,ch(1,2,3)
PPTX
Session 01 designing and scoping a data science project
PDF
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Data science
Introduction to Data Science by Datalent Team @Data Science Clinic #9
2005)
Data science
Introduction to data science
Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Data Science Lecture: Overview and Information Collateral
Unit 3 part 2
Introduction To Data Science
Data science
Machine learning in action at Pipedrive
data science
Machine Learning part 3 - Introduction to data science
Data science | What is Data science
Big data and data science overview
What is Datamining? Which algorithms can be used for Datamining?
Introduction to data science intro,ch(1,2,3)
Session 01 designing and scoping a data science project
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Ad

Similar to Adding Open Data Value to 'Closed Data' Problems (20)

PPTX
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
PDF
RD shared services and research data spring
PPTX
Aligning stakeholders' perspectives in Open Government Data Community
PDF
Towards a BIG Data Public Private Partnership
PDF
Carlo Colicchio: Big Data for business
PPTX
The Analytics and Data Science Landscape
PPTX
ppt1.pptx
PDF
Supporting Open Data Publishers
PDF
Big Data & Analytics for E&P conference
PPTX
UK data management environment and support
PDF
Key Technology Trends for Big Data in Europe
PPTX
A coordinated framework for open data open science in Botswana/Simon Hodson
PPTX
ODINE - Open Data Incubator Europe
PDF
Big Data Analytics in light of Financial Industry
PPTX
Easy SPARQLing for the Building Performance Professional
PDF
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
PDF
Fueling the open data economy
PDF
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
PPTX
Data Activities in Austria
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
RD shared services and research data spring
Aligning stakeholders' perspectives in Open Government Data Community
Towards a BIG Data Public Private Partnership
Carlo Colicchio: Big Data for business
The Analytics and Data Science Landscape
ppt1.pptx
Supporting Open Data Publishers
Big Data & Analytics for E&P conference
UK data management environment and support
Key Technology Trends for Big Data in Europe
A coordinated framework for open data open science in Botswana/Simon Hodson
ODINE - Open Data Incubator Europe
Big Data Analytics in light of Financial Industry
Easy SPARQLing for the Building Performance Professional
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
Fueling the open data economy
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
Data Activities in Austria
Ad

More from Simon Price (20)

PPT
Citizen Science and Crowd-sourcing Biological Surveys
PPT
Mining and Mapping the Research Landscape
PPTX
Managing Large-scale Multimedia Development Projects
PPT
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
PPTX
NewsPatterns - visualisation layer of news feed mining
PPT
A review of the state of the art in Machine Learning on the Semantic Web
PPT
Webs of People, Webs of Data
PPTX
Visualising China - historical photos of China
PPTX
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
PPTX
Adapting CARDIO for BOS
PPTX
data.bris - Use case, role and functionality for CKAN adoption
PPT
Nature Locator
PPTX
Co-designing Research IT and Research Data Services
PPT
Managing research data at Bristol
PPTX
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
PPTX
A Higher-Order Data Flow Model for Heterogeneous Big Data
PPT
SubSift web services and workflows for profiling and comparing scientists and...
PPT
SubSift: a novel application of the vector space model to support the academi...
PPTX
Code Club - a Fight Club inspired approach to software inspection and review
PPTX
Academic IT support for Data Science
Citizen Science and Crowd-sourcing Biological Surveys
Mining and Mapping the Research Landscape
Managing Large-scale Multimedia Development Projects
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
NewsPatterns - visualisation layer of news feed mining
A review of the state of the art in Machine Learning on the Semantic Web
Webs of People, Webs of Data
Visualising China - historical photos of China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Adapting CARDIO for BOS
data.bris - Use case, role and functionality for CKAN adoption
Nature Locator
Co-designing Research IT and Research Data Services
Managing research data at Bristol
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
A Higher-Order Data Flow Model for Heterogeneous Big Data
SubSift web services and workflows for profiling and comparing scientists and...
SubSift: a novel application of the vector space model to support the academi...
Code Club - a Fight Club inspired approach to software inspection and review
Academic IT support for Data Science

Recently uploaded (20)

PPTX
recommendation Project PPT with details attached
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
Business_Capability_Map_Collection__pptx
PPT
statistic analysis for study - data collection
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
ai agent creaction with langgraph_presentation_
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PPTX
The Data Security Envisioning Workshop provides a summary of an organization...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
recommendation Project PPT with details attached
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Best Data Science Professional Certificates in the USA | IABAC
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Business_Capability_Map_Collection__pptx
statistic analysis for study - data collection
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Navigating the Thai Supplements Landscape.pdf
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
ai agent creaction with langgraph_presentation_
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
The Data Security Envisioning Workshop provides a summary of an organization...
retention in jsjsksksksnbsndjddjdnFPD.pptx
SET 1 Compulsory MNH machine learning intro
CYBER SECURITY the Next Warefare Tactics
MBA JAPAN: 2025 the University of Waseda
AI AND ML PROPOSAL PRESENTATION MUST.pptx
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...

Adding Open Data Value to 'Closed Data' Problems

  • 1. Adding Open Data Value to 'Closed Data' Problems Dr Simon Price Research Fellow, University of Bristol Data Scientist, Capgemini Insights & Data
  • 2. Who am I? • 30 years software development and leadership roles • Moved into Data Science via PhD in Machine Learning (2014) • Research Fellow in Machine Learning group  ~20 Machine Learning researchers • Led project to establish Bristol’s open research data repository • One of the organisers of Open Data Institute (ODI) Bristol • Data Scientist in Big Data Analytics team  ~100 Data Scientists, Big Data Engineers and Data Analysts • Focus on Open Source and Big Data technologies to solve client problems
  • 3. Outline 1. Case study: open data + ‘closed data’ 2. Deriving value from open data 3. Data Science with ‘closed data’
  • 4. Case study: SubSift Conferences using SubSift • ECML-PKDD: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases • KDD: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining • PAKDD: Pacific-Asia Conference on Knowledge Discovery and Data Mining • SDM: SIAM International Conference on Data Mining Journals using SubSift • Machine Learning • Data Mining and Knowledge Discovery https://doi.org/10.1145/2979672
  • 5. Initial problem addressed by SubSift Matching submitted conference papers to possible reviewers in Programme Committee
  • 11. Why did SubSift recommend this person?
  • 13. Profiling staff at meetings
  • 15. Open research data • data.bris.ac.uk • Research data storage facility • Each researcher gets 10TB "forever"
  • 17.  140+ datasets live on opendata.bristol.gov.uk  Mostly static but some real-time data  Examples • Government: Elections since 2007 • Community: Quality of Life survey • Education: School Results • Energy: Installed PV, Energy Use in Council Buildings • Environment: Real time & Historic Air Quality, Flood Alerts (EA) • Land use: 2013 Planning applications • Health: Life expectancy/ Mortality, Obesity, NHS Spend Open government data
  • 19. Deriving value from open data 1. Data Science 2. Using open data to enrich and connect ’closed data’
  • 23. Big Data Analytics Insights & Data www.capgemini.com/insights-data
  • 24. 25Copyright © Capgemini 2017. All Rights Reserved June 2017 Example Data Science application Assurance Scoring http://ow.ly/4nbEUI Using existing enterprise data plus any useful open data, detect potentially fraudulent transactions
  • 25. 26Copyright © Capgemini 2017. All Rights Reserved June 2017 Example Data Science application Assurance Scoring http://ow.ly/4nbEUI
  • 26. 27Copyright © Capgemini 2017. All Rights Reserved June 2017 Machine Learning Transform Selection Model Training Validation Test Feature Extraction and Selection Model Building Variety of output files: logs, graphics, saved models, etc. Testing: Unit tests, monitoring tests and integration tests Vector Build Input Data Manipulate, Explore Data Machine Learning Framework (Python, Scala, Spark)
  • 27. 28Copyright © Capgemini 2017. All Rights Reserved June 2017 Graph Links - Matching Key part of assurance scoring – bringing data together from disparate sources Probability of Match: 80% Attribute Data Source 1 Data Source 2 Name Richard Smith Rich Smith Phone Number 07123 456 789 07123 456 798 Favourite Sport Football Cricket
  • 28. 29Copyright © Capgemini 2017. All Rights Reserved June 2017 Related to: - record linkage - duplicate detection - reference resolution - object identity - entity matching Connect graph descriptions using background knowledge from open data sources. e.g. Linked Open Data Advanced matching
  • 29. 30Copyright © Capgemini 2017. All Rights Reserved June 2017 Linked Open Data
  • 30. Data Science with ‘closed data’  The information contained in this presentation is proprietary. © 2012 Capgemini. All rights reserved. www.capgemini.com About Capgemini With more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model. Rightshore® is a trademark belonging to Capgemini
  • 31. Problems of opening up ‘closed data’
  • 32. Research data now open by default - including sensitive data Funders Journals data.bris has 3 levels of access:
  • 37. Data Science with ‘closed data’
  • 38. Data science with ‘closed data’ • Custom R server running inside secure data repository / warehouse • Enables non-disclosive, remote analysis of sensitive research data.