SlideShare a Scribd company logo
Adding Open Data Value
to 'Closed Data' Problems
Dr Simon Price
Research Fellow, University of Bristol
Data Scientist, Capgemini Insights & Data
Who am I?
• 30 years software development and leadership roles
• Moved into Data Science via PhD in Machine Learning (2014)
• Research Fellow in Machine Learning group
 ~20 Machine Learning researchers
• Led project to establish Bristol’s open research data repository
• One of the organisers of Open Data Institute (ODI) Bristol
• Data Scientist in Big Data Analytics team
 ~100 Data Scientists, Big Data Engineers and Data Analysts
• Focus on Open Source and Big Data technologies to solve client problems
Outline
1. Case study: open data + ‘closed data’
2. Deriving value from open data
3. Data Science with ‘closed data’
Case study: SubSift
Conferences using SubSift
• ECML-PKDD: European Conference on
Machine Learning and Principles and
Practice of Knowledge Discovery in
Databases
• KDD: ACM SIGKDD International
Conference on Knowledge Discovery and
Data Mining
• PAKDD: Pacific-Asia Conference on
Knowledge Discovery and Data Mining
• SDM: SIAM International Conference on
Data Mining
Journals using SubSift
• Machine Learning
• Data Mining and Knowledge Discovery
https://doi.org/10.1145/2979672
Initial problem addressed by SubSift
Matching submitted conference papers to possible reviewers in Programme Committee
confidential
‘closed data’
open data
Initial SubSift workflow
Generic SubSift workflow
Personalised session recommendations
Expert finding
Why did SubSift recommend this person?
Profiling our organisation
Profiling staff at meetings
Open data opportunities?
Open research data
• data.bris.ac.uk
• Research data storage facility
• Each researcher gets 10TB "forever"
Adding Open Data Value to 'Closed Data' Problems
 140+ datasets live on opendata.bristol.gov.uk
 Mostly static but some real-time data
 Examples
• Government: Elections since 2007
• Community: Quality of Life survey
• Education: School Results
• Energy: Installed PV, Energy Use in Council Buildings
• Environment: Real time & Historic Air Quality, Flood Alerts (EA)
• Land use: 2013 Planning applications
• Health: Life expectancy/ Mortality, Obesity, NHS Spend
Open government data
Adding Open Data Value to 'Closed Data' Problems
Deriving value from open data
1. Data Science
2. Using open data to enrich and connect ’closed data’
Adding Open Data Value to 'Closed Data' Problems
statistics software
engineering
machine
learning
data
science
statistics software
engineering
machine
learning
data
science
application
domains
research
domains
Big Data Analytics
Insights & Data
www.capgemini.com/insights-data
25Copyright © Capgemini 2017. All Rights Reserved
June 2017
Example Data Science application
Assurance Scoring
http://ow.ly/4nbEUI
Using existing enterprise data plus any
useful open data, detect potentially
fraudulent transactions
26Copyright © Capgemini 2017. All Rights Reserved
June 2017
Example Data Science application
Assurance Scoring
http://ow.ly/4nbEUI
27Copyright © Capgemini 2017. All Rights Reserved
June 2017
Machine Learning
Transform Selection Model
Training
Validation
Test
Feature Extraction and Selection Model Building
Variety of output files: logs, graphics, saved models, etc.
Testing: Unit tests, monitoring tests and integration tests
Vector Build
Input Data
Manipulate, Explore
Data
Machine Learning Framework (Python, Scala, Spark)
28Copyright © Capgemini 2017. All Rights Reserved
June 2017
Graph Links - Matching
Key part of assurance scoring – bringing data together from disparate
sources
Probability of Match: 80%
Attribute Data Source 1 Data Source 2
Name Richard Smith Rich Smith
Phone Number 07123 456 789 07123 456 798
Favourite Sport Football Cricket
29Copyright © Capgemini 2017. All Rights Reserved
June 2017
Related to:
- record linkage
- duplicate detection
- reference resolution
- object identity
- entity matching
Connect graph
descriptions using
background knowledge
from open data sources.
e.g. Linked Open Data
Advanced matching
30Copyright © Capgemini 2017. All Rights Reserved
June 2017
Linked Open Data
Data Science with ‘closed data’

The information contained in this presentation is proprietary.
© 2012 Capgemini. All rights reserved.
www.capgemini.com
About Capgemini
With more than 120,000 people in 40 countries, Capgemini is one
of the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2011 global revenues
of EUR 9.7 billion.
Together with its clients, Capgemini creates and delivers
business and technology solutions that fit their needs and drive
the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the
Collaborative Business ExperienceTM, and draws on Rightshore ®,
its worldwide delivery model.
Rightshore® is a trademark belonging to Capgemini
Problems of opening up ‘closed data’
Research data now open by default - including sensitive data
Funders
Journals
data.bris has 3 levels of access:
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Data Science with ‘closed data’
Data science with ‘closed data’
• Custom R server running
inside secure data
repository / warehouse
• Enables non-disclosive,
remote analysis of
sensitive research data.
Number of Letters
NumberofWords
Non-disclosive Disclosive
Non-disclosive visualisation
Single-partition DataSHIELD
Multiple-partition DataSHIELD
DataSHIELD partition models
horizontal verticalideal
Adding Open Data Value to 'Closed Data' Problems
http://www.simonprice.info
simon.price@capgemini.com
@simonprice_info

More Related Content

What's hot (20)

Data science
Data scienceData science
Data science
Sreejith c
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
2005)
2005)2005)
2005)
butest
 
Data science
Data scienceData science
Data science
Mohamed Loey
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Data Science
Data ScienceData Science
Data Science
Prithwis Mukerjee
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
Frank Kienle
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
MohammadAsharAshraf
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Data science
Data science Data science
Data science
SouravSadhukhan6
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
André Karpištšenko
 
data science
data sciencedata science
data science
skhraletta
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
Frank Kienle
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
Colleen Farrelly
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
Seval Çapraz
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
bodaceacat
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Edureka!
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
Frank Kienle
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
André Karpištšenko
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
Frank Kienle
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
Colleen Farrelly
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
Seval Çapraz
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
bodaceacat
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Edureka!
 

Similar to Adding Open Data Value to 'Closed Data' Problems (20)

Assurance Scoring Pydata London 2016
Assurance Scoring Pydata London 2016 Assurance Scoring Pydata London 2016
Assurance Scoring Pydata London 2016
Matthew Thomson
 
Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...
South West Data Meetup
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
Innovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle RInnovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle R
Capgemini
 
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxEDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
saurav3107pandey
 
Data Science Introduction and Process in Data Science
Data Science Introduction and Process in Data ScienceData Science Introduction and Process in Data Science
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
CambridgeshireInsight
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
Data Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayData Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That Way
Melinda Thielbar
 
Sql Server 2012
Sql Server 2012Sql Server 2012
Sql Server 2012
Performics.Convonix
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
How to stop boring people with open data
How to stop boring people with open dataHow to stop boring people with open data
How to stop boring people with open data
Benjamin Cave
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
Venkatesh Prasad Ranganath
 
Data Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptxData Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptx
bansalmayank1512
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
Joe_F
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Amit Sheth
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
Sandeep Garg
 
(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data
(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data
(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data
Precisely
 
EDF2012 Nigel Shadbolt - Transparency and Open Data
EDF2012   Nigel Shadbolt - Transparency and Open DataEDF2012   Nigel Shadbolt - Transparency and Open Data
EDF2012 Nigel Shadbolt - Transparency and Open Data
European Data Forum
 
Assurance Scoring Pydata London 2016
Assurance Scoring Pydata London 2016 Assurance Scoring Pydata London 2016
Assurance Scoring Pydata London 2016
Matthew Thomson
 
Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...
South West Data Meetup
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
Innovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle RInnovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle R
Capgemini
 
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxEDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
saurav3107pandey
 
Data Science Introduction and Process in Data Science
Data Science Introduction and Process in Data ScienceData Science Introduction and Process in Data Science
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...
CambridgeshireInsight
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
Data Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayData Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That Way
Melinda Thielbar
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
How to stop boring people with open data
How to stop boring people with open dataHow to stop boring people with open data
How to stop boring people with open data
Benjamin Cave
 
Data Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptxData Science and Machine learning-Module-01 (3).pptx
Data Science and Machine learning-Module-01 (3).pptx
bansalmayank1512
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
Joe_F
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Amit Sheth
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
Sandeep Garg
 
(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data
(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data
(Data) Integrity Matters: Four Ways You Can Build Trust in Your Data
Precisely
 
EDF2012 Nigel Shadbolt - Transparency and Open Data
EDF2012   Nigel Shadbolt - Transparency and Open DataEDF2012   Nigel Shadbolt - Transparency and Open Data
EDF2012 Nigel Shadbolt - Transparency and Open Data
European Data Forum
 
Ad

More from Simon Price (20)

Citizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological SurveysCitizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological Surveys
Simon Price
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
Simon Price
 
Managing Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development ProjectsManaging Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development Projects
Simon Price
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Simon Price
 
NewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed miningNewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed mining
Simon Price
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of Data
Simon Price
 
Visualising China - historical photos of China
Visualising China - historical photos of ChinaVisualising China - historical photos of China
Visualising China - historical photos of China
Simon Price
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising ChinaBest of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Simon Price
 
Adapting CARDIO for BOS
Adapting CARDIO for BOSAdapting CARDIO for BOS
Adapting CARDIO for BOS
Simon Price
 
data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoption
Simon Price
 
Nature Locator
Nature LocatorNature Locator
Nature Locator
Simon Price
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data Services
Simon Price
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at Bristol
Simon Price
 
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Simon Price
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...
Simon Price
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
Simon Price
 
Code Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and reviewCode Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and review
Simon Price
 
Academic IT support for Data Science
Academic IT support for Data ScienceAcademic IT support for Data Science
Academic IT support for Data Science
Simon Price
 
Citizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological SurveysCitizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological Surveys
Simon Price
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
Simon Price
 
Managing Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development ProjectsManaging Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development Projects
Simon Price
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Simon Price
 
NewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed miningNewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed mining
Simon Price
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of Data
Simon Price
 
Visualising China - historical photos of China
Visualising China - historical photos of ChinaVisualising China - historical photos of China
Visualising China - historical photos of China
Simon Price
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising ChinaBest of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Simon Price
 
Adapting CARDIO for BOS
Adapting CARDIO for BOSAdapting CARDIO for BOS
Adapting CARDIO for BOS
Simon Price
 
data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoption
Simon Price
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data Services
Simon Price
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at Bristol
Simon Price
 
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Simon Price
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...
Simon Price
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
Simon Price
 
Code Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and reviewCode Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and review
Simon Price
 
Academic IT support for Data Science
Academic IT support for Data ScienceAcademic IT support for Data Science
Academic IT support for Data Science
Simon Price
 
Ad

Recently uploaded (20)

Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud ManManaged Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdfPause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptxData-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdfMEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en OostKLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdfReport_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptxSAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdfAdvanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdfMedia_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdfHypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELSQUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud ManManaged Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptxData-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdfMEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en OostKLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdfReport_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptxSAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdfAdvanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdfMedia_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdfHypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELSQUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 

Adding Open Data Value to 'Closed Data' Problems

  • 1. Adding Open Data Value to 'Closed Data' Problems Dr Simon Price Research Fellow, University of Bristol Data Scientist, Capgemini Insights & Data
  • 2. Who am I? • 30 years software development and leadership roles • Moved into Data Science via PhD in Machine Learning (2014) • Research Fellow in Machine Learning group  ~20 Machine Learning researchers • Led project to establish Bristol’s open research data repository • One of the organisers of Open Data Institute (ODI) Bristol • Data Scientist in Big Data Analytics team  ~100 Data Scientists, Big Data Engineers and Data Analysts • Focus on Open Source and Big Data technologies to solve client problems
  • 3. Outline 1. Case study: open data + ‘closed data’ 2. Deriving value from open data 3. Data Science with ‘closed data’
  • 4. Case study: SubSift Conferences using SubSift • ECML-PKDD: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases • KDD: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining • PAKDD: Pacific-Asia Conference on Knowledge Discovery and Data Mining • SDM: SIAM International Conference on Data Mining Journals using SubSift • Machine Learning • Data Mining and Knowledge Discovery https://doi.org/10.1145/2979672
  • 5. Initial problem addressed by SubSift Matching submitted conference papers to possible reviewers in Programme Committee
  • 11. Why did SubSift recommend this person?
  • 13. Profiling staff at meetings
  • 15. Open research data • data.bris.ac.uk • Research data storage facility • Each researcher gets 10TB "forever"
  • 17.  140+ datasets live on opendata.bristol.gov.uk  Mostly static but some real-time data  Examples • Government: Elections since 2007 • Community: Quality of Life survey • Education: School Results • Energy: Installed PV, Energy Use in Council Buildings • Environment: Real time & Historic Air Quality, Flood Alerts (EA) • Land use: 2013 Planning applications • Health: Life expectancy/ Mortality, Obesity, NHS Spend Open government data
  • 19. Deriving value from open data 1. Data Science 2. Using open data to enrich and connect ’closed data’
  • 23. Big Data Analytics Insights & Data www.capgemini.com/insights-data
  • 24. 25Copyright © Capgemini 2017. All Rights Reserved June 2017 Example Data Science application Assurance Scoring http://ow.ly/4nbEUI Using existing enterprise data plus any useful open data, detect potentially fraudulent transactions
  • 25. 26Copyright © Capgemini 2017. All Rights Reserved June 2017 Example Data Science application Assurance Scoring http://ow.ly/4nbEUI
  • 26. 27Copyright © Capgemini 2017. All Rights Reserved June 2017 Machine Learning Transform Selection Model Training Validation Test Feature Extraction and Selection Model Building Variety of output files: logs, graphics, saved models, etc. Testing: Unit tests, monitoring tests and integration tests Vector Build Input Data Manipulate, Explore Data Machine Learning Framework (Python, Scala, Spark)
  • 27. 28Copyright © Capgemini 2017. All Rights Reserved June 2017 Graph Links - Matching Key part of assurance scoring – bringing data together from disparate sources Probability of Match: 80% Attribute Data Source 1 Data Source 2 Name Richard Smith Rich Smith Phone Number 07123 456 789 07123 456 798 Favourite Sport Football Cricket
  • 28. 29Copyright © Capgemini 2017. All Rights Reserved June 2017 Related to: - record linkage - duplicate detection - reference resolution - object identity - entity matching Connect graph descriptions using background knowledge from open data sources. e.g. Linked Open Data Advanced matching
  • 29. 30Copyright © Capgemini 2017. All Rights Reserved June 2017 Linked Open Data
  • 30. Data Science with ‘closed data’  The information contained in this presentation is proprietary. © 2012 Capgemini. All rights reserved. www.capgemini.com About Capgemini With more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model. Rightshore® is a trademark belonging to Capgemini
  • 31. Problems of opening up ‘closed data’
  • 32. Research data now open by default - including sensitive data Funders Journals data.bris has 3 levels of access:
  • 37. Data Science with ‘closed data’
  • 38. Data science with ‘closed data’ • Custom R server running inside secure data repository / warehouse • Enables non-disclosive, remote analysis of sensitive research data.