SlideShare a Scribd company logo
Creating Open Data with
Open Source
Sammy Fung
sammy.hk
[ITFest.HK] Seminar of Free / Open Source in Hong Kong, April 2013.
Agenda
● What is Open Data ?
● Use of Open Source Software in web crawling.
● Starting new Open Source projects to create
Open Data.
Sammy Fung
● Software Developer using open source.
– Perl → PHP → Python.
– Data Mining / Web Crawling.
– Also deploying OpenStack Cloud and Linux Solutions.
● Open Source Community Leader.
– opensource.hk, HKLUG, GNOME Asia committee, Mozilla
Rep, and program committee member of the largest
Taiwan open source conference - COSCUP.
● Blogger at sammy.hk.
Open Data
Three Laws of Open Government Data by David Eaves.
1.If it can't be spidered or indexed, it doesn't exist.
2.If it isn't available in open and machine readable format, it
can't engage.
3.If a legal framework doesn't allow it to be repurposed, it
doesn't empower.
http://eaves.ca/2009/09/30/three-law-of-open-government-data/
Open Data
● Tim Berners-Lee, the inventor of the Web.
– 5stardata.info
– 5 star deployment scheme of Open Data.
* One Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
** Two Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
*** Three Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
**** Four Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
***** Five Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
Open Data from HK Government ?
● 2 Use Cases of Data:
– Legco Meeting Minutes and Voting Results.
– Weather at Data.One.
Legco Meeting Minutes
and Voting Results
Legco Meeting Minutes
and Voting Results
Legco Meeting Minutes
and Voting Results
● All legco voting results are scanned and
released in PDF, it is only possible to retrieve
voting results manually.
● In recent years, it seems scanned minutes
from sheets scanned are replaced by minutes
converted from original computer document
files.
Improving Legco Vote Result Data ?
● Legcovotes.net is created by Hong Kong
netitizens(?).
● Only 20 famous vote results are included.
● It is possible to let public to input other vote
results by hand, and submissions should be
verified by legcovotes.net authoritative.
● Including other data, eg. Minutes in plain text
or paragraphs related to a counciler.
Weather at Data.One
● My Chinese Blog Post 「香港政府機構開放資
料 Open Data 情況」 on 2013/1/17.
● Data.One released on 2011/3/31.
● Weather at Data.One provides 7 dataset URLs,
returns RSS (XML) format (Eng/TChi/SChi)
– One word: Useless.
– Data.One dataset (RSS) is completely different
with HKO own paid service (XML).
Weather at Data.One
● Example - Current local weather report:
● Plain text report in RSS.
● Difference to quote report content:
– Website: a pair of HTML tags, eg. <PRE>....</PRE>.
– Data.One: a pair of RSS description tags,
<description>....</description>.
● Other weather data is missing, eg. Regional
temperture updates per each 12 mins.
Weather at Data.One
● Weather at Data.One is 'report' but not 'data'.
● Weather RSS is already released by HKO
before launch of Data.One.
● Technically, json/xml format is better
readable by computer programs.
Oversea Open Data Project
Examples
● Toronto:
– City Data: http://map.toronto.ca/wellbeing/
– Transportation: http://www.rocketradar.net/
– Pollution: http://www.emitter.ca/
● US & Canada:
– https://www.crimereports.com/
Use of Open Source Software in
Web Crawling
● Use Open Source Tools to collect useful and
meaningful machine-readable data.
● Doesn't need to wait provider to release data
in machine-readable format.
Open Source Tools
● Python programming lanugage
● with Regular Expression library
● Scrapy web crawling framework
Why python + scrapy ?
● python: my current favourite programming
language for few years.
● scrapy: web crawling framework written in
Python.
Scrapy
● scrapy: web crawling framework written in
Python.
● HtmlXPathSelector
● Output: built-in JSON, CSV, XML.
● Python: import re
My Products
● WeatherHK ← ← ←
● TCTrack
WeatherHK
● http://twitter.com/weatherhk
● hourly current weather report
● weather forecast report
● tropical signal warning
WeatherHK
● Backend: Python + Scrapy + Database +
Twitter + NNTP......
● Frontend: Twitter + Newsgroup
WeatherHK
● http://twitter.com/weatherhk
● Interview by MetroPop in 2009.
My Products
● WeatherHK
● TCTrack ← ← ←
TCTrack
● http://sammy.hk/projects/tctrack/tctrack.php
● Plot TC current and forecast tracks over
Google Map.
● Source:
– JTWC
– HKO
TCTrack
● http://sammy.hk/projects/tctrack/tctrack.php
● Probably first tctrack map in HK using
GoogleMap
● Use of GMap: TCTrack -> Weather
Underground Hong Kong -> HKO
TCTrack
● http://twitter.com/tctrack
● Tweet JTWC updates for Northwest Pacific.
Starting new Open Source projects
to create Open Data
● Develop a open source project.
● Release data in standard machine-readable
data format.
Open Source Project Examples
● Hk0weather
● My weather related open source project.
hk0weather
● https://github.com/sammyfung/hk0weather
● Open Source Hong Kong Weather Project.
● convert to JSON data from HKO webpages.
● python + scrapy
● 1st version: from current weather report,
extracting temperture and humidity from 20+
weather stations, export in json format.
hk0weather
● https://github.com/sammyfung/hk0weather
● $ virtualenv hk0weatherenv
● $ source hk0weatherenv/bin/activate
● $ pip install scrapy
● $ git clone
https://github.com/sammyfung/hk0weather.git
● $ cd hk0weather
● $ scrapy crawl currwx -t json -o testresult
hk0weather
[{"humidity": 80, "station": "hko", "temperture": 17, "time": 1360785720},
{"station": "kingspark", "temperture": 16, "time": 1360785720},
{"station": "wongchukhang", "temperture": 17, "time": 1360785720},
{"station": "takwuling", "temperture": 16, "time": 1360785720},
{"station": "laufaushan", "temperture": 15, "time": 1360785720},
{"station": "taipo", "temperture": 16, "time": 1360785720},
{"station": "shatin", "temperture": 17, "time": 1360785720},
{"station": "tuenmun", "temperture": 17, "time": 1360785720},
{"station": "tseungkwano", "temperture": 16, "time": 1360785720},
{"station": "saikung", "temperture": 16, "time": 1360785720},
{"station": "cheungchau", "temperture": 17, "time": 1360785720},
{"station": "cheungchau", "temperture": 17, "time": 1360785720},
{"station": "tsingyi", "temperture": 17, "time": 1360785720},
{"station": "shekkong", "temperture": 15, "time": 1360785720},
{"station": "tsuenwanhokoon", "temperture": 15, "time": 1360785720},
{"station": "tsuenwanshingmunvalley", "temperture": 17, "time": 1360785720},
{"station": "hongkongpark", "temperture": 17, "time": 1360785720},
{"station": "shaukeiwan", "temperture": 16, "time": 1360785720},
{"station": "kowlooncity", "temperture": 16, "time": 1360785720},
{"station": "happyvalley", "temperture": 18, "time": 1360785720},
{"station": "wongtaisin", "temperture": 17, "time": 1360785720},
{"station": "stanley", "temperture": 16, "time": 1360785720},
{"station": "kwuntong", "temperture": 15, "time": 1360785720},
{"station": "shamshuipo", "temperture": 17, "time": 1360785720}]
Items.py
class Hk0WeatherItem(Item):
time = Field()
station = Field()
temperture = Field()
humidity = Field()
Currwx.py
start_urls = (
'http://www.weather.gov.hk/wxinfo/currwx/curr
entc.htm',
)
Currwx.py
def parse(self, response):
laststation = ''
temperture = int()
stations = []
hxs = HtmlXPathSelector(response)
report = hxs.select('//div[@id="ming"]')
libhk0
class hk0:
stations = [
(u' 天 文 台 ', 'hko'),
(u' 京 士 柏 ', 'kingspark'),
(u' 黃 竹 坑 ', 'wongchukhang'),
(u' 打 鼓 嶺 ', 'takwuling'),
(u' 流 浮 山 ', 'laufaushan'),
libhk0
class hk0:
def gettime(self, report):
…
def hk0current(self, report):
…
hk0weather
● Future Planning:
● Add more weather reports.
● Getting ideas and/or cooperate with 'pro'
Weather hobbists.
● Remarks:
● Development of hk0weather is started from
ZERO, its code is different than my twitter
@weatherhk.
Challenge
● Challenge on first day of hk0weather release.
● Director of a mobile app developer company
told me by leaving a Facebook comment.
– HKO provides data in pretty XML format with their 
annual service plan for commerical companies.
– He think that ***MAYBE*** HKO would provide XML 
to you ***without*** any charges if I asked.
● Remark: This is an assumption only, not listed on HKO 
website.
Challenge
● I replied the following to him after googling for HKO XML
schema.
– HKO didn't mention 'free of charge service' of XML data feed on
website.
– I registered and got authorization from HKO to re-distribute their
weather information for non-profit making. And I received some
emails from HKO for any updates of website and HTML structure,
but never mention about XML data feed service.
– Weather data available on HKO XML data feed is still fewer than its
HTML website.
●
So, this challenge is FAIL! XD
Open Data Project Examples
● Open Government initiative from HKU JMSC.
● http://opengov.jmsc.hku.hk/
● https://github.com/jmschku
Agenda
● What is Open Data ?
● Use of Open Source Software in web crawling.
● Starting new Open Source projects to create
Open Data.
Thank You!
sammy.hk

More Related Content

What's hot (20)

PPTX
RDF Stream Processing Tutorial: RSP implementations
Jean-Paul Calbimonte
 
PDF
Introduction of g0v.tw at OpenDataHK.meet.12
Sammy Fung
 
PPTX
Realtimestream and realtime fastcatsearch
상욱 송
 
PDF
Actionable data in life sciences
Jorge Boucas
 
PPTX
Connecting Stream Reasoners on the Web
Jean-Paul Calbimonte
 
PPTX
Reminiscing about interoperability
Herbert Van de Sompel
 
PPTX
Signposting Overview
Herbert Van de Sompel
 
PDF
Triplewave: a step towards RDF Stream Processing on the Web
Daniele Dell'Aglio
 
PDF
The Lonesome LOD Cloud
Ruben Verborgh
 
PDF
Linked Data Fragments
Ruben Verborgh
 
PPTX
Query Rewriting in RDF Stream Processing
Jean-Paul Calbimonte
 
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Martin Junghanns
 
PDF
Introduction to Research Objects - Collaboartions Workshop 2015, Oxford
matthewgamble
 
PDF
Bigdive 2014 - RDF, principles and case studies
Diego Valerio Camarda
 
PPTX
IIIF & Digital Humanities
Jean-Philippe Moreux
 
PDF
Staab programming thesemanticweb
Aneta Tu
 
PPTX
Programming the Semantic Web
Steffen Staab
 
PDF
Collaborations in the Extreme: 
The rise of open code development in the scie...
Kelle Cruz
 
PDF
Introduction to OpenRefine
Heather Myers
 
PPTX
AINL 2016: Bugaychenko
Lidia Pivovarova
 
RDF Stream Processing Tutorial: RSP implementations
Jean-Paul Calbimonte
 
Introduction of g0v.tw at OpenDataHK.meet.12
Sammy Fung
 
Realtimestream and realtime fastcatsearch
상욱 송
 
Actionable data in life sciences
Jorge Boucas
 
Connecting Stream Reasoners on the Web
Jean-Paul Calbimonte
 
Reminiscing about interoperability
Herbert Van de Sompel
 
Signposting Overview
Herbert Van de Sompel
 
Triplewave: a step towards RDF Stream Processing on the Web
Daniele Dell'Aglio
 
The Lonesome LOD Cloud
Ruben Verborgh
 
Linked Data Fragments
Ruben Verborgh
 
Query Rewriting in RDF Stream Processing
Jean-Paul Calbimonte
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Martin Junghanns
 
Introduction to Research Objects - Collaboartions Workshop 2015, Oxford
matthewgamble
 
Bigdive 2014 - RDF, principles and case studies
Diego Valerio Camarda
 
IIIF & Digital Humanities
Jean-Philippe Moreux
 
Staab programming thesemanticweb
Aneta Tu
 
Programming the Semantic Web
Steffen Staab
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Kelle Cruz
 
Introduction to OpenRefine
Heather Myers
 
AINL 2016: Bugaychenko
Lidia Pivovarova
 

Viewers also liked (16)

PDF
Software Freedom and Open Source Community
Sammy Fung
 
PDF
Mozilla - Openness of the Web
Sammy Fung
 
PDF
香港中文開源軟件翻譯
Sammy Fung
 
PDF
香港開放原始碼社群 在香港社會發聲
Sammy Fung
 
PDF
Global Open Source Development 2011-2014 Review and 2015 Forecast
Sammy Fung
 
PPT
Firefox 4 介紹短講
Sammy Fung
 
PDF
From Hk0weather to Open Data
Sammy Fung
 
ODP
讓網誌博客自由的 Wordpress - 香港 wordpress 博客用家分享
Sammy Fung
 
PDF
Open source communities in hong kong and asia (2012 updates) (Summer BarCam...
Sammy Fung
 
PDF
Use open source software to develop ideas at work
Sammy Fung
 
PDF
Mozilla Community and Hong Kong
Sammy Fung
 
PDF
Open Data and Web API
Sammy Fung
 
PDF
IBus Chinese input methods for HongKongers - Problem, Solution, Future.
Sammy Fung
 
PDF
Building your own job site with Drupal
Sammy Fung
 
PDF
Open Source Technology and Community
Sammy Fung
 
PDF
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 
Software Freedom and Open Source Community
Sammy Fung
 
Mozilla - Openness of the Web
Sammy Fung
 
香港中文開源軟件翻譯
Sammy Fung
 
香港開放原始碼社群 在香港社會發聲
Sammy Fung
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Sammy Fung
 
Firefox 4 介紹短講
Sammy Fung
 
From Hk0weather to Open Data
Sammy Fung
 
讓網誌博客自由的 Wordpress - 香港 wordpress 博客用家分享
Sammy Fung
 
Open source communities in hong kong and asia (2012 updates) (Summer BarCam...
Sammy Fung
 
Use open source software to develop ideas at work
Sammy Fung
 
Mozilla Community and Hong Kong
Sammy Fung
 
Open Data and Web API
Sammy Fung
 
IBus Chinese input methods for HongKongers - Problem, Solution, Future.
Sammy Fung
 
Building your own job site with Drupal
Sammy Fung
 
Open Source Technology and Community
Sammy Fung
 
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 
Ad

Similar to Creating Open Data with Open Source (beta2) (20)

PDF
Local Weather Information and GNOME Shell Extension
Sammy Fung
 
PDF
Use of Open Data in Hong Kong
Sammy Fung
 
PDF
Use of Open Data in Hong Kong (LegCo 2014)
Sammy Fung
 
PDF
Access Open Data with Open Source Software Tools
Sammy Fung
 
PDF
How Open Data can help entrepreneurs - ITFest 2014 E2
Sammy Fung
 
PDF
Workshop: Open Data - What's the Point?
BPCW10
 
PPTX
Linked Open Data in Romania
Vlad Posea
 
KEY
Make Open Data Lausanne
Frederic Jacobs
 
PDF
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Paolo Corti
 
PDF
EDF2012: The Web of Data and its Five Stars
Richard Cyganiak
 
PDF
Introduction to Open Data and Data Science
Suraj Kumar Jana
 
PDF
Open Data in Agrifood: A Tutorial
Christopher Brewster
 
PPTX
Session 03 acquiring data
Sara-Jayne Terp
 
PPTX
Session 03 acquiring data
bodaceacat
 
PPTX
Open Data Presentation
Saviour Sanders
 
PPT
Opening Up: The City of Regina's Open Data Journey
Alyssa Daku
 
PDF
Stuart Harrison Open data - Under the hood
eventwithme
 
PDF
Open data under the hood stuart harrison - lichfield district council
BPCW10
 
PDF
Opendata - Under the hood
pezholio
 
PPT
Open Data Trentino - Seminar at Universidad Simon Bolivar - 15th October 2013
European Commission, Joint Research Centre
 
Local Weather Information and GNOME Shell Extension
Sammy Fung
 
Use of Open Data in Hong Kong
Sammy Fung
 
Use of Open Data in Hong Kong (LegCo 2014)
Sammy Fung
 
Access Open Data with Open Source Software Tools
Sammy Fung
 
How Open Data can help entrepreneurs - ITFest 2014 E2
Sammy Fung
 
Workshop: Open Data - What's the Point?
BPCW10
 
Linked Open Data in Romania
Vlad Posea
 
Make Open Data Lausanne
Frederic Jacobs
 
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Paolo Corti
 
EDF2012: The Web of Data and its Five Stars
Richard Cyganiak
 
Introduction to Open Data and Data Science
Suraj Kumar Jana
 
Open Data in Agrifood: A Tutorial
Christopher Brewster
 
Session 03 acquiring data
Sara-Jayne Terp
 
Session 03 acquiring data
bodaceacat
 
Open Data Presentation
Saviour Sanders
 
Opening Up: The City of Regina's Open Data Journey
Alyssa Daku
 
Stuart Harrison Open data - Under the hood
eventwithme
 
Open data under the hood stuart harrison - lichfield district council
BPCW10
 
Opendata - Under the hood
pezholio
 
Open Data Trentino - Seminar at Universidad Simon Bolivar - 15th October 2013
European Commission, Joint Research Centre
 
Ad

More from Sammy Fung (13)

PDF
Python 爬網⾴工具 - Scrapy 介紹
Sammy Fung
 
PDF
DevRel - Transform article writing from printing to online
Sammy Fung
 
PDF
Introduction to Open Source by opensource.hk (2019 Edition)
Sammy Fung
 
PDF
My Open Source Journey - Developer and Community
Sammy Fung
 
PDF
Introduction to development with Django web framework
Sammy Fung
 
PDF
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Sammy Fung
 
PDF
Software Freedom and Community
Sammy Fung
 
PDF
Open Source Job Board
Sammy Fung
 
PDF
Introduction of Mozilla Hong Kong (COSCUP 2014)
Sammy Fung
 
PDF
Introduction of Open Source Job Board with Drupal CMS
Sammy Fung
 
PDF
ITFest 2014 - Open Source Marketing
Sammy Fung
 
PDF
Air Pollution Weather Map at OpenDataHK.make.02
Sammy Fung
 
PDF
15+ years of open source movements in Hong Kong
Sammy Fung
 
Python 爬網⾴工具 - Scrapy 介紹
Sammy Fung
 
DevRel - Transform article writing from printing to online
Sammy Fung
 
Introduction to Open Source by opensource.hk (2019 Edition)
Sammy Fung
 
My Open Source Journey - Developer and Community
Sammy Fung
 
Introduction to development with Django web framework
Sammy Fung
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Sammy Fung
 
Software Freedom and Community
Sammy Fung
 
Open Source Job Board
Sammy Fung
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Sammy Fung
 
Introduction of Open Source Job Board with Drupal CMS
Sammy Fung
 
ITFest 2014 - Open Source Marketing
Sammy Fung
 
Air Pollution Weather Map at OpenDataHK.make.02
Sammy Fung
 
15+ years of open source movements in Hong Kong
Sammy Fung
 

Recently uploaded (20)

PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PPTX
Wondershare Filmora Crack Free Download 2025
josanj305
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PDF
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
Wondershare Filmora Crack Free Download 2025
josanj305
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 

Creating Open Data with Open Source (beta2)

  • 1. Creating Open Data with Open Source Sammy Fung sammy.hk [ITFest.HK] Seminar of Free / Open Source in Hong Kong, April 2013.
  • 2. Agenda ● What is Open Data ? ● Use of Open Source Software in web crawling. ● Starting new Open Source projects to create Open Data.
  • 3. Sammy Fung ● Software Developer using open source. – Perl → PHP → Python. – Data Mining / Web Crawling. – Also deploying OpenStack Cloud and Linux Solutions. ● Open Source Community Leader. – opensource.hk, HKLUG, GNOME Asia committee, Mozilla Rep, and program committee member of the largest Taiwan open source conference - COSCUP. ● Blogger at sammy.hk.
  • 4. Open Data Three Laws of Open Government Data by David Eaves. 1.If it can't be spidered or indexed, it doesn't exist. 2.If it isn't available in open and machine readable format, it can't engage. 3.If a legal framework doesn't allow it to be repurposed, it doesn't empower. http://eaves.ca/2009/09/30/three-law-of-open-government-data/
  • 5. Open Data ● Tim Berners-Lee, the inventor of the Web. – 5stardata.info – 5 star deployment scheme of Open Data.
  • 6. * One Star - Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 7. ** Two Star - Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 8. *** Three Star - Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 9. **** Four Star - Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 10. ***** Five Star - Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 11. Open Data from HK Government ? ● 2 Use Cases of Data: – Legco Meeting Minutes and Voting Results. – Weather at Data.One.
  • 12. Legco Meeting Minutes and Voting Results
  • 13. Legco Meeting Minutes and Voting Results
  • 14. Legco Meeting Minutes and Voting Results ● All legco voting results are scanned and released in PDF, it is only possible to retrieve voting results manually. ● In recent years, it seems scanned minutes from sheets scanned are replaced by minutes converted from original computer document files.
  • 15. Improving Legco Vote Result Data ? ● Legcovotes.net is created by Hong Kong netitizens(?). ● Only 20 famous vote results are included. ● It is possible to let public to input other vote results by hand, and submissions should be verified by legcovotes.net authoritative. ● Including other data, eg. Minutes in plain text or paragraphs related to a counciler.
  • 16. Weather at Data.One ● My Chinese Blog Post 「香港政府機構開放資 料 Open Data 情況」 on 2013/1/17. ● Data.One released on 2011/3/31. ● Weather at Data.One provides 7 dataset URLs, returns RSS (XML) format (Eng/TChi/SChi) – One word: Useless. – Data.One dataset (RSS) is completely different with HKO own paid service (XML).
  • 17. Weather at Data.One ● Example - Current local weather report: ● Plain text report in RSS. ● Difference to quote report content: – Website: a pair of HTML tags, eg. <PRE>....</PRE>. – Data.One: a pair of RSS description tags, <description>....</description>. ● Other weather data is missing, eg. Regional temperture updates per each 12 mins.
  • 18. Weather at Data.One ● Weather at Data.One is 'report' but not 'data'. ● Weather RSS is already released by HKO before launch of Data.One. ● Technically, json/xml format is better readable by computer programs.
  • 19. Oversea Open Data Project Examples ● Toronto: – City Data: http://map.toronto.ca/wellbeing/ – Transportation: http://www.rocketradar.net/ – Pollution: http://www.emitter.ca/ ● US & Canada: – https://www.crimereports.com/
  • 20. Use of Open Source Software in Web Crawling ● Use Open Source Tools to collect useful and meaningful machine-readable data. ● Doesn't need to wait provider to release data in machine-readable format.
  • 21. Open Source Tools ● Python programming lanugage ● with Regular Expression library ● Scrapy web crawling framework
  • 22. Why python + scrapy ? ● python: my current favourite programming language for few years. ● scrapy: web crawling framework written in Python.
  • 23. Scrapy ● scrapy: web crawling framework written in Python. ● HtmlXPathSelector ● Output: built-in JSON, CSV, XML. ● Python: import re
  • 24. My Products ● WeatherHK ← ← ← ● TCTrack
  • 25. WeatherHK ● http://twitter.com/weatherhk ● hourly current weather report ● weather forecast report ● tropical signal warning
  • 26. WeatherHK ● Backend: Python + Scrapy + Database + Twitter + NNTP...... ● Frontend: Twitter + Newsgroup
  • 28. My Products ● WeatherHK ● TCTrack ← ← ←
  • 29. TCTrack ● http://sammy.hk/projects/tctrack/tctrack.php ● Plot TC current and forecast tracks over Google Map. ● Source: – JTWC – HKO
  • 30. TCTrack ● http://sammy.hk/projects/tctrack/tctrack.php ● Probably first tctrack map in HK using GoogleMap ● Use of GMap: TCTrack -> Weather Underground Hong Kong -> HKO
  • 32. Starting new Open Source projects to create Open Data ● Develop a open source project. ● Release data in standard machine-readable data format.
  • 33. Open Source Project Examples ● Hk0weather ● My weather related open source project.
  • 34. hk0weather ● https://github.com/sammyfung/hk0weather ● Open Source Hong Kong Weather Project. ● convert to JSON data from HKO webpages. ● python + scrapy ● 1st version: from current weather report, extracting temperture and humidity from 20+ weather stations, export in json format.
  • 35. hk0weather ● https://github.com/sammyfung/hk0weather ● $ virtualenv hk0weatherenv ● $ source hk0weatherenv/bin/activate ● $ pip install scrapy ● $ git clone https://github.com/sammyfung/hk0weather.git ● $ cd hk0weather ● $ scrapy crawl currwx -t json -o testresult
  • 36. hk0weather [{"humidity": 80, "station": "hko", "temperture": 17, "time": 1360785720}, {"station": "kingspark", "temperture": 16, "time": 1360785720}, {"station": "wongchukhang", "temperture": 17, "time": 1360785720}, {"station": "takwuling", "temperture": 16, "time": 1360785720}, {"station": "laufaushan", "temperture": 15, "time": 1360785720}, {"station": "taipo", "temperture": 16, "time": 1360785720}, {"station": "shatin", "temperture": 17, "time": 1360785720}, {"station": "tuenmun", "temperture": 17, "time": 1360785720}, {"station": "tseungkwano", "temperture": 16, "time": 1360785720}, {"station": "saikung", "temperture": 16, "time": 1360785720}, {"station": "cheungchau", "temperture": 17, "time": 1360785720}, {"station": "cheungchau", "temperture": 17, "time": 1360785720}, {"station": "tsingyi", "temperture": 17, "time": 1360785720}, {"station": "shekkong", "temperture": 15, "time": 1360785720}, {"station": "tsuenwanhokoon", "temperture": 15, "time": 1360785720}, {"station": "tsuenwanshingmunvalley", "temperture": 17, "time": 1360785720}, {"station": "hongkongpark", "temperture": 17, "time": 1360785720}, {"station": "shaukeiwan", "temperture": 16, "time": 1360785720}, {"station": "kowlooncity", "temperture": 16, "time": 1360785720}, {"station": "happyvalley", "temperture": 18, "time": 1360785720}, {"station": "wongtaisin", "temperture": 17, "time": 1360785720}, {"station": "stanley", "temperture": 16, "time": 1360785720}, {"station": "kwuntong", "temperture": 15, "time": 1360785720}, {"station": "shamshuipo", "temperture": 17, "time": 1360785720}]
  • 37. Items.py class Hk0WeatherItem(Item): time = Field() station = Field() temperture = Field() humidity = Field()
  • 39. Currwx.py def parse(self, response): laststation = '' temperture = int() stations = [] hxs = HtmlXPathSelector(response) report = hxs.select('//div[@id="ming"]')
  • 40. libhk0 class hk0: stations = [ (u' 天 文 台 ', 'hko'), (u' 京 士 柏 ', 'kingspark'), (u' 黃 竹 坑 ', 'wongchukhang'), (u' 打 鼓 嶺 ', 'takwuling'), (u' 流 浮 山 ', 'laufaushan'),
  • 41. libhk0 class hk0: def gettime(self, report): … def hk0current(self, report): …
  • 42. hk0weather ● Future Planning: ● Add more weather reports. ● Getting ideas and/or cooperate with 'pro' Weather hobbists. ● Remarks: ● Development of hk0weather is started from ZERO, its code is different than my twitter @weatherhk.
  • 43. Challenge ● Challenge on first day of hk0weather release. ● Director of a mobile app developer company told me by leaving a Facebook comment. – HKO provides data in pretty XML format with their  annual service plan for commerical companies. – He think that ***MAYBE*** HKO would provide XML  to you ***without*** any charges if I asked. ● Remark: This is an assumption only, not listed on HKO  website.
  • 44. Challenge ● I replied the following to him after googling for HKO XML schema. – HKO didn't mention 'free of charge service' of XML data feed on website. – I registered and got authorization from HKO to re-distribute their weather information for non-profit making. And I received some emails from HKO for any updates of website and HTML structure, but never mention about XML data feed service. – Weather data available on HKO XML data feed is still fewer than its HTML website. ● So, this challenge is FAIL! XD
  • 45. Open Data Project Examples ● Open Government initiative from HKU JMSC. ● http://opengov.jmsc.hku.hk/ ● https://github.com/jmschku
  • 46. Agenda ● What is Open Data ? ● Use of Open Source Software in web crawling. ● Starting new Open Source projects to create Open Data.