SlideShare a Scribd company logo
When Big Data Meet Python

                             Jimmy Lai (賴弘哲)
                           jimmy.lai@oi-sys.com
                                2012/08/19
Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python


                          2012
 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
                                                                                                                            1
自我介紹
• 賴弘哲 (Jimmy Lai)
• Interests: Data mining, Machine Learning,
  Natural Language Processing, Distributed
  Computing, Python
• LindedIn profile: http://goo.gl/XTEM5
• 現任職於引京聚點知識結構搜索公司,
  從事大資料語意分析


            2012                              2
Outline
1. Big Data
  a. Concept
  b. Technical issues
2. Big Data + Python
  a. Related open source tools
  b. Example




              2012               3
Benefits of Big Data
1. Creating transparency(透明度) e.g. http://www.data.gov/
2. Enabling experimentation to discover needs,
   expose variability, and improve
   performance(發現需求及潛在威脅、改善產能)
3. Segmenting populations to customize(客製化)
   actions
4. Replacing/supporting human decision making
   with automated algorithms(自動決策)
5. Innovating new business models, products and
   services(創新的服務、產業)
深度資料分析人才的短缺               (May 2011). Big Data: The next frontier for
                          innovation, competition, and productivity.
              2012        McKinsey Global Institute.                    4
Initiative from the White House
• (Mar 2012) Big Data Research and
  Development Initiative, the White House.
• National Science Foundation encourages
  education on Big Data.
• Government invest on developing state-of-
  the-art technologies, harness those
  technologies, and expand the workforce for
  Big Data.

            2012                               5
Big Data Issues
User Generated Content              Machine Generated Data



                         Collecting

                         Storage

                     Computing

                         Analysis

                    Visualization
          2012                                               6
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Crawler
                                     – Collect raw data
           Collecting                – E.g. Heritrix, Nutch
                                   • Scraping
            Storage                  – Parse information
           Computing
                                       from raw data
                                     – E.g. Yahoo! Pipes,
            Analysis                   Scrapy

          Visualization
                   2012                                       7
Big Data Techniques
User Generated       Machine
                  Generated Data
                                   • Big Table
   Content
                                     – Distributed key-value
                                       storage
           Collecting                – E.g.Hbase, Cassandra
                                   • NoSQL
            Storage                  – Not use SQL for
                                       manipulation
           Computing                 – Not use relational
                                       database model
            Analysis                 – E.g. MongoDB, Redis,
                                       CouchDB
          Visualization
                   2012                                    8
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Batch
                                     – MapReduce
           Collecting                – E.g. Hadoop
                                   • Real-time
            Storage                  – Stream processing
           Computing                 – E.g. S4, Storm

            Analysis

          Visualization
                   2012                                    9
Big Data Techniques
User Generated       Machine       • Data mining
   Content        Generated Data
                                      – Weka
                                   • Machine learning
           Collecting                 – scikit-learn
                                   • Natural language
            Storage                  processing
                                      – NLTK, Stanford NLP
           Computing               • Statistics
                                      –R
            Analysis

          Visualization
                   2012                                      10
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Abstract
                                   • Interactive
           Collecting              • E.g. Processing,
                                     Gephi, D3.js
            Storage

           Computing

            Analysis

          Visualization
                   2012                                 11
Why Python?
• Good code readability     • Fast growing among
  for fast development.       open source
• Scripting language: the     communities.
  less code, the more         – Commits statistics from
  productivity.                 ohloh.net




              2012                                        12
When Big Data meet Python
        User Generated       Machine
           Content        Generated Data



                   Collecting              Scrapy: scraping framework


                                       PyMongo: Python client for Mongodb
Infrastructure




                    Storage
                                       Hadoop streaming: Linux pipe interface
                   Computing           Disco: lightweight MapReduce in Python
                                       Pandas: data analysis/manipulation
                    Analysis           Statsmodels: statistics
                                       NLTK: natural language processing
                                       Scikit-learn: machine learning
                  Visualization        Matplotlib: plotting
                           2012        NetworkX: graph visualization            13
When Big Data meet Python
User Generated       Machine
                  Generated Data                            http://scrapy.org/
   Content
                                   web scraping framework
                                   • Simple and Extensible
           Collecting
                                   • Components:
                                      •   Scheduler
            Storage                   •   Downloader
                                      •   Spider(Scraper)
           Computing                  •   Item pipeline

            Analysis

          Visualization
                   2012                                                   14
When Big Data meet Python
User Generated       Machine
                                                       http://www.mongodb.org/
   Content        Generated Data
                                   NoSQL database
                                   • PyMongo: client for python
           Collecting
                                   • Document(JSON)-oriented
                                   • No schema
            Storage
                                   • Scalable
                                     • Auto-sharding
           Computing
                                     • Replica-set

            Analysis               • File storage
                                   • MapReduce aggregation
          Visualization
                   2012                                                15
When Big Data meet Python
                     Machine                           http://discoproject.org/
User Generated
   Content        Generated Data
                                   • Distributed computing:
                                      – MapReduce
           Collecting                 – Disco distributed file system
                                   • Write code in Python
            Storage                   – Easy/fast to profiling
                                      – Easy/fast to debugging
           Computing

            Analysis

          Visualization
                   2012                                                    16
When Big Data meet Python
User Generated       Machine
   Content        Generated Data
                                                     http://pandas.pydata.org/

                                   • Data analysis library
           Collecting              • Datastructure for fast data
                                     manipulation
                                      – Slicing
            Storage
                                      – Indexing
                                      – subsetting
           Computing
                                   • Handling missing data
            Analysis               • Aggregation
                                   • Time series
          Visualization
                   2012                                                     17
When Big Data meet Python
User Generated       Machine               Statsmodels
   Content        Generated Data           http://statsmodels.sourceforge.net/

                                   • Statistical analysis
           Collecting                • Statistical models
                                     • Fit data with model
            Storage                  • Statistical tests
                                     • Data exploration
           Computing                 • Time series analysis

            Analysis

          Visualization
                   2012                                                      18
When Big Data meet Python
User Generated       Machine                      scikit-learn
   Content        Generated Data                  http://scikit-learn.org/

                                   •   Machine learning algorithms
                                   •   Supervised learning
           Collecting
                                   •   Unsupervised learning
                                   •   Dataset
            Storage
                                       • Preprocessing
           Computing                   • feature extraction
                                   • Model
            Analysis                   • Selection
                                       • Pipeline
          Visualization
                   2012                                                      19
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NLTK: Natural Language Toolkit
                                                       http://scikit-learn.org/

                                   • Natural language processing
           Collecting              • Annotated corpora and resources
                                      Information Extraction Work Flow


            Storage                    Sentence
                                     Segmentation
                                                      Tokenization       POS tagging




           Computing                 Named Entity      Relation
                                      Recognition     Recognition



            Analysis

          Visualization
                   2012                                                            20
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NL
                                               http://matplotlib.sourceforge.net/

                                   • Plotting
           Collecting                 – Histograms
                                      – Power spectra
            Storage                   – Bar charts
                                      – Error charts
           Computing                  – Scatter plots
                                   • Full control to detail of plotting
            Analysis

          Visualization
                   2012                                                       21
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NetworkX http://networkx.lanl.gov/
                                   • Graph algorithms and
                                     visisualization
           Collecting
                                   • Draw graph with layout:
                                       –   Circular
            Storage                    –   Random
                                       –   Spectural
           Computing                   –   Spring
                                       –   Shell
            Analysis                   –   Graphviz


          Visualization
                   2012                                                 22
聚寶評 www.ezpao.com

      美食搜尋引擎




搜尋各大部落格食記

  2012              23
聚寶評 www.ezpao.com

     語意分析搜尋引擎




  2012              24
評論主題分析




  網友分享菜分析




   正評/負評分析




2012                  25
Thank you for your attention.
           Q&A
We are hiring!
• 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師

Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com

                              2012
     When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
                                                                                                                                26

More Related Content

PDF
Big data analysis in python @ PyCon.tw 2013
PPT
Web Crawling and Data Gathering with Apache Nutch
ODP
Large scale crawling with Apache Nutch
PDF
Web Crawling with Apache Nutch
PPTX
Techniques used in RDF Data Publishing at Nature Publishing Group
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PDF
Elasticsearch first-steps
KEY
Elasticsearch & "PeopleSearch"
Big data analysis in python @ PyCon.tw 2013
Web Crawling and Data Gathering with Apache Nutch
Large scale crawling with Apache Nutch
Web Crawling with Apache Nutch
Techniques used in RDF Data Publishing at Nature Publishing Group
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Elasticsearch first-steps
Elasticsearch & "PeopleSearch"

What's hot (20)

PDF
MongoDB and Python
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
PDF
An introduction to U1db
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Bubbles – Virtual Data Objects
PDF
Design of Experiments on Federator Polystore Architecture
PPTX
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
PPTX
Tracking the Performance of the Web with HTTP Archive
PDF
Python and MongoDB
PPTX
PPTX
Data Science Stack with MongoDB and RStudio
PPTX
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
PPTX
Elasticsearch - DevNexus 2015
PPTX
Back to Basics Webinar 1: Introduction to NoSQL
PDF
R statistics with mongo db
PPTX
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
KEY
PDF
Getting started with pandas
PDF
elasticsearch basics workshop
MongoDB and Python
Introduction to Apache Tajo: Future of Data Warehouse
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
An introduction to U1db
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Bubbles – Virtual Data Objects
Design of Experiments on Federator Polystore Architecture
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP Archive
Python and MongoDB
Data Science Stack with MongoDB and RStudio
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
Elasticsearch - DevNexus 2015
Back to Basics Webinar 1: Introduction to NoSQL
R statistics with mongo db
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Getting started with pandas
elasticsearch basics workshop
Ad

Viewers also liked (19)

PDF
Crawling the web for fun and profit
PDF
Collecting web information with open source tools
PPTX
Scrapy.for.dummies
PDF
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
PPT
摘星
PDF
[LDSP] Solr Usage
PDF
Fast data mining flow prototyping using IPython Notebook
PDF
Data Analyst Nanodegree
PDF
[LDSP] Search Engine Back End API Solution for Fast Prototyping
PDF
Nltk natural language toolkit overview and application @ PyCon.tw 2012
PDF
Software development practices in python
PDF
Documentation with sphinx @ PyHug
PDF
Apache thrift-RPC service cross languages
PDF
Build a Searchable Knowledge Base
PPTX
Nltk natural language toolkit overview and application @ PyHug
PDF
NetworkX - python graph analysis and visualization @ PyHug
PDF
Text classification in scikit-learn
PDF
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Crawling the web for fun and profit
Collecting web information with open source tools
Scrapy.for.dummies
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
摘星
[LDSP] Solr Usage
Fast data mining flow prototyping using IPython Notebook
Data Analyst Nanodegree
[LDSP] Search Engine Back End API Solution for Fast Prototyping
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Software development practices in python
Documentation with sphinx @ PyHug
Apache thrift-RPC service cross languages
Build a Searchable Knowledge Base
Nltk natural language toolkit overview and application @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
Text classification in scikit-learn
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Ad

Similar to When big data meet python @ COSCUP 2012 (20)

PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
Hadoop as data refinery
PPTX
From open data to API-driven business
PPTX
NoSQL & Big Data Analytics: History, Hype, Opportunities
PDF
Apache hadoop bigdata-in-banking
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Présentation on radoop
PPT
Getting Started with MongoDB at Oracle Open World 2012
PPTX
Pass bac jd_sm
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PDF
Rails with MongoDB
PDF
Ibm db2update2019 icp4 data
PDF
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
PDF
Using hadoop to expand data warehousing
PPT
Big Data = Big Decisions
PDF
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
PPT
Data mining - GDi Techno Solutions
PDF
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
PPT
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
PPTX
NoSQL for the SQL Server Pro
Hadoop as Data Refinery - Steve Loughran
Hadoop as data refinery
From open data to API-driven business
NoSQL & Big Data Analytics: History, Hype, Opportunities
Apache hadoop bigdata-in-banking
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Présentation on radoop
Getting Started with MongoDB at Oracle Open World 2012
Pass bac jd_sm
Introduction to Cloud computing and Big Data-Hadoop
Rails with MongoDB
Ibm db2update2019 icp4 data
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Using hadoop to expand data warehousing
Big Data = Big Decisions
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Data mining - GDi Techno Solutions
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
NoSQL for the SQL Server Pro

More from Jimmy Lai (9)

PDF
[PyCon US 2025] Scaling the Mountain_ A Framework for Tackling Large-Scale Te...
PDF
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
PDF
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
PDF
Python Linters at Scale.pdf
PDF
EuroPython 2022 - Automated Refactoring Large Python Codebases
PDF
Annotate types in large codebase with automated refactoring
PDF
The journey of asyncio adoption in instagram
PDF
Distributed system coordination by zookeeper and introduction to kazoo python...
PDF
Continuous Delivery: automated testing, continuous integration and continuous...
[PyCon US 2025] Scaling the Mountain_ A Framework for Tackling Large-Scale Te...
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Python Linters at Scale.pdf
EuroPython 2022 - Automated Refactoring Large Python Codebases
Annotate types in large codebase with automated refactoring
The journey of asyncio adoption in instagram
Distributed system coordination by zookeeper and introduction to kazoo python...
Continuous Delivery: automated testing, continuous integration and continuous...

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Machine Learning_overview_presentation.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
SOPHOS-XG Firewall Administrator PPT.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Advanced methodologies resolving dimensionality complications for autism neur...
Getting Started with Data Integration: FME Form 101
Machine Learning_overview_presentation.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
20250228 LYD VKU AI Blended-Learning.pptx

When big data meet python @ COSCUP 2012

  • 1. When Big Data Meet Python Jimmy Lai (賴弘哲) [email protected] 2012/08/19 Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 1
  • 2. 自我介紹 • 賴弘哲 (Jimmy Lai) • Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python • LindedIn profile: http://goo.gl/XTEM5 • 現任職於引京聚點知識結構搜索公司, 從事大資料語意分析 2012 2
  • 3. Outline 1. Big Data a. Concept b. Technical issues 2. Big Data + Python a. Related open source tools b. Example 2012 3
  • 4. Benefits of Big Data 1. Creating transparency(透明度) e.g. http://www.data.gov/ 2. Enabling experimentation to discover needs, expose variability, and improve performance(發現需求及潛在威脅、改善產能) 3. Segmenting populations to customize(客製化) actions 4. Replacing/supporting human decision making with automated algorithms(自動決策) 5. Innovating new business models, products and services(創新的服務、產業) 深度資料分析人才的短缺 (May 2011). Big Data: The next frontier for innovation, competition, and productivity. 2012 McKinsey Global Institute. 4
  • 5. Initiative from the White House • (Mar 2012) Big Data Research and Development Initiative, the White House. • National Science Foundation encourages education on Big Data. • Government invest on developing state-of- the-art technologies, harness those technologies, and expand the workforce for Big Data. 2012 5
  • 6. Big Data Issues User Generated Content Machine Generated Data Collecting Storage Computing Analysis Visualization 2012 6
  • 7. Big Data Techniques Machine User Generated Content Generated Data • Crawler – Collect raw data Collecting – E.g. Heritrix, Nutch • Scraping Storage – Parse information Computing from raw data – E.g. Yahoo! Pipes, Analysis Scrapy Visualization 2012 7
  • 8. Big Data Techniques User Generated Machine Generated Data • Big Table Content – Distributed key-value storage Collecting – E.g.Hbase, Cassandra • NoSQL Storage – Not use SQL for manipulation Computing – Not use relational database model Analysis – E.g. MongoDB, Redis, CouchDB Visualization 2012 8
  • 9. Big Data Techniques Machine User Generated Content Generated Data • Batch – MapReduce Collecting – E.g. Hadoop • Real-time Storage – Stream processing Computing – E.g. S4, Storm Analysis Visualization 2012 9
  • 10. Big Data Techniques User Generated Machine • Data mining Content Generated Data – Weka • Machine learning Collecting – scikit-learn • Natural language Storage processing – NLTK, Stanford NLP Computing • Statistics –R Analysis Visualization 2012 10
  • 11. Big Data Techniques Machine User Generated Content Generated Data • Abstract • Interactive Collecting • E.g. Processing, Gephi, D3.js Storage Computing Analysis Visualization 2012 11
  • 12. Why Python? • Good code readability • Fast growing among for fast development. open source • Scripting language: the communities. less code, the more – Commits statistics from productivity. ohloh.net 2012 12
  • 13. When Big Data meet Python User Generated Machine Content Generated Data Collecting Scrapy: scraping framework PyMongo: Python client for Mongodb Infrastructure Storage Hadoop streaming: Linux pipe interface Computing Disco: lightweight MapReduce in Python Pandas: data analysis/manipulation Analysis Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning Visualization Matplotlib: plotting 2012 NetworkX: graph visualization 13
  • 14. When Big Data meet Python User Generated Machine Generated Data http://scrapy.org/ Content web scraping framework • Simple and Extensible Collecting • Components: • Scheduler Storage • Downloader • Spider(Scraper) Computing • Item pipeline Analysis Visualization 2012 14
  • 15. When Big Data meet Python User Generated Machine http://www.mongodb.org/ Content Generated Data NoSQL database • PyMongo: client for python Collecting • Document(JSON)-oriented • No schema Storage • Scalable • Auto-sharding Computing • Replica-set Analysis • File storage • MapReduce aggregation Visualization 2012 15
  • 16. When Big Data meet Python Machine http://discoproject.org/ User Generated Content Generated Data • Distributed computing: – MapReduce Collecting – Disco distributed file system • Write code in Python Storage – Easy/fast to profiling – Easy/fast to debugging Computing Analysis Visualization 2012 16
  • 17. When Big Data meet Python User Generated Machine Content Generated Data http://pandas.pydata.org/ • Data analysis library Collecting • Datastructure for fast data manipulation – Slicing Storage – Indexing – subsetting Computing • Handling missing data Analysis • Aggregation • Time series Visualization 2012 17
  • 18. When Big Data meet Python User Generated Machine Statsmodels Content Generated Data http://statsmodels.sourceforge.net/ • Statistical analysis Collecting • Statistical models • Fit data with model Storage • Statistical tests • Data exploration Computing • Time series analysis Analysis Visualization 2012 18
  • 19. When Big Data meet Python User Generated Machine scikit-learn Content Generated Data http://scikit-learn.org/ • Machine learning algorithms • Supervised learning Collecting • Unsupervised learning • Dataset Storage • Preprocessing Computing • feature extraction • Model Analysis • Selection • Pipeline Visualization 2012 19
  • 20. When Big Data meet Python User Generated Machine Content Generated Data NLTK: Natural Language Toolkit http://scikit-learn.org/ • Natural language processing Collecting • Annotated corpora and resources Information Extraction Work Flow Storage Sentence Segmentation Tokenization POS tagging Computing Named Entity Relation Recognition Recognition Analysis Visualization 2012 20
  • 21. When Big Data meet Python User Generated Machine Content Generated Data NL http://matplotlib.sourceforge.net/ • Plotting Collecting – Histograms – Power spectra Storage – Bar charts – Error charts Computing – Scatter plots • Full control to detail of plotting Analysis Visualization 2012 21
  • 22. When Big Data meet Python User Generated Machine Content Generated Data NetworkX http://networkx.lanl.gov/ • Graph algorithms and visisualization Collecting • Draw graph with layout: – Circular Storage – Random – Spectural Computing – Spring – Shell Analysis – Graphviz Visualization 2012 22
  • 23. 聚寶評 www.ezpao.com 美食搜尋引擎 搜尋各大部落格食記 2012 23
  • 24. 聚寶評 www.ezpao.com 語意分析搜尋引擎 2012 24
  • 25. 評論主題分析 網友分享菜分析 正評/負評分析 2012 25
  • 26. Thank you for your attention. Q&A We are hiring! • 核心引擎演算法研發工程師 • 系統研發工程師 • 網路應用研發工程師 Oxygen Intelligence Taiwan Limited 引京聚點 知識結構搜索股份有限公司 • 公司簡介: http://www.ezpao.com/about/ • 職缺簡介: http://www.ezpao.com/join/ • 請將履歷寄到 [email protected] 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 26