SlideShare a Scribd company logo
Text Classification in Python – using
   Pandas, scikit-learn, IPython
     Notebook and matplotlib
                     Jimmy Lai
             r97922028 [at] ntu.edu.tw
 http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
                    2013/02/17
Critical Technologies for Big Data
                          Analysis
       User Generated       Machine
          Content        Generated Data
                                          • Please refer
                                            http://www.slideshare.net/jimmy
                  Collecting                _lai/when-big-data-meet-python
                                            for more detail.
                   Storage
Infrastructure
 C/JAVA
                  Computing

Python/R           Analysis

Javascript       Visualization
                                                                             2
Fast prototyping - IPython Notebook
• Write python code in browser:
  – Exploit the remote server resources
  – View the graphical results in web page
  – Sketch code pieces as blocks
  – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-
    prototyping-using-ipython-notebook for more introduction.




                           Text Classification in Python               3
Demo Code
• Demo Code:
  ipython_demo/text_classification_demo.ipynb
  in https://bitbucket.org/noahsark/slideshare
• Ipython Notebook:
  – Install
  $ pip install ipython
  – Execution (under ipython_demo dir)
  $ ipython notebook --pylab=inline
  – Open notebook with browser, e.g.
     http://127.0.0.1:8888

                    Text Classification in Python   4
Machine learning classification
•   𝑋 𝑖 = [𝑥1 , 𝑥2 , … , 𝑥 𝑛 ], 𝑥 𝑛 ∈ 𝑅
•   𝑦𝑖 ∈ 𝑁
•   𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌
•   𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦 𝑖 = 𝑓(𝑋 𝑖 )




                         Text Classification in Python   5
Text classification
         Feature
        Generation

  Model
                                         Feature
Parameter
                                        Selection
  Tuning

      Classification
      Model Training
            Text Classification in Python           6
From: zyeh@caspian.usc.edu (zhenghao yeh)
Subject: Re: Newsgroup Split
Organization: University of Southern California, Los Angeles, CA
Lines: 18
                                                                     Dataset:
Distribution: world                                                  20 newsgroups
NNTP-Posting-Host: caspian.usc.edu       Structured Data                 dataset
In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu
(Chris Herringshaw) writes:
|> Concerning the proposed newsgroup split, I personally am not in favor of
|> doing this. I learn an awful lot about all aspects of graphics by reading
|> this group, from code to hardware to algorithms. I just think making 5
|> different groups out of this is a wate, and will only result in a few posts
|> a week per group. I kind of like the convenience of having one big forum
|> for discussing all aspects of graphics. Anyone else feel this way?
|> Just curious.
|>
|>
|> Daemon
|>
                                                                          Text
I agree with you. Of cause I'll try to be a daemon :-)

Yeh                                  Text Classification in Python               7
USC
Dataset in sklearn
• sklearn.datasets
  – Toy datasets
  – Download data from http://mldata.org repository
• Data format of classification problem
  – Dataset
     • data: [raw_data or numerical]
     • target: [int]
     • target_names: [str]


                      Text Classification in Python   8
Feature extraction from structured
                 data (1/2)
• Count the frequency of
                                                                      Keyword Count
  keyword and select the                                              Distribution 2549
  keywords as features:                                               Summary 397
  ['From', 'Subject',                                                 Disclaimer 125
                                                                      File 257
  'Organization',                                                     Expires 116
  'Distribution', 'Lines']                                            Subject 11612
• E.g.                                                                From 11398
                                                                      Keywords 943
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
                                                                      Originator 291
Organization: University of Maryland, College                         Organization 10872
Park                                                                  Lines 11317
Distribution: None                                                    Internet 140
Lines: 15                                                             To 106



                                      Text Classification in Python                        9
Feature extraction from structured
              data (2/2)
• Separate structured                    • Transform token matrix
  data and text data                        as numerical matrix by
   – Text data start from                   sklearn.feature_extract
     “Line:”                                ionDictVectorizer
                                         • E.g.
                                         [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] =>
                                         [[1, 1, 0], [0, 0, 1]]




                        Text Classification in Python                  10
Text Feature extraction in sklearn
• sklearn.feature_extraction.text
• CountVectorizer
  – Transform articles into token-count matrix
• TfidfVectorizer
  – Transform articles into token-TFIDF matrix
• Usage:
  – fit(): construct token dictionary given dataset
  – transform(): generate numerical matrix

                     Text Classification in Python    11
Text Feature extraction
• Analyzer
  – Preprocessor: str -> str
     • Default: lowercase
     • Extra: strip_accents – handle unicode chars
  – Tokenizer: str -> [str]
     • Default: re.findall(ur"(?u)bww+b“, string)
  – Analyzer: str -> [str]
     1. Call preprocessor and tokenizer
     2. Filter stopwords
     3. Generate n-gram tokens

                       Text Classification in Python    12
Text Classification in Python   13
Feature Selection
• Decrease the number of features:
  – Reduce the resource usage for faster learning
  – Remove the most common tokens and the most
    rare tokens (words with less information):
     • Parameter for Vectorizer:
        – max_df
        – min_df
        – max_features




                         Text Classification in Python   14
Classification Model Training
• Common classifiers in sklearn:
  – sklearn.linear_model
  – sklearn.svm
• Usage:
  – fit(X, Y): train the model
  – predict(X): get predicted Y




                     Text Classification in Python   15
Cross Validation
• When tuning the parameters of model, let
  each article as training and testing data
  alternately to ensure the parameters are not
  dedicated to some specific articles.
  – from sklearn.cross_validation import KFold
  – for train_index, test_index in KFold(10, 2):
     • train_index = [5 6 7 8 9]
     • test_index = [0 1 2 3 4]


                        Text Classification in Python   16
Performance Evaluation
                              𝑡𝑝                     • sklearn.metrics
  • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
                            𝑡𝑝+𝑓𝑝
                                   – precision_score
               𝑡𝑝
  • 𝑟𝑒𝑐𝑎𝑙𝑙 =                       – recall_score
             𝑡𝑝+𝑓𝑛
                  𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 – f1_score
  • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2
                            𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙




                                    Text Classification in Python        17
Source: http://en.wikipedia.org/wiki/Precision_and_recall
Visualization
1. Matplotlib
2. plot() function of Series, DataFrame




                   Text Classification in Python   18
Experiment Result




• Future works:
  – Feature selection by statistics or dimension reduction
  – Parameter tuning
  – Ensemble models

                      Text Classification in Python      19
Ad

Recommended

Text Classification
Text Classification
RAX Automation Suite
 
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Akanksha Bali
 
Machine Learning and Data Mining
Machine Learning and Data Mining
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Text classification presentation
Text classification presentation
Marijn van Zelst
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
shanbady
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
Big Data Engineering, Faculty of Engineering, Dhurakij Pundit University
 
First order predicate logic (fopl)
First order predicate logic (fopl)
chauhankapil
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
Presentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
SwatiTripathi44
 
Support Vector Machines
Support Vector Machines
nextlib
 
Classification in data mining
Classification in data mining
Sulman Ahmed
 
Data Preprocessing
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
Natural Language Processing
Natural Language Processing
VeenaSKumar2
 
K mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
NLP
NLP
Girish Khanzode
 
Ranking algorithms
Ranking algorithms
Ankit Raj
 
Deep Learning Tutorial
Deep Learning Tutorial
Amr Rashed
 
K Nearest Neighbors
K Nearest Neighbors
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Resume parser
Resume parser
Akrita Agarwal
 
Text classification & sentiment analysis
Text classification & sentiment analysis
M. Atif Qureshi
 
Using binary classifiers
Using binary classifiers
butest
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.
 
What is Deep Learning?
What is Deep Learning?
NVIDIA
 
Categorical Data Analysis in Python
Categorical Data Analysis in Python
Jaidev Deshpande
 
Text classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 

More Related Content

What's hot (20)

First order predicate logic (fopl)
First order predicate logic (fopl)
chauhankapil
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
Presentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
SwatiTripathi44
 
Support Vector Machines
Support Vector Machines
nextlib
 
Classification in data mining
Classification in data mining
Sulman Ahmed
 
Data Preprocessing
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
Natural Language Processing
Natural Language Processing
VeenaSKumar2
 
K mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
NLP
NLP
Girish Khanzode
 
Ranking algorithms
Ranking algorithms
Ankit Raj
 
Deep Learning Tutorial
Deep Learning Tutorial
Amr Rashed
 
K Nearest Neighbors
K Nearest Neighbors
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Resume parser
Resume parser
Akrita Agarwal
 
Text classification & sentiment analysis
Text classification & sentiment analysis
M. Atif Qureshi
 
Using binary classifiers
Using binary classifiers
butest
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.
 
What is Deep Learning?
What is Deep Learning?
NVIDIA
 
First order predicate logic (fopl)
First order predicate logic (fopl)
chauhankapil
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
Presentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
SwatiTripathi44
 
Support Vector Machines
Support Vector Machines
nextlib
 
Classification in data mining
Classification in data mining
Sulman Ahmed
 
Natural Language Processing
Natural Language Processing
VeenaSKumar2
 
K mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
Ranking algorithms
Ranking algorithms
Ankit Raj
 
Deep Learning Tutorial
Deep Learning Tutorial
Amr Rashed
 
Text classification & sentiment analysis
Text classification & sentiment analysis
M. Atif Qureshi
 
Using binary classifiers
Using binary classifiers
butest
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.
 
What is Deep Learning?
What is Deep Learning?
NVIDIA
 

Viewers also liked (20)

Categorical Data Analysis in Python
Categorical Data Analysis in Python
Jaidev Deshpande
 
Text classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas
 
Intro to scikit-learn
Intro to scikit-learn
AWeber
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Machine learning with scikit-learn
Machine learning with scikit-learn
Qingkai Kong
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
AWeber
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
Kan Ouivirach, Ph.D.
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
Asim Jalis
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017
Francesco Mosconi
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learn
Yoss Cohen
 
Machine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
Categorical Data Analysis in Python
Categorical Data Analysis in Python
Jaidev Deshpande
 
Text classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas
 
Intro to scikit-learn
Intro to scikit-learn
AWeber
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Machine learning with scikit-learn
Machine learning with scikit-learn
Qingkai Kong
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
AWeber
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
Kan Ouivirach, Ph.D.
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
Asim Jalis
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017
Francesco Mosconi
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learn
Yoss Cohen
 
Machine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
Ad

Similar to Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib (20)

Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
Python ml
Python ml
Shubham Sharma
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Hands-on - Machine Learning using scikitLearn
Hands-on - Machine Learning using scikitLearn
avrtraining021
 
Ember
Ember
mrphilroth
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
Chengjen Lee
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python
Jose Manuel Ortega Candel
 
unit-iii-deep-learningunit-iii-deep-learning.pdf
unit-iii-deep-learningunit-iii-deep-learning.pdf
nandan543979
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1
Tekendra Nath Yogi
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
Josh Patterson
 
Python for data analysis
Python for data analysis
Savitribai Phule Pune University
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Databricks
 
Ml programming with python
Ml programming with python
Kumud Arora
 
python for data anal gh i o fytysis creation.pptx
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Hot tutorials
Hot tutorials
Kanagaraj M
 
PPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Hands-on - Machine Learning using scikitLearn
Hands-on - Machine Learning using scikitLearn
avrtraining021
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
Chengjen Lee
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python
Jose Manuel Ortega Candel
 
unit-iii-deep-learningunit-iii-deep-learning.pdf
unit-iii-deep-learningunit-iii-deep-learning.pdf
nandan543979
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1
Tekendra Nath Yogi
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
Josh Patterson
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Databricks
 
Ml programming with python
Ml programming with python
Kumud Arora
 
python for data anal gh i o fytysis creation.pptx
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Ad

More from Jimmy Lai (20)

[PyCon US 2025] Scaling the Mountain_ A Framework for Tackling Large-Scale Te...
[PyCon US 2025] Scaling the Mountain_ A Framework for Tackling Large-Scale Te...
Jimmy Lai
 
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
Jimmy Lai
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Python Linters at Scale.pdf
Python Linters at Scale.pdf
Jimmy Lai
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python Codebases
Jimmy Lai
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
Jimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
Jimmy Lai
 
Data Analyst Nanodegree
Data Analyst Nanodegree
Jimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
Jimmy Lai
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge Base
Jimmy Lai
 
[LDSP] Solr Usage
[LDSP] Solr Usage
Jimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
Jimmy Lai
 
Software development practices in python
Software development practices in python
Jimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
Jimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
Jimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
Jimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
Jimmy Lai
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Jimmy Lai
 
[PyCon US 2025] Scaling the Mountain_ A Framework for Tackling Large-Scale Te...
[PyCon US 2025] Scaling the Mountain_ A Framework for Tackling Large-Scale Te...
Jimmy Lai
 
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
Jimmy Lai
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Python Linters at Scale.pdf
Python Linters at Scale.pdf
Jimmy Lai
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python Codebases
Jimmy Lai
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
Jimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
Jimmy Lai
 
Data Analyst Nanodegree
Data Analyst Nanodegree
Jimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
Jimmy Lai
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge Base
Jimmy Lai
 
[LDSP] Solr Usage
[LDSP] Solr Usage
Jimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
Jimmy Lai
 
Software development practices in python
Software development practices in python
Jimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
Jimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
Jimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
Jimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
Jimmy Lai
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Jimmy Lai
 

Recently uploaded (20)

Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
ICT Frame Magazine Pvt. Ltd.
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Alliance
 
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
ICT Frame Magazine Pvt. Ltd.
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Alliance
 
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

  • 1. Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/02/17
  • 2. Critical Technologies for Big Data Analysis User Generated Machine Content Generated Data • Please refer http://www.slideshare.net/jimmy Collecting _lai/when-big-data-meet-python for more detail. Storage Infrastructure C/JAVA Computing Python/R Analysis Javascript Visualization 2
  • 3. Fast prototyping - IPython Notebook • Write python code in browser: – Exploit the remote server resources – View the graphical results in web page – Sketch code pieces as blocks – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow- prototyping-using-ipython-notebook for more introduction. Text Classification in Python 3
  • 4. Demo Code • Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare • Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. http://127.0.0.1:8888 Text Classification in Python 4
  • 5. Machine learning classification • 𝑋 𝑖 = [𝑥1 , 𝑥2 , … , 𝑥 𝑛 ], 𝑥 𝑛 ∈ 𝑅 • 𝑦𝑖 ∈ 𝑁 • 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌 • 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦 𝑖 = 𝑓(𝑋 𝑖 ) Text Classification in Python 5
  • 6. Text classification Feature Generation Model Feature Parameter Selection Tuning Classification Model Training Text Classification in Python 6
  • 7. From: [email protected] (zhenghao yeh) Subject: Re: Newsgroup Split Organization: University of Southern California, Los Angeles, CA Lines: 18 Dataset: Distribution: world 20 newsgroups NNTP-Posting-Host: caspian.usc.edu Structured Data dataset In article <[email protected]>, [email protected] (Chris Herringshaw) writes: |> Concerning the proposed newsgroup split, I personally am not in favor of |> doing this. I learn an awful lot about all aspects of graphics by reading |> this group, from code to hardware to algorithms. I just think making 5 |> different groups out of this is a wate, and will only result in a few posts |> a week per group. I kind of like the convenience of having one big forum |> for discussing all aspects of graphics. Anyone else feel this way? |> Just curious. |> |> |> Daemon |> Text I agree with you. Of cause I'll try to be a daemon :-) Yeh Text Classification in Python 7 USC
  • 8. Dataset in sklearn • sklearn.datasets – Toy datasets – Download data from http://mldata.org repository • Data format of classification problem – Dataset • data: [raw_data or numerical] • target: [int] • target_names: [str] Text Classification in Python 8
  • 9. Feature extraction from structured data (1/2) • Count the frequency of Keyword Count keyword and select the Distribution 2549 keywords as features: Summary 397 ['From', 'Subject', Disclaimer 125 File 257 'Organization', Expires 116 'Distribution', 'Lines'] Subject 11612 • E.g. From 11398 Keywords 943 From: [email protected] (where's my thing) Subject: WHAT car is this!? Originator 291 Organization: University of Maryland, College Organization 10872 Park Lines 11317 Distribution: None Internet 140 Lines: 15 To 106 Text Classification in Python 9
  • 10. Feature extraction from structured data (2/2) • Separate structured • Transform token matrix data and text data as numerical matrix by – Text data start from sklearn.feature_extract “Line:” ionDictVectorizer • E.g. [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]] Text Classification in Python 10
  • 11. Text Feature extraction in sklearn • sklearn.feature_extraction.text • CountVectorizer – Transform articles into token-count matrix • TfidfVectorizer – Transform articles into token-TFIDF matrix • Usage: – fit(): construct token dictionary given dataset – transform(): generate numerical matrix Text Classification in Python 11
  • 12. Text Feature extraction • Analyzer – Preprocessor: str -> str • Default: lowercase • Extra: strip_accents – handle unicode chars – Tokenizer: str -> [str] • Default: re.findall(ur"(?u)bww+b“, string) – Analyzer: str -> [str] 1. Call preprocessor and tokenizer 2. Filter stopwords 3. Generate n-gram tokens Text Classification in Python 12
  • 14. Feature Selection • Decrease the number of features: – Reduce the resource usage for faster learning – Remove the most common tokens and the most rare tokens (words with less information): • Parameter for Vectorizer: – max_df – min_df – max_features Text Classification in Python 14
  • 15. Classification Model Training • Common classifiers in sklearn: – sklearn.linear_model – sklearn.svm • Usage: – fit(X, Y): train the model – predict(X): get predicted Y Text Classification in Python 15
  • 16. Cross Validation • When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles. – from sklearn.cross_validation import KFold – for train_index, test_index in KFold(10, 2): • train_index = [5 6 7 8 9] • test_index = [0 1 2 3 4] Text Classification in Python 16
  • 17. Performance Evaluation 𝑡𝑝 • sklearn.metrics • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝 – precision_score 𝑡𝑝 • 𝑟𝑒𝑐𝑎𝑙𝑙 = – recall_score 𝑡𝑝+𝑓𝑛 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 – f1_score • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 Text Classification in Python 17 Source: http://en.wikipedia.org/wiki/Precision_and_recall
  • 18. Visualization 1. Matplotlib 2. plot() function of Series, DataFrame Text Classification in Python 18
  • 19. Experiment Result • Future works: – Feature selection by statistics or dimension reduction – Parameter tuning – Ensemble models Text Classification in Python 19