SlideShare a Scribd company logo
Practical Data Analysis in Python
Hilary Mason
@hmason
www.hilarymason.com
hilary@path101.com
Data is ubiquitous.
The ability and tools to use it are not.
(Focused) Data == Intelligence
Data Analysis on the Web
Data items change rapidly.
Data items are not independent.
There’s a lot of semi-structured data around.
There’s a LOT of data around.
==
Too many problems, few tools, and few experts.
Entity Disambiguation
This is important.
ME
UGLY HAG
Entity Disambiguation
This is important.
Company disambiguation is a very common
problem – Are “Microsoft”, “Microsoft
Corporation”, and “MS” the same company?
This is a hard problem.
SPAM sucks
Classification
Document classification.
Image recognition.
Topic recognition.
Text Parsing
Recommendation Systems
Product recommendations.
Disease predictions.
Behavior analysis.
IEEE Tag Clustering
immunity
ultrasound
medical
imaging
medical
devices
thermoelectric
devices
fault-tolerant
circuits
low power
devices
Python for Data Analysis
import why_python_is_awesome
Python is readable.
Easy to transition from Matlab or R.
Numerical computing support.
Growing set of machine learning libraries.
Libraries
NLTK (Natural Language Toolkit) – www.nltk.org
mlpy (Machine Learning PY) – mlpy.fbk.eu
numpy & scipy – scipy.org
An EC2 AMI provisioned with all of the toys you
need:
http://blog.infochimps.org/2009/02/06/start-
hacking-machetec2-released/
MachetEC2
Practical Data Analysis in Python
Supervised Classification
Text
Feature
Extractor
Trained
Classifier
Spam
Not Spam
Training
Data
Feature
Extractor
Data: Tweets
Hand-classified. For example, some spam:
| don't disrespect me. I just wanted yall to get a head start so
don't feel bad when I have more followers in two days.
http://xyyx.eu/a1ha |
| oh yay more new followers..hiii...if u want go to
http://xyyx.eu/a1hb
|
| My friend made this new tool to get more twitter followers,
http://xyyx.eu/a1ht
|
| Yes, Twitter is doing some Follower/Following count
corrections. Get it back at: http://xyyx.eu/a1h8
|
| man if i see one more person cry about losing followers!!!
http://xyyx.eu/a1h4
|
Features
def document_features(self, document):
document_words = set(document)
features = {}
for word in self.word_features:
features['contains(%s)' % word] = (word in document_words)
return features
Break tweets into lists of relevant words.
Naïve Bayesian Classifer
P(A|B) = the conditional probability of A given B
http://yudkowsky.net/rational/bayes
http://blog.oscarbonilla.com/2009/05/visualizin
g-bayes-theorem/
classifier = nltk.NaiveBayesClassifier.train(train_set)
Classifer Accuracy
Use a hand-classified test set to see the accuracy
of the classifier:
nltk.classify.accuracy(classifier, test_set)
Feature Relevance
contains(') = True not_s : spam = 53.6 : 1.4
contains(") = True not_s : spam = 32.2 : 1.1
contains(#) = True not_s : spam = 22.0 : 1.0
contains(!) = True not_s : spam = 10.8 : 1.0
contains(*) = True spam : not_s = 7.4 : 1.0
contains(=) = True not_s : spam = 5.5 : 1.0
contains(i) = False spam : not_s = 5.2 : 1.0
contains(?) = True not_s : spam = 2.4 : 1.0
contains(:) = True spam : not_s = 2.3 : 1.0
contains(&) = True not_s : spam = 1.8 : 1.0
contains(;) = True not_s : spam = 1.6 : 1.0
contains($) = True spam : not_s = 1.5 : 1.0
contains(u) = True spam : not_s = 1.5 : 1.0
contains(2.0) = False not_s : spam = 1.4 : 1.0
contains(saw) = False not_s : spam = 1.4 : 1.0
contains(noble) = False not_s : spam = 1.4 : 1.0
contains(sound) = False not_s : spam = 1.3 : 1.0
contains(approach) = False not_s : spam = 1.3 : 1.0
contains(finally) = False not_s : spam = 1.3 : 1.0
contains(more) = False spam : not_s = 1.3 : 1.0
Kitchen Sink
wash, rinse, repeat
Results
90% accuracy on spam tweets – not bad!
Other possibilities:
categorization – what do you tweet about?
human vs bot?
which celebrity tweeter are you?
<3 Data
Thank you!

More Related Content

PPTX
Analyzing Adverse Drug Events Using Data Mining Approach
PPTX
Say "Hi!" to Your New Boss
PDF
Machine learning in the life sciences with knime
PPTX
Icse2014 v3
PDF
Implementing and analyzing online experiments
PDF
Fairly Measuring Fairness In Machine Learning
PDF
Data analysis_PredictingActivity_SamsungSensorData
PDF
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Analyzing Adverse Drug Events Using Data Mining Approach
Say "Hi!" to Your New Boss
Machine learning in the life sciences with knime
Icse2014 v3
Implementing and analyzing online experiments
Fairly Measuring Fairness In Machine Learning
Data analysis_PredictingActivity_SamsungSensorData
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...

Viewers also liked (20)

PDF
pandas - Python Data Analysis
PDF
Parsing real-time data using Twitter Streaming API
ODP
Data Analysis in Python
PPTX
Python and Data Analysis
PPTX
Intro to Python Data Analysis in Wakari
PDF
Getting started with pandas
PDF
pandas: Powerful data analysis tools for Python
PDF
Python for Financial Data Analysis with pandas
PPTX
CLASSIFICATION OF TWEETS
PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
PPTX
Python for Data Analysis: Chapter 2
PDF
Creative Data Analysis with Python
PDF
Researh toolbox-data-analysis-with-python
PDF
Making your-very-own-android-apps-for-waternomics-using-app-inventor-2
PPTX
Data analysis with pandas
PDF
Creating Your First Predictive Model In Python
PDF
Categorical Data Analysis in Python
PDF
Big data analysis in python @ PyCon.tw 2013
PPTX
Analyzing Data With Python
PDF
Data Structures for Statistical Computing in Python
pandas - Python Data Analysis
Parsing real-time data using Twitter Streaming API
Data Analysis in Python
Python and Data Analysis
Intro to Python Data Analysis in Wakari
Getting started with pandas
pandas: Powerful data analysis tools for Python
Python for Financial Data Analysis with pandas
CLASSIFICATION OF TWEETS
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Python for Data Analysis: Chapter 2
Creative Data Analysis with Python
Researh toolbox-data-analysis-with-python
Making your-very-own-android-apps-for-waternomics-using-app-inventor-2
Data analysis with pandas
Creating Your First Predictive Model In Python
Categorical Data Analysis in Python
Big data analysis in python @ PyCon.tw 2013
Analyzing Data With Python
Data Structures for Statistical Computing in Python
Ad

Similar to Practical Data Analysis in Python (20)

PDF
The Magical Art of Extracting Meaning From Data
PDF
19BayesTheoremClassification19BayesTheoremClassification.ppt
PPT
Cs221 lecture5-fall11
PPTX
Fake news detection
PDF
Intro to Python for Data Science
PDF
Intro to Python for Data Science
PDF
A Map of the PyData Stack
PPTX
Data Mining Email SPam Detection PPT WITH Algorithms
PDF
Text classification in scikit-learn
PDF
2013 - Andrei Zmievski: Machine learning para datos
PPTX
Naïve Bayes Classifier Algorithm.pptx
PDF
Text analysis using python
PDF
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
PDF
Email Spam Detection Using Machine Learning
PDF
Python for Data Science 1 / converted Edition Yuli Vasiliev
PDF
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PDF
Introduction To Machine Learning With Python A Guide For Data Scientists 1st ...
PDF
Pycon 2012 Scikit-Learn
PDF
IRJET - Fake News Detection using Machine Learning
The Magical Art of Extracting Meaning From Data
19BayesTheoremClassification19BayesTheoremClassification.ppt
Cs221 lecture5-fall11
Fake news detection
Intro to Python for Data Science
Intro to Python for Data Science
A Map of the PyData Stack
Data Mining Email SPam Detection PPT WITH Algorithms
Text classification in scikit-learn
2013 - Andrei Zmievski: Machine learning para datos
Naïve Bayes Classifier Algorithm.pptx
Text analysis using python
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Email Spam Detection Using Machine Learning
Python for Data Science 1 / converted Edition Yuli Vasiliev
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Introduction To Machine Learning With Python A Guide For Data Scientists 1st ...
Pycon 2012 Scikit-Learn
IRJET - Fake News Detection using Machine Learning
Ad

More from Hilary Mason (12)

PDF
Grace Hopper Conference Opening Keynote
PPTX
Short URLs, Big Fun
PPTX
Strata NY Sep 2011: Big Data, Short URLs: Learning in Realtime
PPTX
PyCon 2011 Keynote
PPTX
Machine Learning for Web Data
PPTX
A Data-driven Look at the Realtime Web
PDF
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
PPT
Have data? What now?!
PPT
JWU Guest Talk: JavaScript and AJAX
PPT
Analytics for Virtual Worlds
PPT
Experiential Learning in Second Life
PPT
Virtual Worlds in Education
Grace Hopper Conference Opening Keynote
Short URLs, Big Fun
Strata NY Sep 2011: Big Data, Short URLs: Learning in Realtime
PyCon 2011 Keynote
Machine Learning for Web Data
A Data-driven Look at the Realtime Web
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
Have data? What now?!
JWU Guest Talk: JavaScript and AJAX
Analytics for Virtual Worlds
Experiential Learning in Second Life
Virtual Worlds in Education

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Cloud computing and distributed systems.
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced Soft Computing BINUS July 2025.pdf
20250228 LYD VKU AI Blended-Learning.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Cloud computing and distributed systems.
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
NewMind AI Monthly Chronicles - July 2025
Transforming Manufacturing operations through Intelligent Integrations
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Practical Data Analysis in Python

Editor's Notes

  • #4: 1) Access to the data, and 2) CPU power/algorithms that are robust enough to analyze it
  • #15: NLTK – in development since 2001