SlideShare a Scribd company logo
CSE509: Introduction to Web Science and TechnologyLecture 5: Social Network AnalysisArjumandYounusWeb Science Research GroupInstitute of Business Administration (IBA)
Last Time…Web Data ExplosionPart IMapReduce BasicsMapReduce Example and DetailsMapReduce Case-Study: Web Crawler based on MapReduce ArchitecturePart IILarge-Scale File SystemsGoogle File System Case-StudyAugust 06, 2011
TodayTransition from Web 1.0 to Web 2.0Social Media CharacteristicsPart I: Theoretical AspectsSocial Networks as a GraphProperties of Social NetworksPart II: Getting Hands-On Experience on Social Media AnalyticsTwitter Data HacksPart III: Example ResearchesAugust 06, 2011
Quick SurveyDo you have a Facebook, MySpace, Twitter, or LinkedIn account?Do you own a blog?Do you read blogs?Have you ever searched for something on Wikipedia?Have you ever submitted content to a social network?August 06, 2011
Web 1.0 vs. Web 2.0August 06, 2011Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
What is so Different about Web 2.0?User Generated ContentCollaborative Environment: Participatory Web, Citizen JournalismUser is the Driving FactorAugust 06, 2011A Paradigm Shift rather than a Technology Shift
Top 20 Most Visited Web SitesInternet traffic report by Alexa on July 29th 2008August 06, 2011Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
Various forms of Social MediaBlog: Wordpress, blogspot, LiveJournalForum: Yahoo! Answers,  EpinionsMedia Sharing: Flickr, YouTube, ScribdMicroblogging: Twitter, FourSquareSocial Networking: Facebook, LinkedIn, OrkutSocial Bookmarking: Del.icio.us, DiigoWikis: Wikipedia, scholarpedia, AskDrWikiAugust 06, 2011
Characteristics of Social Media“Consumers” become “Producers”Rich User InteractionUser-Generated ContentsCollaborative environmentCollective WisdomLong TailBroadcast MediaFilter, then PublishSocial MediaPublish, then FilterAugust 06, 2011
August 06, 2011
PART I: Theoretical AspectsAugust 06, 2011
Networks and RepresentationSocial Network:  A social structure made of nodes (individuals or organizations) and edges that connect nodes in various  relationships like friendship, kinship etc. August 06, 2011Graph Representation
Matrix RepresentationProperties of Large-Scale NetworksNetworks in social media are typically huge, involving millions of actors and connectionsLarge-scale networks in real world demonstrate similar patternsScale-free DistributionsSmall-world EffectStrong Community StructureAugust 06, 2011
Scale-Free DistributionsDegree distribution in large-scale networks often follows a power law. A.k.a. long tail distribution, scale-free distributionAugust 06, 2011Degrees       Nodes
Small-World Effect“Six Degrees of Separation”A famous experiment conducted by Travers and Milgram (1969)Subjects were asked to send a chain letter to his acquaintance in order to reach a target person The average path length is around 5.5Verified on a planetary-scale IM network of 180 million users (Leskovec and Horvitz 2008) The average path length is 6.6August 06, 2011
Small World Facebook Experiment by Yahoo! LabsAnyone in the world can get a message to anyone else in just "six degrees of separation" by passing it from friend to friend. Sociologists have tried to prove (or disprove) this claim for decades, but it is still unresolved.http://smallworld.sandbox.yahoo.com/August 06, 2011
Community StructureCommunity: People in a group interact with each other more frequently than those outside the groupki = number of edges among node Ni’s neighborsFriends of a friend are likely to be friends as wellMeasured by clustering coefficient: Density of connections among one’s friendsAugust 06, 2011
Clustering CoefficientAugust 06, 2011d6=4, N6= {4, 5, 7,8}
k6=4 as e(4,5), e(5,7), e(5,8), e(7,8)
C6 = 4/(4*3/2) = 2/3
Average clustering coefficientC = (C1 + C2 + … + Cn)/nC = 0.61 for the left network
In a random graph, the expected coefficient is  14/(9*8/2) = 0.19.ChallengesScalabilitySocial networks are often in a scale of millions of nodes and connectionsTraditional network analysis often deals with at most hundreds of subjectsHeterogeneityVarious types of entities and interactions are involvedEvolutionTimelines are emphasized in social mediaCollective IntelligenceHow to utilize wisdom of crowds in forms of tags, wikis, reviewsEvaluationLack of ground truth, and complete information due to privacyAugust 06, 2011
Social Computing TasksSocial Computing: a young and vibrant fieldConferences: KDD, WSDM, WWW, ICML, AAAI/IJCAI, SocialCom, etc.TasksCentrality Analysis and Influence ModelingCommunity DetectionClassification and RecommendationPrivacy, Spam and SecurityAugust 06, 2011
Centrality Analysis and Influence ModelingCentrality Analysis: Identify the most important actors or edgesE.g. PageRank in GoogleVarious other criteriaInfluence modeling: How is information diffused? How does one influence each other?  Related ProblemsViral marketing: word-of-mouth effectInfluence maximizationAugust 06, 2011
Community DetectionA community is a set of nodes between which the interactions are (relatively) frequentA.k.a.,  group, cluster, cohesive subgroups, modules Applications: Recommendation based communities, Network Compression, Visualization of a huge network New lines of research in social mediaCommunity Detection in Heterogeneous NetworksCommunity Evolution  in Dynamic NetworksScalable Community Detection in Large-Scale NetworksAugust 06, 2011
Classification and RecommendationCommon in social media applicationsTag suggestion, Product/Friend/Group RecommendationAugust 06, 2011Link predictionNetwork-Based Classification
Privacy, Spam and SecurityPrivacy is a big concern in social mediaFacebook, Google buzz often appear in debates about privacyNetFlix Prize Sequel cancelled due to privacy concernSimple anonymization does not necessarily protect privacySpam blog (splog), spam comments, fake identity, etc., all requires new techniquesAs private information is involved, a secure and trustable system is critical Need to achieve a balance between sharing and privacyAugust 06, 2011
PART II: Practical SNA with Twittersphere MiningAugust 06, 2011
Pre-RequisitesExpectation that Python is installed and you have some hands-on experience with itDependencieseasy_installnetworkxtwitter (Twitter API for Python)For Windows usersInstall ActivePython: comes bundled with easy_installeasy_installnetworkxeasy_install twitterFor Linux userssh setuptools-0.6c11-py2.6.eggsudoeasy_installnetworkxsudoeasy_install twitterAugust 06, 2011
Getting Tweets from Twitter Search APIimport twitterimport jsontwitter_search=twitter.Twitter(domain="search.twitter.com")search_results=[]for page in range(1,6):search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))print json.dumps(search_results, sort_keys=True, indent=1)tweets=[r['text'] for result in search_results for r in result['results']]print tweetsAugust 06, 2011
Lexical Diversity for Tweetswords=[]for t in tweets:    words+= [w for w in t.split()]lexical_diversity=1.0*len(set(words))/len(words)August 06, 2011
What People are Tweeting: Frequency Analysisfreq_dist=nltk.FreqDist(words)freq_dist.keys()[:50]freq_dist.keys()[-50:]August 06, 2011
Extracting Relationships from Tweets (1/3)Step 1: Extracting Graph Dataimport networkx as nximport reg=nx.DiGraph()twitter_search=twitter.Twitter(domain="search.twitter.com")search_results=[]for page in range(1,6):search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))all_tweets=[tweet for page in search_results for tweet in page["results"]]def get_rt_sources(tweet):rt_patterns=re.compile(r"(RT|via)((?:\b\W*@\w+)+)",re.IGNORECASE)return [source.strip() for tuple in rt_patterns.findall(tweet) for source in tuple if source not in ("RT", "via")]for tweet in all_tweets:rt_sources=get_rt_sources(tweet["text"])if not rt_sources:continuefor rt_source in rt_sources:g.add_edge(rt_source,tweet["from_user"],{"tweet_id":tweet["id"]})August 06, 2011
Extracting Relationships from Tweets (2/3)Step 2: Generating DOT FileOUT = "pakistan_search_results.dot“dot=['"%s" -> "%s" [tweet_id=%s]' % (n1.encode('utf-8'), n2.encode('utf-8'), g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]f=open(OUT, 'w')f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),))f.close()August 06, 2011
Extracting Relationships from Tweets (3/3)Step 3: Visualizing the Retweet Data in Graphical FormFor Windows usersFor Linux userscirco -Tpng -Osnl_search_results pakistan_search_results.dotAugust 06, 2011
PART III: Example ResearchesAugust 06, 2011
Million Follower Fallacy (New York Times)August 06, 2011
Twitter: More a News Medium than a Social Network (PC World)August 06, 2011
Twitter for World Peace (Business Week)August 06, 2011
SocialFlow: Social Media OptimizationSocial Media Optimization PlatformWorks in Domains of Viral and Word-of-Mouth MarketingProvides Services to Major Media Outlets Recent studyHow different audiences consumed and rebroadcast messages news organizations were sending out: AlJazeera English, BBC News,  CNN, The Economist, Fox News and New York TimesAugust 06, 2011
August 06, 2011Twitter as a Real-Time News Analysis Service
Studying Ins and Outs of NewsUsing Twitter to study hot news items people are heavily tweeting aboutAugust 06, 2011
Algorithm for Identification of Popular NewsAugust 06, 2011

More Related Content

PPTX
2013 NodeXL Social Media Network Analysis
PPTX
2009 December NodeXL Overview
PPTX
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
PPTX
2015 pdf-marc smith-node xl-social media sna
PDF
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
PDF
Exploring Social Media with NodeXL
PPTX
20120301 strata-marc smith-mapping social media networks with no coding using...
PPTX
Simplifying Social Network Diagrams
2013 NodeXL Social Media Network Analysis
2009 December NodeXL Overview
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
2015 pdf-marc smith-node xl-social media sna
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Exploring Social Media with NodeXL
20120301 strata-marc smith-mapping social media networks with no coding using...
Simplifying Social Network Diagrams

What's hot (20)

PPTX
2014 TheNextWeb-Mapping connections with NodeXL
PPT
Social Network Analysis
PPTX
New Metrics for New Media Bay Area CIO IT Executives Meetup
PPT
The Basics of Social Network Analysis
PPTX
2015 #MMeasure-Marc Smith-NodeXL Mapping social media using social network ma...
PDF
Social Computing in the area of Big Data at the Know-Center Austria's leading...
PPTX
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
PDF
Introduction to Social Network Analysis
PPTX
Community detection in complex social networks
PPT
Social Network Analysis (SNA) and its implications for knowledge discovery in...
PDF
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
PDF
Social network analysis
PDF
Mining the Social Web - Lecture 2 - T61.6020
PPTX
00 Introduction to SN&H: Key Concepts and Overview
PPT
Prof. Hendrik Speck - Social Network Analysis
PPTX
2017 05-26 NodeXL Twitter search #shakeupshow
PDF
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
PPTX
Conversation graphs in Online Social Media
PPT
KASW'08 - Invited Talk
PPTX
Social Network Visualization 101
2014 TheNextWeb-Mapping connections with NodeXL
Social Network Analysis
New Metrics for New Media Bay Area CIO IT Executives Meetup
The Basics of Social Network Analysis
2015 #MMeasure-Marc Smith-NodeXL Mapping social media using social network ma...
Social Computing in the area of Big Data at the Know-Center Austria's leading...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Introduction to Social Network Analysis
Community detection in complex social networks
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Social network analysis
Mining the Social Web - Lecture 2 - T61.6020
00 Introduction to SN&H: Key Concepts and Overview
Prof. Hendrik Speck - Social Network Analysis
2017 05-26 NodeXL Twitter search #shakeupshow
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
Conversation graphs in Online Social Media
KASW'08 - Invited Talk
Social Network Visualization 101
Ad

Similar to CSE509 Lecture 5 (20)

PDF
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
PPTX
2010-November-8-NIA - Smart Society and Civic Culture - Marc Smith
PPTX
LSS'11: Charting Collections Of Connections In Social Media
PPTX
20111103 con tech2011-marc smith
PPTX
Think Link: Network Insights with No Programming Skills
PPTX
Mining Social Networks, an Introduction and Overview - Andy Pryke
PPTX
Social Network Analysis (SNA) 2018
PPTX
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
PPTX
20111123 mwa2011-marc smith
PPTX
Social Network Analysis for small learning groups
PPT
Social network
PPT
Social network (1)
PPTX
lec3_socialnetwork_part1.pptx
PPTX
20110719 social media research foundation-charting collections of connections
PDF
Oxford Digital Humanities Summer School
PDF
Unleashing Twitter Data for Fun and Insight
PDF
Unleashing twitter data for fun and insight
PPTX
An Introduction to NodeXL for Social Scientists
PDF
CS6010 Social Network Analysis Unit V
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
2010-November-8-NIA - Smart Society and Civic Culture - Marc Smith
LSS'11: Charting Collections Of Connections In Social Media
20111103 con tech2011-marc smith
Think Link: Network Insights with No Programming Skills
Mining Social Networks, an Introduction and Overview - Andy Pryke
Social Network Analysis (SNA) 2018
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
20111123 mwa2011-marc smith
Social Network Analysis for small learning groups
Social network
Social network (1)
lec3_socialnetwork_part1.pptx
20110719 social media research foundation-charting collections of connections
Oxford Digital Humanities Summer School
Unleashing Twitter Data for Fun and Insight
Unleashing twitter data for fun and insight
An Introduction to NodeXL for Social Scientists
CS6010 Social Network Analysis Unit V
Ad

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Tartificialntelligence_presentation.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Teaching material agriculture food technology
A comparative analysis of optical character recognition models for extracting...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectroscopy.pptx food analysis technology
Unlocking AI with Model Context Protocol (MCP)
Univ-Connecticut-ChatGPT-Presentaion.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mushroom cultivation and it's methods.pdf
Tartificialntelligence_presentation.pptx
OMC Textile Division Presentation 2021.pptx
Machine Learning_overview_presentation.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
TLE Review Electricity (Electricity).pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Digital-Transformation-Roadmap-for-Companies.pptx

CSE509 Lecture 5

  • 1. CSE509: Introduction to Web Science and TechnologyLecture 5: Social Network AnalysisArjumandYounusWeb Science Research GroupInstitute of Business Administration (IBA)
  • 2. Last Time…Web Data ExplosionPart IMapReduce BasicsMapReduce Example and DetailsMapReduce Case-Study: Web Crawler based on MapReduce ArchitecturePart IILarge-Scale File SystemsGoogle File System Case-StudyAugust 06, 2011
  • 3. TodayTransition from Web 1.0 to Web 2.0Social Media CharacteristicsPart I: Theoretical AspectsSocial Networks as a GraphProperties of Social NetworksPart II: Getting Hands-On Experience on Social Media AnalyticsTwitter Data HacksPart III: Example ResearchesAugust 06, 2011
  • 4. Quick SurveyDo you have a Facebook, MySpace, Twitter, or LinkedIn account?Do you own a blog?Do you read blogs?Have you ever searched for something on Wikipedia?Have you ever submitted content to a social network?August 06, 2011
  • 5. Web 1.0 vs. Web 2.0August 06, 2011Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
  • 6. What is so Different about Web 2.0?User Generated ContentCollaborative Environment: Participatory Web, Citizen JournalismUser is the Driving FactorAugust 06, 2011A Paradigm Shift rather than a Technology Shift
  • 7. Top 20 Most Visited Web SitesInternet traffic report by Alexa on July 29th 2008August 06, 2011Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
  • 8. Various forms of Social MediaBlog: Wordpress, blogspot, LiveJournalForum: Yahoo! Answers, EpinionsMedia Sharing: Flickr, YouTube, ScribdMicroblogging: Twitter, FourSquareSocial Networking: Facebook, LinkedIn, OrkutSocial Bookmarking: Del.icio.us, DiigoWikis: Wikipedia, scholarpedia, AskDrWikiAugust 06, 2011
  • 9. Characteristics of Social Media“Consumers” become “Producers”Rich User InteractionUser-Generated ContentsCollaborative environmentCollective WisdomLong TailBroadcast MediaFilter, then PublishSocial MediaPublish, then FilterAugust 06, 2011
  • 11. PART I: Theoretical AspectsAugust 06, 2011
  • 12. Networks and RepresentationSocial Network: A social structure made of nodes (individuals or organizations) and edges that connect nodes in various relationships like friendship, kinship etc. August 06, 2011Graph Representation
  • 13. Matrix RepresentationProperties of Large-Scale NetworksNetworks in social media are typically huge, involving millions of actors and connectionsLarge-scale networks in real world demonstrate similar patternsScale-free DistributionsSmall-world EffectStrong Community StructureAugust 06, 2011
  • 14. Scale-Free DistributionsDegree distribution in large-scale networks often follows a power law. A.k.a. long tail distribution, scale-free distributionAugust 06, 2011Degrees Nodes
  • 15. Small-World Effect“Six Degrees of Separation”A famous experiment conducted by Travers and Milgram (1969)Subjects were asked to send a chain letter to his acquaintance in order to reach a target person The average path length is around 5.5Verified on a planetary-scale IM network of 180 million users (Leskovec and Horvitz 2008) The average path length is 6.6August 06, 2011
  • 16. Small World Facebook Experiment by Yahoo! LabsAnyone in the world can get a message to anyone else in just "six degrees of separation" by passing it from friend to friend. Sociologists have tried to prove (or disprove) this claim for decades, but it is still unresolved.http://smallworld.sandbox.yahoo.com/August 06, 2011
  • 17. Community StructureCommunity: People in a group interact with each other more frequently than those outside the groupki = number of edges among node Ni’s neighborsFriends of a friend are likely to be friends as wellMeasured by clustering coefficient: Density of connections among one’s friendsAugust 06, 2011
  • 18. Clustering CoefficientAugust 06, 2011d6=4, N6= {4, 5, 7,8}
  • 19. k6=4 as e(4,5), e(5,7), e(5,8), e(7,8)
  • 21. Average clustering coefficientC = (C1 + C2 + … + Cn)/nC = 0.61 for the left network
  • 22. In a random graph, the expected coefficient is 14/(9*8/2) = 0.19.ChallengesScalabilitySocial networks are often in a scale of millions of nodes and connectionsTraditional network analysis often deals with at most hundreds of subjectsHeterogeneityVarious types of entities and interactions are involvedEvolutionTimelines are emphasized in social mediaCollective IntelligenceHow to utilize wisdom of crowds in forms of tags, wikis, reviewsEvaluationLack of ground truth, and complete information due to privacyAugust 06, 2011
  • 23. Social Computing TasksSocial Computing: a young and vibrant fieldConferences: KDD, WSDM, WWW, ICML, AAAI/IJCAI, SocialCom, etc.TasksCentrality Analysis and Influence ModelingCommunity DetectionClassification and RecommendationPrivacy, Spam and SecurityAugust 06, 2011
  • 24. Centrality Analysis and Influence ModelingCentrality Analysis: Identify the most important actors or edgesE.g. PageRank in GoogleVarious other criteriaInfluence modeling: How is information diffused? How does one influence each other? Related ProblemsViral marketing: word-of-mouth effectInfluence maximizationAugust 06, 2011
  • 25. Community DetectionA community is a set of nodes between which the interactions are (relatively) frequentA.k.a., group, cluster, cohesive subgroups, modules Applications: Recommendation based communities, Network Compression, Visualization of a huge network New lines of research in social mediaCommunity Detection in Heterogeneous NetworksCommunity Evolution in Dynamic NetworksScalable Community Detection in Large-Scale NetworksAugust 06, 2011
  • 26. Classification and RecommendationCommon in social media applicationsTag suggestion, Product/Friend/Group RecommendationAugust 06, 2011Link predictionNetwork-Based Classification
  • 27. Privacy, Spam and SecurityPrivacy is a big concern in social mediaFacebook, Google buzz often appear in debates about privacyNetFlix Prize Sequel cancelled due to privacy concernSimple anonymization does not necessarily protect privacySpam blog (splog), spam comments, fake identity, etc., all requires new techniquesAs private information is involved, a secure and trustable system is critical Need to achieve a balance between sharing and privacyAugust 06, 2011
  • 28. PART II: Practical SNA with Twittersphere MiningAugust 06, 2011
  • 29. Pre-RequisitesExpectation that Python is installed and you have some hands-on experience with itDependencieseasy_installnetworkxtwitter (Twitter API for Python)For Windows usersInstall ActivePython: comes bundled with easy_installeasy_installnetworkxeasy_install twitterFor Linux userssh setuptools-0.6c11-py2.6.eggsudoeasy_installnetworkxsudoeasy_install twitterAugust 06, 2011
  • 30. Getting Tweets from Twitter Search APIimport twitterimport jsontwitter_search=twitter.Twitter(domain="search.twitter.com")search_results=[]for page in range(1,6):search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))print json.dumps(search_results, sort_keys=True, indent=1)tweets=[r['text'] for result in search_results for r in result['results']]print tweetsAugust 06, 2011
  • 31. Lexical Diversity for Tweetswords=[]for t in tweets: words+= [w for w in t.split()]lexical_diversity=1.0*len(set(words))/len(words)August 06, 2011
  • 32. What People are Tweeting: Frequency Analysisfreq_dist=nltk.FreqDist(words)freq_dist.keys()[:50]freq_dist.keys()[-50:]August 06, 2011
  • 33. Extracting Relationships from Tweets (1/3)Step 1: Extracting Graph Dataimport networkx as nximport reg=nx.DiGraph()twitter_search=twitter.Twitter(domain="search.twitter.com")search_results=[]for page in range(1,6):search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))all_tweets=[tweet for page in search_results for tweet in page["results"]]def get_rt_sources(tweet):rt_patterns=re.compile(r"(RT|via)((?:\b\W*@\w+)+)",re.IGNORECASE)return [source.strip() for tuple in rt_patterns.findall(tweet) for source in tuple if source not in ("RT", "via")]for tweet in all_tweets:rt_sources=get_rt_sources(tweet["text"])if not rt_sources:continuefor rt_source in rt_sources:g.add_edge(rt_source,tweet["from_user"],{"tweet_id":tweet["id"]})August 06, 2011
  • 34. Extracting Relationships from Tweets (2/3)Step 2: Generating DOT FileOUT = "pakistan_search_results.dot“dot=['"%s" -> "%s" [tweet_id=%s]' % (n1.encode('utf-8'), n2.encode('utf-8'), g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]f=open(OUT, 'w')f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),))f.close()August 06, 2011
  • 35. Extracting Relationships from Tweets (3/3)Step 3: Visualizing the Retweet Data in Graphical FormFor Windows usersFor Linux userscirco -Tpng -Osnl_search_results pakistan_search_results.dotAugust 06, 2011
  • 36. PART III: Example ResearchesAugust 06, 2011
  • 37. Million Follower Fallacy (New York Times)August 06, 2011
  • 38. Twitter: More a News Medium than a Social Network (PC World)August 06, 2011
  • 39. Twitter for World Peace (Business Week)August 06, 2011
  • 40. SocialFlow: Social Media OptimizationSocial Media Optimization PlatformWorks in Domains of Viral and Word-of-Mouth MarketingProvides Services to Major Media Outlets Recent studyHow different audiences consumed and rebroadcast messages news organizations were sending out: AlJazeera English, BBC News, CNN, The Economist, Fox News and New York TimesAugust 06, 2011
  • 41. August 06, 2011Twitter as a Real-Time News Analysis Service
  • 42. Studying Ins and Outs of NewsUsing Twitter to study hot news items people are heavily tweeting aboutAugust 06, 2011
  • 43. Algorithm for Identification of Popular NewsAugust 06, 2011
  • 45. Observations (1/3)August 06, 2011Percentage of news in tweets per day greater than 50% for all days except one day
  • 46. Observations (2/3)August 06, 2011Highest Number of Recorded Tweets per Day

Editor's Notes

  • #4: The past decade has witnessed the emergence of participatory Web and social media, bringing peopletogether in many creative ways. Millions of users are playing, tagging, working, and socializingonline, demonstrating new forms of collaboration, communication, and intelligence that were hardlyimaginable just a short time ago. Social media also helps reshape business models, sway opinions andemotions, and opens up numerous possibilities to study human interaction and collective behavior inan unparalleled scale. This lecture, from a data mining perspective, introduces characteristics of socialmedia, reviews representative tasks of computing with social media, and illustrates associated challenges.
  • #8: In traditional media such as TV, radio, movies, and newspapers, it is only a small numberof “authorities” or “experts” who decide which information should be produced and how it is distributed.The majority of users are consumers who are separated from the production process. Thecommunication pattern in the traditional media is one-way traffic, from a centralized producer towidespread consumers.This new type of mass publication enables the production of timely news and grassrootsinformation and leads to mountains of user-generated contents, forming the wisdom of crowds
  • #9: Twitter: a directed graphFacebook: an undirected graphIn Twitter, for example, one user x follows another user y, but user y does not necessarily follow user x. In this case, the follower-followee network is directed and asymmetrical
  • #10: a linear relationship between the logarithms of the variables
  • #11: the number of connections between one’s friends over the total number of possible connections among them
  • #12: Previously: email communication networks, instant messaging networks, mobile call networks, friendshipNetworks. Other forms of complex networks, like coauthorship or citation networks, biological networks, metabolic pathways, genetic regulatory networks and food webThese large-scale networks combined with unique characteristics of social media present novelchallenges for mining social media.In reality, multiple relationships can exist between individuals. Two personscan be friends and colleagues at the same time. Thus, a variety of interactions exist betweenthe same set of actors in a network. Multiple types of entities can also be involved in onenetwork. For many social bookmarking and media sharing sites, users, tags and content areintertwined with each other, leading to heterogeneous entities in one network. Analysis ofthese heterogeneous networks involving heterogeneous entities or interactions requires newtheories and tools.Social media emphasizes timeliness. For example, in content sharing sites andblogosphere, people quickly lose their interest in most shared contents and blog posts. Thisdiffers fromclassical web mining.Newusers join in,newconnections establish between existingmembers, and senior users become dormant or simply leave.How can we capture the dynamicsof individuals in networks? Can we find the die-hard members that are the backbone ofcommunities? Can they determine the rise and fall of their communities?In social media, people tend to share their connections. The wisdomof crowds, in forms of tags, comments, reviews, and ratings, is often accessible. The metainformation, in conjunction with user interactions, might be useful for many applications.It remains a challenge to effectively employ social connectivity information and collectiveintelligence to build social computing applications.A research barrier concerning mining social media is evaluation. In traditionaldata mining, we are so used to the training-testing model of evaluation. It differs in socialmedia. Since many social media sites are required to protect user privacy information, limitedbenchmark data is available. Another frequently encountered problem is the lack of groundtruth for many social computing tasks, which further hinders some comparative study ofdifferent works.Without ground truth, how can we conduct fair comparison and evaluation?Slide 7-11