SlideShare a Scribd company logo
CSMR: A Scalable Algorithm for 
Text Clustering with Cosine 
Similarity and MapReduce 
Giannakouris – Salalidis Victor - Undergraduate Student 
Plerou Antonia - PhD Candidate 
Sioutas Spyros - Associate Professor
Introduction 
• Big Data: Massive amount of data as a result of the huge 
rate of growth 
• Big Data need to be faced in various domains: Business 
Intelligence, Bioinformatics, Social Media Analytics etc. 
• Text Mining: Classification/Clustering in digital libraries, 
e-mail, Sentiment Analysis on Social Media 
• CSMR: Performs pairwise text similarity, represents text 
data in a vector space and measures similarity in parallel 
manner using MapReduce
Background 
• Vector Space Model: An algebraic model for representing 
text documents as vectors 
• Efficient method for text similarity measurement
TF-IDF 
• Term Frequency – Inverse Document Frequency 
• A numerical statistic that reflects the significance of a 
term in a corpus of documents 
• Usually used in search engines, text mining, text 
similarity in the vector space 
푇퐹 × 퐼퐷퐹 = 
푛푖,푗 
푡 ∈ 푑푗 
× 푙표푔 
|퐷| 
|푑 ∈ 퐷: 푡 ∈ 푑|
Cosine Similarity 
• Cosine Similarity: A measure of similarity between two 
documents represented as vector 
• Measuring of the angle between two vectors 
A  B A  
B 
  
1 
1 2 2 
A  
B 
1 1 
cos(A,B) 
|| A|| || B|| 
( ) ( ) 
n 
i i 
n 
i 
i i 
i i 
 
  
 
 
Hadoop 
• Framework developed by Apache 
• Large-Scale Data Processing and Analytics 
• Scalable and parallel processing of data on large 
computer clusters using MapReduce 
• Runs on commodity, low-end hardware 
• Main Components: HDFS (Hadoop Distributed File 
System), MapReduce 
• Currently used by: Adobe, Yahoo!, Amazon, eBay, 
Facebook and many other companies
MapReduce 
• Programming Paradigm running on Apache Hadoop 
• The main component of Hadoop 
• Useful for processing of large data-sets 
• Breaks the data into key-value pairs 
• Model derived from map and reduce functions of 
Functional Programming 
• Every MR program constitutes of Mappers and Reducers
MapReduce Diagram
CSMR 
• The purposed method, CSMR combines all the above 
mentioned techniques 
• Scalable Algorithm for text clustering using MapReduce model 
• Applies MR model on TF-IDF and Cosine Similarity 
• 4 Phases: 
1. Word Counting 
2. Text Vectorization using term frequencies 
3. Apply TF-IDF on document vectors 
4. Cosine Similarity Measurement
Phase 1: Word Counting 
Algorithm 1: Word Count 
1: class Mapper 
2: method Map( document ) 
3: for each term ∈ document 
4: write ( ( term , docId ) , 1 ) 
5: 
6: class Reducer 
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 
8: sum = 0 
9: for each one ∈ ones do 
10: sum = sum +1 
11: return ( ( term , docId ) , o ) 
12: 
13: /* { o ∈ N : the number of occurrences } */
Phase 2: Term Frequency 
Algorithm 2: Term Frequency 
1: class Mapper 
2: method Map( ( term , docId ) , o ) 
3: for each element ∈ ( term , docId ) 
4: write ( docId, ( term, o ) ) 
5: 
6: class Reducer 
7: method Reduce( docId, (term, o) ) 
8: N = 0 
9: for each tuple ∈ ( term, o ) do 
10: N = N + o 
return ( (docId, N), (term, o) )
Phase 3: TF-IDF 
Algorithm 3: Tf-Idf 
1: class Mapper 
2: method Map( ( docId , N ), ( term , o ) ) 
3: for each element ∈ ( term , o ) 
4: write ( term, ( docId, o, N ) ) 
5: 
6: class Reducer 
7: method Reduce( term, ( docId , o , N ) ) 
8: n = 0 
9: for each element ∈ ( docId , o , N ) do 
10: n = n + 1 
11: tf = o / N 
12: idf = log|D| /(1n) 
13: return ( docId, ( term , tf×idf ) ) 
14: 
15: /* Where |D| is the number of documents in the corpus */
Phase 4: Cosine Similarity 
Algorithm 4: Cosine Similarity 
1: class Mapper 
2: method Map( docs ) 
3: n = docs.length 
4: 
5: for i = 0 to docs.length 
6: for j = i+1 to docs.length 
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 
8: 
9: class Reducer 
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 
11: A = docA.tfidf 
12: B = docB.tfidf 
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 
14: return ( (docId_A, docId_B), cosine )
Phase 4: Diagram 
Map 
Doc1,Doc2 
[Doc1 TF-IDF], [Doc2 TF-IDF] 
Doc1,Doc3 
[Doc1 TF-IDF], [Doc3 TF-IDF] 
Doc1,Doc4 
Input [Doc1 TF-IDF], [Doc4 TF-IDF] 
Output 
Doc4,Doc10 
[Doc4 TF-IDF], [Doc10 TF-IDF] 
DocM,DocN 
[DocM TF-IDF], [DocN TF-IDF] 
Reduce 
Doc1,Doc3 
Cosine(Doc1, Doc3) 
Doc1,Doc4 
Cosine(Doc1 ,Doc4) 
Doc4,Doc10 
Cosine(Doc4, Doc10) 
DocM,DocN 
Cosine(DocM, DocN) 
Doc1,Doc2 
Cosine(Doc1, Doc2)
Conclusions & Future Work 
• Finalized proposed method 
• Implementation of the method 
• Experimental tests on real data and computer clusters 
• Deployment of an open-source project 
• Additional implementation using more efficient tools such 
as Apache Spark and Scala 
• Publication of test results

More Related Content

PPT
Textmining Retrieval And Clustering
PPTX
Document Classification and Clustering
PPTX
Text clustering
PPTX
Document clustering and classification
PPTX
Document clustering for forensic analysis an approach for improving compute...
PPTX
Document clustering for forensic analysis
PPTX
Probabilistic models (part 1)
PPTX
Tdm probabilistic models (part 2)
Textmining Retrieval And Clustering
Document Classification and Clustering
Text clustering
Document clustering and classification
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic analysis
Probabilistic models (part 1)
Tdm probabilistic models (part 2)

What's hot (20)

PPTX
PDF
Vchunk join an efficient algorithm for edit similarity joins
PPTX
Scoring, term weighting and the vector space
PDF
Web clustering engines
PPT
3.5 model based clustering
DOCX
Final proj 2 (1)
PPT
Web clustring engine
PDF
Big data Clustering Algorithms And Strategies
PPT
Lect4
PDF
Current clustering techniques
PDF
Text Categorization Using Improved K Nearest Neighbor Algorithm
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PPT
PPT
3.2 partitioning methods
PPT
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
PDF
A survey of web clustering engines
PDF
IRE- Algorithm Name Detection in Research Papers
PPTX
Algorithm Name Detection & Extraction
PPTX
Introduction to Clustering algorithm
PPT
3.6 constraint based cluster analysis
Vchunk join an efficient algorithm for edit similarity joins
Scoring, term weighting and the vector space
Web clustering engines
3.5 model based clustering
Final proj 2 (1)
Web clustring engine
Big data Clustering Algorithms And Strategies
Lect4
Current clustering techniques
Text Categorization Using Improved K Nearest Neighbor Algorithm
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
3.2 partitioning methods
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A survey of web clustering engines
IRE- Algorithm Name Detection in Research Papers
Algorithm Name Detection & Extraction
Introduction to Clustering algorithm
3.6 constraint based cluster analysis
Ad

Viewers also liked (20)

PDF
OUTDATED Text Mining 4/5: Text Classification
PDF
Optimization for iterative queries on Mapreduce
PDF
MachineLearning_MPI_vs_Spark
PDF
Seeds Affinity Propagation Based on Text Clustering
PPTX
06 how to write a map reduce version of k-means clustering
PDF
Spark Bi-Clustering - OW2 Big Data Initiative, altic
PPT
Lec4 Clustering
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
PPT
Information retreival, By Hadi Mohammadzadeh
PPTX
05 k-means clustering
PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
PDF
Data clustering using map reduce
PDF
Modeling with Hadoop kdd2011
PDF
Parallel-kmeans
PPTX
Temporal Pattern Mining
PDF
IntelliGO semantic similarity measure for Gene Ontology annotations
PDF
Exploring Citation Networks to Study Intertextuality in Classics
PDF
How many citations are there in the Data Citation Index?
PDF
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
PDF
Cloud Deployments with Apache Hadoop and Apache HBase
OUTDATED Text Mining 4/5: Text Classification
Optimization for iterative queries on Mapreduce
MachineLearning_MPI_vs_Spark
Seeds Affinity Propagation Based on Text Clustering
06 how to write a map reduce version of k-means clustering
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Lec4 Clustering
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Information retreival, By Hadi Mohammadzadeh
05 k-means clustering
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Data clustering using map reduce
Modeling with Hadoop kdd2011
Parallel-kmeans
Temporal Pattern Mining
IntelliGO semantic similarity measure for Gene Ontology annotations
Exploring Citation Networks to Study Intertextuality in Classics
How many citations are there in the Data Citation Index?
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Cloud Deployments with Apache Hadoop and Apache HBase
Ad

Similar to CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce (20)

PPTX
Introduction to Map-Reduce in Hadoop.pptx
PPTX
Introduction to Map-Reduce in Hadoop.pptx
PDF
Big Data Frameworks: A primer on Apache Spark and MapReduce
PDF
Hadoop exercise
PDF
Parallel and Distributed Algorithms for Large Text Datasets Analysis
PDF
Hadoop map reduce concepts
PDF
Document Similarity with Cloud Computing
PDF
IRJET - Document Comparison based on TF-IDF Metric
PDF
Intro to Map Reduce
PDF
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
PDF
CityLABS Workshop: Working with large tables
PPTX
SN-BDA-MR-Analysis-6.pptx.................
PPTX
Gpu programming with java
PDF
MapReduce in Cloud Computing
PDF
Semantic Analysis Using MapReduce
PDF
20433-39028-3-PB.pdf
PPTX
IR.pptx
PPT
Mapreduce in Search
PPTX
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptx
Big Data Frameworks: A primer on Apache Spark and MapReduce
Hadoop exercise
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Hadoop map reduce concepts
Document Similarity with Cloud Computing
IRJET - Document Comparison based on TF-IDF Metric
Intro to Map Reduce
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
CityLABS Workshop: Working with large tables
SN-BDA-MR-Analysis-6.pptx.................
Gpu programming with java
MapReduce in Cloud Computing
Semantic Analysis Using MapReduce
20433-39028-3-PB.pdf
IR.pptx
Mapreduce in Search
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Introduction to the R Programming Language
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Introduction to Data Science and Data Analysis
PDF
.pdf is not working space design for the following data for the following dat...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to the R Programming Language
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
[EN] Industrial Machine Downtime Prediction
Business Ppt On Nestle.pptx huunnnhhgfvu
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Introduction to Data Science and Data Analysis
.pdf is not working space design for the following data for the following dat...

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

  • 1. CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce Giannakouris – Salalidis Victor - Undergraduate Student Plerou Antonia - PhD Candidate Sioutas Spyros - Associate Professor
  • 2. Introduction • Big Data: Massive amount of data as a result of the huge rate of growth • Big Data need to be faced in various domains: Business Intelligence, Bioinformatics, Social Media Analytics etc. • Text Mining: Classification/Clustering in digital libraries, e-mail, Sentiment Analysis on Social Media • CSMR: Performs pairwise text similarity, represents text data in a vector space and measures similarity in parallel manner using MapReduce
  • 3. Background • Vector Space Model: An algebraic model for representing text documents as vectors • Efficient method for text similarity measurement
  • 4. TF-IDF • Term Frequency – Inverse Document Frequency • A numerical statistic that reflects the significance of a term in a corpus of documents • Usually used in search engines, text mining, text similarity in the vector space 푇퐹 × 퐼퐷퐹 = 푛푖,푗 푡 ∈ 푑푗 × 푙표푔 |퐷| |푑 ∈ 퐷: 푡 ∈ 푑|
  • 5. Cosine Similarity • Cosine Similarity: A measure of similarity between two documents represented as vector • Measuring of the angle between two vectors A  B A  B   1 1 2 2 A  B 1 1 cos(A,B) || A|| || B|| ( ) ( ) n i i n i i i i i      
  • 6. Hadoop • Framework developed by Apache • Large-Scale Data Processing and Analytics • Scalable and parallel processing of data on large computer clusters using MapReduce • Runs on commodity, low-end hardware • Main Components: HDFS (Hadoop Distributed File System), MapReduce • Currently used by: Adobe, Yahoo!, Amazon, eBay, Facebook and many other companies
  • 7. MapReduce • Programming Paradigm running on Apache Hadoop • The main component of Hadoop • Useful for processing of large data-sets • Breaks the data into key-value pairs • Model derived from map and reduce functions of Functional Programming • Every MR program constitutes of Mappers and Reducers
  • 9. CSMR • The purposed method, CSMR combines all the above mentioned techniques • Scalable Algorithm for text clustering using MapReduce model • Applies MR model on TF-IDF and Cosine Similarity • 4 Phases: 1. Word Counting 2. Text Vectorization using term frequencies 3. Apply TF-IDF on document vectors 4. Cosine Similarity Measurement
  • 10. Phase 1: Word Counting Algorithm 1: Word Count 1: class Mapper 2: method Map( document ) 3: for each term ∈ document 4: write ( ( term , docId ) , 1 ) 5: 6: class Reducer 7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 8: sum = 0 9: for each one ∈ ones do 10: sum = sum +1 11: return ( ( term , docId ) , o ) 12: 13: /* { o ∈ N : the number of occurrences } */
  • 11. Phase 2: Term Frequency Algorithm 2: Term Frequency 1: class Mapper 2: method Map( ( term , docId ) , o ) 3: for each element ∈ ( term , docId ) 4: write ( docId, ( term, o ) ) 5: 6: class Reducer 7: method Reduce( docId, (term, o) ) 8: N = 0 9: for each tuple ∈ ( term, o ) do 10: N = N + o return ( (docId, N), (term, o) )
  • 12. Phase 3: TF-IDF Algorithm 3: Tf-Idf 1: class Mapper 2: method Map( ( docId , N ), ( term , o ) ) 3: for each element ∈ ( term , o ) 4: write ( term, ( docId, o, N ) ) 5: 6: class Reducer 7: method Reduce( term, ( docId , o , N ) ) 8: n = 0 9: for each element ∈ ( docId , o , N ) do 10: n = n + 1 11: tf = o / N 12: idf = log|D| /(1n) 13: return ( docId, ( term , tf×idf ) ) 14: 15: /* Where |D| is the number of documents in the corpus */
  • 13. Phase 4: Cosine Similarity Algorithm 4: Cosine Similarity 1: class Mapper 2: method Map( docs ) 3: n = docs.length 4: 5: for i = 0 to docs.length 6: for j = i+1 to docs.length 7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 8: 9: class Reducer 10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 11: A = docA.tfidf 12: B = docB.tfidf 13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 14: return ( (docId_A, docId_B), cosine )
  • 14. Phase 4: Diagram Map Doc1,Doc2 [Doc1 TF-IDF], [Doc2 TF-IDF] Doc1,Doc3 [Doc1 TF-IDF], [Doc3 TF-IDF] Doc1,Doc4 Input [Doc1 TF-IDF], [Doc4 TF-IDF] Output Doc4,Doc10 [Doc4 TF-IDF], [Doc10 TF-IDF] DocM,DocN [DocM TF-IDF], [DocN TF-IDF] Reduce Doc1,Doc3 Cosine(Doc1, Doc3) Doc1,Doc4 Cosine(Doc1 ,Doc4) Doc4,Doc10 Cosine(Doc4, Doc10) DocM,DocN Cosine(DocM, DocN) Doc1,Doc2 Cosine(Doc1, Doc2)
  • 15. Conclusions & Future Work • Finalized proposed method • Implementation of the method • Experimental tests on real data and computer clusters • Deployment of an open-source project • Additional implementation using more efficient tools such as Apache Spark and Scala • Publication of test results