K-Means Algorithm Implementation In python

Users categorization of StackOverflow data
Using K-Means clustering Algorithm
Project Presentation
Team Membar – Afzal Ahmad and Abhishek Barnwal

What is StackOverflow ?
• Stack Overflow is a question and answer site
Written in C# for professional and enthusiast
programmers. It's built and run by us as part of
the Stack Exchange network of Q&A sites.

About User Account on stackoverflow
• This site is all about getting answers. Good answers are voted up and
rise to the top .
• User reputation score goes up when others vote up his questions,
answers and edits.
• Badges are special achievements User earns for participating on the
site. They come in three levels: bronze, silver, and gold.
• The person who asked can mark one answer as "accepted".

DataSet Overview
• The dataset is obtained from stackexchange data dump at the
internet archieve.
• The link to the dataset is as follows.
Www.archive.org/details/stackexchange
•Each site under stack exchange is formatted as a separate archive
Consisting of xml file zipped via 7-zip that includes various files.

Dataset overview
• Stack overflow dataset consists of following files that is treated as table in
our database design.
1.posts
2.postLinks
3.Tags
4.Users
5.Votes
6.Badges
7.Comments
♥ But we are interested only in Users file which contains user's Id and and his
features like age,reputation,upotes,downvotes etc...

Features of Users Data
1. Age
2. Reputations
3. Upvotes
4. Downvotes
5. Views

Data preprocessing
• Our Dataset is in XML format and unfit for our algorithm to process
that’s why we need data processing to make it fit for our algorithm to
process it.
• Data preprocessing is a data mining technique that involves
transforming raw data into an understandable format.
• To achieve tha data in desired format we need to parse it.

python script to convert xml to csv
from copy import deepcopy
import numpy as np
import pandas as pd
#from matplotlib import pyplot as plt
#%matplotlib inline
#plt.rcParams['figure.figsize'] = (16, 9)
#plt.style.use('ggplot')
import xml.etree.ElementTree as ET
import csv

tree = ET.parse("Users.xml")
root = tree.getroot()
# open a file for writing
User_data = open('user_data1.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(User_data)
count = 0

csvwriter.writerow(['Reputation', 'Views', 'UpVotes', 'DownVotes', 'Age'])
for i in root.findall('row'):
data = [i.get('Reputation'), i.get('Views'), i.get('UpVotes'), i.get('DownVotes'), i.get('Age') or '0']
# print data
count = count + 1
csvwriter.writerow(data)
User_data.close()

What is clustering ?
Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same groups
are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters.

Pictorial representation of Clustering

Types of Clustering
1. Hard Clustering: In hard clustering, each data point either
belongs to a cluster completely or not.
2. Soft Clustering: In soft clustering, instead of putting each
data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is
assigned.

Algorithm Used
• We are using K-means clustering algorithm to categorise the user of
different types on the basis of given features.
• k-means clustering is a data mining/machine learning algorithm used
to cluster observations into groups of related observations without
any prior knowledge of those relationships.
• This algorithm is also called unsupervised learning algorithm as it
does not have any idea of label of cluster.
• Using this algorithm we find the different k -categories depending on
the value of K.

Unsupervised Learning
Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses.
The most common unsupervised learning method is cluster analysis, which
is used for exploratory data analysis to find hidden patterns or grouping in
data. The clusters are modeled using a measure of similarity which is
defined upon metrics such as Euclidean or probabilistic distance.

Working of K-Means Algorithm
1 .Specify the desired number of clusters K : Let us choose k=2 for
these 5 data points in 2-D space.

2 . Randomly assign each data point to a cluster : Let’s assign three
points in cluster 1 shown using red color and two points in cluster 2
shown using grey color.

3 . Compute cluster centroids : The centroid of data points in the red
cluster is shown using red cross and those in grey cluster using grey
cross.

4. Now Re-assign each point to the closest cluster centroid .

5. Re-compute cluster centroids : Now, re-computing the
centroids for both the clusters.

6. Repeat steps 4 and 5 until no improvements are possible.
When there will be no further switching of data points between two
clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.

Pictorial representation of K-means
Algorithm

Implementation of K-means Algorithm
1. We have converted our XML data into CSV.
2. Run K-Means Algorithm on stackoverflow data.
3. If K=4 then We get the four cluster center with the values given
below.
array([[ 1.82709702e+02, 8.86936593e-01, 8.58670741e-01,
3.59052712e-02, 3.21581360e+01],
[ 1.71912000e+04, 7.34000000e+01, 1.92800000e+02,
1.29000000e+01, 3.92000000e+01],
[ 3.89650000e+04, 3.47000000e+02, 5.10000000e+02,
8.60000000e+01, 3.00000000e+01],
[ 4.18018750e+03, 1.38750000e+01, 3.42187500e+01,
1.40625000e+00, 3.27187500e+01]])

Pictorial form of Data with 4 cluster centre

Important information regarding insights of
data
1.We processed the data of android users of stack overflow.
2.Here all the results and insights are only of android specific users.
3.We used only numerical value information of User’s as K-Means
algorithm works on Euclidean distance.
4. User’s information used here are as follows.
‘Age’ , ‘Views’ ,’Reputations’, ‘Upvotes’, Downvotes

Insights from stack overflow data
1. Almost all the users of android specific are above 30 in Age.
2. Users who have maximum reputations,views,upvotes and
downvotes are of minimum age among all other users.It means
young community is more involved in android than older.
3. With the growth of Age users are not interested to downvote the
answer. Young community is most involved in downvoting as well as
in upvoting to the answer.
4. Profile views are mostly affected by reputation.It is increasing 3-4
times on doubling the reputation.

K-Means Algorithm Implementation In python

Recommended

More Related Content

What's hot (20)

Similar to K-Means Algorithm Implementation In python (20)

Recently uploaded (20)

K-Means Algorithm Implementation In python