SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3441
Survey paper on Big data imputation and Privacy algorithms
G.Swetha1, G.Ramya2
1,2 Professor,CSE,CVRCE,India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Big data is a collection of large data sets that
traditional processing methods are inadequate to deal with
them. however , the fast growth of such large data generates
both opportunities and problems. This paper presents the
literature review about issues, datacreation,dataprotection
and also different algorithms to deal with the issues.
Key Words: Big Data, Imputation, nearest neighbour,
data protection , Data Distortion, data blocking.
1.INTRODUCTION
Goods and Services tax was introduced in India from July
1st2017.People from all over the nation have given their
feedback on it. Some people have given positive feedback
and some have given negative feedback on it. If we can
summaries all types of opinions including updated ones, we
can consider it as a good example for Big data. Maximum
percentage of the data in the world were produced within
the last few years[2].Data is coming from various sources
and in various formats. Especiallysocial networkingsites are
producing large amount of data everyhourandhandlingthis
large data is very difficult.
Big data challenges [7] include Capturing, data storage, data
analysis, search, sharing, transfer, visualization, querying,
and updating and information privacy
The paper is organized as follows. Chapter II gives an
introduction to data imputation and algorithms for missing
data replacement. chapter III gives an introduction to
privacy protection and algorithms forprivacyprotection.IV.
Conclusion ,
2. Data Imputation
Normally when we preprocess data in data mining, we miss
some of the attribute values. But we can extract knowledge
from the data only if the data has good quality that is
without missing values. But if we have missing data we
cannot get good quality data. Missing data may occur
because of a detained student in a class, not responding to
the questions in a survey and so on. If we can handle missing
data carefully, then we can increase the quality of the
knowledge. So we need to replace the missing data with
some other reasonable data. This is known as data
Imputation.
If we have knowledge on that data we can predict the
missing value, but it is very complicated.Data maybemissed
in columns or rows or in both. Data which is missed can be
replaced before Data mining starts or after it starts. This
paper is a survey on 2 methods for handling missing data.
First method is Refined Mean Substitution and Second
method is K-Nearest Neighbor for missing data.
2.1 Data Imputation Algorithms:
The paper[1],proposed an algorithm for missing data. Here
missing data is estimated by using an Euclidean distance of
the missing instances or attributesandremainingrecords.In
this methoddistance(d)iscalculatedbetweenapproximately
imputed data set and rows of the data set. Now we need to
find data whose value is greater than the mean of d. Now
name this data as I. That is I is the index elements whose
distance is higher than mean(d).Now we need to find mean
(μj)of elements Dnew(I,n).Now for all the missing values we
need to replace μj in rows of missing data.By calculating for
every row like this and by substitutingin everymissingplace
,finally the imputed data set will be generated. This
algorithm was evaluated with five different metrics. The
performance is evaluated in terms of RAND INDEX,
Performance in terms of Accuracy, Performance in terms of
Specificity, Performance in terms of sensitivity, and
performance in terms of Mean Square Error. According to
[1],in almost all the cases this algorithm performed better
than MC/mean value substitution method.
The second algorithm for[8] imputation is K-Nearest
Neighbors.
Features of k-Nearest Neighbor are:
1).All the values of the attributes correlate with in an n-
dimensional Euclidean space.
2).When a new attribute value is entered, then classification
begins.
3).Different points’ feature vector is compared for doing
classification.
4).Here we don't use any particular function, it may be
discrete or real valued.
5).Euclidean distance between any two values will be
calculated. Mean value of the k-nearest neighbors will be
taken.
According to [4], classes for missing data randomness are:
(1).Missing completely at random: Here probability of the
missing value does not depend on existing value or itself. So,
we can do imputation with any data.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3442
(2).Missing at random: Here probability of missing data
depends on known values but not itself.
(3).Not missing at random: Here probability of the missing
data depends on itself.
According to [4], Missing data handling methods are:
(1).We can completely delete all instances of missing values
or attributes or we can check whether any particular
attribute or instance is missing in higher levels also then we
can delete it.
(2).We can use algorithms which can handle estimation of
parameter in the presence of missing data.
(3).we can replace with some reasonable value in the
missing data, which is known as imputation.
Imputation using k-nearest neighbor[3]:
According to [3],the main advantages of this method are:
(1).k-nearest neighbor can predict the missing value by
considering the most frequent value among the k-nearest
neighbors, and it can find mean among the k-nearest
neighbors and substitute it.
(2).Here it is not required to have a model whichguessesthe
value of the missing attribute , thats why here we can use
any attribute as class because we are not using any specific
model.
The main drawback of this model is: As we need to see for
the most frequent instance, the algorithm searches all the
data set, As the database is very large it will be difficult for
KDD.
3.Privacy Protection:
In recent years, the privacy and personal data protection has
become an issue especially in the context of social
networking and online advertisement. personal data means
any kind of data which identifies an individual person.
examples are person name, address, phonenumber,identity
number, date of birth. the way data isgrowing exponentially,
it will change the world that scarcely imagine today. that is
why the protection of personal data is very important.
Safeguards are necessary to give citizens and consumers
trust in administration, business and other private entities.
Data Privacy Algorithms:
Privacy preservation using association rule hiding:
Association rule hiding algorithms are used to hidesensitive
data. Suppose a database ‘D’ is available with minimum
support and confidence and set of rules ‘R’ are mined. A
subset ‘Rs’ set of sensitive association rules where’ Rs ‘is
subset of ‘R’. The aim of association rule hiding algorithms
are to change the database in such a way that it will be
difficult to mine sensitive association rules by maintain
remaining rule unaffected[5],
Classification of privacy preserving association rule hiding
algorithm:
1. Heuristic –Based Techniques
2. Border Approach
3. Exact Approach
4. Reconstruction based association Rule
5. Cryptography based Techniques
6. Hybrid technique approach
3.1 Heuristic- based techniques:
Heuristic based techniques directly modify the data to hide
sensitive information. Basedonthemodificationofdata,this
technique is dividedintotwo groups:Data distortionmethod
and Data Blocking method.
a. Data distortion method:
Data distortion methods works by adding some noise or
unknown values. These distortion methods must preserve
the privacy and at the same time must keeptheutilityofdata
after distortion. The classical data distortion methods are
based on random value perturbation. Below functions are
two random value perturbation functions.
i. Uniformly distributed noise:
In this method a noise matrix is added to the original matrix.
And noise[6] matrix is generatedwiththeuniformdistribute
function in a given interval of values.
ii. Normally distributed noise:
This method is same as previous method but here noise
matrix is generated with the help of normal distribution
function[6] using mean and standard deviation.
b. Data blocking method:
Data blocking method works by reducing degree of support
and confidence [6] of association rule. To get less value this
method replaces the attribute values with the values that
give low support count.
4. CONCLUSIONS
Big data is collection of large amount of structured,
unstructured form of data coming from different sources.It
has both advantages and disadvantages.
In order to solve problems of big data challenges, many
researchers proposed a different system models,techniques
for big data In this paper we discussed about the two issues
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3443
related to big data mining. The two issues are problems
while collecting the data and data protection. we also
discussed algorithms like k-nearest neighbor for missing
data and association rule hiding for data privacy protection.
REFERENCES
[1] R.S. Somasundaram1 and R. Nedunchezhian2."Missing
Value Imputation using RefinedMean Substitution"IJCSI
International Journal of computer science
issues,vol.9,issue 4,No 3,July 2012 ISSN(online):1694-
0814.
[2] ]"IBM What is Big Data:Bring Big Data to the Enterprise
,"http://www01.ibm.com/software/data/bigdata/,IBM,
2012.
[3] ."A Study of K-Nearest Neighbour as an Imputation
Method" Gustavo E. A. P. A. Batista and Maria Carolina
Monard.University of S˜ao Paulo – USP,Institute of
Mathematics and ComputerScience – ICMC,Department
of Computer Science and Statistics – SCE, Laboratory of
Computational Intelligence – LABIC, P. O. Box 668,
13560-970 - S˜ao Carlos, SP, Brazil, {gbatista,
mcmonard}@icmc.usp.br
[4] R. J. Little and D. B. Rubin. Statistical Analysis with
Missing Data. John Wiley and Sons, New, York, 1987
[5] Mohamed Refaat Abdellah ,H. Aboelseoud M , Khalid
Shafee Badran , M. Badr Senousy ,”Privacy Preserving
Association Rule Hiding Techniques: Current Research
Challenges “,International Journal of Computer
Applications (0975 – 8887) Volume 136 – No.6,
February 2016 .
[6] Jun Zhang and Jie Wang, University of Kentucky, USA
Shuting Xu, Virginia State University, USA ,Matrix
“Decomposition-Based Data Distortion Techniques for
Privacy Preservation in Data Mining .”
[7] Jaseena K.U.1 and Julie M. David2, “Issues,Challenges
and Solutions : Big Data Minig.”
[8] G. E. A. P. A. Batista and M. C. Monard. K-Nearest
Neighbour as Imputation Method: Experimental
Results (in print). Technical report, ICMC-USP, 2002.
ISSN-0103-2569.

More Related Content

PDF
A statistical data fusion technique in virtual data integration environment
PDF
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
PDF
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
PDF
Recommendation system using bloom filter in mapreduce
PDF
Hybrid Algorithm for Clustering Mixed Data Sets
PDF
Column store decision tree classification of unseen attribute set
PDF
Variance rover system
PDF
Variance rover system web analytics tool using data
A statistical data fusion technique in virtual data integration environment
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
Recommendation system using bloom filter in mapreduce
Hybrid Algorithm for Clustering Mixed Data Sets
Column store decision tree classification of unseen attribute set
Variance rover system
Variance rover system web analytics tool using data

What's hot (20)

PDF
61_Empirical
PDF
2-IJCSE-00536
PDF
IRJET- Privacy Preservation using Apache Spark
PDF
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
PDF
A study on rough set theory based
PDF
Using particle swarm optimization to solve test functions problems
PDF
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
PDF
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
PDF
Estimating project development effort using clustered regression approach
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
PDF
Effective data mining for proper
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PDF
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PDF
Privacy preserving clustering on centralized data through scaling transf
PDF
Analysis and Implementation of Efficient Association Rules using K-mean and N...
PDF
A Survey on Fuzzy Association Rule Mining Methodologies
PDF
Distance based transformation for privacy preserving data mining using hybrid...
PDF
Experimental study of Data clustering using k- Means and modified algorithms
PDF
Efficient Intrusion Detection using Weighted K-means Clustering and Naïve Bay...
61_Empirical
2-IJCSE-00536
IRJET- Privacy Preservation using Apache Spark
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
A study on rough set theory based
Using particle swarm optimization to solve test functions problems
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
Estimating project development effort using clustered regression approach
84cc04ff77007e457df6aa2b814d2346bf1b
Effective data mining for proper
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
Privacy preserving clustering on centralized data through scaling transf
Analysis and Implementation of Efficient Association Rules using K-mean and N...
A Survey on Fuzzy Association Rule Mining Methodologies
Distance based transformation for privacy preserving data mining using hybrid...
Experimental study of Data clustering using k- Means and modified algorithms
Efficient Intrusion Detection using Weighted K-means Clustering and Naïve Bay...
Ad

Similar to Survey paper on Big Data Imputation and Privacy Algorithms (20)

DOCX
Machine Learning Approaches and its Challenges
PDF
IRJET- Missing Data Imputation by Evidence Chain
PDF
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
PDF
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
PDF
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
PDF
A Novel Filtering based Scheme for Privacy Preserving Data Mining
PDF
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
PDF
G44093135
PDF
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
PDF
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION
PDF
Review of Algorithms for Crime Analysis & Prediction
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PDF
IRJET - An Overview of Machine Learning Algorithms for Data Science
PDF
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
PPTX
UNIT 2: Part 2: Data Warehousing and Data Mining
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
PDF
13_Data Preprocessing in Python.pptx (1).pdf
DOCX
Data Analytics Using R - Report
Machine Learning Approaches and its Challenges
IRJET- Missing Data Imputation by Evidence Chain
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
IRJET- A Detailed Study on Classification Techniques for Data Mining
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
A Novel Filtering based Scheme for Privacy Preserving Data Mining
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
G44093135
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION
Review of Algorithms for Crime Analysis & Prediction
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
UNIT 2: Part 2: Data Warehousing and Data Mining
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
13_Data Preprocessing in Python.pptx (1).pdf
Data Analytics Using R - Report
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
composite construction of structures.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT
Project quality management in manufacturing
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
PPTX
additive manufacturing of ss316l using mig welding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
web development for engineering and engineering
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Lecture Notes Electrical Wiring System Components
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
UNIT 4 Total Quality Management .pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
bas. eng. economics group 4 presentation 1.pptx
573137875-Attendance-Management-System-original
composite construction of structures.pdf
Internet of Things (IOT) - A guide to understanding
Automation-in-Manufacturing-Chapter-Introduction.pdf
Project quality management in manufacturing
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................
additive manufacturing of ss316l using mig welding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Foundation to blockchain - A guide to Blockchain Tech
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
web development for engineering and engineering
Embodied AI: Ushering in the Next Era of Intelligent Systems
CH1 Production IntroductoryConcepts.pptx
Lecture Notes Electrical Wiring System Components

Survey paper on Big Data Imputation and Privacy Algorithms

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3441 Survey paper on Big data imputation and Privacy algorithms G.Swetha1, G.Ramya2 1,2 Professor,CSE,CVRCE,India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Big data is a collection of large data sets that traditional processing methods are inadequate to deal with them. however , the fast growth of such large data generates both opportunities and problems. This paper presents the literature review about issues, datacreation,dataprotection and also different algorithms to deal with the issues. Key Words: Big Data, Imputation, nearest neighbour, data protection , Data Distortion, data blocking. 1.INTRODUCTION Goods and Services tax was introduced in India from July 1st2017.People from all over the nation have given their feedback on it. Some people have given positive feedback and some have given negative feedback on it. If we can summaries all types of opinions including updated ones, we can consider it as a good example for Big data. Maximum percentage of the data in the world were produced within the last few years[2].Data is coming from various sources and in various formats. Especiallysocial networkingsites are producing large amount of data everyhourandhandlingthis large data is very difficult. Big data challenges [7] include Capturing, data storage, data analysis, search, sharing, transfer, visualization, querying, and updating and information privacy The paper is organized as follows. Chapter II gives an introduction to data imputation and algorithms for missing data replacement. chapter III gives an introduction to privacy protection and algorithms forprivacyprotection.IV. Conclusion , 2. Data Imputation Normally when we preprocess data in data mining, we miss some of the attribute values. But we can extract knowledge from the data only if the data has good quality that is without missing values. But if we have missing data we cannot get good quality data. Missing data may occur because of a detained student in a class, not responding to the questions in a survey and so on. If we can handle missing data carefully, then we can increase the quality of the knowledge. So we need to replace the missing data with some other reasonable data. This is known as data Imputation. If we have knowledge on that data we can predict the missing value, but it is very complicated.Data maybemissed in columns or rows or in both. Data which is missed can be replaced before Data mining starts or after it starts. This paper is a survey on 2 methods for handling missing data. First method is Refined Mean Substitution and Second method is K-Nearest Neighbor for missing data. 2.1 Data Imputation Algorithms: The paper[1],proposed an algorithm for missing data. Here missing data is estimated by using an Euclidean distance of the missing instances or attributesandremainingrecords.In this methoddistance(d)iscalculatedbetweenapproximately imputed data set and rows of the data set. Now we need to find data whose value is greater than the mean of d. Now name this data as I. That is I is the index elements whose distance is higher than mean(d).Now we need to find mean (μj)of elements Dnew(I,n).Now for all the missing values we need to replace μj in rows of missing data.By calculating for every row like this and by substitutingin everymissingplace ,finally the imputed data set will be generated. This algorithm was evaluated with five different metrics. The performance is evaluated in terms of RAND INDEX, Performance in terms of Accuracy, Performance in terms of Specificity, Performance in terms of sensitivity, and performance in terms of Mean Square Error. According to [1],in almost all the cases this algorithm performed better than MC/mean value substitution method. The second algorithm for[8] imputation is K-Nearest Neighbors. Features of k-Nearest Neighbor are: 1).All the values of the attributes correlate with in an n- dimensional Euclidean space. 2).When a new attribute value is entered, then classification begins. 3).Different points’ feature vector is compared for doing classification. 4).Here we don't use any particular function, it may be discrete or real valued. 5).Euclidean distance between any two values will be calculated. Mean value of the k-nearest neighbors will be taken. According to [4], classes for missing data randomness are: (1).Missing completely at random: Here probability of the missing value does not depend on existing value or itself. So, we can do imputation with any data.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3442 (2).Missing at random: Here probability of missing data depends on known values but not itself. (3).Not missing at random: Here probability of the missing data depends on itself. According to [4], Missing data handling methods are: (1).We can completely delete all instances of missing values or attributes or we can check whether any particular attribute or instance is missing in higher levels also then we can delete it. (2).We can use algorithms which can handle estimation of parameter in the presence of missing data. (3).we can replace with some reasonable value in the missing data, which is known as imputation. Imputation using k-nearest neighbor[3]: According to [3],the main advantages of this method are: (1).k-nearest neighbor can predict the missing value by considering the most frequent value among the k-nearest neighbors, and it can find mean among the k-nearest neighbors and substitute it. (2).Here it is not required to have a model whichguessesthe value of the missing attribute , thats why here we can use any attribute as class because we are not using any specific model. The main drawback of this model is: As we need to see for the most frequent instance, the algorithm searches all the data set, As the database is very large it will be difficult for KDD. 3.Privacy Protection: In recent years, the privacy and personal data protection has become an issue especially in the context of social networking and online advertisement. personal data means any kind of data which identifies an individual person. examples are person name, address, phonenumber,identity number, date of birth. the way data isgrowing exponentially, it will change the world that scarcely imagine today. that is why the protection of personal data is very important. Safeguards are necessary to give citizens and consumers trust in administration, business and other private entities. Data Privacy Algorithms: Privacy preservation using association rule hiding: Association rule hiding algorithms are used to hidesensitive data. Suppose a database ‘D’ is available with minimum support and confidence and set of rules ‘R’ are mined. A subset ‘Rs’ set of sensitive association rules where’ Rs ‘is subset of ‘R’. The aim of association rule hiding algorithms are to change the database in such a way that it will be difficult to mine sensitive association rules by maintain remaining rule unaffected[5], Classification of privacy preserving association rule hiding algorithm: 1. Heuristic –Based Techniques 2. Border Approach 3. Exact Approach 4. Reconstruction based association Rule 5. Cryptography based Techniques 6. Hybrid technique approach 3.1 Heuristic- based techniques: Heuristic based techniques directly modify the data to hide sensitive information. Basedonthemodificationofdata,this technique is dividedintotwo groups:Data distortionmethod and Data Blocking method. a. Data distortion method: Data distortion methods works by adding some noise or unknown values. These distortion methods must preserve the privacy and at the same time must keeptheutilityofdata after distortion. The classical data distortion methods are based on random value perturbation. Below functions are two random value perturbation functions. i. Uniformly distributed noise: In this method a noise matrix is added to the original matrix. And noise[6] matrix is generatedwiththeuniformdistribute function in a given interval of values. ii. Normally distributed noise: This method is same as previous method but here noise matrix is generated with the help of normal distribution function[6] using mean and standard deviation. b. Data blocking method: Data blocking method works by reducing degree of support and confidence [6] of association rule. To get less value this method replaces the attribute values with the values that give low support count. 4. CONCLUSIONS Big data is collection of large amount of structured, unstructured form of data coming from different sources.It has both advantages and disadvantages. In order to solve problems of big data challenges, many researchers proposed a different system models,techniques for big data In this paper we discussed about the two issues
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 3443 related to big data mining. The two issues are problems while collecting the data and data protection. we also discussed algorithms like k-nearest neighbor for missing data and association rule hiding for data privacy protection. REFERENCES [1] R.S. Somasundaram1 and R. Nedunchezhian2."Missing Value Imputation using RefinedMean Substitution"IJCSI International Journal of computer science issues,vol.9,issue 4,No 3,July 2012 ISSN(online):1694- 0814. [2] ]"IBM What is Big Data:Bring Big Data to the Enterprise ,"http://www01.ibm.com/software/data/bigdata/,IBM, 2012. [3] ."A Study of K-Nearest Neighbour as an Imputation Method" Gustavo E. A. P. A. Batista and Maria Carolina Monard.University of S˜ao Paulo – USP,Institute of Mathematics and ComputerScience – ICMC,Department of Computer Science and Statistics – SCE, Laboratory of Computational Intelligence – LABIC, P. O. Box 668, 13560-970 - S˜ao Carlos, SP, Brazil, {gbatista, mcmonard}@icmc.usp.br [4] R. J. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley and Sons, New, York, 1987 [5] Mohamed Refaat Abdellah ,H. Aboelseoud M , Khalid Shafee Badran , M. Badr Senousy ,”Privacy Preserving Association Rule Hiding Techniques: Current Research Challenges “,International Journal of Computer Applications (0975 – 8887) Volume 136 – No.6, February 2016 . [6] Jun Zhang and Jie Wang, University of Kentucky, USA Shuting Xu, Virginia State University, USA ,Matrix “Decomposition-Based Data Distortion Techniques for Privacy Preservation in Data Mining .” [7] Jaseena K.U.1 and Julie M. David2, “Issues,Challenges and Solutions : Big Data Minig.” [8] G. E. A. P. A. Batista and M. C. Monard. K-Nearest Neighbour as Imputation Method: Experimental Results (in print). Technical report, ICMC-USP, 2002. ISSN-0103-2569.