SlideShare a Scribd company logo
Jan Zizka et al. (Eds) : CCSIT, SIPP, AISC, CMCA, SEAS, CSITEC, DaKM, PDCTA, NeCoM - 2016
pp. 287–296, 2016. © CS & IT-CSCP 2016 DOI : 10.5121/csit.2016.60124
A PREFIXED-ITEMSET-BASED
IMPROVEMENT FOR APRIORI
ALGORITHM
Yu Shoujian1
, Zhou Yiyang2
College of computer science and technology,
Donghua University, Shanghai, 201600, China
1
jackyysj@dhu.edu.cn
2
yiyang0203@foxmail.com
ABSTRACT
Association rules is a very important part of data mining. It is used to find the interesting
patterns from transaction databases. Apriori algorithm is one of the most classical algorithms
of association rules, but it has the bottleneck in efficiency. In this article, we proposed a
prefixed-itemset-based data structure for candidate itemset generation, with the help of the
structure we managed to improve the efficiency of the classical Apriori algorithm.
KEYWORDS
Data mining, association rules, Apriori algorithm, prefixed-itemset, hash map
1. INTRODUCTION
With the rapid development of computer technology in various sectors, the data generated by
different industries are becoming more and more, but how to get valuable information from the
big data has become a new problem. Data mining, that is data knowledge discovery, came into
being in this backdrop. Data mining is to excavate the implied, unknown, interesting knowledge
and rules from a large number of data [1]
. Association rules is an important part of data mining, it
was first put forward by R.Agrawal, mainly to solve the customer transaction association rules
between sets of items in the transaction library [2]
. In the following year, R.Agrawal proposed the
most classical algorithm to calculate association rules, that is Apriori algorithm [3]
, which is to
infer the (k+1) – itemsets by the k- itemsets.
However, due to the computing bottleneck of Apriori algorithm when calculating the candidate
set, in recent years there have been many improved algorithms of the traditional Apriori
algorithm from different aspects. Chun-Sheng Z proposed an improved Apriori algorithm based
on classification [4]
. Jia Y improves the algorithm from the aspect of transaction database
partitioning and dynamic itemset planning [5]
. Shuangyue L proposed an improved algorithm
based on the matrix of database to enhance the efficiency of calculating [6]
. Wang P proposed an
optimization method to reduce the search times of the transaction library to improve the
efficiency [7]
. Vaithiyanathan V uses the method of compressing the transactions of the similar
interests in the database to improve the efficiency of the algorithm [8]
. Lin X implements Apriori
288 Computer Science & Information Technology (CS & IT)
algorithm based on Map Reduce to improve the candidate sets of large amounts of data
generation efficiency [9]
. Zhang first analyze the characteristic of the data, that is medical data,
and then combine the characteristics of the data to improved Apriori algorithm [10]
. Wu Huan
proposed an improved algorithm IAA, which adopts a new count-based method to prune
candidate itemsets and uses generation record to reduce total data scan amount [11]
. Wang Yuan
proposes an improved item constrain association rules mining algorithm, which improves
traditional algorithm in two aspects: trimming frequent itemsets and calculating candidate
itemsets [12]
. Lin Ming-Yen proposes three algorithms, named SPC, FPC, and DPC, to investigate
effective implementations of the Apriori algorithm in the MapReduce framework [13]
. Chai Sheng
proposes a novel algorithm so called Reduced Apriori Algorithm with Tag (RAAT), which
reduces one redundant pruning operations of C2 [14]
.
This article will be focus on the two concrete steps of classical Apriori algorithm, namely
connecting step and the pruning step, using a new prefix-itemset-based storage, combining the
fast lookup feature of hash tables to improve the efficiency. This paper will first describe the
classical Apriori algorithm and its shortcomings, then specifically describe the improvements,
and finally introduce the comparisons of efficiency of classical Apriori algorithm and improve
Apriori algorithm on specific data sets.
2. APRIORI ALGORITHM
2.1. Apriori algorithm introduction
Apriori algorithm is a classical algorithm for frequent itemset mining association rules, the basic
idea of the algorithm is to use an iterative approach layer by layer to find the frequent. The
algorithm will first obtain k-itemsets, and then use the k- itemsets to explore (k+1)-itemsets. First,
let’s introduce the priori knowledge of frequent itemsets, which is, any subset of a frequent
itemset is also a frequent itemset. Apriori algorithm uses the prior knowledge of frequent
itemsets, first to find the collection of frequent 1-itemsets, denoted L1. Then use the 2-itemsets of
L1 to get L2, and then L3, and so on, until you cannot find the frequent k-itemsets. Apriori
algorithm mainly consists of the following three steps:
(1) Connecting step: connecting k- frequent itemsets to generate (k+1)-candidate sets, denoted by
Ck+1. The connect condition of the connecting step is that the two k-itemsets have the same
first (k-1) items and different k-th items. Denote li[j] is j-th item of li, the condition is:
lଵሾ1ሿ = lଶሾ1ሿ ∧ lଵሾ2ሿ = lଶሾ2ሿ ∧ … … ∧ lଵሾk − 1ሿ = lଶሾk − 1ሿ ∧ lଵሾkሿ ≠ lଶሾkሿ
In which l1 and l2 are k-item subset of the set collection Lk, l1[k] ≠ l2[k] is to ensure not to
generate duplicate k- itemsets. Itemsets generated by the l1 and l2 connection as follows:
{lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଵሾkሿ, lଶሾkሿ}
(2) Pruning step: To pick out the true frequent itemsets Lk+1 from the candidate set Ck+1. Because
the candidate set Ck+1 is the superset of the true frequent itemsets Lk+1. According to the
nature of Apriori: any subsets of frequent set must also be frequent, that is any (k-1)- items
subsets of k-items must also be frequent. With this property we can find out if the k- items
subsets of Ck+1 are in Lk, if not, then remove the candidate (k + 1) - itemset is removed from
the Ck+1.
(3) Counting step: scanning the database, accumulate the number of candidates appearing in the
database. If the appear times of a candidate is less than the given minimum support threshold,
the candidate itemset will be removed.
Computer Science & Information Technology (CS & IT) 289
2.2. Shortage of Apriori algorithm
Apriori is one of the most classical algorithms for mining association rules, but it also has the
shortage of low efficiency. The time Apriori algorithm consumes lies mainly in the following
three aspects:
(1) In connection step, when connects k-itemsets to generate (k+1)-itemsets, it compares too
many times to determine if the itemsets meets the connection conditions. When Lk has m k-
itemsets, the time complexity of the connection step is Ο(k*m^2).
(2) In the pruning step, when determine if a subset of candidate set Ck+1 is in the frequent set Lk,
the best situation is to simply scan once to get the result, while the worst-case is that is needs
to scan k times to find that the k-th subset of Ck+1 is not in the Lk. So the average times need
to scan and compare the Lk is |Ck+1| * |Lk| * k / 2.
(3) In counting step, when accumulate the support times of itemsets in Ck+1, we need to scan the
database for |Ck+1|times.
Taking into account these three aspects of time-consuming steps of classical Apriori algorithm,
this article presents an improved Apriori algorithm based on prefix-itemset.
3. IMPROVED APRIORI ALGORITHM
3.1. Improved Apriori algorithm
In 1.2 we have analyzed the shortcomings of classical Apriori algorithm, so its improvements
also focus on the three steps mentioned in 1.2. Since the records are already sorted by the
dictionary, therefore the candidate set generated by Apriori algorithm is ordered.
(1)Prefixed-itemset-based storage
In the improved algorithm we proposed a new method to store the itemsets. For each itemset in
Lk, we use a structure similar to Map <key, value> to store them, in which we save the forward
(k-1)- item content as the key while the last item content as the value. After having all the
itemsets saved in the new format, we group all the itemsets with the same key and store the union
of their values as the new value.
For example: the database is shown in Table 1and the minimum support is 2.
Table 1 database
TID Itemset
T1 A,B,E
T2 B,D
T3 B,C
T4 A,B,D
T5 A,C
T6 B,C
T7 A,C
T8 A,B,C
T9 A,B,C,E
290 Computer Science & Information Technology (CS & IT)
The traditional Apriori algorithm will scan the database to obtain the times each item appears in
the database, to form the 1- itemsets, and then to generate the 2-itemsets that meets the minimum
support, that is 2. Here is the content generated by the classical Apriori algorithm.
Table 2 classical Apriori algorithm
1-itemset 2- itemset
Item Count Item Count
A 6 AB 4
B 7 AC 4
C 6 AE 2
D 2 BC 4
E 2 BD 2
BE 2
While Table 3 shows how we store the itemsets with the prefix-itemset-based storage.
Table 3 prefix-itemset-based storage
Prefixed-key Value
1-itemset NULL {A, B, C, D, E}
2-itemset A {B, C, E}
B {C, D, E}
As shown in Table 3, 1- itemset has only one item, so the key of 1- itemset is NULL. Besides, we
can infer the length of the itemset from the length of the key because the length of the value of
the key stores all the items in the itemset but the last item.
(2) Prefixed-itemset-based connecting step
After the establishing of prefix-itemset-based storage, when we have to generate (k+1)- itemset
by connecting the two k- itemsets, we can simply combine two different items in the value, and
then generate new itemset with the key. For example, when connecting the 2-itemset with the
prefix-key of A in Table 3, we can generate the 3- itemset by combine the value and get the result
as {{B, C},{B, D},{C, D}}.
(3) Prefixed-itemset-based pruning step
In chapter 1.1 we know that (k+1)- itemsets are generate from two k- itemsets, and if any k-
itemset subset of the (k+1)-itemset does not exist in Lk, then we have to remove the (k+1)-
itemset from Ck+1.
Theorem: If we generate a (k+1)-itemset by connecting two k-itemsets, l1 and l2, and one k-
itemset of all the k-itemset subset does not exist in Lk, then the k-itemset subset must contains
both l1[k] and l2[k].
Computer Science & Information Technology (CS & IT) 291
Prove: Assume l1 and l2 are both k-itemset, and the (k+1)-itemset generated by connecting l1 and
l2 is {lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଵሾkሿ, lଶሾkሿ}. If the k- itemset dose not contains both l1[k] and l2[k],
then the possible options are {lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଵሾkሿ}and {lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଶሾkሿ}, that
is l1 and l2, and both l1 and l2 come from Lk, so if the k-itemset does not belong to Lk, then it must
contains both lଵሾkሿ and lଶሾkሿ.
So in prefixed-itemset-based pruning step, we can simply consider the subset of (k+1)- itemset
which contains both the last two items. With the example from Table 3, we can get the result as
follow.
Table 4 pruning step
Subset of 3-itemset If belong to L2
B,C yes
B,E yes
C,E no
C,D no
C,E no
D,E no
As shown in Table 4, only {{B,C},{B,E}} are possible 2-itemset subsets, plus the corresponding
prefix-key, that is {{A,B,C},{A,B,E}}, namely the candidate set Cଷ.
After the pruning step, we have to scan the database to accumulate the times the itemset appears.
After accumulating the times of itemsets after pruning step, we can find that both {A,B,C} and
{A,B,E} meet the minimum support, and then we add them to the prefix-itemset-based storage as
follows.
Table 5 3-itemset storage
prefix-key Value
3-itemset A,B {C,E}
3.2. Algorithm
The algorithm is described as follow:
Input:Database D,minimum support min_sup
Output:frequent itemsetsL
1) Lଵ=1-itemset of D
2) Map<String[],String[]> map;
3) Import L1 to map, set the key as null,value as the union of items in L1
4) for(k=2;L୩ିଵ ≠ ϕ;k++){
5) C୩=pre_apriori_gen(map,k-2);
6) count the appear times of every itemset of C୩, L୩={cϵC୩|c.count>min_sup}
7) }
8) Return L୩;
procedurepre_apriori_gen(map:Map<String[],String[]>;k:int)
1) for each key in map{
292 Computer Science & Information Technology (CS & IT)
2) if(key.length()==k){
3) c:=key plus two items from value
4) if(map.containsKey(c[0:k])){
5)
6)
7)
8) }
9) else continue;
10) }
11) Return C୩
4. EXPERIMENT AND RESULTS
The data of the experimental is a total of 120000 patient
in Ruijin Hospital, the data record
experimental machine is configured to Core i5
memory.
This experiment compares the classical Apriori
algorithm from two aspects, one is to compare the operation efficiency with fixed total tests and
variable minimum support, the other is to compare the operation efficiency with fixed minimum
support and variable total tests.
The result of the first experiment
Figure 1. Time
The picture above shows the time consuming
and variable min_sup. We can infer that the less min_sup is, the more operation efficiency the
improved algorithm improves. And when the min_sup increases to a certain point, the classical
Apriori algorithm and the improved algorithm are of the same efficiency.
Computer Science & Information Technology (CS & IT)
if(key.length()==k){
c:=key plus two items from value
if(map.containsKey(c[0:k])){
If( any (k+1)- itemset subset belong to (key,value)){
put c into Ck
}
ESULTS
The data of the experimental is a total of 120000 patients with diabetes clinical prescription data
the data records the prescription drug number per user per visit. This
experimental machine is configured to Core i5 2.7GHz 8GB processor, 1866MHz LPDDR3 Intel
compares the classical Apriori algorithm and the prefixed-itemset
from two aspects, one is to compare the operation efficiency with fixed total tests and
, the other is to compare the operation efficiency with fixed minimum
experiment is as follows.
Figure 1. Time consuming with variable min_sup
The picture above shows the time consuming of the two algorithms when given fixed total test
and variable min_sup. We can infer that the less min_sup is, the more operation efficiency the
improved algorithm improves. And when the min_sup increases to a certain point, the classical
and the improved algorithm are of the same efficiency.
itemset subset belong to (key,value)){
with diabetes clinical prescription data
per visit. This
2.7GHz 8GB processor, 1866MHz LPDDR3 Intel
itemset-based
from two aspects, one is to compare the operation efficiency with fixed total tests and
, the other is to compare the operation efficiency with fixed minimum
of the two algorithms when given fixed total test
and variable min_sup. We can infer that the less min_sup is, the more operation efficiency the
improved algorithm improves. And when the min_sup increases to a certain point, the classical
Computer Science & Information Technology (CS & IT)
Table 6 improvements under variable min_sup
Min_sup
(total 12w)
Classical Apriori
Time(ms)
600 210696
1200 53822
1800 26317
2400 19359
3000 15508
3600 12393
4200 9017
4800 5705
5400 4868
Table 6 shows the specific operation time of the
Apriori algorithm and the comparison
The results of the second experiment are as follows.
Figure 2. Time
The picture above shows the time consuming of the two algorithm
and variable total test. And we can tell that when the total test becomes larger, the
become more obvious.
Computer Science & Information Technology (CS & IT)
Table 6 improvements under variable min_sup
Classical Apriori
Time(ms)
Improved Apriori
Time(ms)
Improvement (%)
210696 25192 81.44%
53822 11648 68.63%
26317 7614 51.88%
19359 7127 38.99%
15508 6753 56.45%
12393 5842 52.86%
9017 5424 39.85%
5705 5175 9.29%
4868 5161 -6.02%
Table 6 shows the specific operation time of the classical Apriori algorithm and the
comparison between them.
The results of the second experiment are as follows.
Figure 2. Time consuming with variable total test
above shows the time consuming of the two algorithms when given fixed min_sup
and variable total test. And we can tell that when the total test becomes larger, the improvements
293
and the improved
when given fixed min_sup
improvements
294 Computer Science & Information Technology (CS & IT)
Table 7 improvements under variable total test
Total tests
(min_sup%
=2%)
Classical
Apriori
Time(ms)
Improved
Apriori
Time(ms)
Improvement
(%)
10k 1686 390 76.87%
20k 2839 695 75.52%
30k 4088 1229 69.94%
40k 5729 1846 67.78%
50k 8141 2409 70.41%
60k 9833 3197 67.49%
70k 12848 3630 71.75%
80k 13004 4339 66.63%
90k 16007 5442 66.00%
100k 17438 5588 67.96%
Table 7 shows the specific operation time of the two algorithm and we can learn from the table
that when the min_sup is fixed to 2% of the total test, the improvement rate is about 70%.
Experiments have shown that the prefix-itemset-based Apriori algorithm is effective and feasible.
4. SUMMARY
In this paper, we described the Apriori algorithm specifically, and pointed out some limitations of
the classical Apriori algorithm during the two steps of the algorithm, namely the connection and
the paper cutting steps, and proposed the method of prefixed-itemset-based data storage and the
improvements based on it. With the help of prefixed-itemset-based data storage, we managed to
finish the connecting step and the pruning step of the Apriori algorithm much faster, besides we
can store the candidate itemsets with smaller storage space. Finally, we compare the efficiency of
classical Apriori algorithm and improve Apriori algorithm on the aspect of support degree and
the total number, and the experimental results on both aspects proved the feasibility of the
prefixed-itemset-based algorithm.
Computer Science & Information Technology (CS & IT) 295
REFERENCES
[1] Han J, Kamber M, Pei J, et al. Data mining.Concepts and techniques. 3rd ed[J]. San Francisco, 2001,
29(S1):S103–S109.
[2] RakeshAgrawal T. Imielinski,andArun Swami. Miningassociationrules between setsof items inlarge
databases[J]. Inproceedings Ofthe AcmSigmod Conference, 1993:207--216.
[3] Agrawal, Rakesh, and RamakrishnanSrikant. "Fast algorithms for mining association rules." Proc.
20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.
[4] Chun-Sheng Z, Yan L. Extension of local association rules mining algorithm based on apriori
algorithm[C]//Software Engineering and Service Science (ICSESS), 2014 5th IEEE International
Conference on. IEEE, 2014: 340-343.
[5] Jia Y, Xia G, Fan H, et al. An Improved Apriori Algorithm Based on Association Analysis[C]//2012
Third International Conference on Networking and Distributed Computing. 2012.
[6] Shuangyue L, Li P. Analysis of Coal Mine Hidden Danger Correlation Based on Improved A Priori
Algorithm[C]//Intelligent Systems Design and Engineering Applications, 2013 Fourth International
Conference on. IEEE, 2013: 112-116.
[7] Wang P, An C, Wang L. An improved algorithm for Mining Association Rule in relational
database[C]//Machine Learning and Cybernetics (ICMLC), 2014 International Conference on. IEEE,
2014, 1: 247-252.
[8] Vaithiyanathan V, Rajeswari K, Phalnikar R, et al. Improved apriori algorithm based on selection
criterion[C]//Computational Intelligence & Computing Research (ICCIC), 2012 IEEE International
Conference on. IEEE, 2012: 1-4.
[9] Lin X. MR-Apriori: Association Rules algorithm based on MapReduce[C]//Software Engineering and
Service Science (ICSESS), 2014 5th IEEE International Conference on. IEEE, 2014: 141-144.
[10] Zhang, Wenjing, Donglai Ma, and Wei Yao. "Medical Diagnosis Data Mining Based on Improved
Apriori Algorithm." Journal of Networks 9.5 (2014): 1339-1345.
[11] Wu, Huan, et al. "An improved apriori-based algorithm for association rules mining." Fuzzy Systems
and Knowledge Discovery, 2009. FSKD'09. Sixth International Conference on. Vol. 2. IEEE, 2009.
[12] Wang, Yuan, and Lan Zheng. "Endocrine Hormones Association Rules Mining Based on Improved
Apriori Algorithm." Journal of Convergence Information Technology 7.7 (2012).
[13] Lin, Ming-Yen, Pei-Yu Lee, and Sue-Chen Hsueh. "Apriori-based frequent itemset mining algorithms
on MapReduce." Proceedings of the 6th international conference on ubiquitous information
management and communication. ACM, 2012.
[14] Chai, Sheng, Jia Yang, and Yang Cheng. "The research of improved Apriori algorithm for mining
association rules." Service Systems and Service Management, 2007 International Conference on.
IEEE, 2007.
[15] Han J, Pei J, Yin Y. Mining Frequent Patterns without Candidate Generation[J]. Proceeding of
AcmSigmod International Conference Management of Data, 1999, 29(2):1-12.
296 Computer Science & Information Technology (CS & IT)
AUTHORS
Yu Shoujian, the vice professor, main research direction: Web services, enterprise application integration,
database and data warehouse;
Zhou Yiyang, master, the main research direction: data mining, machine learning

More Related Content

PPTX
Ist year Msc,2nd sem module1
PPTX
Analysis of algorithms
PDF
IRJET- A Survey on Different Searching Algorithms
PDF
Heap, quick and merge sort
PPSX
Algorithm and Programming (Searching)
PDF
DATA STRUCTURES USING C -ENGGDIGEST
PDF
SQUARE ROOT SORTING ALGORITHM
PDF
Data structures Basics
Ist year Msc,2nd sem module1
Analysis of algorithms
IRJET- A Survey on Different Searching Algorithms
Heap, quick and merge sort
Algorithm and Programming (Searching)
DATA STRUCTURES USING C -ENGGDIGEST
SQUARE ROOT SORTING ALGORITHM
Data structures Basics

What's hot (20)

PDF
An Approach of Improvisation in Efficiency of Apriori Algorithm
PDF
Lu3520152020
PPT
Introduction of data structure
PPTX
Datastructures using c++
PDF
final_copy_camera_ready_paper (7)
PPTX
Bca ii dfs u-1 introduction to data structure
PPTX
Interval intersection
PPT
Searching algorithms
PPTX
Linear search-and-binary-search
DOCX
Stacks
PPT
Associative Learning
PPT
Data structure lecture 1
PPS
Data Structure
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
COMPARING THE CUCKOO ALGORITHM WITH OTHER ALGORITHMS FOR ESTIMATING TWO GLSD ...
PDF
2nd puc computer science chapter 3 data structures 1
PPTX
Data Structure and Algorithms
PPTX
Binary search
PDF
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
PDF
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Approach of Improvisation in Efficiency of Apriori Algorithm
Lu3520152020
Introduction of data structure
Datastructures using c++
final_copy_camera_ready_paper (7)
Bca ii dfs u-1 introduction to data structure
Interval intersection
Searching algorithms
Linear search-and-binary-search
Stacks
Associative Learning
Data structure lecture 1
Data Structure
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
COMPARING THE CUCKOO ALGORITHM WITH OTHER ALGORITHMS FOR ESTIMATING TWO GLSD ...
2nd puc computer science chapter 3 data structures 1
Data Structure and Algorithms
Binary search
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
Ad

Viewers also liked (10)

PDF
Om bag house
PDF
TWT Trendradar: Smarte Spiegel für das perfekte Outfit
 
PDF
Stenless Steeling Items
PDF
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
PDF
03.30.2015, PRESENTATION, Measuring human capitalworkforce efficiency, Matthe...
PDF
شمارنده ها - نمونه سوال امتحانی 2
PDF
IBM 2016 - Six reasons to upgrade your database
DOCX
Step-by-Step Approaches for Anterior Direct Restorative
PPT
Berritzegune
PDF
D ambitious release
Om bag house
TWT Trendradar: Smarte Spiegel für das perfekte Outfit
 
Stenless Steeling Items
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
03.30.2015, PRESENTATION, Measuring human capitalworkforce efficiency, Matthe...
شمارنده ها - نمونه سوال امتحانی 2
IBM 2016 - Six reasons to upgrade your database
Step-by-Step Approaches for Anterior Direct Restorative
Berritzegune
D ambitious release
Ad

Similar to A PREFIXED-ITEMSET-BASED IMPROVEMENT FOR APRIORI ALGORITHM (20)

PDF
J0945761
PDF
IMPROVED APRIORI ALGORITHM FOR ASSOCIATION RULES
PDF
Discovering Frequent Patterns with New Mining Procedure
PDF
An improvised tree algorithm for association rule mining using transaction re...
PDF
Ijcatr04051008
PDF
5 parallel implementation 06299286
PDF
A boolean modeling for improving
PDF
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
PDF
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
PDF
A New Extraction Optimization Approach to Frequent 2 Item sets
PDF
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
PDF
Modifed Bit-Apriori Algorithm for Frequent Item- Sets in Data Mining
PDF
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
PPT
Lecture#1(Algorithmic Notations).ppt
PDF
The International Journal of Engineering and Science (The IJES)
PDF
geekgap.io webinar #1
PDF
A unique sorting algorithm with linear time &amp; space complexity
PDF
An improved apriori algorithm for association rules
PDF
Scalable frequent itemset mining using heterogeneous computing par apriori a...
PPT
Data structures
J0945761
IMPROVED APRIORI ALGORITHM FOR ASSOCIATION RULES
Discovering Frequent Patterns with New Mining Procedure
An improvised tree algorithm for association rule mining using transaction re...
Ijcatr04051008
5 parallel implementation 06299286
A boolean modeling for improving
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A New Extraction Optimization Approach to Frequent 2 Item sets
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Modifed Bit-Apriori Algorithm for Frequent Item- Sets in Data Mining
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Lecture#1(Algorithmic Notations).ppt
The International Journal of Engineering and Science (The IJES)
geekgap.io webinar #1
A unique sorting algorithm with linear time &amp; space complexity
An improved apriori algorithm for association rules
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Data structures

Recently uploaded (20)

PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPT
Mechanical Engineering MATERIALS Selection
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Sustainable Sites - Green Building Construction
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
web development for engineering and engineering
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Fundamentals of safety and accident prevention -final (1).pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT 4 Total Quality Management .pptx
Safety Seminar civil to be ensured for safe working.
Embodied AI: Ushering in the Next Era of Intelligent Systems
R24 SURVEYING LAB MANUAL for civil enggi
Mechanical Engineering MATERIALS Selection
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
OOP with Java - Java Introduction (Basics)
Sustainable Sites - Green Building Construction
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Foundation to blockchain - A guide to Blockchain Tech
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
III.4.1.2_The_Space_Environment.p pdffdf
web development for engineering and engineering

A PREFIXED-ITEMSET-BASED IMPROVEMENT FOR APRIORI ALGORITHM

  • 1. Jan Zizka et al. (Eds) : CCSIT, SIPP, AISC, CMCA, SEAS, CSITEC, DaKM, PDCTA, NeCoM - 2016 pp. 287–296, 2016. © CS & IT-CSCP 2016 DOI : 10.5121/csit.2016.60124 A PREFIXED-ITEMSET-BASED IMPROVEMENT FOR APRIORI ALGORITHM Yu Shoujian1 , Zhou Yiyang2 College of computer science and technology, Donghua University, Shanghai, 201600, China 1 [email protected] 2 [email protected] ABSTRACT Association rules is a very important part of data mining. It is used to find the interesting patterns from transaction databases. Apriori algorithm is one of the most classical algorithms of association rules, but it has the bottleneck in efficiency. In this article, we proposed a prefixed-itemset-based data structure for candidate itemset generation, with the help of the structure we managed to improve the efficiency of the classical Apriori algorithm. KEYWORDS Data mining, association rules, Apriori algorithm, prefixed-itemset, hash map 1. INTRODUCTION With the rapid development of computer technology in various sectors, the data generated by different industries are becoming more and more, but how to get valuable information from the big data has become a new problem. Data mining, that is data knowledge discovery, came into being in this backdrop. Data mining is to excavate the implied, unknown, interesting knowledge and rules from a large number of data [1] . Association rules is an important part of data mining, it was first put forward by R.Agrawal, mainly to solve the customer transaction association rules between sets of items in the transaction library [2] . In the following year, R.Agrawal proposed the most classical algorithm to calculate association rules, that is Apriori algorithm [3] , which is to infer the (k+1) – itemsets by the k- itemsets. However, due to the computing bottleneck of Apriori algorithm when calculating the candidate set, in recent years there have been many improved algorithms of the traditional Apriori algorithm from different aspects. Chun-Sheng Z proposed an improved Apriori algorithm based on classification [4] . Jia Y improves the algorithm from the aspect of transaction database partitioning and dynamic itemset planning [5] . Shuangyue L proposed an improved algorithm based on the matrix of database to enhance the efficiency of calculating [6] . Wang P proposed an optimization method to reduce the search times of the transaction library to improve the efficiency [7] . Vaithiyanathan V uses the method of compressing the transactions of the similar interests in the database to improve the efficiency of the algorithm [8] . Lin X implements Apriori
  • 2. 288 Computer Science & Information Technology (CS & IT) algorithm based on Map Reduce to improve the candidate sets of large amounts of data generation efficiency [9] . Zhang first analyze the characteristic of the data, that is medical data, and then combine the characteristics of the data to improved Apriori algorithm [10] . Wu Huan proposed an improved algorithm IAA, which adopts a new count-based method to prune candidate itemsets and uses generation record to reduce total data scan amount [11] . Wang Yuan proposes an improved item constrain association rules mining algorithm, which improves traditional algorithm in two aspects: trimming frequent itemsets and calculating candidate itemsets [12] . Lin Ming-Yen proposes three algorithms, named SPC, FPC, and DPC, to investigate effective implementations of the Apriori algorithm in the MapReduce framework [13] . Chai Sheng proposes a novel algorithm so called Reduced Apriori Algorithm with Tag (RAAT), which reduces one redundant pruning operations of C2 [14] . This article will be focus on the two concrete steps of classical Apriori algorithm, namely connecting step and the pruning step, using a new prefix-itemset-based storage, combining the fast lookup feature of hash tables to improve the efficiency. This paper will first describe the classical Apriori algorithm and its shortcomings, then specifically describe the improvements, and finally introduce the comparisons of efficiency of classical Apriori algorithm and improve Apriori algorithm on specific data sets. 2. APRIORI ALGORITHM 2.1. Apriori algorithm introduction Apriori algorithm is a classical algorithm for frequent itemset mining association rules, the basic idea of the algorithm is to use an iterative approach layer by layer to find the frequent. The algorithm will first obtain k-itemsets, and then use the k- itemsets to explore (k+1)-itemsets. First, let’s introduce the priori knowledge of frequent itemsets, which is, any subset of a frequent itemset is also a frequent itemset. Apriori algorithm uses the prior knowledge of frequent itemsets, first to find the collection of frequent 1-itemsets, denoted L1. Then use the 2-itemsets of L1 to get L2, and then L3, and so on, until you cannot find the frequent k-itemsets. Apriori algorithm mainly consists of the following three steps: (1) Connecting step: connecting k- frequent itemsets to generate (k+1)-candidate sets, denoted by Ck+1. The connect condition of the connecting step is that the two k-itemsets have the same first (k-1) items and different k-th items. Denote li[j] is j-th item of li, the condition is: lଵሾ1ሿ = lଶሾ1ሿ ∧ lଵሾ2ሿ = lଶሾ2ሿ ∧ … … ∧ lଵሾk − 1ሿ = lଶሾk − 1ሿ ∧ lଵሾkሿ ≠ lଶሾkሿ In which l1 and l2 are k-item subset of the set collection Lk, l1[k] ≠ l2[k] is to ensure not to generate duplicate k- itemsets. Itemsets generated by the l1 and l2 connection as follows: {lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଵሾkሿ, lଶሾkሿ} (2) Pruning step: To pick out the true frequent itemsets Lk+1 from the candidate set Ck+1. Because the candidate set Ck+1 is the superset of the true frequent itemsets Lk+1. According to the nature of Apriori: any subsets of frequent set must also be frequent, that is any (k-1)- items subsets of k-items must also be frequent. With this property we can find out if the k- items subsets of Ck+1 are in Lk, if not, then remove the candidate (k + 1) - itemset is removed from the Ck+1. (3) Counting step: scanning the database, accumulate the number of candidates appearing in the database. If the appear times of a candidate is less than the given minimum support threshold, the candidate itemset will be removed.
  • 3. Computer Science & Information Technology (CS & IT) 289 2.2. Shortage of Apriori algorithm Apriori is one of the most classical algorithms for mining association rules, but it also has the shortage of low efficiency. The time Apriori algorithm consumes lies mainly in the following three aspects: (1) In connection step, when connects k-itemsets to generate (k+1)-itemsets, it compares too many times to determine if the itemsets meets the connection conditions. When Lk has m k- itemsets, the time complexity of the connection step is Ο(k*m^2). (2) In the pruning step, when determine if a subset of candidate set Ck+1 is in the frequent set Lk, the best situation is to simply scan once to get the result, while the worst-case is that is needs to scan k times to find that the k-th subset of Ck+1 is not in the Lk. So the average times need to scan and compare the Lk is |Ck+1| * |Lk| * k / 2. (3) In counting step, when accumulate the support times of itemsets in Ck+1, we need to scan the database for |Ck+1|times. Taking into account these three aspects of time-consuming steps of classical Apriori algorithm, this article presents an improved Apriori algorithm based on prefix-itemset. 3. IMPROVED APRIORI ALGORITHM 3.1. Improved Apriori algorithm In 1.2 we have analyzed the shortcomings of classical Apriori algorithm, so its improvements also focus on the three steps mentioned in 1.2. Since the records are already sorted by the dictionary, therefore the candidate set generated by Apriori algorithm is ordered. (1)Prefixed-itemset-based storage In the improved algorithm we proposed a new method to store the itemsets. For each itemset in Lk, we use a structure similar to Map <key, value> to store them, in which we save the forward (k-1)- item content as the key while the last item content as the value. After having all the itemsets saved in the new format, we group all the itemsets with the same key and store the union of their values as the new value. For example: the database is shown in Table 1and the minimum support is 2. Table 1 database TID Itemset T1 A,B,E T2 B,D T3 B,C T4 A,B,D T5 A,C T6 B,C T7 A,C T8 A,B,C T9 A,B,C,E
  • 4. 290 Computer Science & Information Technology (CS & IT) The traditional Apriori algorithm will scan the database to obtain the times each item appears in the database, to form the 1- itemsets, and then to generate the 2-itemsets that meets the minimum support, that is 2. Here is the content generated by the classical Apriori algorithm. Table 2 classical Apriori algorithm 1-itemset 2- itemset Item Count Item Count A 6 AB 4 B 7 AC 4 C 6 AE 2 D 2 BC 4 E 2 BD 2 BE 2 While Table 3 shows how we store the itemsets with the prefix-itemset-based storage. Table 3 prefix-itemset-based storage Prefixed-key Value 1-itemset NULL {A, B, C, D, E} 2-itemset A {B, C, E} B {C, D, E} As shown in Table 3, 1- itemset has only one item, so the key of 1- itemset is NULL. Besides, we can infer the length of the itemset from the length of the key because the length of the value of the key stores all the items in the itemset but the last item. (2) Prefixed-itemset-based connecting step After the establishing of prefix-itemset-based storage, when we have to generate (k+1)- itemset by connecting the two k- itemsets, we can simply combine two different items in the value, and then generate new itemset with the key. For example, when connecting the 2-itemset with the prefix-key of A in Table 3, we can generate the 3- itemset by combine the value and get the result as {{B, C},{B, D},{C, D}}. (3) Prefixed-itemset-based pruning step In chapter 1.1 we know that (k+1)- itemsets are generate from two k- itemsets, and if any k- itemset subset of the (k+1)-itemset does not exist in Lk, then we have to remove the (k+1)- itemset from Ck+1. Theorem: If we generate a (k+1)-itemset by connecting two k-itemsets, l1 and l2, and one k- itemset of all the k-itemset subset does not exist in Lk, then the k-itemset subset must contains both l1[k] and l2[k].
  • 5. Computer Science & Information Technology (CS & IT) 291 Prove: Assume l1 and l2 are both k-itemset, and the (k+1)-itemset generated by connecting l1 and l2 is {lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଵሾkሿ, lଶሾkሿ}. If the k- itemset dose not contains both l1[k] and l2[k], then the possible options are {lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଵሾkሿ}and {lଵሾ1ሿ, lଵሾ2ሿ, lଵሾ3ሿ, … … , lଶሾkሿ}, that is l1 and l2, and both l1 and l2 come from Lk, so if the k-itemset does not belong to Lk, then it must contains both lଵሾkሿ and lଶሾkሿ. So in prefixed-itemset-based pruning step, we can simply consider the subset of (k+1)- itemset which contains both the last two items. With the example from Table 3, we can get the result as follow. Table 4 pruning step Subset of 3-itemset If belong to L2 B,C yes B,E yes C,E no C,D no C,E no D,E no As shown in Table 4, only {{B,C},{B,E}} are possible 2-itemset subsets, plus the corresponding prefix-key, that is {{A,B,C},{A,B,E}}, namely the candidate set Cଷ. After the pruning step, we have to scan the database to accumulate the times the itemset appears. After accumulating the times of itemsets after pruning step, we can find that both {A,B,C} and {A,B,E} meet the minimum support, and then we add them to the prefix-itemset-based storage as follows. Table 5 3-itemset storage prefix-key Value 3-itemset A,B {C,E} 3.2. Algorithm The algorithm is described as follow: Input:Database D,minimum support min_sup Output:frequent itemsetsL 1) Lଵ=1-itemset of D 2) Map<String[],String[]> map; 3) Import L1 to map, set the key as null,value as the union of items in L1 4) for(k=2;L୩ିଵ ≠ ϕ;k++){ 5) C୩=pre_apriori_gen(map,k-2); 6) count the appear times of every itemset of C୩, L୩={cϵC୩|c.count>min_sup} 7) } 8) Return L୩; procedurepre_apriori_gen(map:Map<String[],String[]>;k:int) 1) for each key in map{
  • 6. 292 Computer Science & Information Technology (CS & IT) 2) if(key.length()==k){ 3) c:=key plus two items from value 4) if(map.containsKey(c[0:k])){ 5) 6) 7) 8) } 9) else continue; 10) } 11) Return C୩ 4. EXPERIMENT AND RESULTS The data of the experimental is a total of 120000 patient in Ruijin Hospital, the data record experimental machine is configured to Core i5 memory. This experiment compares the classical Apriori algorithm from two aspects, one is to compare the operation efficiency with fixed total tests and variable minimum support, the other is to compare the operation efficiency with fixed minimum support and variable total tests. The result of the first experiment Figure 1. Time The picture above shows the time consuming and variable min_sup. We can infer that the less min_sup is, the more operation efficiency the improved algorithm improves. And when the min_sup increases to a certain point, the classical Apriori algorithm and the improved algorithm are of the same efficiency. Computer Science & Information Technology (CS & IT) if(key.length()==k){ c:=key plus two items from value if(map.containsKey(c[0:k])){ If( any (k+1)- itemset subset belong to (key,value)){ put c into Ck } ESULTS The data of the experimental is a total of 120000 patients with diabetes clinical prescription data the data records the prescription drug number per user per visit. This experimental machine is configured to Core i5 2.7GHz 8GB processor, 1866MHz LPDDR3 Intel compares the classical Apriori algorithm and the prefixed-itemset from two aspects, one is to compare the operation efficiency with fixed total tests and , the other is to compare the operation efficiency with fixed minimum experiment is as follows. Figure 1. Time consuming with variable min_sup The picture above shows the time consuming of the two algorithms when given fixed total test and variable min_sup. We can infer that the less min_sup is, the more operation efficiency the improved algorithm improves. And when the min_sup increases to a certain point, the classical and the improved algorithm are of the same efficiency. itemset subset belong to (key,value)){ with diabetes clinical prescription data per visit. This 2.7GHz 8GB processor, 1866MHz LPDDR3 Intel itemset-based from two aspects, one is to compare the operation efficiency with fixed total tests and , the other is to compare the operation efficiency with fixed minimum of the two algorithms when given fixed total test and variable min_sup. We can infer that the less min_sup is, the more operation efficiency the improved algorithm improves. And when the min_sup increases to a certain point, the classical
  • 7. Computer Science & Information Technology (CS & IT) Table 6 improvements under variable min_sup Min_sup (total 12w) Classical Apriori Time(ms) 600 210696 1200 53822 1800 26317 2400 19359 3000 15508 3600 12393 4200 9017 4800 5705 5400 4868 Table 6 shows the specific operation time of the Apriori algorithm and the comparison The results of the second experiment are as follows. Figure 2. Time The picture above shows the time consuming of the two algorithm and variable total test. And we can tell that when the total test becomes larger, the become more obvious. Computer Science & Information Technology (CS & IT) Table 6 improvements under variable min_sup Classical Apriori Time(ms) Improved Apriori Time(ms) Improvement (%) 210696 25192 81.44% 53822 11648 68.63% 26317 7614 51.88% 19359 7127 38.99% 15508 6753 56.45% 12393 5842 52.86% 9017 5424 39.85% 5705 5175 9.29% 4868 5161 -6.02% Table 6 shows the specific operation time of the classical Apriori algorithm and the comparison between them. The results of the second experiment are as follows. Figure 2. Time consuming with variable total test above shows the time consuming of the two algorithms when given fixed min_sup and variable total test. And we can tell that when the total test becomes larger, the improvements 293 and the improved when given fixed min_sup improvements
  • 8. 294 Computer Science & Information Technology (CS & IT) Table 7 improvements under variable total test Total tests (min_sup% =2%) Classical Apriori Time(ms) Improved Apriori Time(ms) Improvement (%) 10k 1686 390 76.87% 20k 2839 695 75.52% 30k 4088 1229 69.94% 40k 5729 1846 67.78% 50k 8141 2409 70.41% 60k 9833 3197 67.49% 70k 12848 3630 71.75% 80k 13004 4339 66.63% 90k 16007 5442 66.00% 100k 17438 5588 67.96% Table 7 shows the specific operation time of the two algorithm and we can learn from the table that when the min_sup is fixed to 2% of the total test, the improvement rate is about 70%. Experiments have shown that the prefix-itemset-based Apriori algorithm is effective and feasible. 4. SUMMARY In this paper, we described the Apriori algorithm specifically, and pointed out some limitations of the classical Apriori algorithm during the two steps of the algorithm, namely the connection and the paper cutting steps, and proposed the method of prefixed-itemset-based data storage and the improvements based on it. With the help of prefixed-itemset-based data storage, we managed to finish the connecting step and the pruning step of the Apriori algorithm much faster, besides we can store the candidate itemsets with smaller storage space. Finally, we compare the efficiency of classical Apriori algorithm and improve Apriori algorithm on the aspect of support degree and the total number, and the experimental results on both aspects proved the feasibility of the prefixed-itemset-based algorithm.
  • 9. Computer Science & Information Technology (CS & IT) 295 REFERENCES [1] Han J, Kamber M, Pei J, et al. Data mining.Concepts and techniques. 3rd ed[J]. San Francisco, 2001, 29(S1):S103–S109. [2] RakeshAgrawal T. Imielinski,andArun Swami. Miningassociationrules between setsof items inlarge databases[J]. Inproceedings Ofthe AcmSigmod Conference, 1993:207--216. [3] Agrawal, Rakesh, and RamakrishnanSrikant. "Fast algorithms for mining association rules." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994. [4] Chun-Sheng Z, Yan L. Extension of local association rules mining algorithm based on apriori algorithm[C]//Software Engineering and Service Science (ICSESS), 2014 5th IEEE International Conference on. IEEE, 2014: 340-343. [5] Jia Y, Xia G, Fan H, et al. An Improved Apriori Algorithm Based on Association Analysis[C]//2012 Third International Conference on Networking and Distributed Computing. 2012. [6] Shuangyue L, Li P. Analysis of Coal Mine Hidden Danger Correlation Based on Improved A Priori Algorithm[C]//Intelligent Systems Design and Engineering Applications, 2013 Fourth International Conference on. IEEE, 2013: 112-116. [7] Wang P, An C, Wang L. An improved algorithm for Mining Association Rule in relational database[C]//Machine Learning and Cybernetics (ICMLC), 2014 International Conference on. IEEE, 2014, 1: 247-252. [8] Vaithiyanathan V, Rajeswari K, Phalnikar R, et al. Improved apriori algorithm based on selection criterion[C]//Computational Intelligence & Computing Research (ICCIC), 2012 IEEE International Conference on. IEEE, 2012: 1-4. [9] Lin X. MR-Apriori: Association Rules algorithm based on MapReduce[C]//Software Engineering and Service Science (ICSESS), 2014 5th IEEE International Conference on. IEEE, 2014: 141-144. [10] Zhang, Wenjing, Donglai Ma, and Wei Yao. "Medical Diagnosis Data Mining Based on Improved Apriori Algorithm." Journal of Networks 9.5 (2014): 1339-1345. [11] Wu, Huan, et al. "An improved apriori-based algorithm for association rules mining." Fuzzy Systems and Knowledge Discovery, 2009. FSKD'09. Sixth International Conference on. Vol. 2. IEEE, 2009. [12] Wang, Yuan, and Lan Zheng. "Endocrine Hormones Association Rules Mining Based on Improved Apriori Algorithm." Journal of Convergence Information Technology 7.7 (2012). [13] Lin, Ming-Yen, Pei-Yu Lee, and Sue-Chen Hsueh. "Apriori-based frequent itemset mining algorithms on MapReduce." Proceedings of the 6th international conference on ubiquitous information management and communication. ACM, 2012. [14] Chai, Sheng, Jia Yang, and Yang Cheng. "The research of improved Apriori algorithm for mining association rules." Service Systems and Service Management, 2007 International Conference on. IEEE, 2007. [15] Han J, Pei J, Yin Y. Mining Frequent Patterns without Candidate Generation[J]. Proceeding of AcmSigmod International Conference Management of Data, 1999, 29(2):1-12.
  • 10. 296 Computer Science & Information Technology (CS & IT) AUTHORS Yu Shoujian, the vice professor, main research direction: Web services, enterprise application integration, database and data warehouse; Zhou Yiyang, master, the main research direction: data mining, machine learning