SlideShare a Scribd company logo
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 5, Ver. V (Sep. – Oct. 2015), PP 08-14
www.iosrjournals.org
DOI: 10.9790/0661-17550814 www.iosrjournals.org 8 | Page
Mining High Utility Itemsets from its Concise and Lossless
Representations
Nasreen Ali A.1
, Arunkumar M2
1
(Computer Science and Engineering, Ilahia College of Engineering and Technology/ Mahatma Gandhi
University, India)
2
(Computer Science and Engineering, Ilahia College of Engineering and Technology/ Mahatma Gandhi
University, India)
Abstract: Mining high utility items from databases using the utility of items is an emerging technology.Recent
algorithms have a drawback in the performance level considering memory and time.Novel strategy proposed
here is the Miner Algorithm.A vertical data structure is used to store the elements with the utility values. A
matrix representation is generated to identify the element co-occurrences and reduce the join operation for the
patterns generated. An extensive experimental study with the datasets shows that the resulting algorithm
reduces the join operation upto 95% compared with the UP Growth state of the art algorithm.
Keywords: Utility, utility list, co-occurences, pruning
I. Introduction
Data mining is concerned with large volumes of data to analyze and automatically discover interesting
regularities or relationships. The primary goal is to discover hidden patterns, unexpected trends in the data. Data
mining activities uses combination of techniques from database technologies, statistics, artificial intelligence
and machine learning. Real world applications include bioinformatics, genetics, medicine, clinical research,
education, retail and marketing research.
Frequency Mining [1] is a popular data mining task with a wide range of applications. Given a
transaction database, it consists of discovering frequent itemsets. i.e. groups of items (itemsets) appearing
frequently in transactions [1]. However, an important limitation of frequency mining is that all items have the
same importance (weight, unit profit or value). These assumptions often do not hold in real applications. For
example, consider a database of customer transactions containing unit profit for each item and different
quantities of each item. Frequency mining algorithms would discard this information and may thus discover
many frequent itemsets generating a low profit and fail to discover less frequent itemsets that generate a high
profit.
Utility mining [5], [6], [7], [8], emerges as an important topic in data mining. In utility mining, each
item has a weight (e.g. unit profit) and can appear more than once in each transaction (e.g. purchase quantity).
The utility of an itemset represents its importance, which can be measured in terms of weight, profit, cost,
quantity or other information depending on the user preference. Utility is a measure of how useful or profitable
an itemset X is .The utility of items in a transaction database consists of two aspects: (1) the importance of
distinct items, which is called external utility, and (2) the importance of the items in the transaction, which is
called internal utility. The utility of an item is defined as the external utility multiplied by the internal utility.
The utility of an itemset X, i.e., u(X), is the sum of the utilities of itemset X in all the transactions containing X.
An itemset X is called a high utility itemset if and only if u(X) > min_utility, where min_utility is a user-defined
minimum utility threshold. However, mining high utility itemsets from databases is not an easy task since
downward closure property in frequent itemset mining does not hold. In other words, pruning search space for
high utility itemset mining is difficult because a superset of a low-utility itemset may be a high utility itemset. A
naı¨ve method to address this problem is to enumerate all itemsets from databases by the principle of
exhaustion. Obviously, this method suffers from the problems of a large search space, especially when databases
contain lots of long transactions or a low minimum utility threshold is set. Hence, how to effectively prune the
search space and efficiently capture all high utility itemsets with no miss is a crucial challenge in utility mining.
To identify high utility itemsets, most existing algorithms first generate candidate itemsets by
overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms
incur the problem that a very large number of candidates are generated, but most of the candidates are found out
to b e not high utility after their exact utilities are computed. In this paper, we propose an algorithm, called
Miner, for high utility itemset mining .Miner uses a novel structure, called utility-list, to store both the utility
information about an itemset and the heuristic information for pruning the search space of Miner. By avoiding
the costly generation and utility computation of numerous candidate itemsets, Miner can efficiently mine high
utility itemsets from the utility lists constructed from a mined database. To reduce the number of costly joins
Mining High Utility Itemsets from its Concise and Lossless Representations
DOI: 10.9790/0661-17550814 www.iosrjournals.org 9 | Page
that are performed, we propose a novel pruning strategy named EUCP (Estimated Utility Cooccurrence Pruning)
that can prune itemsets without having to perform joins. This strategy is easy to implement and very effective.
We compare the performance of Miner and UP Growth on real-life datasets. Results show that Miner performs
upto 95% less join operations than UP Growth and is up to six times faster than UP Growth. Experimental
results show that Miner outperforms this algorithm in terms of both running time and memory consumption.
II. Review of literature
Fast Algorithms for mining Association Rule by R.Agrawal and R.Srikant in 1994 proposed Apriori
algorithm.Apriori is more efficient during the candidate generation process for two reasons; Apriori employs a
different candidate’s generation method and a new pruning technique. There are two processes to finds out all
the large itemsets from the database in Apriori algorithm. First the candidate itemsets are generated, and then the
database is scanned to check the actual support count of the corresponding itemsets. During the first scanning of
the database the support count of each item is calculated and the large 1 -itemsets are generated by pruning those
itemsets whose supports are below the predefined threshold. In each pass only those candidate itemsets that
include the same specified number of items are generated and checked.
Advantages are 1] Uses large itemset property. 2] Easily parallelized. 3] Easy to implement. 4] It doesn’t need
to generate conditional pattern bases. Disadvantages: 1] It requires multiple database scans. 2] Assumes
transaction database is memory resident. 3] Generating candidate itemsets.
Mining Frequent Patterns without Candidate Generation by J. Han, J. Pei, and Y. Yin in 2000
proposed a novel frequent pattern tree (FP-tree),which is an extended prefix tree structure for storing
compressed, crucial information about frequent patterns, and develop an efficient FP-tree based mining method,
FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is
achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data
structure, which avoids costly, repeated database scans, (2) FP-tree-based mining adopts a pattern fragment
growth method to avoid the costly generation of a large number of candidate sets, and (3)a partitioning-based,
divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined
patterns in conditional databases, which dramatically reduces the search space. Performance study shows that
the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an
order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent
pattern mining methods. Here the item priority is not taken into consideration. Advantages are 1] It finds
frequent itemsets without generating any candidate itemset 2] Scans database just twice. 3] Does not generate
candidate itemsets. Disadvantages are 1] It treats all items with the same importance/weight/price. 2] Consumes
more memory and performs badly with long pattern data sets.
A Fast High Utility Itemsets Mining Algorithm by Y. Liu, W.-K. Liao and A. Choudhary in 2005
proposed Two Phase algorithm. Utility mining focuses on identifying the itemsets with high utilities. As
“downward closure property” doesn’t apply to utility mining, the generation of candidate itemsets is the most
costly in terms of time and memory space. In this paper, a Two-Phase algorithm is presented to efficiently prune
down the number of candidates and can precisely obtain the complete set of high utility itemsets. In the first
phase, a model is proposed that applies the “transaction-weighted downward closure property” on the search
space to expedite the identification of candidates. In the second phase, one extra database scan is performed to
identify the high utility itemsets. It performs very efficiently in terms of speed and memory cost, and shows
good scalability on multiple processors, even on large databases that are difficult for existing algorithms to
handle. Advantages are 1] It performs very efficiently in terms of speed and memory cost .Disadvantages are 1]
Generate too many candidates to obtain HTWUI require multiple database scan.
UP-Growth: An Efficient Algorithm for High Utility Itemsets Mining by Vincent S. Tseng, Cheng-
Wei Wu, Bai-En Shie, and Philip S. Yu in 2010, proposed an efficient algorithm, namely UP-Growth (Utility
Pattern Growth), for mining high utility itemsets with a set of techniques for pruning candidate itemsets. The
information of high utility itemsets is maintained in a special data structure named UP-Tree (Utility Pattern
Tree) such that the candidate itemsets can be generated efficiently with only two scans of the database. The
experimental results show that UP-Growth not only reduces the number of candidates effectively but also
outperforms other algorithms substantially in terms of execution time, especially when the database contains lots
of long transactions.
Mining High utility Itemsets without Candidate Generation by Mengchi Liu Wuhan proposed the
algorithm HUI-Miner. High utility itemsets refer to the sets of items with high utility like profit in a database,
and efficient mining of high utility itemsets plays a crucial role in many real-life applications and is an
important research issue in data mining area. To identify high utility itemsets, most existing algorithms first
generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of
these candidates. These algorithms incur the problem that a very large number of candidates are generated, but
most of the candidates are found out to be not high utility after their exact utilities are computed. HUI-Miner
Mining High Utility Itemsets from its Concise and Lossless Representations
DOI: 10.9790/0661-17550814 www.iosrjournals.org 10 | Page
uses a novel structure, called utility-list, to store both the utility information about an itemset and the heuristic
information for pruning the search space of HUI-Miner. By avoiding the costly generation and utility
computation of numerous candidate itemsets, HUI-Miner can efficiently mine high utility itemsets from the
utility-lists constructed from a mined database. We compared HUI-Miner with the state-of-the-art algorithms on
various databases, and experimental results show that HUI-Miner outperforms these algorithms in terms of both
running time and memory consumption.
III. current work
We first introduce important preliminary definitions.
A transaction database is a set of transactions D = {T1, T2, Tn} such that for each transaction Tc has a
unique identifier c called its Tid. Each item i ∈ I where I is the set of items is associated with a positive number
p(i), called its external importance or utility (e.g. unit profit). For each transaction Tc such that i ∈ Tc, a positive
number q (i, Tc) is called the internal importance or utility of i (e.g. purchase quantity).
Example 1. Consider the database of Fig. 1 (left), which will be used as our running example. This
database contains five transactions (T1, T2...T5). Transaction T2 indicates that items a, c, e and g appear in this
transaction with an internal utility of respectively 2, 6, 2 and 5. FIG. 1 (right) indicates that the external utility of
these items are respectively 5, 1, 3 and 1.
The importance or utility of an item i in a transaction Tc is denoted as u (i, Tc) and defined as p (i) × q
(i, Tc).The importance or utility of an itemset X (a group of items X ⊆ I) in a transaction Tc is denoted as u(X,
Tc) and defined as u(X, Tc) = Pi ∈X u (i, Tc).
Example 2. The utility of item a in T2 is u(a,T2) = 5 × 2 = 10. The utility of the itemset {a, c}
in T2 is u ({a, c}, T2) = u (a, T2) + u(c, T2) = 5 × 2 + 1 × 6 = 16.
The importance or utility of an itemset X is denoted as u(X) and defined as u(X) = PTc ∈g (X) u(X,
Tc), where g(X) is the set of transactions containing X.
Example 3. The importance or utility of the itemset {a,c} is u({a,c}) = u(a)+u(c) = u(a,T1)+u(a,T2) + u(a,T3) +
u(c,T1) + u(c,T2) + u(c,T3) = 5 + 10 + 5 + 1 + 6 + 1 = 28.
The purpose of high-utility itemset mining is to discover all high-utility itemsets. An itemset X is a
high-utility itemset if its utility u(X) is no less than a user-specified minimum utility threshold min_util given by
the user. Otherwise, X is a low-utility itemset.
Example 4. If min_util = 30, the high-utility itemsets in the database of our running example are {b,d}, {a,c,e},
{b,c,d}, {b,c,e}, {b,d,e}, {b,c,d,e} with respectively a utility of 30, 31, 34, 31, 36, 40 and 30.
The utility of the transaction (TU) Tc is the sum of the utility of the items from Tc in Tc. i.e. TU (Tc)
=Px ∈Tc u(x, Tc).
Example 5. FIG. 2 (left) shows the TU of transactions T1, T2, T3, T4, and T5 from our running example.
The transaction utility of transactions containing the item gives the Transaction Weighted Utility of the
item X, i.e. TWU(X) = PT c ∈g (X) TU (Tc).
Example 6. Fig. 2 (center) shows the TWU of single items a, b,c,d, d, e, f, g.Consider item a. TWU(A) =
TU(T1) + TU(T2) + TU(T3) = 8 + 27 + 30 = 65.
The TWU measure has three important properties that are used to prune the search space.
Property (overestimation). The TWU of an itemset X is higher than or equal to its utility, i.e. TWU(X) ≥ u(X)
[8].
Property 2 (antimonotonicity). The TWU measure is anti-monotonic. Let X and Y be two itemsets. If X
⊂ Y, then TWU(X) ≥TWU(Y) [8].
Property 3 (pruning). Let X be an itemset. If TWU(X) < min_util, then the itemset X is a low-utility
itemset as well as all its supersets. Proof. This directly follows from Property 1 and Property 2.
The set of items in the utility-list of an itemset X in a database D is a set of tuples such that there is a
tuple (tid, iutil, rutil) for each transaction T tid containing X. The iutil element of a tuple is the utility of X in
Ttid. i.e. u(X, Ttid). The rutil element of a tuple is defined as Pi ∈Ttid ∧i 6∈X U (i, Ttid).
Mining High Utility Itemsets from its Concise and Lossless Representations
DOI: 10.9790/0661-17550814 www.iosrjournals.org 11 | Page
Example 7. The utility-list of {a} is {(T1, 5, 3) (T2, 10, 17) (T3, 5, 25)}. The utility list of {e} is {(T2, 6, 5) (T3,
3, 5) (T4, 3, 0)}. The utility-listof {a, e} is {(T2, 16, 5), (T3, 8, 5)}.
To obtain the high-utility itemsets, Miner perform only one database scan to create utility-lists of
patterns containing single items. Then, bigger patterns are obtained by performing the join operation of utility-
lists of smaller patterns. Pruning the search space is done using the two following properties.
Property 4 (sum of iutils). Let X be an itemset. If the sum of iutil values in the utility-list of x is higher than or
equal to min_util, then X is a high-utility itemset. Otherwise, it is a low-utility itemset [7].
Property 5 (sum of iutils and rutils). Let X be an itemset. Let the extensions of X be the itemsets that can be
obtained by appending an item y to X such that y i for all item i in X. If the sum of iutil and rutil values in the
utility-list of x is less than min_util, all extensions of X and their transitive extensions are low-utility itemsets
[7].
In the next section, we introduce our novel algorithm, which improves upon existing algorithms by
being able to eliminate low-utility itemsets without performing join operations.
Table. 1. A transaction database (left) and external utility values (right)
Table. 2. Transaction utilities (left), TWU values (center) and EUCS (right)
IV. Algorithm
In this section, we present our proposal, the Miner algorithm. The main procedure (Algorithm 1) takes
as input a transaction database with utility values and the min_util threshold. The algorithm first scans the
database to calculate the TWU of each item. Then, the algorithm identifies the set I∗ of all items having a TWU
greater than min_util .TWU values can be used to arrange the items in the ascending order. Items in transactions
are reordered according to the total order during the second database scan and the utility-list of each item i ∈ I∗
is built and the novel structure named EUCS (Estimated Utility Co-Occurrence Structure) is built. This structure
is defined as a set of triples of the form (a, b, c) ∈ I∗ × I∗ × R. A triple (a, b, c) indicates that TWU ({a, b}) = c.
The EUCS can be implemented as a triangular matrix as shown in TABLE. 2 (right) where only tuple
of the form (a, b, c) such that c ≠ 0 are kept. EUCS structure is more memory efficient because only few items
co-occur with other items. Building the EUCS is very fast (it is performed with a single database scan) and
occupies a small amount of memory, bounded by |I∗|×|I∗|, although in practice the size is much smaller because
a limited number of pairs of items co-occur in transactions. After the construction of the EUCS, the depth-first
search exploration of itemsets starts by calling the recursive procedure Search with the empty itemset ∅, the set
of single items I∗, min_util and the EUCS structure.
The Search procedure (Algorithm 2) has input (1) an itemset P, (2) extensions of P having the form Pz
meaning that Pz was previously obtained by appending an item z to P, (3) min_util and (4) the EUCS. The
search procedure operates as follows. The sum of iutil values of each extension of P, i.e Px, is taken and if it is
greater than min_util then Px is a high-utility itemset and it is output. The sums of iutil and rutil values in the
utility list of Px are greater than min_util then the extensions of Px should be explored. This is performed by
merging Px with all extensions Py of P such that order of y greater than x to form extensions of the form Pxy
containing |Px| + 1 items. The utility-list of Pxy is then constructed by calling the Construct procedure (cf.
Algorithm 3) to join the utility-lists of P, Px and Py. Then, a recursive call to the Search procedure with Pxy is
id Transactions
T1 (a,1)(c,1)(d,1)
T2 (a,2)(c,6)(e,2)(g,5)
T3 (a,1)(b,2)(c,1)(d,6),(e,1),(f,5)
T4 (b,4)(c,3)(d,3)(e,1)
T5 (b,2)(c,2)(e,1)(g,2)
Item a b c d e f g
Profit 5 2 1 2 3 1 1
TID TU
T1 8
T2 27
T3 30
T4 20
T5 11
Items TWU
a 61
b 65
c 96
d 58
e 88
f 30
g 38
Item a b c d e f
b 30
c 65 61
d 38 50 58
e 57 61 77 50
f 30 30 30 30 30
g 27 38 38 0 38 0
Mining High Utility Itemsets from its Concise and Lossless Representations
DOI: 10.9790/0661-17550814 www.iosrjournals.org 12 | Page
done to calculate its utility and explore its extension(s). Search procedure is recursive and starts from single
items, and appends single items and it only prunes the search space based on Property 5. It can be easily seen
based on Property 4 and 5 that this procedure is correct and complete to discover all high-utility itemsets.
The Construct procedure considers the revised transactions where the transactions are arranged in the
ascending order of the TWU.X/T is the set of all the items in T after X.ru(X/T) is the remaining utility of itemset
X in T which is the sum of all the items in X/T.Here there is no need of database scan. First we identify the
common one item transaction and combine it to form the two item utility list. To construct the k item utility list
we have to combine the k-1 item utility list and k item utility list
Co-occurrence based Pruning. The main novelty in Miner is a pruning mechanism named EUCP
(Estimated Utility Co-occurrence Pruning), which uses the EUCS. EUCP performs the construction of the utility
list by eliminating the low-utility extension of Pxy and all its transitive extensions. This is done on line 8 of the
Search procedure. The pruning condition do not explore Pxy and its supersets if any tuple (x, y, c) in EUCS such
that c ≤ min_util. This strategy is correct (only prune low-utility itemsets). The proof is that by Property 3, if an
itemset X contains another itemset Y such that TWU(Y) <min_util, then X and its supersets are low-utility
itemsets. Search procedure is recursive and will check all other pairs of items in Pxy in previous recursions of
the Search procedure leading to Pxy. For example, consider an itemset Z = {a1, a2, a3, a4}. To generate this
itemset, the search procedure had to combine {a1, a2 a3} and {a1, a2, a4}, obtained by combining {a1, a2} and
{a1, a3}, and {a1, a2} and {a1, a4}, obtained by combining single items. It can be easily observed that when
generating Z all pairs of items in Z have been checked by EUCP except {a3, a4}.
Algorithm 1: Miner Algorithm
input: D: a transaction database, min_util: a user-specified threshold
output: the set of high-utility itemsets
1 Scan D to calculate the TWU of single items;
2 I ∗ ← each item i such that TWU (i) < min_util;
3 Let be the total order of TWU ascending values on I ∗;
4 Scan D to built the utility-list of each item i ∈ I ∗ and build the EUCS structure;
5 Search (∅, I ∗, min_util, EUCS);
Algorithm 2: Search Algorithm
input: P: an itemset, ExtensionsOfP: a set of extensions of P, the min_util threshold, the EUCS
structure
output: the set of high-utility itemsets
1 foreach itemset P x ∈ ExtensionsOfP do
2 if SUM (Px.utilitylist.iutils) ≥ min_util then
3 output Px;
4 end
5 if SUM (Px.utilitylist.iutils) +SUM (Px.utilitylist.rutils) ≥ min_util then
6 ExtensionsOfPx ← ∅;
7 foreach itemset Py ∈ ExtensionsOfP such that y x do
8 if ∃(x, y, c) ∈ EUCS such that c ≥ min_util) then
9 Pxy ← Px ∪ Py;
10 Pxy.utilitylist ← Construct (P, Px, Py);
11 ExtensionsOfPx ← ExtensionsOfPx ∪ Pxy;
12 end
13 end
14 Search (Px, ExtensionsOfPx, min_util);
15 end
16 end
Algorithm 3: Construct Algorithm
input : P: an itemset, P x: the extension of P with an item x, P y: the extension of P with an item y
output: the utility-list of P xy
1 UtilityListOfPxy ← ∅;
2 foreach tuple ex ∈ P x.utilitylist do
3 if ∃ey ∈ P y.utilitylist and ex.tid = exy.tid then
4 if P.utilitylist ≠ ∅ then
Mining High Utility Itemsets from its Concise and Lossless Representations
DOI: 10.9790/0661-17550814 www.iosrjournals.org 13 | Page
5 Search element e ∈ P.utilitylist such that e.tid = ex.tid.;
6 exy ← (ex.tid, ex.iutil + ey.iutil − e.iutil, ey.rutil);
7 end
8 else
9 exy ← (ex.tid, ex.iutil + ey.iutil, ey.rutil);
10 end
11 UtilityListOfP xy ← UtilityListOfP xy ∪ {exy};
12 end
13 end
14 return UtilityListPxy;
V. Experimental study
We performed experiments to assess the performance of the proposed algorithm. Experiments were
performed on a computer with a 64 bit Corei5 processor running Windows 7 and 5 GB of free RAM. We
compared the performance of Miner with the state-of-the-art algorithm UP Growth for high-utility itemset
mining. All memory measurements were done using the Java API. Experiments were carried on real-life dataset
having varied characteristics. The dataset contains 1,112,949 transactions with 46,086 distinct items and an
average transaction length of 7.26 items. External utilities for items are generated between 1 and 1,000 by using
a log-normal distribution and quantities of items are generated randomly between 1 and 5, as the settings of [2,
7, 10].
Execution time: We first ran the Miner and UP growth algorithms on the dataset while decreasing the
min_util threshold until algorithms became too long to execute, ran out of memory or a clear winner was
observed. We recorded the execution time, the percentage of candidate pruned by the Miner algorithm and the
total size of the EUCS. The comparison of execution time against min-utility threshold is shown in Fig 1. Miner
was faster up to 6 times than UP growth.
Fig: 1 Performance of Execution Time against Min utility Threshold
Pruning effectiveness: These results show that candidate pruning can be very effective by pruning up to
95% of candidates. As expected, when more pruning was done, the performance gap between Miner and UP
growth became larger.
Fig: 2 Performance of Candidate Items against Min utility threshold
Mining High Utility Itemsets from its Concise and Lossless Representations
DOI: 10.9790/0661-17550814 www.iosrjournals.org 14 | Page
Memory overhead: We also studied the memory overhead of using the EUCS structure. We found that
the memory footprint of EUCS was 6 times less than UP Growth. We therefore conclude that the cost of using
the EUCP strategy in terms of memory is low.
Fig: 3 Performance of Memory Complexity against Threshold.
VI. Conclusion
In this paper, we have presented a novel algorithm for high-utility itemset mining named Miner. This
algorithm integrates a novel strategy named EUCP (Estimated Utility Cooccurrence Pruning) to reduce the
number of joins operations when mining high-utility itemsets using the utilitylist data structure. We have
performed an extensive experimental study on real-life datasets to compare the performance of Miner with the
state-of-the-art algorithm UP Growth. Results show that the pruning strategy reduces the search space by upto
95 % and that Miner is up to 8 times faster than UP Growth.
Acknowledgements
The authors wish to thank the Management, the Principal and Head of the Department (CSE) of ICET
for the support and help in completing the work.
References
[1]. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: Proc. Int. Conf. Very Large Databases, pp. 487–499, (1994)
[2]. V. S. Tseng, C.-W. Wu, B.-E. Shie, and P. S. Yu, “UP-Growth: An efficient algorithm for high utility itemset mining,” in Proc.
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2010, pp. 253–262.
[3]. Y. Liu, W. Liao, and A. Choudhary, “A fast high utility itemsets mining algorithm,” in Proc. Utility-Based Data Mining
Workshop,2005, pp. 90–99.
[4]. Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu,Fellow, IEEE, “Efficient Algorithms for Mining High Utility
Itemsets from Transactional Databases”, IEEE Transactions On Knowledge And Data Engineering, Vol. 25, No. 8, August 2013.
[5]. C. Lucchese, S. Orlando, and R. Perego, “Fast and memory efficient mining of frequent closed itemsets,” IEEE Trans. Knowl. Data
Eng., vol. 18, no. 1, pp. 21–36, Jan. 2006.
[6]. K. Chuang, J. Huang, and M. Chen, “Mining top-k frequent patterns in the presence of the memory constraint,” VLDB J., vol. 17,
pp. 1321–1344, 2008.
[7]. R. Chan, Q. Yang, and Y. Shen, “Mining high utility itemsets,” in Proc. IEEE Int. Conf. Data Min., 2003, pp. 19–26.
[8]. Erwin, R. P. Gopalan, and N. R. Achuthan, “Efficient mining of high utility itemsets from large datasets,” in Proc. Int. Conf.
Pacific- Asia Conf. Knowl. Discovery Data Mining, 2008, pp. 554–561.
[9]. K. Gouda and M. J. Zaki, “Efficiently mining maximal frequent itemsets,” in Proc. IEEE Int. Conf. Data Mining, 2001, pp. 163–
170.
[10]. Mengchi Liu, Junfeng Qu, “Mining High Utility Itemsets without Candidate Generation”,in Proceeding CIKM '12 Proceedings of
the 21st ACM international conference on Information and knowledge management,Pages 55-64
[11]. C. W. Wu, P. Fournier-Viger, P. S. Yu, and V. S. Tseng,” Efficient mining of a concise and lossless representation of high utility
itemsets”, In Proc. IEEE Int’l Conf. Data Mining, pages 824 –833, 2011.
[12]. C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong and Y.-K. Lee, “Efficient tree structures for high utility pattern mining in incremental
databases,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 12, pp. 1708– 1721, Dec. 2009.

More Related Content

PDF
An Efficient and Scalable UP-Growth Algorithm with Optimized Threshold (min_u...
PDF
Mining High Utility Patterns in Large Databases using Mapreduce Framework
PDF
A Survey Report on High Utility Itemset Mining for Frequent Pattern Mining
PPTX
Up growth an efficient algorithm for high utility itemset mining(sigkdd2010) (1)
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
The International Journal of Engineering and Science (The IJES)
PDF
A Fuzzy Algorithm for Mining High Utility Rare Itemsets – FHURI
PDF
50120140503019
An Efficient and Scalable UP-Growth Algorithm with Optimized Threshold (min_u...
Mining High Utility Patterns in Large Databases using Mapreduce Framework
A Survey Report on High Utility Itemset Mining for Frequent Pattern Mining
Up growth an efficient algorithm for high utility itemset mining(sigkdd2010) (1)
International Journal of Engineering Research and Development (IJERD)
The International Journal of Engineering and Science (The IJES)
A Fuzzy Algorithm for Mining High Utility Rare Itemsets – FHURI
50120140503019

What's hot (17)

PDF
Ijcatr04051004
PDF
Parallel Key Value Pattern Matching Model
PDF
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
PDF
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
PDF
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
PDF
PDF
Clustbigfim frequent itemset mining of
PDF
Ag35183189
PDF
3.[18 22]hybrid association rule mining using ac tree
PDF
Aa31163168
PDF
Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...
PDF
Optimized High-Utility Itemsets Mining for Effective Association Mining Paper
PDF
Generation of Potential High Utility Itemsets from Transactional Databases
PDF
New Research Articles 2019 April Issue International Journal on Computational...
PDF
Ijariie1129
PDF
Data mining techniques application for prediction in OLAP cube
PDF
A New Extraction Optimization Approach to Frequent 2 Item sets
Ijcatr04051004
Parallel Key Value Pattern Matching Model
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
Clustbigfim frequent itemset mining of
Ag35183189
3.[18 22]hybrid association rule mining using ac tree
Aa31163168
Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...
Optimized High-Utility Itemsets Mining for Effective Association Mining Paper
Generation of Potential High Utility Itemsets from Transactional Databases
New Research Articles 2019 April Issue International Journal on Computational...
Ijariie1129
Data mining techniques application for prediction in OLAP cube
A New Extraction Optimization Approach to Frequent 2 Item sets
Ad

Viewers also liked (20)

PDF
F010243136
PDF
A1103010108
PDF
A1303010105
PDF
Role of the Various Functionaries in Regular Surveillance
PDF
Design Issues for Search Engines and Web Crawlers: A Review
PDF
H017255560
PDF
N0172189102
PDF
Synthesis and Application of Direct Dyes Derived From Terephthalic and Isopht...
PDF
P01243104110
PDF
F010434147
PDF
C011111829
PDF
M1803047782
PDF
N010118893
PDF
C012621520
PDF
K1304036670
PDF
E010422834
PDF
Traditional use of Monocotyledon Plants of Arakuvalley Mandalam, Visakhapatna...
PDF
Low Power Energy Harvesting & Supercapacitor Storage
PDF
I012116164
PDF
E010412433
F010243136
A1103010108
A1303010105
Role of the Various Functionaries in Regular Surveillance
Design Issues for Search Engines and Web Crawlers: A Review
H017255560
N0172189102
Synthesis and Application of Direct Dyes Derived From Terephthalic and Isopht...
P01243104110
F010434147
C011111829
M1803047782
N010118893
C012621520
K1304036670
E010422834
Traditional use of Monocotyledon Plants of Arakuvalley Mandalam, Visakhapatna...
Low Power Energy Harvesting & Supercapacitor Storage
I012116164
E010412433
Ad

Similar to B017550814 (20)

PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Comparison Between High Utility Frequent Item sets Mining Techniques
PDF
Improved Map reduce Framework using High Utility Transactional Databases
PDF
A1030105
PDF
A Relative Study on Various Techniques for High Utility Itemset Mining from T...
PDF
A FLEXIBLE APPROACH TO MINE HIGH UTILITY ITEMSETS FROM TRANSACTIONAL DATABASE...
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
PDF
Transaction Profitability Using HURI Algorithm [TPHURI]
PDF
TRANSACTION PROFITABILITY USING HURI ALGORITHM [TPHURI]
PDF
Discovering High Utility Item Sets to Achieve Lossless Mining using Apriori A...
PDF
Dm unit ii r16
PPT
Apriori and Eclat algorithm in Association Rule Mining
PDF
Discovering Frequent Patterns with New Mining Procedure
PDF
Efficient Mining of Association Rules in Oscillatory-based Data
PDF
Research Inventy : International Journal of Engineering and Science
PDF
Research Inventy : International Journal of Engineering and Science
PDF
An improvised tree algorithm for association rule mining using transaction re...
PPSX
Frequent itemset mining methods
PDF
Data mining : rule mining algorithms
PDF
Volume 2-issue-6-2081-2084
International Journal of Engineering Research and Development (IJERD)
Comparison Between High Utility Frequent Item sets Mining Techniques
Improved Map reduce Framework using High Utility Transactional Databases
A1030105
A Relative Study on Various Techniques for High Utility Itemset Mining from T...
A FLEXIBLE APPROACH TO MINE HIGH UTILITY ITEMSETS FROM TRANSACTIONAL DATABASE...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Transaction Profitability Using HURI Algorithm [TPHURI]
TRANSACTION PROFITABILITY USING HURI ALGORITHM [TPHURI]
Discovering High Utility Item Sets to Achieve Lossless Mining using Apriori A...
Dm unit ii r16
Apriori and Eclat algorithm in Association Rule Mining
Discovering Frequent Patterns with New Mining Procedure
Efficient Mining of Association Rules in Oscillatory-based Data
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
An improvised tree algorithm for association rule mining using transaction re...
Frequent itemset mining methods
Data mining : rule mining algorithms
Volume 2-issue-6-2081-2084

More from IOSR Journals (20)

PDF
A011140104
PDF
M0111397100
PDF
L011138596
PDF
K011138084
PDF
J011137479
PDF
I011136673
PDF
G011134454
PDF
H011135565
PDF
F011134043
PDF
E011133639
PDF
D011132635
PDF
C011131925
PDF
B011130918
PDF
A011130108
PDF
I011125160
PDF
H011124050
PDF
G011123539
PDF
F011123134
PDF
E011122530
PDF
D011121524
A011140104
M0111397100
L011138596
K011138084
J011137479
I011136673
G011134454
H011135565
F011134043
E011133639
D011132635
C011131925
B011130918
A011130108
I011125160
H011124050
G011123539
F011123134
E011122530
D011121524

Recently uploaded (20)

PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Univ-Connecticut-ChatGPT-Presentaion.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Programs and apps: productivity, graphics, security and other tools
Heart disease approach using modified random forest and particle swarm optimi...
SOPHOS-XG Firewall Administrator PPT.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Group 1 Presentation -Planning and Decision Making .pptx
A comparative analysis of optical character recognition models for extracting...
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia

B017550814

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 5, Ver. V (Sep. – Oct. 2015), PP 08-14 www.iosrjournals.org DOI: 10.9790/0661-17550814 www.iosrjournals.org 8 | Page Mining High Utility Itemsets from its Concise and Lossless Representations Nasreen Ali A.1 , Arunkumar M2 1 (Computer Science and Engineering, Ilahia College of Engineering and Technology/ Mahatma Gandhi University, India) 2 (Computer Science and Engineering, Ilahia College of Engineering and Technology/ Mahatma Gandhi University, India) Abstract: Mining high utility items from databases using the utility of items is an emerging technology.Recent algorithms have a drawback in the performance level considering memory and time.Novel strategy proposed here is the Miner Algorithm.A vertical data structure is used to store the elements with the utility values. A matrix representation is generated to identify the element co-occurrences and reduce the join operation for the patterns generated. An extensive experimental study with the datasets shows that the resulting algorithm reduces the join operation upto 95% compared with the UP Growth state of the art algorithm. Keywords: Utility, utility list, co-occurences, pruning I. Introduction Data mining is concerned with large volumes of data to analyze and automatically discover interesting regularities or relationships. The primary goal is to discover hidden patterns, unexpected trends in the data. Data mining activities uses combination of techniques from database technologies, statistics, artificial intelligence and machine learning. Real world applications include bioinformatics, genetics, medicine, clinical research, education, retail and marketing research. Frequency Mining [1] is a popular data mining task with a wide range of applications. Given a transaction database, it consists of discovering frequent itemsets. i.e. groups of items (itemsets) appearing frequently in transactions [1]. However, an important limitation of frequency mining is that all items have the same importance (weight, unit profit or value). These assumptions often do not hold in real applications. For example, consider a database of customer transactions containing unit profit for each item and different quantities of each item. Frequency mining algorithms would discard this information and may thus discover many frequent itemsets generating a low profit and fail to discover less frequent itemsets that generate a high profit. Utility mining [5], [6], [7], [8], emerges as an important topic in data mining. In utility mining, each item has a weight (e.g. unit profit) and can appear more than once in each transaction (e.g. purchase quantity). The utility of an itemset represents its importance, which can be measured in terms of weight, profit, cost, quantity or other information depending on the user preference. Utility is a measure of how useful or profitable an itemset X is .The utility of items in a transaction database consists of two aspects: (1) the importance of distinct items, which is called external utility, and (2) the importance of the items in the transaction, which is called internal utility. The utility of an item is defined as the external utility multiplied by the internal utility. The utility of an itemset X, i.e., u(X), is the sum of the utilities of itemset X in all the transactions containing X. An itemset X is called a high utility itemset if and only if u(X) > min_utility, where min_utility is a user-defined minimum utility threshold. However, mining high utility itemsets from databases is not an easy task since downward closure property in frequent itemset mining does not hold. In other words, pruning search space for high utility itemset mining is difficult because a superset of a low-utility itemset may be a high utility itemset. A naı¨ve method to address this problem is to enumerate all itemsets from databases by the principle of exhaustion. Obviously, this method suffers from the problems of a large search space, especially when databases contain lots of long transactions or a low minimum utility threshold is set. Hence, how to effectively prune the search space and efficiently capture all high utility itemsets with no miss is a crucial challenge in utility mining. To identify high utility itemsets, most existing algorithms first generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms incur the problem that a very large number of candidates are generated, but most of the candidates are found out to b e not high utility after their exact utilities are computed. In this paper, we propose an algorithm, called Miner, for high utility itemset mining .Miner uses a novel structure, called utility-list, to store both the utility information about an itemset and the heuristic information for pruning the search space of Miner. By avoiding the costly generation and utility computation of numerous candidate itemsets, Miner can efficiently mine high utility itemsets from the utility lists constructed from a mined database. To reduce the number of costly joins
  • 2. Mining High Utility Itemsets from its Concise and Lossless Representations DOI: 10.9790/0661-17550814 www.iosrjournals.org 9 | Page that are performed, we propose a novel pruning strategy named EUCP (Estimated Utility Cooccurrence Pruning) that can prune itemsets without having to perform joins. This strategy is easy to implement and very effective. We compare the performance of Miner and UP Growth on real-life datasets. Results show that Miner performs upto 95% less join operations than UP Growth and is up to six times faster than UP Growth. Experimental results show that Miner outperforms this algorithm in terms of both running time and memory consumption. II. Review of literature Fast Algorithms for mining Association Rule by R.Agrawal and R.Srikant in 1994 proposed Apriori algorithm.Apriori is more efficient during the candidate generation process for two reasons; Apriori employs a different candidate’s generation method and a new pruning technique. There are two processes to finds out all the large itemsets from the database in Apriori algorithm. First the candidate itemsets are generated, and then the database is scanned to check the actual support count of the corresponding itemsets. During the first scanning of the database the support count of each item is calculated and the large 1 -itemsets are generated by pruning those itemsets whose supports are below the predefined threshold. In each pass only those candidate itemsets that include the same specified number of items are generated and checked. Advantages are 1] Uses large itemset property. 2] Easily parallelized. 3] Easy to implement. 4] It doesn’t need to generate conditional pattern bases. Disadvantages: 1] It requires multiple database scans. 2] Assumes transaction database is memory resident. 3] Generating candidate itemsets. Mining Frequent Patterns without Candidate Generation by J. Han, J. Pei, and Y. Yin in 2000 proposed a novel frequent pattern tree (FP-tree),which is an extended prefix tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3)a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods. Here the item priority is not taken into consideration. Advantages are 1] It finds frequent itemsets without generating any candidate itemset 2] Scans database just twice. 3] Does not generate candidate itemsets. Disadvantages are 1] It treats all items with the same importance/weight/price. 2] Consumes more memory and performs badly with long pattern data sets. A Fast High Utility Itemsets Mining Algorithm by Y. Liu, W.-K. Liao and A. Choudhary in 2005 proposed Two Phase algorithm. Utility mining focuses on identifying the itemsets with high utilities. As “downward closure property” doesn’t apply to utility mining, the generation of candidate itemsets is the most costly in terms of time and memory space. In this paper, a Two-Phase algorithm is presented to efficiently prune down the number of candidates and can precisely obtain the complete set of high utility itemsets. In the first phase, a model is proposed that applies the “transaction-weighted downward closure property” on the search space to expedite the identification of candidates. In the second phase, one extra database scan is performed to identify the high utility itemsets. It performs very efficiently in terms of speed and memory cost, and shows good scalability on multiple processors, even on large databases that are difficult for existing algorithms to handle. Advantages are 1] It performs very efficiently in terms of speed and memory cost .Disadvantages are 1] Generate too many candidates to obtain HTWUI require multiple database scan. UP-Growth: An Efficient Algorithm for High Utility Itemsets Mining by Vincent S. Tseng, Cheng- Wei Wu, Bai-En Shie, and Philip S. Yu in 2010, proposed an efficient algorithm, namely UP-Growth (Utility Pattern Growth), for mining high utility itemsets with a set of techniques for pruning candidate itemsets. The information of high utility itemsets is maintained in a special data structure named UP-Tree (Utility Pattern Tree) such that the candidate itemsets can be generated efficiently with only two scans of the database. The experimental results show that UP-Growth not only reduces the number of candidates effectively but also outperforms other algorithms substantially in terms of execution time, especially when the database contains lots of long transactions. Mining High utility Itemsets without Candidate Generation by Mengchi Liu Wuhan proposed the algorithm HUI-Miner. High utility itemsets refer to the sets of items with high utility like profit in a database, and efficient mining of high utility itemsets plays a crucial role in many real-life applications and is an important research issue in data mining area. To identify high utility itemsets, most existing algorithms first generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms incur the problem that a very large number of candidates are generated, but most of the candidates are found out to be not high utility after their exact utilities are computed. HUI-Miner
  • 3. Mining High Utility Itemsets from its Concise and Lossless Representations DOI: 10.9790/0661-17550814 www.iosrjournals.org 10 | Page uses a novel structure, called utility-list, to store both the utility information about an itemset and the heuristic information for pruning the search space of HUI-Miner. By avoiding the costly generation and utility computation of numerous candidate itemsets, HUI-Miner can efficiently mine high utility itemsets from the utility-lists constructed from a mined database. We compared HUI-Miner with the state-of-the-art algorithms on various databases, and experimental results show that HUI-Miner outperforms these algorithms in terms of both running time and memory consumption. III. current work We first introduce important preliminary definitions. A transaction database is a set of transactions D = {T1, T2, Tn} such that for each transaction Tc has a unique identifier c called its Tid. Each item i ∈ I where I is the set of items is associated with a positive number p(i), called its external importance or utility (e.g. unit profit). For each transaction Tc such that i ∈ Tc, a positive number q (i, Tc) is called the internal importance or utility of i (e.g. purchase quantity). Example 1. Consider the database of Fig. 1 (left), which will be used as our running example. This database contains five transactions (T1, T2...T5). Transaction T2 indicates that items a, c, e and g appear in this transaction with an internal utility of respectively 2, 6, 2 and 5. FIG. 1 (right) indicates that the external utility of these items are respectively 5, 1, 3 and 1. The importance or utility of an item i in a transaction Tc is denoted as u (i, Tc) and defined as p (i) × q (i, Tc).The importance or utility of an itemset X (a group of items X ⊆ I) in a transaction Tc is denoted as u(X, Tc) and defined as u(X, Tc) = Pi ∈X u (i, Tc). Example 2. The utility of item a in T2 is u(a,T2) = 5 × 2 = 10. The utility of the itemset {a, c} in T2 is u ({a, c}, T2) = u (a, T2) + u(c, T2) = 5 × 2 + 1 × 6 = 16. The importance or utility of an itemset X is denoted as u(X) and defined as u(X) = PTc ∈g (X) u(X, Tc), where g(X) is the set of transactions containing X. Example 3. The importance or utility of the itemset {a,c} is u({a,c}) = u(a)+u(c) = u(a,T1)+u(a,T2) + u(a,T3) + u(c,T1) + u(c,T2) + u(c,T3) = 5 + 10 + 5 + 1 + 6 + 1 = 28. The purpose of high-utility itemset mining is to discover all high-utility itemsets. An itemset X is a high-utility itemset if its utility u(X) is no less than a user-specified minimum utility threshold min_util given by the user. Otherwise, X is a low-utility itemset. Example 4. If min_util = 30, the high-utility itemsets in the database of our running example are {b,d}, {a,c,e}, {b,c,d}, {b,c,e}, {b,d,e}, {b,c,d,e} with respectively a utility of 30, 31, 34, 31, 36, 40 and 30. The utility of the transaction (TU) Tc is the sum of the utility of the items from Tc in Tc. i.e. TU (Tc) =Px ∈Tc u(x, Tc). Example 5. FIG. 2 (left) shows the TU of transactions T1, T2, T3, T4, and T5 from our running example. The transaction utility of transactions containing the item gives the Transaction Weighted Utility of the item X, i.e. TWU(X) = PT c ∈g (X) TU (Tc). Example 6. Fig. 2 (center) shows the TWU of single items a, b,c,d, d, e, f, g.Consider item a. TWU(A) = TU(T1) + TU(T2) + TU(T3) = 8 + 27 + 30 = 65. The TWU measure has three important properties that are used to prune the search space. Property (overestimation). The TWU of an itemset X is higher than or equal to its utility, i.e. TWU(X) ≥ u(X) [8]. Property 2 (antimonotonicity). The TWU measure is anti-monotonic. Let X and Y be two itemsets. If X ⊂ Y, then TWU(X) ≥TWU(Y) [8]. Property 3 (pruning). Let X be an itemset. If TWU(X) < min_util, then the itemset X is a low-utility itemset as well as all its supersets. Proof. This directly follows from Property 1 and Property 2. The set of items in the utility-list of an itemset X in a database D is a set of tuples such that there is a tuple (tid, iutil, rutil) for each transaction T tid containing X. The iutil element of a tuple is the utility of X in Ttid. i.e. u(X, Ttid). The rutil element of a tuple is defined as Pi ∈Ttid ∧i 6∈X U (i, Ttid).
  • 4. Mining High Utility Itemsets from its Concise and Lossless Representations DOI: 10.9790/0661-17550814 www.iosrjournals.org 11 | Page Example 7. The utility-list of {a} is {(T1, 5, 3) (T2, 10, 17) (T3, 5, 25)}. The utility list of {e} is {(T2, 6, 5) (T3, 3, 5) (T4, 3, 0)}. The utility-listof {a, e} is {(T2, 16, 5), (T3, 8, 5)}. To obtain the high-utility itemsets, Miner perform only one database scan to create utility-lists of patterns containing single items. Then, bigger patterns are obtained by performing the join operation of utility- lists of smaller patterns. Pruning the search space is done using the two following properties. Property 4 (sum of iutils). Let X be an itemset. If the sum of iutil values in the utility-list of x is higher than or equal to min_util, then X is a high-utility itemset. Otherwise, it is a low-utility itemset [7]. Property 5 (sum of iutils and rutils). Let X be an itemset. Let the extensions of X be the itemsets that can be obtained by appending an item y to X such that y i for all item i in X. If the sum of iutil and rutil values in the utility-list of x is less than min_util, all extensions of X and their transitive extensions are low-utility itemsets [7]. In the next section, we introduce our novel algorithm, which improves upon existing algorithms by being able to eliminate low-utility itemsets without performing join operations. Table. 1. A transaction database (left) and external utility values (right) Table. 2. Transaction utilities (left), TWU values (center) and EUCS (right) IV. Algorithm In this section, we present our proposal, the Miner algorithm. The main procedure (Algorithm 1) takes as input a transaction database with utility values and the min_util threshold. The algorithm first scans the database to calculate the TWU of each item. Then, the algorithm identifies the set I∗ of all items having a TWU greater than min_util .TWU values can be used to arrange the items in the ascending order. Items in transactions are reordered according to the total order during the second database scan and the utility-list of each item i ∈ I∗ is built and the novel structure named EUCS (Estimated Utility Co-Occurrence Structure) is built. This structure is defined as a set of triples of the form (a, b, c) ∈ I∗ × I∗ × R. A triple (a, b, c) indicates that TWU ({a, b}) = c. The EUCS can be implemented as a triangular matrix as shown in TABLE. 2 (right) where only tuple of the form (a, b, c) such that c ≠ 0 are kept. EUCS structure is more memory efficient because only few items co-occur with other items. Building the EUCS is very fast (it is performed with a single database scan) and occupies a small amount of memory, bounded by |I∗|×|I∗|, although in practice the size is much smaller because a limited number of pairs of items co-occur in transactions. After the construction of the EUCS, the depth-first search exploration of itemsets starts by calling the recursive procedure Search with the empty itemset ∅, the set of single items I∗, min_util and the EUCS structure. The Search procedure (Algorithm 2) has input (1) an itemset P, (2) extensions of P having the form Pz meaning that Pz was previously obtained by appending an item z to P, (3) min_util and (4) the EUCS. The search procedure operates as follows. The sum of iutil values of each extension of P, i.e Px, is taken and if it is greater than min_util then Px is a high-utility itemset and it is output. The sums of iutil and rutil values in the utility list of Px are greater than min_util then the extensions of Px should be explored. This is performed by merging Px with all extensions Py of P such that order of y greater than x to form extensions of the form Pxy containing |Px| + 1 items. The utility-list of Pxy is then constructed by calling the Construct procedure (cf. Algorithm 3) to join the utility-lists of P, Px and Py. Then, a recursive call to the Search procedure with Pxy is id Transactions T1 (a,1)(c,1)(d,1) T2 (a,2)(c,6)(e,2)(g,5) T3 (a,1)(b,2)(c,1)(d,6),(e,1),(f,5) T4 (b,4)(c,3)(d,3)(e,1) T5 (b,2)(c,2)(e,1)(g,2) Item a b c d e f g Profit 5 2 1 2 3 1 1 TID TU T1 8 T2 27 T3 30 T4 20 T5 11 Items TWU a 61 b 65 c 96 d 58 e 88 f 30 g 38 Item a b c d e f b 30 c 65 61 d 38 50 58 e 57 61 77 50 f 30 30 30 30 30 g 27 38 38 0 38 0
  • 5. Mining High Utility Itemsets from its Concise and Lossless Representations DOI: 10.9790/0661-17550814 www.iosrjournals.org 12 | Page done to calculate its utility and explore its extension(s). Search procedure is recursive and starts from single items, and appends single items and it only prunes the search space based on Property 5. It can be easily seen based on Property 4 and 5 that this procedure is correct and complete to discover all high-utility itemsets. The Construct procedure considers the revised transactions where the transactions are arranged in the ascending order of the TWU.X/T is the set of all the items in T after X.ru(X/T) is the remaining utility of itemset X in T which is the sum of all the items in X/T.Here there is no need of database scan. First we identify the common one item transaction and combine it to form the two item utility list. To construct the k item utility list we have to combine the k-1 item utility list and k item utility list Co-occurrence based Pruning. The main novelty in Miner is a pruning mechanism named EUCP (Estimated Utility Co-occurrence Pruning), which uses the EUCS. EUCP performs the construction of the utility list by eliminating the low-utility extension of Pxy and all its transitive extensions. This is done on line 8 of the Search procedure. The pruning condition do not explore Pxy and its supersets if any tuple (x, y, c) in EUCS such that c ≤ min_util. This strategy is correct (only prune low-utility itemsets). The proof is that by Property 3, if an itemset X contains another itemset Y such that TWU(Y) <min_util, then X and its supersets are low-utility itemsets. Search procedure is recursive and will check all other pairs of items in Pxy in previous recursions of the Search procedure leading to Pxy. For example, consider an itemset Z = {a1, a2, a3, a4}. To generate this itemset, the search procedure had to combine {a1, a2 a3} and {a1, a2, a4}, obtained by combining {a1, a2} and {a1, a3}, and {a1, a2} and {a1, a4}, obtained by combining single items. It can be easily observed that when generating Z all pairs of items in Z have been checked by EUCP except {a3, a4}. Algorithm 1: Miner Algorithm input: D: a transaction database, min_util: a user-specified threshold output: the set of high-utility itemsets 1 Scan D to calculate the TWU of single items; 2 I ∗ ← each item i such that TWU (i) < min_util; 3 Let be the total order of TWU ascending values on I ∗; 4 Scan D to built the utility-list of each item i ∈ I ∗ and build the EUCS structure; 5 Search (∅, I ∗, min_util, EUCS); Algorithm 2: Search Algorithm input: P: an itemset, ExtensionsOfP: a set of extensions of P, the min_util threshold, the EUCS structure output: the set of high-utility itemsets 1 foreach itemset P x ∈ ExtensionsOfP do 2 if SUM (Px.utilitylist.iutils) ≥ min_util then 3 output Px; 4 end 5 if SUM (Px.utilitylist.iutils) +SUM (Px.utilitylist.rutils) ≥ min_util then 6 ExtensionsOfPx ← ∅; 7 foreach itemset Py ∈ ExtensionsOfP such that y x do 8 if ∃(x, y, c) ∈ EUCS such that c ≥ min_util) then 9 Pxy ← Px ∪ Py; 10 Pxy.utilitylist ← Construct (P, Px, Py); 11 ExtensionsOfPx ← ExtensionsOfPx ∪ Pxy; 12 end 13 end 14 Search (Px, ExtensionsOfPx, min_util); 15 end 16 end Algorithm 3: Construct Algorithm input : P: an itemset, P x: the extension of P with an item x, P y: the extension of P with an item y output: the utility-list of P xy 1 UtilityListOfPxy ← ∅; 2 foreach tuple ex ∈ P x.utilitylist do 3 if ∃ey ∈ P y.utilitylist and ex.tid = exy.tid then 4 if P.utilitylist ≠ ∅ then
  • 6. Mining High Utility Itemsets from its Concise and Lossless Representations DOI: 10.9790/0661-17550814 www.iosrjournals.org 13 | Page 5 Search element e ∈ P.utilitylist such that e.tid = ex.tid.; 6 exy ← (ex.tid, ex.iutil + ey.iutil − e.iutil, ey.rutil); 7 end 8 else 9 exy ← (ex.tid, ex.iutil + ey.iutil, ey.rutil); 10 end 11 UtilityListOfP xy ← UtilityListOfP xy ∪ {exy}; 12 end 13 end 14 return UtilityListPxy; V. Experimental study We performed experiments to assess the performance of the proposed algorithm. Experiments were performed on a computer with a 64 bit Corei5 processor running Windows 7 and 5 GB of free RAM. We compared the performance of Miner with the state-of-the-art algorithm UP Growth for high-utility itemset mining. All memory measurements were done using the Java API. Experiments were carried on real-life dataset having varied characteristics. The dataset contains 1,112,949 transactions with 46,086 distinct items and an average transaction length of 7.26 items. External utilities for items are generated between 1 and 1,000 by using a log-normal distribution and quantities of items are generated randomly between 1 and 5, as the settings of [2, 7, 10]. Execution time: We first ran the Miner and UP growth algorithms on the dataset while decreasing the min_util threshold until algorithms became too long to execute, ran out of memory or a clear winner was observed. We recorded the execution time, the percentage of candidate pruned by the Miner algorithm and the total size of the EUCS. The comparison of execution time against min-utility threshold is shown in Fig 1. Miner was faster up to 6 times than UP growth. Fig: 1 Performance of Execution Time against Min utility Threshold Pruning effectiveness: These results show that candidate pruning can be very effective by pruning up to 95% of candidates. As expected, when more pruning was done, the performance gap between Miner and UP growth became larger. Fig: 2 Performance of Candidate Items against Min utility threshold
  • 7. Mining High Utility Itemsets from its Concise and Lossless Representations DOI: 10.9790/0661-17550814 www.iosrjournals.org 14 | Page Memory overhead: We also studied the memory overhead of using the EUCS structure. We found that the memory footprint of EUCS was 6 times less than UP Growth. We therefore conclude that the cost of using the EUCP strategy in terms of memory is low. Fig: 3 Performance of Memory Complexity against Threshold. VI. Conclusion In this paper, we have presented a novel algorithm for high-utility itemset mining named Miner. This algorithm integrates a novel strategy named EUCP (Estimated Utility Cooccurrence Pruning) to reduce the number of joins operations when mining high-utility itemsets using the utilitylist data structure. We have performed an extensive experimental study on real-life datasets to compare the performance of Miner with the state-of-the-art algorithm UP Growth. Results show that the pruning strategy reduces the search space by upto 95 % and that Miner is up to 8 times faster than UP Growth. Acknowledgements The authors wish to thank the Management, the Principal and Head of the Department (CSE) of ICET for the support and help in completing the work. References [1]. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proc. Int. Conf. Very Large Databases, pp. 487–499, (1994) [2]. V. S. Tseng, C.-W. Wu, B.-E. Shie, and P. S. Yu, “UP-Growth: An efficient algorithm for high utility itemset mining,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2010, pp. 253–262. [3]. Y. Liu, W. Liao, and A. Choudhary, “A fast high utility itemsets mining algorithm,” in Proc. Utility-Based Data Mining Workshop,2005, pp. 90–99. [4]. Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu,Fellow, IEEE, “Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases”, IEEE Transactions On Knowledge And Data Engineering, Vol. 25, No. 8, August 2013. [5]. C. Lucchese, S. Orlando, and R. Perego, “Fast and memory efficient mining of frequent closed itemsets,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 1, pp. 21–36, Jan. 2006. [6]. K. Chuang, J. Huang, and M. Chen, “Mining top-k frequent patterns in the presence of the memory constraint,” VLDB J., vol. 17, pp. 1321–1344, 2008. [7]. R. Chan, Q. Yang, and Y. Shen, “Mining high utility itemsets,” in Proc. IEEE Int. Conf. Data Min., 2003, pp. 19–26. [8]. Erwin, R. P. Gopalan, and N. R. Achuthan, “Efficient mining of high utility itemsets from large datasets,” in Proc. Int. Conf. Pacific- Asia Conf. Knowl. Discovery Data Mining, 2008, pp. 554–561. [9]. K. Gouda and M. J. Zaki, “Efficiently mining maximal frequent itemsets,” in Proc. IEEE Int. Conf. Data Mining, 2001, pp. 163– 170. [10]. Mengchi Liu, Junfeng Qu, “Mining High Utility Itemsets without Candidate Generation”,in Proceeding CIKM '12 Proceedings of the 21st ACM international conference on Information and knowledge management,Pages 55-64 [11]. C. W. Wu, P. Fournier-Viger, P. S. Yu, and V. S. Tseng,” Efficient mining of a concise and lossless representation of high utility itemsets”, In Proc. IEEE Int’l Conf. Data Mining, pages 824 –833, 2011. [12]. C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong and Y.-K. Lee, “Efficient tree structures for high utility pattern mining in incremental databases,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 12, pp. 1708– 1721, Dec. 2009.