Mining High Utility Patterns in Large Databases using Mapreduce Framework

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 06 | June -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2021
Mining High Utility Patterns in Large Databases using MapReduce
Framework
1Ms. Priti Haribhau Deshmukh ,2Assistant Prof. A. S. More
1Computer Engineering Department , Rajarshi Shahu School of Engineering and Research Narhe,
Pune, 411041, India
2Computer Engineering Department Rajarshi Shahu School of Engineering and Research, Narhe,
Pune, 411041, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Recently considering users interest is more
important and utility mining does not consider this
interestingness measure. Utility mining is a new promising
technology of data mining. High utility pattern growth
approach is a look ahead strategy, and a lineardatastructure.
Here linear data structure allows computing atightboundfor
powerful pruning search space and to straightforwardly
identify high utility patterns in an efficient and scalable way.
In this it targets the root cause with prior algorithms.
Recently, in data mining area high utility pattern mining is
one of the most fundamental research issues since itsability to
consider the no binary frequency values of items in
transactions and different profit standards for every item.
When a database is updated, or when the minimum threshold
is changed, incremental and interactive data mining provides
the ability of reducing unnecessary calculations using
previous data structures and mining outcome of. In this
analytical study, novel tree structures are proposed for
efficient presentation of incremental and interactive HUP
mining.
Keyword —data mining, utility mining,highutilitypatterns
, d2HUP, pattern mining.
1. INTRODUCTION
Frequent pattern mining is provided with the explanation
that, candidate set generation and test prototype of prior. It
is having many drawbacks like it requires multiple database
scans and generates many candidate item sets. This query is
solved by growth approach. For this introductionofa prefix-
tree-based algorithm is provided that is without candidate
set generation and testing. Given that frequent pattern
mining plays an important role in data mining applications,
its two limitations are found in this. First, it treats all items
with the same importance/weight/price. Second one is, in
one transaction, each item appears in a binary form, i.e.,
either it is present or absent. Since, in the real world, each
item in the supermarket has a different importance/price
and one customer can buy multiple copies of an item. Items
that have high and low selling frequencies mayhavelowand
high profit values, respectively. Take example as, some
frequently sold items like biscuits, milk, andpencil mayhave
lower profit values compared to the infrequentlysoldhigher
profit value items such as gold ring and gold necklace. So, in
finding only the traditional frequent patterns in a database
cannot satisfy the requirement of finding the most valuable
item sets/customers that contribute to the major part of the
total profits in a retail business. This gives the motivation to
develop a mining model to discover the itemsets/customers
contributing to the majority of the profit. Utility mining
model was defined for discovering more important
knowledge from a database. Here the importance of an item
set by the concept of utility is measured. The datasetwith no
binary frequency standards of each item intransactions, and
also with different profit standards of each item is handled.
Therefore, utility mining represents real world marketdata.
Other application areas, suchasstock tickers,network traffic
measurements, web server logs, data feeds from sensor
networks, and telecom call records can have similar
solutions. It is not suitable for large databases e.g. big data.
Earlier studies deals with only small datasets, here big data
is considered where data is having properties like variety ,
variability, veracity, volume, velocity, etc. Utility mining is
done for big data which improves the efficiency.
2. REVIEW OF LITERATURE
As of now a great amount of study has been done on high
utility pattern mining. In this section, we review the prior
works on pattern mining and utility mining. In [1] Luc De
Raedt implements the data mining using the constrained
programming for item set mining. But this concept arise the
problem of Specialization andoptimization.Analgorithm for
finding patterns was presented by Ramesh C. Agarwal in [2]
that finds long patterns using depth first search. This
requires post-processing of data and time required for that
depends on number of item sets found by tree generator.

In [3] Chowdhury Farhan Ahmed proposes three novel tree
structures to efficiently perform incremental andinteractive
HUP mining. But this is not scalable for handling a large
number of distinct items and transactions.FrancescoBonchi
in [4] proposes a preprocessing approach overcomes the
suspected incompatibility between anti-monotonicity and
monotonicity, instead using the two components’synergyto
reduce data and search-space size.In [5] Claudio Lucchese
finds frequent patterns by using the concept of review and
extends the state-of-the-art of the constraints that can be
pushed in a frequent pattern computation.Butthisapproach
needs to build a system with incremental mining. Francesco
Bonchi integrates the recently proposed ExAnte data
reduction technique within the FP-growth algorithm in [6].
C.H. Cai introduced Mining Association Rules with Weighted
Items in [7]. This approach introduces the notion of
weighted items to represent the importance of individual
items. But this approach considers mining of binary
association rules. Roberto J. Bayardo proposed mining the
most interesting rules in [8], to explain that the best rule
according to any of these metrics must exist in along a
support border. This approach suffers from deficiency of
most optimized rule miners.
In [9] Rakesh Agrawal proposed fast algorithms for mining
association rules. This approach present two, Apriority and
Apriority Tied fordiscoveringall significantassociationrules
between items in a large database of transactions. It didn’t
consider the quantities of items brought in a transaction.
Rakesh Agrawal proposed mining associationrulesbetween
sets of items in large databases in [10]. This approach
presented an efficient algorithmthatgeneratesall significant
association rules between items in the database. But in this
approach some parameters are redundant and are not
generated in the calculations.
3. SYSTEM OVERVIEW
We can overcome the disadvantage of the existing
method. In the existing system a single dataset is used for
utility mining. But this is not suitable for large databases. So
our proposed system works on big data using the parallel
and distributed algorithms. Through which data can be
partitioned and computed across many of hosts, and the
execute application computations in parallel close to their
data. Partitioning is made by use of Map Reduce framework
of Hadoop. Each partition of data is individually minedusing
D2HUP algorithm and finally all results are collected
together and set of high utility patterns are obtained.
A. Architecture
Figure 1. Architecture of proposed system
B. Module Description
1. Map Reduce module
Providing the dataset as the input of system, the system
should partition data using Map-Reduce programming
model. Mapper will partition data. For each partitioned
data, high utility patterns are discovered using d2HUP
algorithm.
2. Tree node edges finding Module
This module is used to generate a reverse set
enumeration tree based on the product ID. This module
will help us to generate the pattern of products.
3. Pattern Mining Module
This algorithm provides the whole implementation of
Direct Discovery of High Utility Pattern. The algorithm
calculates the result of each unit that was partitioned
before individually.
4. Result Analysis
The result of above module is then passed to thismodule.
This module takes the resulted utilization of each
partition and compares it with theuserdefinedthreshold
which will give us the final output and shows usthelistof
combination that has high utilization.
C. Mathematical Model
Let S be the High Utility Pattern system
S={ D, XUT, minU, TS, X, Ω, DFS, u, s, W, uBitem,
uBfpe}
Where,
D-database
XUT- External Utility table
minU- minimum Threshold

TS- Transaction set
X- pattern
Ω- imposed ordering
DFS- Depth First Search
u- utilization value
s- support value
W-relevant items
uBitem( i, X) - For a pattern X and an item (i U X),
sum of utility of the full prefix extension of pattern
with respect to every transaction in TS({i}U X)
uBfpe - Sum of utility of the full prefix extension of
pattern with respect to every transaction in TS(X)
4. ALGORITHM
The proposed system uses following algorithms for finding
the high utility pattern mining.
Algorithm 1: High Utility Pattern Algorithm
Input: minimum threshold and transaction dataset
Output: High Utility Patterns
1. Mapper<T, t1, t2…tn>
T= {t1, t2 …tn}
2. Reducer<t1, t2,...tn}
If (T not Null) then
For each T: t1, t2….tn then
3. Item_info = tn (price, name_of_item)
4. HUP = DFS (tn, threshold, Item_info)
5. End for
6. End If
7. Else insert data first
8. Return HUP
Algorithm 2: Depth First Search Algorithm
DFS (N, TS(pat(N)), minU, Ω)
1. If u(pat(N)) ≥ minU then output pat(N)
2. W ← {i|i < pat(N)∧uBitem(i, pat(N))≥minU}
3. If Closure(pat(N),W, minU) is satisfied
4. then output nonempty subsets of W∪pat(N)
5. Else if Singleton(pat(N),W, minU) is satisfied
6. then output W ∪ pat(N) as an HUP
7. Else for each item i belongs to W in Ω do
8. If uBfpe({i} ∪ pat(N)) ≥ minU
9. then C ← the child node of N for i
10. TS(pat(C)) ← Project(TS(pat(N)), i)
11. DFS(C, TS(pat(C)), minU, Ω)
12. End for each
Algorithm 3: PseudoProject
PseudoProject (TScaul(pat(P)), i)
1. For each relevant item j < i do
2. (s[j], u[j],uBitem[j],uBfpe[j], link[j]) ← 0
3. End for each
4. For each utility list t threaded by link[i] do
5. u(pat(N), t) ← u(pat(P), t) + u(i, t)
6. Σ ← u(pat(N), t)
7. For each relevant item j belongs to t ∧ j < i by Ω do
8. s[j] ← s[j] + 1
9. u[j] ← u[j] + u(j, t) + u(pat(N), t)
10. Σ ← Σ + u(j, t)
11. uBfpe[j] ← uBfpe[j]
+ Σ
12. End for each
13. For each relevant
item j belongs to t ∧ j < i by Ω
do
14. uBitem[j] ← uBitem[j] + Σ
15. thread t into the chain by link[j]
16. End for each
17. End for each
The d2HUP algorithms traverse the reverse set enumerated
tree using depth first search technique. The algorithm uses
closure and singleton method which checks the utilization
value of each node in tree. Using Pseudo projection

algorithm it calculates utility value for each node in tree.
With a parallel, distributed algorithm large data sets are
processed. i.e a complex data structure which avoids costly
repeated database scan. Divide and conquer methodscan be
used to decompose mining tasks into set of smaller tasks for
mining.
5. SOFTWARE REQUIREMENT SPECIFICATION
Hardware Requirements
For this we required Pentium IV or latest system with hard
disk of 40GB and 4GB RAM.
Software Requirements
Windows 8 or latest operating system is required for
designed system. Java is the coding language is used. Back
end is MS SQL Server. Eclipse IDE for Java is used as
development environment. Hadoop 2.7.1 is also used.
6. PERFORMANCE ANALYSIS
The proposed system finds high utility patterns for large
dataset. In this analytical study the items from foodmart
dataset are calculated which are having more utility than
threshold utility given by user. The final result provided by
system is the patterns with high utilizationaccordingtouser
threshold. The proposed system is designed in such a way
that it can provide facility of selecting dataset to users asper
their requirement.
Performance is analyzed on the basis of execution time
required. When we use existing algorithms to find high
utility patterns from large datasets it takes long time. As
compared to existing system this proposed system requires
less time for execution.
In order to make efficient system we used Hadoop
technology which partition the data using Map Reduce
programming module and perform function on each
partition in parallel manner which save system time. The
proposed system will find result more accurately and with
efficiently as compared to other system.
Chart.1 Comparison of execution time required by
approach
7. CONCLUSION
The designed system implements d2HUP algorithm for
utility mining with the item set share framework, which
finds high utility patterns in big data without candidate
generation.
Map Reduce framework of Hadoop is usedforhandlinglarge
databases. This approach is used to enhance the significance
by the look ahead strategy that identifies high utility
patterns.
ACKNOWLEDGEMENT
I express true sense of gratitude towards my project guide
Prof. A. S. More of computer department for her invaluable
co-operation and guidance that she gave me throughout my
research, for inspiring me and providing me all the lab
facilities, which made this research work very convenient
and easy. I would also like to express my appreciation and
thanks to our HOD Prof. R. H. Kulkarni and Director Dr A. B.
Auti and all my friends who knowinglyorunknowinglyhave
assisted me throughout my hard work.
REFERENCES
[1] L. De Raedt, T. Guns, and S. Nijssen, “Constraint
programming for item set mining,” in SIGKDD, pp. 204–212,
2008.
[2] R. Agarwal, C. Aggarwal, and V. Prasad, “Depth first
generation of long patterns,” in SIGKDD, pp. 108–118, 2000.
[3] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee,
“Efficient tree structures for high utility pattern mining in

incremental databases,” IEEE TKDE, vol. 21, no. 12, pp.
1708– 1721, 2009.
[4] F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi,
“Exante: A preprocessing method for frequent-pattern
mining,” IEEE Intelligent Systems, vol. 20, no. 3, pp. 25–31,
2005.
[5] F. Bonchi and C. Lucchese, “Extendingthestate-of-the-art
of constraint-based pattern discovery,” Data andKnowledge
Engineering, vol. 60, no. 2, pp. 377–399, 2007.
[6] F. Bonchi and B. Goethals, “Fp-bonsai: The art of growing
and pruning small fp-trees,” in PAKDD, pp. 155–160, 2004.
[7] C. H. Cai, A. W. C. Fu, C. H. Cheng, and W. W. Kwong,
“Mining association rules with weighted items,” in
International Database Engineering and Applications
Symposium. IEEE, 68-77, 1998.
[8] R. Bayardo and R. Agrawal, “Mining the most interesting
rules,” in SIGKDD. ACM, pp. 145–154, 1999.
[9] R. Agrawal and R. Srikant, “Fast algorithms for mining
association rules,” in VLDB, pp. 487–499, 1994.
[10] R. Agrawal, T. Imielinski, and A. Swami, “Mining
association rules between sets of items in large databases,”
in SIGMOD. ACM, pp. 207–216, 1993.
[11] S. Dawar and V. Goyal, “Up-hist tree: An efficient data
structure for mining high utility patterns from transaction
databases,” in IDEAS. ACM, pp. 56–61, 2015.
[12] T. De Bie, “Maximum entropy models and subjective
interestingness: an application to tiles in binary databases,”
Data Mining and Knowledge Discovery, vol. 23, no. 3, pp.
407–446,2011.
[13] G.-C. Lan, T.-P. Hong, and V. S. Tseng, “An efficient
projectionbased indexing approach for mining high utility
itemsets,” KAIS, vol. 38, no. 1, pp. 85–107, 2014.
[14] Y.-C. Li, J.-S. Yeh, and C.-C. Chang, “Isolated items
discarding strategy for discovering high utility itemsets,”
Data & Knowledge Engineering, vol. 64, no. 1, pp. 198–217,
2008.
[15] A. Erwin, R. P. Gopalan, and N. R. Achuthan, “Efficient
mining of high utility itemsets from large datasets,” in
PAKDD,
pp. 554–561, 2008.
[16] P. Fournier-Viger, C.-W. Wu, S. Zida, and V. S. Tseng,
“Fhm: Faster high-utility itemset mining using estimated
utility cooccurrence pruning,” in ISMIS. Springer, pp. 83–92,
2014.
[17] L. Geng and H. J. Hamilton, “Interestingness measures
for data mining: A survey,” ACM Computing Surveys, vol. 38,
no. 3, p. 9, 2006.
[18] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns
without candidate generation,” in SIGMOD. ACM, pp. 1–12,
2000.
[19] R. J. Hilderman, C. L. Carter, H. J. Hamilton, and N.
Cercone, “Mining market basket data using share measures
and characterized itemsets,” in PAKDD, pp. 72–86, 1998.
[20] R. J. Hilderman and H. J. Hamilton, “Measuring the
interestingness of discovered knowledge: A principled
approach,” Intelligent Data Analysis, vol. 7, no. 4, pp. 347–
382, 2003.

Mining High Utility Patterns in Large Databases using Mapreduce Framework

Recommended

More Related Content

What's hot (19)

Similar to Mining High Utility Patterns in Large Databases using Mapreduce Framework (20)

More from IRJET Journal (20)

Recently uploaded (20)

Mining High Utility Patterns in Large Databases using Mapreduce Framework