SlideShare a Scribd company logo
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 142
Mining Regular Patterns in Data Streams Using Vertical Format
G. Vijay Kumar gvijay_73@yahoo.co.in
School of Computing
K L University
Guntur – 522 502, India
M. Sreedevi msreedevi_27@yahoo.co.in
School of Computing
K L University
Guntur – 522 502, India
NVS Pavan Kumar nvspavankumar@gmail.com
School of Computing
K L University
Guntur – 522 502, India
Abstract
The increasing prominence of data streams has been lead to the study of online mining in order
to capture interesting trends, patterns and exceptions. Recently, temporal regularity in occurrence
behavior of a pattern was treated as an emerging area in several online applications like network
traffic, sensor networks, e-business and stock market analysis etc. A pattern is said to be regular
in a data stream, if its occurrence behavior is not more than the user given regularity threshold.
Although there has been some efforts done in finding regular patterns over stream data, no such
method has been developed yet by using vertical data format. Therefore, in this paper we
develop a new method called VDSRP-method to generate the complete set of regular patterns
over a data stream at a user given regularity threshold. Our experimental results show that highly
efficiency in terms of execution and memory consumption.
Keywords: Data Streams, Temporal Regularity, Regular Pattern, Vertical Data.
1. INTRODUCTION
Unlike mining static databases, data stream mining [1, 2, 3] creates many new challenges. It is
unrealistic to keep the entire stream in the main memory or even in a secondary storage device
because data stream is a continuous, massive (e.g., terabytes in volume), unbounded, timely
ordered series of data elements generates at a rapid rate. Incredible volumes of data streams are
often generated by communication networks, Internet traffic, real-time surveillance systems, online
transactions in the financial market, remote sensors, scientific and engineering experiments and
other dynamic environments. Discovering knowledge in data streams is an important research
area in data mining and knowledge discovery process.
Mining Frequent patterns [4, 5] from static databases has been broadly studied in Stream data
mining [1, 2, 3]. Apriori algorithm [5] is a classical algorithm proposed by R. Agarwal and R.
Srikanth in 1993 for mining frequent item sets for Boolean association rules. The algorithm uses
prior knowledge and employs an iterative approach known as a level–wise search to generate
frequent item sets. First it generates with 1-item sets, recursively generates 2-item set and then
frequent 3-item set and continues until all the frequent item sets are generated. Later Han et. al
[4] proposed the frequent pattern tree (FP-tree) and FP-growth algorithm to mine frequent
patterns without candidate generation. The Apriori and FP-growth algorithms find outs the
occurrence frequencies of a pattern i.e., support. Several algorithms have been proposed so far
to mine frequent patterns in a transaction databases as well as in data streams. However, the
significance of a pattern may not always depend upon the occurrence frequency of a pattern (i.e.,
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 143
support). The significance of a pattern may also depend upon other occurrence characteristics
such as temporal regularity of a pattern. For example, to improve web site design the web site
administrator may be interested in regularly visited web page sequence rather than heavily hit
web pages only for a specific period of time. Also, in a retail market some products may have
regular demand than other products. To know how regularly a product has been sold is essential
rather than the occurrence frequency of a product. Therefore, finding patterns at regular intervals
also plays an important role in data mining.
Recently, Tanbeer et. al [6] introduced a new problem of discovering Regular Patterns that follow
a temporal regularity in their occurrence behavior. With the help of user given maximum regularity
measure at which pattern occurs in a database is called a regular pattern. They also extended the
same problem in Data Streams. They proposed a tree based data-structure, called RPS-tree [7]
that captures user-given regularity threshold and mines regular patterns in a data stream with the
help of FP-growth algorithm and conditional pattern bases and corresponding conditional trees.
Therefore, in this paper, we propose a new method called Vertical Data Stream Regular Patterns
method (VDSRP - method in short), using the same Data Stream which is in [7] to mine regular
patterns using vertical data format. By using Vertical Data Format [8, 9, 10, 11], it will be able to
judge the non-regular item sets before generating candidate item sets. The main idea of our new
method is to develop a simple, but yet powerful, that captures the data stream content in a
window by using sliding-window technique to find regular items. The experimental results show
the effectiveness of VDSRP-method in finding regular patterns in a Data Stream.
The rest of the paper is organized as follows. Section 2 summarizes the existing tree structure to
mine regular patterns. Section 3 introduces the problem definition of regular pattern mining. The
method of VDSRP to find regular patterns using vertical data format are given in section 4. Section
5, our experimental results are shown. Finally, we conclude the paper in section 6.
2. RELATED WORK
In data mining, one of the most important techniques is Association rule mining. It was first
introduced by Agarwal et al. [5]. It extracts frequent patterns, correlations, associations among
sets of items in databases. The main drawback with the classical Apriori algorithm is that it needs
repeated scans to generate candidate set. After that Frequent pattern tree and FP-growth
algorithm [4] is introduced by Han et al. to mine frequent patterns without candidate generation.
Periodic patterns [12], [13] and Cyclic patterns [14] are also closely related with Regular patterns.
Periodic pattern mining in time-series data focuses on the cyclic behavior of patterns either in
whole or some part of time-series. Although periodic pattern mining is closely related with our
work, it cannot be applied directly to mine regular patterns from a data stream because it process
with either time-series or sequential data.
Tanbeer et al. [7] have proposed a tree based data-structure, called RPS-tree that captures user-
given regularity threshold and mines regular patterns in a data stream with the help of FP-growth
[4] algorithm and conditional pattern bases and corresponding conditional trees. First, they
constructed RPS-tree consists of one root node referred to as “null” and a set of item-prefix sub-
trees called children of the root. Each node in an RPS-tree represents an itemset in the path from
the root up to that node. The RPS-tree maintains the occurrence information of all transactions in
the current window with the tree structure. Also RPS-tree maintains two types of nodes called
ordinary nodes and tail nodes. Nodes of both types explicitly maintain parent, children and node
traversal pointers. In addition each tail node maintains a tid-list and a tail-node pointer. The tail-
node pointer points to either the next tail node in the tree if any, or “null”. Then they construct an
item header table called RPS-table consists of each distinct item in the current window with
relative regularity and a pointer pointing to the first node in the RPS-tree that carries the item.
RPS-table of a RPS-tree consists of three fields, they are item name (i), regularity of i (r), and a
pointer to the RPS-tree for i (p). Similar to FP-growth mining, they mine the RPS-tree of
decreasing size to generate regular patterns by creating conditional pattern-bases and
corresponding conditional trees.
.
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 144
3. PROBLEM DEFINITION
Let I = {i1, i2, . . . , in} be the set of items. A set X = {ij, . . . , ik} ⊆ I, where j ≤ k and j, k ∈ [1, n] is
called a pattern (or an itemset). A transaction t = (tid, Y) is a couple where tid is a transaction-id
and Y is a patter or an itemset. If X ⊆ Y, which means that t contains X or X occurs in t. Let size(t)
be the size of t, i.e., the number of items in Y.
3.1 Definition 1 (Data Stream)
A data stream DS can be defined as infinite sequence of transactions, i.e., DS = [t1, t2, …, tm, …],
i ∈ [1, m] where ti is the i-th
arrived transaction. A window W can be referred to as a set of all
transactions between the i-th
and j-th
arrival of transactions, where j > i and the size of W is W= j
– i, i.e., the number of transactions between i-th
and j-th
arrival of transactions. Let each slide of
window introduce and expire slide_size, 1 ≤ slide_size ≥ W, transactions into and from the
current window. If X occurs in tj, j ∈ [1, W], such transactions-id is denoted as tj
X
, j ∈ [1, W].
Therefore Tw
X
= { tj
X
, . . . tk
X
}, j, k ∈ [1, W] and j ≤ k is the set of all transaction-ids where X
occurs in the current window W.
3.2 Definition 2 (A period of X in W)
Let tX
j+1 and tj
X
, j ∈ [1, ( W-1)], be two consecutive transaction-ids in Tw
X
. the number of
transactions between t
X
j+1 and tj
X
is defined as a period of X, say p
X
(i.e., p
X
= t
X
j+1 - tj
X
, j ∈ [1, (
W-1)]). For the simplicity of period computation, a “null” transaction with no item is considered
at the beginning of W, i.e., tf = 0(null), where tf represents the tid of the first transaction to be
considered. Similarly, tl, the tid of the last transaction to be considered, is the tid of the W
-th
transaction in the window, i.e., tl = t W. For instance, the stream data in Table 1, consider the
window is composed of eight transactions (i.e., tid = 1 to tid = 8 make the first window, say W1).
Then set of transactions in W1 where pattern (b, c) appears in (2, 3, 5). Therefore, the periods for
(b, c) are {(2 – tf) = 2, (3 – 2) = 1, (5 – 3) = 2 and (tl – 5) = 3}, where tf = 0 and tl = 8.
The occurrence periods of X in W defined as above will be the precise information about the
occurrence behaviour of a pattern. A pattern will not be a regular in W, if it appears after large
period at any stage. The largest occurrence period of a pattern can provide the upper limit of its
periodic occurrence characteristic. Hence, the measure of the characteristic of a pattern of being
regular in a W (i.e., the regularity of a pattern in W) can be defined as follows.
3.3 Definition 3 (Regularity of a pattern X in W)
Let in a TwX, PwX be the set of all periods of X in W i.e., PwX = { p1X , . . . , psX}, where s is
the total number of periods of X in W. Then, the regularity of X in W can be denoted as regw(X) =
Max(p1X, ..., psX}. For example, in DS of Table 1 regw(b, c) = 3, since Pw1{b, c} = Max(2, 1, 2,
3) = 3. Therefore a pattern is called a regular pattern in W if its regularity in W must not more than
a user given maximum regularity threshold called max_reg λ, with 1 ≤ λ ≤ W. The regularity
threshold is given as the percentage of window size.
Therefore the regular patterns in W satisfy the downward closure property [6]. i.e., if a pattern is
found to be regular, then all of its non-empty subsets will be regular. Accordingly, if a pattern is
not regular, then none of its supersets can be regular. Given DS, W, and max_reg, finding the
complete set of regular patterns in W, Rw that have regularity of not greater than the max_reg
value is the problem of mining regular patterns in data stream.
4. MINING REGULAR PATTERNS
In contrast with traditional data sets, the continuous flow (in and out) of stream data in a computer
system updates with varying rates. So, we had taken sliding window technique and vertical data
format to mine regular patterns from the data stream.
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 145
Let Figure 1. be the data stream which contains transaction-id i.e., tid, and itemset i.e., transaction
with respect to tid. Now consider the window size may be 8 i.e., the size of the window |W| = 8. Let
the first window W1 handles the transactions of data stream from tid-1 to tid-8, now convert this W1
into vertical data format i.e., (itemset : tid). Then find out the periods for each itemset which are
given in the data stream to get the regular patterns which are less than or equal to the user given
regularity threshold. After finding W1 regular patterns generate second window W2 and repeat the
same procedure to find out latest regular patterns from the data stream.
tid transaction
1. a, c, e, f
2. b, c, f
3. b, c, f
4. c, d, e
5. a, b, c, e
6. c, d, e
7. a, c, d, e
8. c, d, e, f
9. a, c
Window size | W | = 8
FIGURE 1: A data stream DS
Our proposed method is given below to mine regular patterns from the data stream with the help of
sliding-window technique and vertical data format. Both the Apriori algorithm and FP-growth
algorithm mine frequent patterns in Horizontal data format (i.e., {TID : itemset}), where TID is a
transaction-id and itemset is the set of items in transaction TID. But the data can also be present in
{item : TID-set} format where item is an item name and TID-set is the set of transactions containing
the item. This is known as Vertical data format. We are going to mine regular patterns from the
given data stream using vertical format.
VDSRP – Method
Input : DS, λ = 3
Output: Complete set of regular patterns
Procedure:
1. For each window W of size 8 in DS 12. Delete inext
2. Convert Tw into VTw 13. Else
3. For each item i in X where X ⊆ VTw 14. Result ← Result ∪ (i, next)
4. If(Find R(i) > λ) //FindR returns max_reg 15. Do “and” operation till all regular
5. Delete i itemsets found
6. Else 16. }
7. { 17. Update W(i, j)
8. Result ← Result ∪ (i)
9. For each item inext in X
10. {
11. If(Find R (inext) > λ) }
By using the example data stream in Figure 1. We convert the first window into vertical database
format and then we calculated period of X as explained in problem definition.
Streamflow
window1
window2
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 146
Itemset Tid-set P
X
R
a 1, 5, 7 1,4,2 4
b 2, 3, 5 2,1,2,3 2
c 1,2,3,4,5,6,7,8 1,1,1,1,1,1,1,1 1
d 4,6,7,8 4,2,1,1 4
e 1,4,5,6,7,8 1,3,1,1,1,1 3
f 1,2,3,8 1,1,1,5 5
TABLE 1: VDSRP-table with PX
and R
Finding periods is a simple procedure in our VDSRP-method. Create individual array to every
itemset. Find all the periods of each itemset by subtracting i.e., pX
= tX
j+1 - tj
X
. Among number of
periods that we get from the window we consider the maximum period from the tid-set of each
itemset. For example in Table 1, the periods of {a} are (1, 4, 2). Its regularity is 4 because it is the
maximum regularity for {a} and compare with the user given regularity threshold i.e., λ = 3. So {a}
is not a regular itemset because it is greater than the user given threshold.
Itemset Tid-set PX
R
(b, c) 2,3,5 2,1,1,3 3
(b, e) 5 5,3 5
(c, e) 1,4,5,6,7,8 1,3,1,1,1,1 3
(b, c, e) 5 5,3 5
TABLE 2: VDRSP-table with PX
and R
After getting 1-item set we go for 2-item set as shown in Table 2. The regular patterns in W1 satisfy
the down-ward closure property [5] i.e., if a pattern that found regular then all of its non-empty
subsets will be regular, if a pattern which is not regular then none of its supersets can be regular.
So we consider only the itemsets which are found regular to generate k+1 itemsets. From Table 2
we can say that itemsets (b, c), (c, e) are 2-item regular itemsets. We mine with the same
procedure until no regular item set generated in the window.
Itemset Tid-set P
X
R
a 4,6,8 4,2,2 4
b 1,2,4 1,1,2,4 4
c 1,2,3,4,5,6,7,8 1,1,1,1,1,1,1,1, 1
d 3,5,6,7 3,2,1,1,1 3
e 3,4,5,6,7 3,1,1,1,1,1 3
f 1,2,7 1,1,5,1 5
TABLE 3: VDRSP-table with PX
and R from W2
Generally, the regularity of patterns may change with the sliding of window i.e., with the deletion of
old transaction and the insertion of new transaction. For example, in Table 1 and Table 2 the
regular patterns {b} and {b, c} in W1 become irregular patterns in W2 because their regularity is
greater than max_reg. Again, the irregular patterns {d} and {c, d, e} in W1 become regular in W2.
Therefore to reflect the correct regularity of each item in the current window, we perform the
refreshing operation on VDSRP-table to get new window. The process continues to W2, W3 and
soon to find out the latest regular patterns from the data streams.
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 147
Itemset Tid-set P
X
R
(c, d) 3,5,6,7 3,2,1,1,1 3
(c, e) 3,4,5,6,7 3,1,1,1,1,1 3
(d, e) 3,5,6,7 3,2,1,1,1 3
(c, d, e) 3,5,6,7 3,2,1,1,1 3
TABLE 4: VDRSP-table with P
X
and R from W2
5. EXPERIMENT RESULTS
In this section we are going to present our results. All the programs are written in VC++ 6.0 and
executed in Windows XP on a 2.66 GHz machine with 2GB of main memory. We used our
VDSRP-method over several synthetic and real datasets which are frequently used to find out
frequent pattern mining experiments. With our proposed method we present our experiment
results by comparing with the existing RPS-tree. The detailed characteristics of the datasets are
available in Table 5 which are obtained from [15].
Dataset #Trans #Items MaxTL AvgTL Type
Kosarak 9,90,000 41,270 673 8.10 Real
Mushroom 8,124 119 23 23 Real
T1014D100K 1,00,759 870 30 10.10 Synthetic
TABLE 5: Database Characteristics
The above table shows some statistical information about the datasets. We consider the
slide_size = 1 for all the experiments. We report the results on Kosarak dataset which contain
9,90K transactions, 41,270 items and 8.10 average transaction length. We also report on
T1014D100K dataset which contains 1,00,759 transactions, 870 items, 10.10 average transaction
length and also on mushroom dataset which contains 8,124 transactions, 119 items and 23 is the
average transaction length.
5.1 Memory Efficiency
The memory requirements for our VDSRP-method on different datasets with different window
sizes are shown in Table 6. For example, in kosorak dataset when window size is 100K, the
memory required on an average of 3.57MB and when window size is 500K, it requires on an
average of 16.83 MB. Hence from table 6 it can be observed that VDSRP-method is memory
efficient on different real and synthetic datasets.
Kosorak W1 (100K) W2 (300K) W3 (500K) W4 (700K) W5 (900K)
3.57 MB 10.71 MB 16.83 MB 24.96 MB 32.1 MB
Mushroom W1 (1 K) W2 (3 K) W3 (5 K) W4 (7 K) W5 (8 K)
0.7 MB 0.14 MB 0.31 MB 0.48 MB 0.56 MB
T1014D100K W1 (20 K) W2 (40 K) W3 (60 K) W4 (80 K) W5 (100 K)
0.82 MB 1.74 MB 2.57 MB 3.31 MB 4.12 MB
TABLE 6: Memory Requirement for different window sizes
5.2 Runtime Efficiency
From figures 2(a) and 2(b) we can see that our proposed method runs faster than RPS-tree under
various regularity thresholds and with different window sizes respectively. We conducted
experiments on kosarak dataset with window size 500K on different max_reg(%) values. In figure
2(a) y-axis shows different regularity threshold values and x-axis shows the average total time
taken to convert the data into vertical format and mining time as well. Our proposed method is
taking on an average 38 seconds time when max_reg is 0.04%. If the max_reg value increases,
the execution time also increases to mine regular patterns from the window.
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 148
FIGURE 2 (a) On Kosarak (|W|)= 500K
In figure 2(b) the graph shows x-axis with different window sizes and y-axis with the average time
taken in seconds to mine regular patterns at max_reg is 0.06%. Our proposed method takes less
average time when compare with RPS-tree on different window sizes. For example, when window
size is 300K the average time taken is only 45 seconds.
FIGURE 2 (b) On Kosarak (max_reg = 0.06%)
6. CONCLUSION
In this paper we presented a VDSRP method which is much better than the existing RPS-tree
algorithm because it uses sliding-window technique and the advantages of Vertical Database
Format. This method is very simple to use with simple operations like arrays, unions, intersection,
deletion etc. to find out regular patterns over data streams. Our experiment results outperforms in
both execution and memory consumption.
7. REFERENCES
[1] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee “Sliding Window-based Frequent Pattern
Mining over Data Streams. Information Sciences”, 179, 2006, pp. 3843-3865.
Time(sec)
max_reg(%)
Time(sec)
window size (K)
G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar
International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 149
[2] C.K.-S. Leung, , Q.I. Khan “DSTree: A Tree Structure for the mining of Frequent Sets from
Data Streams.” In: ICDM, 2006, pp. 928-932.
[3] H.-F. Li, S.-Y. Lee “Mining Frequent Itemsets over Data Streams Using Efficient Window
Sliding Techniques.” Expert Systems with Applications 36, 2009, pp. 1466-1477.
[4] J. Han, J. Pie, Y. Yin “Mining Frequent Patterns without candidate generation”, In Proc.
ACM SIGMOD international Conference on management of Data, 2000, pp. 1-12.
[5] R. Agarwal, and R. Srikanth, “Fast algorithms for mining association rules in Large
databases”, In Proc. 1994 Int. Conf. Very Large Databases VLDBA’94, Santiago, Chile,
Sept. 1994, pp. 487- 499.
[6] S. K. Tanbeer, C. F. Ahmed, B.S. Jeong, and Y.K. Lee, “Mining Regular Patterns in
Transactional Databases”, IEICE Trans. On Information Systems, E91-D, 11, 2008, pp. 2568-
2577.
[7] S.K. Tanbeer, C.F. Ahmed, B.S. Jeong. “Mining regular patterns in data streams.” In:
DASFAA. Volume 5981 of LNCS., Springer 2010, pp. 399-413.
[8] J. Han, M. Kamber, “Data Mining :Concepts and Techniques”, 2
nd
ed. An Imprint of
Elsevier, Morgan Kaufmann publishers, 2006, pp. 468-489.
[9] G. Yi-ming, W. Zhi-jun, “A Vertical format algorithm for mining frequent item sets”, IEEE
Transactions, pp. 11-13, 2010.
[10]M. J. Zaki, K. Gouda. “Fast Vertical Mining using Diffsets”, SIGKDD ’03, Copyright 2003
ACM 1-58113-737-0/03/0008, August’ 24 – 27, 2003.
[11]G. Vijay Kumar, M. Sreedevi, NVS. Pavan Kumar. “Mining Regular Patterns in Transactional
Databases using vertical Format”, International Journal of Advanced Research in Computer
Science, vol. 2, pp. 581-583, Sep-Oct 2011.
[12]M.G. Elfeky, W.G. Aref, A.K. Elmagarmid “Periodicity detection in time series databases.”
IEEE Transactions on Knowledge and Data Engineering 17(7), pp. 875-887 2005.
[13]G. Lee, W. Yang, J-M Lee. “A Parallel algorithm for mining partial periodic patterns.”
Information Society 176, pp. 2006, pp.3591-3609.
[14]B. Ozden, S. Ramaswamy, A. Silberschatz. “Cyclic Association Rules.” In.: 14th
International
conference on Data Engineering, 1998, pp. 412-421.
[15]Frequent Itemset Mining Dataset Repository http://fimi.cs.helsinki.fi/data/ and UCI machine
learning repository (University or California).

More Related Content

What's hot (17)

PDF
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
IJCSES Journal
 
PDF
Mining frequent itemsets (mfi) over
IJDKP
 
PDF
50120140503013
IAEME Publication
 
PDF
Frequent Pattern Mining with Serialization and De-Serialization
iosrjce
 
PDF
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
Waqas Tariq
 
PDF
Sequential Pattern Tree Mining
IOSR Journals
 
PPTX
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
PDF
slides
Chetan Tonde
 
PDF
Ontology Based PMSE with Manifold Preference
IJCERT
 
PDF
Big Data with Rough Set Using Map- Reduce
ijircee
 
PDF
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
cscpconf
 
PDF
Combined mining approach to generate patterns for complex data
csandit
 
PDF
Ijariie1129
IJARIIE JOURNAL
 
PDF
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
PDF
D-5436
Dale Visser
 
PDF
An improved apriori algorithm for association rules
ijnlc
 
PPTX
Data Mining: Mining ,associations, and correlations
DataminingTools Inc
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
IJCSES Journal
 
Mining frequent itemsets (mfi) over
IJDKP
 
50120140503013
IAEME Publication
 
Frequent Pattern Mining with Serialization and De-Serialization
iosrjce
 
An Efficient Algorithm for Mining Frequent Itemsets within Large Windows over...
Waqas Tariq
 
Sequential Pattern Tree Mining
IOSR Journals
 
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
slides
Chetan Tonde
 
Ontology Based PMSE with Manifold Preference
IJCERT
 
Big Data with Rough Set Using Map- Reduce
ijircee
 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
cscpconf
 
Combined mining approach to generate patterns for complex data
csandit
 
Ijariie1129
IJARIIE JOURNAL
 
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
D-5436
Dale Visser
 
An improved apriori algorithm for association rules
ijnlc
 
Data Mining: Mining ,associations, and correlations
DataminingTools Inc
 

Similar to Mining Regular Patterns in Data Streams Using Vertical Format (20)

PDF
Ijsrdv1 i2039
ijsrd.com
 
PDF
Review Over Sequential Rule Mining
ijsrd.com
 
PDF
Fast Sequential Rule Mining
ijsrd.com
 
PDF
A survey paper on sequence pattern mining with incremental
Alexander Decker
 
PDF
A survey paper on sequence pattern mining with incremental
Alexander Decker
 
PDF
H0964752
IOSR Journals
 
PDF
An efficient algorithm for sequence generation in data mining
ijcisjournal
 
PDF
Mining Maximum Frequent Item Sets Over Data Streams Using Transaction Sliding...
ijitcs
 
PDF
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
AshishDPatel1
 
PDF
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
IJDKP
 
PDF
A Brief Overview On Frequent Pattern Mining Algorithms
Sara Alvarez
 
PDF
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
PDF
Bc26354358
IJERA Editor
 
PDF
Modern association rule mining methods
ijcsity
 
PDF
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 
PDF
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
PDF
Review on: Techniques for Predicting Frequent Items
vivatechijri
 
PDF
I1802055259
IOSR Journals
 
PDF
Literature Survey of modern frequent item set mining methods
ijsrd.com
 
PDF
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
ijsrd.com
 
Ijsrdv1 i2039
ijsrd.com
 
Review Over Sequential Rule Mining
ijsrd.com
 
Fast Sequential Rule Mining
ijsrd.com
 
A survey paper on sequence pattern mining with incremental
Alexander Decker
 
A survey paper on sequence pattern mining with incremental
Alexander Decker
 
H0964752
IOSR Journals
 
An efficient algorithm for sequence generation in data mining
ijcisjournal
 
Mining Maximum Frequent Item Sets Over Data Streams Using Transaction Sliding...
ijitcs
 
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
AshishDPatel1
 
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
IJDKP
 
A Brief Overview On Frequent Pattern Mining Algorithms
Sara Alvarez
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
Bc26354358
IJERA Editor
 
Modern association rule mining methods
ijcsity
 
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
Review on: Techniques for Predicting Frequent Items
vivatechijri
 
I1802055259
IOSR Journals
 
Literature Survey of modern frequent item set mining methods
ijsrd.com
 
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
ijsrd.com
 
Ad

Recently uploaded (20)

PDF
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
PDF
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
PDF
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
PPTX
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
DOCX
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
 
PDF
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
PDF
Supply Chain Security A Comprehensive Approach 1st Edition Arthur G. Arway
rxgnika452
 
PDF
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 
PDF
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
PDF
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
 
PPTX
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
PPTX
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
 
PPTX
Marketing Management PPT Unit 1 and Unit 2.pptx
Sri Ramakrishna College of Arts and science
 
PDF
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
 
PPTX
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
PDF
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
PDF
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
PDF
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
PPTX
The Gift of the Magi by O Henry-A Story of True Love, Sacrifice, and Selfless...
Beena E S
 
PPTX
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
 
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
 
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
Supply Chain Security A Comprehensive Approach 1st Edition Arthur G. Arway
rxgnika452
 
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
 
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
 
Marketing Management PPT Unit 1 and Unit 2.pptx
Sri Ramakrishna College of Arts and science
 
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
 
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
The Gift of the Magi by O Henry-A Story of True Love, Sacrifice, and Selfless...
Beena E S
 
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
 
Ad

Mining Regular Patterns in Data Streams Using Vertical Format

  • 1. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 142 Mining Regular Patterns in Data Streams Using Vertical Format G. Vijay Kumar [email protected] School of Computing K L University Guntur – 522 502, India M. Sreedevi [email protected] School of Computing K L University Guntur – 522 502, India NVS Pavan Kumar [email protected] School of Computing K L University Guntur – 522 502, India Abstract The increasing prominence of data streams has been lead to the study of online mining in order to capture interesting trends, patterns and exceptions. Recently, temporal regularity in occurrence behavior of a pattern was treated as an emerging area in several online applications like network traffic, sensor networks, e-business and stock market analysis etc. A pattern is said to be regular in a data stream, if its occurrence behavior is not more than the user given regularity threshold. Although there has been some efforts done in finding regular patterns over stream data, no such method has been developed yet by using vertical data format. Therefore, in this paper we develop a new method called VDSRP-method to generate the complete set of regular patterns over a data stream at a user given regularity threshold. Our experimental results show that highly efficiency in terms of execution and memory consumption. Keywords: Data Streams, Temporal Regularity, Regular Pattern, Vertical Data. 1. INTRODUCTION Unlike mining static databases, data stream mining [1, 2, 3] creates many new challenges. It is unrealistic to keep the entire stream in the main memory or even in a secondary storage device because data stream is a continuous, massive (e.g., terabytes in volume), unbounded, timely ordered series of data elements generates at a rapid rate. Incredible volumes of data streams are often generated by communication networks, Internet traffic, real-time surveillance systems, online transactions in the financial market, remote sensors, scientific and engineering experiments and other dynamic environments. Discovering knowledge in data streams is an important research area in data mining and knowledge discovery process. Mining Frequent patterns [4, 5] from static databases has been broadly studied in Stream data mining [1, 2, 3]. Apriori algorithm [5] is a classical algorithm proposed by R. Agarwal and R. Srikanth in 1993 for mining frequent item sets for Boolean association rules. The algorithm uses prior knowledge and employs an iterative approach known as a level–wise search to generate frequent item sets. First it generates with 1-item sets, recursively generates 2-item set and then frequent 3-item set and continues until all the frequent item sets are generated. Later Han et. al [4] proposed the frequent pattern tree (FP-tree) and FP-growth algorithm to mine frequent patterns without candidate generation. The Apriori and FP-growth algorithms find outs the occurrence frequencies of a pattern i.e., support. Several algorithms have been proposed so far to mine frequent patterns in a transaction databases as well as in data streams. However, the significance of a pattern may not always depend upon the occurrence frequency of a pattern (i.e.,
  • 2. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 143 support). The significance of a pattern may also depend upon other occurrence characteristics such as temporal regularity of a pattern. For example, to improve web site design the web site administrator may be interested in regularly visited web page sequence rather than heavily hit web pages only for a specific period of time. Also, in a retail market some products may have regular demand than other products. To know how regularly a product has been sold is essential rather than the occurrence frequency of a product. Therefore, finding patterns at regular intervals also plays an important role in data mining. Recently, Tanbeer et. al [6] introduced a new problem of discovering Regular Patterns that follow a temporal regularity in their occurrence behavior. With the help of user given maximum regularity measure at which pattern occurs in a database is called a regular pattern. They also extended the same problem in Data Streams. They proposed a tree based data-structure, called RPS-tree [7] that captures user-given regularity threshold and mines regular patterns in a data stream with the help of FP-growth algorithm and conditional pattern bases and corresponding conditional trees. Therefore, in this paper, we propose a new method called Vertical Data Stream Regular Patterns method (VDSRP - method in short), using the same Data Stream which is in [7] to mine regular patterns using vertical data format. By using Vertical Data Format [8, 9, 10, 11], it will be able to judge the non-regular item sets before generating candidate item sets. The main idea of our new method is to develop a simple, but yet powerful, that captures the data stream content in a window by using sliding-window technique to find regular items. The experimental results show the effectiveness of VDSRP-method in finding regular patterns in a Data Stream. The rest of the paper is organized as follows. Section 2 summarizes the existing tree structure to mine regular patterns. Section 3 introduces the problem definition of regular pattern mining. The method of VDSRP to find regular patterns using vertical data format are given in section 4. Section 5, our experimental results are shown. Finally, we conclude the paper in section 6. 2. RELATED WORK In data mining, one of the most important techniques is Association rule mining. It was first introduced by Agarwal et al. [5]. It extracts frequent patterns, correlations, associations among sets of items in databases. The main drawback with the classical Apriori algorithm is that it needs repeated scans to generate candidate set. After that Frequent pattern tree and FP-growth algorithm [4] is introduced by Han et al. to mine frequent patterns without candidate generation. Periodic patterns [12], [13] and Cyclic patterns [14] are also closely related with Regular patterns. Periodic pattern mining in time-series data focuses on the cyclic behavior of patterns either in whole or some part of time-series. Although periodic pattern mining is closely related with our work, it cannot be applied directly to mine regular patterns from a data stream because it process with either time-series or sequential data. Tanbeer et al. [7] have proposed a tree based data-structure, called RPS-tree that captures user- given regularity threshold and mines regular patterns in a data stream with the help of FP-growth [4] algorithm and conditional pattern bases and corresponding conditional trees. First, they constructed RPS-tree consists of one root node referred to as “null” and a set of item-prefix sub- trees called children of the root. Each node in an RPS-tree represents an itemset in the path from the root up to that node. The RPS-tree maintains the occurrence information of all transactions in the current window with the tree structure. Also RPS-tree maintains two types of nodes called ordinary nodes and tail nodes. Nodes of both types explicitly maintain parent, children and node traversal pointers. In addition each tail node maintains a tid-list and a tail-node pointer. The tail- node pointer points to either the next tail node in the tree if any, or “null”. Then they construct an item header table called RPS-table consists of each distinct item in the current window with relative regularity and a pointer pointing to the first node in the RPS-tree that carries the item. RPS-table of a RPS-tree consists of three fields, they are item name (i), regularity of i (r), and a pointer to the RPS-tree for i (p). Similar to FP-growth mining, they mine the RPS-tree of decreasing size to generate regular patterns by creating conditional pattern-bases and corresponding conditional trees. .
  • 3. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 144 3. PROBLEM DEFINITION Let I = {i1, i2, . . . , in} be the set of items. A set X = {ij, . . . , ik} ⊆ I, where j ≤ k and j, k ∈ [1, n] is called a pattern (or an itemset). A transaction t = (tid, Y) is a couple where tid is a transaction-id and Y is a patter or an itemset. If X ⊆ Y, which means that t contains X or X occurs in t. Let size(t) be the size of t, i.e., the number of items in Y. 3.1 Definition 1 (Data Stream) A data stream DS can be defined as infinite sequence of transactions, i.e., DS = [t1, t2, …, tm, …], i ∈ [1, m] where ti is the i-th arrived transaction. A window W can be referred to as a set of all transactions between the i-th and j-th arrival of transactions, where j > i and the size of W is W= j – i, i.e., the number of transactions between i-th and j-th arrival of transactions. Let each slide of window introduce and expire slide_size, 1 ≤ slide_size ≥ W, transactions into and from the current window. If X occurs in tj, j ∈ [1, W], such transactions-id is denoted as tj X , j ∈ [1, W]. Therefore Tw X = { tj X , . . . tk X }, j, k ∈ [1, W] and j ≤ k is the set of all transaction-ids where X occurs in the current window W. 3.2 Definition 2 (A period of X in W) Let tX j+1 and tj X , j ∈ [1, ( W-1)], be two consecutive transaction-ids in Tw X . the number of transactions between t X j+1 and tj X is defined as a period of X, say p X (i.e., p X = t X j+1 - tj X , j ∈ [1, ( W-1)]). For the simplicity of period computation, a “null” transaction with no item is considered at the beginning of W, i.e., tf = 0(null), where tf represents the tid of the first transaction to be considered. Similarly, tl, the tid of the last transaction to be considered, is the tid of the W -th transaction in the window, i.e., tl = t W. For instance, the stream data in Table 1, consider the window is composed of eight transactions (i.e., tid = 1 to tid = 8 make the first window, say W1). Then set of transactions in W1 where pattern (b, c) appears in (2, 3, 5). Therefore, the periods for (b, c) are {(2 – tf) = 2, (3 – 2) = 1, (5 – 3) = 2 and (tl – 5) = 3}, where tf = 0 and tl = 8. The occurrence periods of X in W defined as above will be the precise information about the occurrence behaviour of a pattern. A pattern will not be a regular in W, if it appears after large period at any stage. The largest occurrence period of a pattern can provide the upper limit of its periodic occurrence characteristic. Hence, the measure of the characteristic of a pattern of being regular in a W (i.e., the regularity of a pattern in W) can be defined as follows. 3.3 Definition 3 (Regularity of a pattern X in W) Let in a TwX, PwX be the set of all periods of X in W i.e., PwX = { p1X , . . . , psX}, where s is the total number of periods of X in W. Then, the regularity of X in W can be denoted as regw(X) = Max(p1X, ..., psX}. For example, in DS of Table 1 regw(b, c) = 3, since Pw1{b, c} = Max(2, 1, 2, 3) = 3. Therefore a pattern is called a regular pattern in W if its regularity in W must not more than a user given maximum regularity threshold called max_reg λ, with 1 ≤ λ ≤ W. The regularity threshold is given as the percentage of window size. Therefore the regular patterns in W satisfy the downward closure property [6]. i.e., if a pattern is found to be regular, then all of its non-empty subsets will be regular. Accordingly, if a pattern is not regular, then none of its supersets can be regular. Given DS, W, and max_reg, finding the complete set of regular patterns in W, Rw that have regularity of not greater than the max_reg value is the problem of mining regular patterns in data stream. 4. MINING REGULAR PATTERNS In contrast with traditional data sets, the continuous flow (in and out) of stream data in a computer system updates with varying rates. So, we had taken sliding window technique and vertical data format to mine regular patterns from the data stream.
  • 4. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 145 Let Figure 1. be the data stream which contains transaction-id i.e., tid, and itemset i.e., transaction with respect to tid. Now consider the window size may be 8 i.e., the size of the window |W| = 8. Let the first window W1 handles the transactions of data stream from tid-1 to tid-8, now convert this W1 into vertical data format i.e., (itemset : tid). Then find out the periods for each itemset which are given in the data stream to get the regular patterns which are less than or equal to the user given regularity threshold. After finding W1 regular patterns generate second window W2 and repeat the same procedure to find out latest regular patterns from the data stream. tid transaction 1. a, c, e, f 2. b, c, f 3. b, c, f 4. c, d, e 5. a, b, c, e 6. c, d, e 7. a, c, d, e 8. c, d, e, f 9. a, c Window size | W | = 8 FIGURE 1: A data stream DS Our proposed method is given below to mine regular patterns from the data stream with the help of sliding-window technique and vertical data format. Both the Apriori algorithm and FP-growth algorithm mine frequent patterns in Horizontal data format (i.e., {TID : itemset}), where TID is a transaction-id and itemset is the set of items in transaction TID. But the data can also be present in {item : TID-set} format where item is an item name and TID-set is the set of transactions containing the item. This is known as Vertical data format. We are going to mine regular patterns from the given data stream using vertical format. VDSRP – Method Input : DS, λ = 3 Output: Complete set of regular patterns Procedure: 1. For each window W of size 8 in DS 12. Delete inext 2. Convert Tw into VTw 13. Else 3. For each item i in X where X ⊆ VTw 14. Result ← Result ∪ (i, next) 4. If(Find R(i) > λ) //FindR returns max_reg 15. Do “and” operation till all regular 5. Delete i itemsets found 6. Else 16. } 7. { 17. Update W(i, j) 8. Result ← Result ∪ (i) 9. For each item inext in X 10. { 11. If(Find R (inext) > λ) } By using the example data stream in Figure 1. We convert the first window into vertical database format and then we calculated period of X as explained in problem definition. Streamflow window1 window2
  • 5. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 146 Itemset Tid-set P X R a 1, 5, 7 1,4,2 4 b 2, 3, 5 2,1,2,3 2 c 1,2,3,4,5,6,7,8 1,1,1,1,1,1,1,1 1 d 4,6,7,8 4,2,1,1 4 e 1,4,5,6,7,8 1,3,1,1,1,1 3 f 1,2,3,8 1,1,1,5 5 TABLE 1: VDSRP-table with PX and R Finding periods is a simple procedure in our VDSRP-method. Create individual array to every itemset. Find all the periods of each itemset by subtracting i.e., pX = tX j+1 - tj X . Among number of periods that we get from the window we consider the maximum period from the tid-set of each itemset. For example in Table 1, the periods of {a} are (1, 4, 2). Its regularity is 4 because it is the maximum regularity for {a} and compare with the user given regularity threshold i.e., λ = 3. So {a} is not a regular itemset because it is greater than the user given threshold. Itemset Tid-set PX R (b, c) 2,3,5 2,1,1,3 3 (b, e) 5 5,3 5 (c, e) 1,4,5,6,7,8 1,3,1,1,1,1 3 (b, c, e) 5 5,3 5 TABLE 2: VDRSP-table with PX and R After getting 1-item set we go for 2-item set as shown in Table 2. The regular patterns in W1 satisfy the down-ward closure property [5] i.e., if a pattern that found regular then all of its non-empty subsets will be regular, if a pattern which is not regular then none of its supersets can be regular. So we consider only the itemsets which are found regular to generate k+1 itemsets. From Table 2 we can say that itemsets (b, c), (c, e) are 2-item regular itemsets. We mine with the same procedure until no regular item set generated in the window. Itemset Tid-set P X R a 4,6,8 4,2,2 4 b 1,2,4 1,1,2,4 4 c 1,2,3,4,5,6,7,8 1,1,1,1,1,1,1,1, 1 d 3,5,6,7 3,2,1,1,1 3 e 3,4,5,6,7 3,1,1,1,1,1 3 f 1,2,7 1,1,5,1 5 TABLE 3: VDRSP-table with PX and R from W2 Generally, the regularity of patterns may change with the sliding of window i.e., with the deletion of old transaction and the insertion of new transaction. For example, in Table 1 and Table 2 the regular patterns {b} and {b, c} in W1 become irregular patterns in W2 because their regularity is greater than max_reg. Again, the irregular patterns {d} and {c, d, e} in W1 become regular in W2. Therefore to reflect the correct regularity of each item in the current window, we perform the refreshing operation on VDSRP-table to get new window. The process continues to W2, W3 and soon to find out the latest regular patterns from the data streams.
  • 6. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 147 Itemset Tid-set P X R (c, d) 3,5,6,7 3,2,1,1,1 3 (c, e) 3,4,5,6,7 3,1,1,1,1,1 3 (d, e) 3,5,6,7 3,2,1,1,1 3 (c, d, e) 3,5,6,7 3,2,1,1,1 3 TABLE 4: VDRSP-table with P X and R from W2 5. EXPERIMENT RESULTS In this section we are going to present our results. All the programs are written in VC++ 6.0 and executed in Windows XP on a 2.66 GHz machine with 2GB of main memory. We used our VDSRP-method over several synthetic and real datasets which are frequently used to find out frequent pattern mining experiments. With our proposed method we present our experiment results by comparing with the existing RPS-tree. The detailed characteristics of the datasets are available in Table 5 which are obtained from [15]. Dataset #Trans #Items MaxTL AvgTL Type Kosarak 9,90,000 41,270 673 8.10 Real Mushroom 8,124 119 23 23 Real T1014D100K 1,00,759 870 30 10.10 Synthetic TABLE 5: Database Characteristics The above table shows some statistical information about the datasets. We consider the slide_size = 1 for all the experiments. We report the results on Kosarak dataset which contain 9,90K transactions, 41,270 items and 8.10 average transaction length. We also report on T1014D100K dataset which contains 1,00,759 transactions, 870 items, 10.10 average transaction length and also on mushroom dataset which contains 8,124 transactions, 119 items and 23 is the average transaction length. 5.1 Memory Efficiency The memory requirements for our VDSRP-method on different datasets with different window sizes are shown in Table 6. For example, in kosorak dataset when window size is 100K, the memory required on an average of 3.57MB and when window size is 500K, it requires on an average of 16.83 MB. Hence from table 6 it can be observed that VDSRP-method is memory efficient on different real and synthetic datasets. Kosorak W1 (100K) W2 (300K) W3 (500K) W4 (700K) W5 (900K) 3.57 MB 10.71 MB 16.83 MB 24.96 MB 32.1 MB Mushroom W1 (1 K) W2 (3 K) W3 (5 K) W4 (7 K) W5 (8 K) 0.7 MB 0.14 MB 0.31 MB 0.48 MB 0.56 MB T1014D100K W1 (20 K) W2 (40 K) W3 (60 K) W4 (80 K) W5 (100 K) 0.82 MB 1.74 MB 2.57 MB 3.31 MB 4.12 MB TABLE 6: Memory Requirement for different window sizes 5.2 Runtime Efficiency From figures 2(a) and 2(b) we can see that our proposed method runs faster than RPS-tree under various regularity thresholds and with different window sizes respectively. We conducted experiments on kosarak dataset with window size 500K on different max_reg(%) values. In figure 2(a) y-axis shows different regularity threshold values and x-axis shows the average total time taken to convert the data into vertical format and mining time as well. Our proposed method is taking on an average 38 seconds time when max_reg is 0.04%. If the max_reg value increases, the execution time also increases to mine regular patterns from the window.
  • 7. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 148 FIGURE 2 (a) On Kosarak (|W|)= 500K In figure 2(b) the graph shows x-axis with different window sizes and y-axis with the average time taken in seconds to mine regular patterns at max_reg is 0.06%. Our proposed method takes less average time when compare with RPS-tree on different window sizes. For example, when window size is 300K the average time taken is only 45 seconds. FIGURE 2 (b) On Kosarak (max_reg = 0.06%) 6. CONCLUSION In this paper we presented a VDSRP method which is much better than the existing RPS-tree algorithm because it uses sliding-window technique and the advantages of Vertical Database Format. This method is very simple to use with simple operations like arrays, unions, intersection, deletion etc. to find out regular patterns over data streams. Our experiment results outperforms in both execution and memory consumption. 7. REFERENCES [1] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee “Sliding Window-based Frequent Pattern Mining over Data Streams. Information Sciences”, 179, 2006, pp. 3843-3865. Time(sec) max_reg(%) Time(sec) window size (K)
  • 8. G. Vijay Kumar, M. Sreedevi & NVS Pavan Kumar International Journal of Computer Science and Security (IJCSS), Volume (6) : Issue (2) : 2012 149 [2] C.K.-S. Leung, , Q.I. Khan “DSTree: A Tree Structure for the mining of Frequent Sets from Data Streams.” In: ICDM, 2006, pp. 928-932. [3] H.-F. Li, S.-Y. Lee “Mining Frequent Itemsets over Data Streams Using Efficient Window Sliding Techniques.” Expert Systems with Applications 36, 2009, pp. 1466-1477. [4] J. Han, J. Pie, Y. Yin “Mining Frequent Patterns without candidate generation”, In Proc. ACM SIGMOD international Conference on management of Data, 2000, pp. 1-12. [5] R. Agarwal, and R. Srikanth, “Fast algorithms for mining association rules in Large databases”, In Proc. 1994 Int. Conf. Very Large Databases VLDBA’94, Santiago, Chile, Sept. 1994, pp. 487- 499. [6] S. K. Tanbeer, C. F. Ahmed, B.S. Jeong, and Y.K. Lee, “Mining Regular Patterns in Transactional Databases”, IEICE Trans. On Information Systems, E91-D, 11, 2008, pp. 2568- 2577. [7] S.K. Tanbeer, C.F. Ahmed, B.S. Jeong. “Mining regular patterns in data streams.” In: DASFAA. Volume 5981 of LNCS., Springer 2010, pp. 399-413. [8] J. Han, M. Kamber, “Data Mining :Concepts and Techniques”, 2 nd ed. An Imprint of Elsevier, Morgan Kaufmann publishers, 2006, pp. 468-489. [9] G. Yi-ming, W. Zhi-jun, “A Vertical format algorithm for mining frequent item sets”, IEEE Transactions, pp. 11-13, 2010. [10]M. J. Zaki, K. Gouda. “Fast Vertical Mining using Diffsets”, SIGKDD ’03, Copyright 2003 ACM 1-58113-737-0/03/0008, August’ 24 – 27, 2003. [11]G. Vijay Kumar, M. Sreedevi, NVS. Pavan Kumar. “Mining Regular Patterns in Transactional Databases using vertical Format”, International Journal of Advanced Research in Computer Science, vol. 2, pp. 581-583, Sep-Oct 2011. [12]M.G. Elfeky, W.G. Aref, A.K. Elmagarmid “Periodicity detection in time series databases.” IEEE Transactions on Knowledge and Data Engineering 17(7), pp. 875-887 2005. [13]G. Lee, W. Yang, J-M Lee. “A Parallel algorithm for mining partial periodic patterns.” Information Society 176, pp. 2006, pp.3591-3609. [14]B. Ozden, S. Ramaswamy, A. Silberschatz. “Cyclic Association Rules.” In.: 14th International conference on Data Engineering, 1998, pp. 412-421. [15]Frequent Itemset Mining Dataset Repository http://fimi.cs.helsinki.fi/data/ and UCI machine learning repository (University or California).