SlideShare a Scribd company logo
Frequent Item Mining
What is data mining?
• Pattern Mining
• What patterns
• Why are they useful
3
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Frequent Itemsets Mining
TID Transactions
100 { A, B, E }
200 { B, D }
300 { A, B, E }
400 { A, C }
500 { B, C }
600 { A, C }
700 { A, B }
800 { A, B, C, E }
900 { A, B, C }
1000 { A, C, E }
• Minimum support level
50%
– {A},{B},{C},{A,B}, {A,C}
• How to link this to Data
Cube?
Three Different Views of FIM
• Transactional Database
– How we do store a transactional
database?
• Horizontal, Vertical, Transaction-Item
Pair
• Binary Matrix
• Bipartite Graph
• How does the FIM formulated in
these different settings?
5
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
6
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there
are 2d possible
candidate itemsets
7
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
N
Transactions List of
Candidates
M
w
8
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
• Apriori principle holds due to the following property
of the support measure:
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
)
(
)
(
)
(
:
, Y
s
X
s
Y
X
Y
X 



9
Illustrating Apriori Principle
Found to be
Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Pruned
supersets
10
Illustrating Apriori Principle
Item Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
Itemset Count
{Bread,Milk} 3
{Bread,Beer} 2
{Bread,Diaper} 3
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
Itemset Count
{Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994
Feequent Item Mining - Data Mining - Pattern Mining
13
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
14
Challenges of Frequent Itemset Mining
• Challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
15
Alternative Methods for Frequent Itemset
Generation
• Representation of Database
– horizontal vs vertical data layout
TID Items
1 A,B,E
2 B,C,D
3 C,E
4 A,C,D
5 A,B,C,D
6 A,E
7 A,B
8 A,B,C
9 A,C,D
10 B
Horizontal
Data Layout
A B C D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
8 10
9
Vertical Data Layout
16
ECLAT
• For each item, store a list of transaction ids
(tids)
TID Items
1 A,B,E
2 B,C,D
3 C,E
4 A,C,D
5 A,B,C,D
6 A,E
7 A,B
8 A,B,C
9 A,C,D
10 B
Horizontal
Data Layout
A B C D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
8 10
9
Vertical Data Layout
TID-list
17
ECLAT
• Determine support of any k-itemset by intersecting tid-lists of
two of its (k-1) subsets.
• 3 traversal approaches:
– top-down, bottom-up and hybrid
• Advantage: very fast support counting
• Disadvantage: intermediate tid-lists may become too large for
memory
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
 
AB
1
5
7
8
Feequent Item Mining - Data Mining - Pattern Mining
Feequent Item Mining - Data Mining - Pattern Mining
20
FP-growth Algorithm
• Use a compressed representation of the
database using an FP-tree
• Once an FP-tree has been constructed, it uses
a recursive divide-and-conquer approach to
mine the frequent itemsets
21
FP-tree construction
TID Items
1 {A,B}
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:
22
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
D:1
D:1
E:1
E:1
TID Items
1 {A,B}
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
Pointers are used to assist
frequent itemset generation
D:1
E:1
Transaction
Database
Item Pointer
A
B
C
D
E
Header table
23
FP-growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1
C:3
D:1
D:1
Conditional Pattern base
for D:
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FP-
growth on P
Frequent Itemsets found
(with sup > 1):
AD, BD, CD, ACD, BCD
D:1
Feequent Item Mining - Data Mining - Pattern Mining
25
Compact Representation of Frequent Itemsets
• Some itemsets are redundant because they have identical support
as their supersets
• Number of frequent itemsets
• Need a compact representation
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1










10
1
10
3 k
k
26
Maximal Frequent Itemset
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCD
E
Border
Infrequent
Itemsets
Maximal
Itemsets
An itemset is maximal frequent if none of its immediate supersets
is frequent
27
Closed Itemset
• An itemset is closed if none of its immediate supersets has the
same support as the itemset
TID Items
1 {A,B}
2 {B,C,D}
3 {A,B,C,D}
4 {A,B,D}
5 {A,B,C,D}
Itemset Support
{A} 4
{B} 5
{C} 3
{D} 4
{A,B} 4
{A,C} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Itemset Support
{A,B,C} 2
{A,B,D} 3
{A,C,D} 2
{B,C,D} 3
{A,B,C,D} 2
28
Maximal vs Closed Itemsets
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Transaction Ids
Not supported by
any transactions
29
Maximal vs Closed Frequent Itemsets
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and
maximal
Closed but
not maximal
30
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Beyond Itemsets
• Sequence Mining
– Finding frequent subsequences from a collection of sequences
• Graph Mining
– Finding frequent (connected) subgraphs from a collection of
graphs
• Tree Mining
– Finding frequent (embedded) subtrees from a set of
trees/graphs
• Geometric Structure Mining
– Finding frequent substructures from 3-D or 2-D geometric
graphs
• Among others…
Frequent Pattern Mining
B
A
E
A B
C
C
F
B
D
F
F
D
E
A B
A
C
A
E
D
C
F
D
A
B
A
C
E
A
D
A B
D C
A
A B
B
D
D
C
C
A B
D C
Why Frequent Pattern Mining is So
Important?
• Application Domains
– Business, biology, chemistry, WWW, computer/networing security, …
• Summarizing the underlying datasets, providing key insights
• Basic tools for other data mining tasks
– Assocation rule mining
– Classification
– Clustering
– Change Detection
– etc…
Network motifs: recurring patterns that
occur significantly more than in
randomized nets
• Do motifs have specific roles in the network?
• Many possible distinct subgraphs
The 13 three-node connected
subgraphs
199 4-node directed connected subgraphs
And it grows fast for larger subgraphs : 9364 5-node subgraphs,
1,530,843 6-node…
Finding network motifs –
an overview
• Generation of a suitable random ensemble (reference
networks)
• Network motifs detection process:
 Count how many times each subgraph
appears
 Compute statistical significance for each
subgraph – probability of appearing in
random as much as in real network
(P-val or Z-score)
Real = 5 Rand=0.5±0.6
Zscore (#Standard Deviations)=7.5
Ensemble
of networks
39
References
• R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in
large databases. SIGMOD, 207-216, 1993.
• R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499,
1994.
• R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD, 85-93, 1998.
References:
• Christian Borgelt, Efficient Implementations of
Apriori and Eclat, FIMI’03
• Ferenc Bodon, A fast APRIORI implementation,
FIMI’03
• Ferenc Bodon, A Survey on Frequent Itemset
Mining, Technical Report, Budapest University
of Technology and Economic, 2006
Important websites:
• FIMI workshop
– Not only Apriori and FIM
• FP-tree, ECLAT, Closed, Maximal
– http://fimi.cs.helsinki.fi/
• Christian Borgelt’s website
– http://www.borgelt.net/software.html
• Ferenc Bodon’s website
– http://www.cs.bme.hu/~bodon/en/apriori/

More Related Content

PPTX
Data Mining Lecture_4.pptx
PPTX
Data Mining Lecture_3.pptx
PPT
The comparative study of apriori and FP-growth algorithm
PPT
Apriori and Eclat algorithm in Association Rule Mining
PDF
Mining Frequent Patterns And Association Rules
PDF
A1030105
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PPT
Mining Frequent Patterns, Association and Correlations
Data Mining Lecture_4.pptx
Data Mining Lecture_3.pptx
The comparative study of apriori and FP-growth algorithm
Apriori and Eclat algorithm in Association Rule Mining
Mining Frequent Patterns And Association Rules
A1030105
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Mining Frequent Patterns, Association and Correlations

Similar to Feequent Item Mining - Data Mining - Pattern Mining (20)

PDF
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
PPT
My6asso
PPT
Cs501 mining frequentpatterns
PPT
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
PDF
06FPBasic02.pdf
PDF
B0950814
PPT
A vertical representation in frequent item set mining
PPTX
Data mining techniques unit III
PDF
Dm unit ii r16
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
PDF
Discovering Frequent Patterns with New Mining Procedure
PPTX
Association Rule Mining, Correlation,Clustering
PPTX
Chapter 01 Introduction DM.pptx
PPTX
Apriori algorithm
PPTX
Frequent Itemset Mining (FIM) using aporiori
PDF
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...
PDF
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
PDF
6 module 4
PDF
Literature Survey of modern frequent item set mining methods
PPSX
Frequent itemset mining methods
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
My6asso
Cs501 mining frequentpatterns
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
06FPBasic02.pdf
B0950814
A vertical representation in frequent item set mining
Data mining techniques unit III
Dm unit ii r16
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Discovering Frequent Patterns with New Mining Procedure
Association Rule Mining, Correlation,Clustering
Chapter 01 Introduction DM.pptx
Apriori algorithm
Frequent Itemset Mining (FIM) using aporiori
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
6 module 4
Literature Survey of modern frequent item set mining methods
Frequent itemset mining methods
Ad

More from Jason J Pulikkottil (20)

PDF
Unix/Linux Command Reference - File Commands and Shortcuts
PDF
Introduction to PERL Programming - Complete Notes
PDF
VLSI System Verilog Notes with Coding Examples
PDF
VLSI Physical Design Physical Design Concepts
PDF
Verilog Coding examples of Digital Circuits
PDF
Floor Plan, Placement Questions and Answers
PDF
Physical Design, ASIC Design, Standard Cells
PDF
Basic Electronics, Digital Electronics, Static Timing Analysis Notes
PDF
Floorplan, Powerplan and Data Setup, Stages
PDF
Floorplanning Power Planning and Placement
PDF
Digital Electronics Questions and Answers
PDF
Different Types Of Cells, Types of Standard Cells
PDF
DFT Rules, set of rules with illustration
PDF
Clock Definitions Static Timing Analysis for VLSI Engineers
PDF
Basic Synthesis Flow and Commands, Logic Synthesis
PDF
ASIC Design Types, Logical Libraries, Optimization
PDF
Floorplanning and Powerplanning - Definitions and Notes
PDF
Physical Design Flow - Standard Cells and Special Cells
PDF
Physical Design - Import Design Flow Floorplan
PDF
Physical Design-Floor Planning Goals And Placement
Unix/Linux Command Reference - File Commands and Shortcuts
Introduction to PERL Programming - Complete Notes
VLSI System Verilog Notes with Coding Examples
VLSI Physical Design Physical Design Concepts
Verilog Coding examples of Digital Circuits
Floor Plan, Placement Questions and Answers
Physical Design, ASIC Design, Standard Cells
Basic Electronics, Digital Electronics, Static Timing Analysis Notes
Floorplan, Powerplan and Data Setup, Stages
Floorplanning Power Planning and Placement
Digital Electronics Questions and Answers
Different Types Of Cells, Types of Standard Cells
DFT Rules, set of rules with illustration
Clock Definitions Static Timing Analysis for VLSI Engineers
Basic Synthesis Flow and Commands, Logic Synthesis
ASIC Design Types, Logical Libraries, Optimization
Floorplanning and Powerplanning - Definitions and Notes
Physical Design Flow - Standard Cells and Special Cells
Physical Design - Import Design Flow Floorplan
Physical Design-Floor Planning Goals And Placement
Ad

Recently uploaded (20)

PPTX
Global journeys: estimating international migration
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
Global journeys: estimating international migration
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IB Computer Science - Internal Assessment.pptx
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Clinical guidelines as a resource for EBP(1).pdf
Moving the Public Sector (Government) to a Digital Adoption
Major-Components-ofNKJNNKNKNKNKronment.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...

Feequent Item Mining - Data Mining - Pattern Mining

  • 2. What is data mining? • Pattern Mining • What patterns • Why are they useful
  • 3. 3 Definition: Frequent Itemset • Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset • An itemset that contains k items • Support count () – Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2 • Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
  • 4. Frequent Itemsets Mining TID Transactions 100 { A, B, E } 200 { B, D } 300 { A, B, E } 400 { A, C } 500 { B, C } 600 { A, C } 700 { A, B } 800 { A, B, C, E } 900 { A, B, C } 1000 { A, C, E } • Minimum support level 50% – {A},{B},{C},{A,B}, {A,C} • How to link this to Data Cube?
  • 5. Three Different Views of FIM • Transactional Database – How we do store a transactional database? • Horizontal, Vertical, Transaction-Item Pair • Binary Matrix • Bipartite Graph • How does the FIM formulated in these different settings? 5 TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
  • 6. 6 Frequent Itemset Generation null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2d possible candidate itemsets
  • 7. 7 Frequent Itemset Generation • Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!! TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke N Transactions List of Candidates M w
  • 8. 8 Reducing Number of Candidates • Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support ) ( ) ( ) ( : , Y s X s Y X Y X    
  • 9. 9 Illustrating Apriori Principle Found to be Infrequent null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Pruned supersets
  • 10. 10 Illustrating Apriori Principle Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Itemset Count {Bread,Milk,Diaper} 3 Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13
  • 11. Apriori R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994
  • 13. 13 How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck
  • 14. 14 Challenges of Frequent Itemset Mining • Challenges – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates • Improving Apriori: general ideas – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates
  • 15. 15 Alternative Methods for Frequent Itemset Generation • Representation of Database – horizontal vs vertical data layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Horizontal Data Layout A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9 Vertical Data Layout
  • 16. 16 ECLAT • For each item, store a list of transaction ids (tids) TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Horizontal Data Layout A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9 Vertical Data Layout TID-list
  • 17. 17 ECLAT • Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. • 3 traversal approaches: – top-down, bottom-up and hybrid • Advantage: very fast support counting • Disadvantage: intermediate tid-lists may become too large for memory A 1 4 5 6 7 8 9 B 1 2 5 7 8 10   AB 1 5 7 8
  • 20. 20 FP-growth Algorithm • Use a compressed representation of the database using an FP-tree • Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
  • 21. 21 FP-tree construction TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} null A:1 B:1 null A:1 B:1 B:1 C:1 D:1 After reading TID=1: After reading TID=2:
  • 22. 22 FP-Tree Construction null A:7 B:5 B:3 C:3 D:1 C:1 D:1 C:3 D:1 D:1 E:1 E:1 TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Pointers are used to assist frequent itemset generation D:1 E:1 Transaction Database Item Pointer A B C D E Header table
  • 23. 23 FP-growth null A:7 B:5 B:1 C:1 D:1 C:1 D:1 C:3 D:1 D:1 Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FP- growth on P Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD D:1
  • 25. 25 Compact Representation of Frequent Itemsets • Some itemsets are redundant because they have identical support as their supersets • Number of frequent itemsets • Need a compact representation TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1           10 1 10 3 k k
  • 26. 26 Maximal Frequent Itemset null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCD E Border Infrequent Itemsets Maximal Itemsets An itemset is maximal frequent if none of its immediate supersets is frequent
  • 27. 27 Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset TID Items 1 {A,B} 2 {B,C,D} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D} Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2
  • 28. 28 Maximal vs Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 124 123 1234 245 345 12 124 24 4 123 2 3 24 34 45 12 2 24 4 4 2 3 4 2 4 Transaction Ids Not supported by any transactions
  • 29. 29 Maximal vs Closed Frequent Itemsets null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 124 123 1234 245 345 12 124 24 4 123 2 3 24 34 45 12 2 24 4 4 2 3 4 2 4 Minimum support = 2 # Closed = 9 # Maximal = 4 Closed and maximal Closed but not maximal
  • 30. 30 Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets
  • 31. Beyond Itemsets • Sequence Mining – Finding frequent subsequences from a collection of sequences • Graph Mining – Finding frequent (connected) subgraphs from a collection of graphs • Tree Mining – Finding frequent (embedded) subtrees from a set of trees/graphs • Geometric Structure Mining – Finding frequent substructures from 3-D or 2-D geometric graphs • Among others…
  • 32. Frequent Pattern Mining B A E A B C C F B D F F D E A B A C A E D C F D A B A C E A D A B D C A A B B D D C C A B D C
  • 33. Why Frequent Pattern Mining is So Important? • Application Domains – Business, biology, chemistry, WWW, computer/networing security, … • Summarizing the underlying datasets, providing key insights • Basic tools for other data mining tasks – Assocation rule mining – Classification – Clustering – Change Detection – etc…
  • 34. Network motifs: recurring patterns that occur significantly more than in randomized nets • Do motifs have specific roles in the network? • Many possible distinct subgraphs
  • 35. The 13 three-node connected subgraphs
  • 36. 199 4-node directed connected subgraphs And it grows fast for larger subgraphs : 9364 5-node subgraphs, 1,530,843 6-node…
  • 37. Finding network motifs – an overview • Generation of a suitable random ensemble (reference networks) • Network motifs detection process:  Count how many times each subgraph appears  Compute statistical significance for each subgraph – probability of appearing in random as much as in real network (P-val or Z-score)
  • 38. Real = 5 Rand=0.5±0.6 Zscore (#Standard Deviations)=7.5 Ensemble of networks
  • 39. 39 References • R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD, 207-216, 1993. • R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994. • R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD, 85-93, 1998.
  • 40. References: • Christian Borgelt, Efficient Implementations of Apriori and Eclat, FIMI’03 • Ferenc Bodon, A fast APRIORI implementation, FIMI’03 • Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006
  • 41. Important websites: • FIMI workshop – Not only Apriori and FIM • FP-tree, ECLAT, Closed, Maximal – http://fimi.cs.helsinki.fi/ • Christian Borgelt’s website – http://www.borgelt.net/software.html • Ferenc Bodon’s website – http://www.cs.bme.hu/~bodon/en/apriori/