PCY Algorithm in Big Data
Last Updated :
09 Jan, 2024
PCY was developed by Park, Chen, and Yu. It is used for frequent itemset mining when the dataset is very large.
What is the PCY Algorithm?
The PCY algorithm (Park-Chen-Yu algorithm) is a data mining algorithm that is used to find frequent itemets in large datasets. It is an improvement over the Apriori algorithm and was first described in 2001 in a paper titled "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth" by Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and Helen Pinto.
The PCY algorithm uses hashing to efficiently count item set frequencies and reduce overall computational cost. The basic idea is to use a hash function to map itemsets to hash buckets, followed by a hash table to count the frequency of itemsets in each bucket.
Example problem solved using PCY algorithm
Problem:
Apply the PCY algorithm on the following transaction to find the candidate sets (frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
Approach:
There are several steps that you have to follow to get the Candidate table.
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to them write their frequency. Note - Note: Pairs should not get repeated avoid the pairs that are already written before.
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It gives us the bucket number). It defines in what bucket this particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the following details -
- Bit vector - if the frequency of the candidate pair is greater than equal to the threshold then the bit vector is 1 otherwise 0. (mostly 1)
- Bucket number - found in the previous step
- Maximum number of support - frequency of this candidate pair, found in step 2.
- Correct - the candidate pair will be mentioned here.
- Candidate set - if the bit vector is 1, then "correct" will be written here.
Solution:
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Items | 1 | 2 | 3 | 4 | 5 | 6 |
---|
Frequency | 4 | 7 | 7 | 8 | 6 | 4 |
---|
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to it write its frequency.
T1 | {(1, 2), (1, 3)} | 2,3 |
---|
T2 | {(2, 3), (2, 4)} | 3,4 |
---|
T3 | {(3, 4),(3, 5)} | 4,3 |
---|
T4 | {(4, 5) ,(4, 6)} | 3,4 |
---|
T5 | {(1, 5)} | 1 |
---|
T6 | {(2, 6)} | 2 |
---|
T7 | {(1, 4)} | 2 |
---|
T8 | {(2, 5)} | 2 |
---|
T9 | {(3, 6)} | 1 |
---|
T10 | - | |
---|
T11 | - | |
---|
T12 | - | |
---|
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It gives us the bucket number).
Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Bucket No.
Bucket no. | Pair |
---|
0 | (4,5) |
---|
2 | (3,4) |
---|
3 | (1,3) |
---|
4 | (4,6) |
---|
5 | (3,5) |
---|
6 | (2,3) |
---|
8 | (2,4) |
---|
Step 4: Prepare candidate set
Bit Vector | Bucket No. | Highest Support Count | Pairs | Candidate Set |
---|
1 | 0 | 3 | (4,5) | (4,5) |
1 | 2 | 4 | (3,4) | (3,4) |
1 | 3 | 3 | (1,3) | (1,3) |
1 | 4 | 4 | (4,6) | (4,6) |
1 | 5 | 3 | (3,5) | (3,5) |
1 | 6 | 3 | (2,3) | (2,3) |
1 | 8 | 4 | (2,4) | (2,4) |
Similar Reads
Analysis of Algorithms Analysis of Algorithms is a fundamental aspect of computer science that involves evaluating performance of algorithms and programs. Efficiency is measured in terms of time and space.Basics on Analysis of Algorithms:Why is Analysis Important?Order of GrowthAsymptotic Analysis Worst, Average and Best
1 min read
Preparata Algorithm Preparata's algorithm is a recursive Divide and Conquer Algorithm where the rank of each input key is computed and the keys are outputted according to their ranks. C++ m[i, j] := M[i, j] for 1 <= i, j <= n in parallel; for r : = 1 to logn do { Step 1. In parallel set q[i, j, k] := m[i, j] + m[
14 min read
Searching Algorithms Searching algorithms are essential tools in computer science used to locate specific items within a collection of data. In this tutorial, we are mainly going to focus upon searching in an array. When we search an item in an array, there are two most common algorithms used based on the type of input
3 min read
Algorithms Design Techniques What is an algorithm? An Algorithm is a procedure to solve a particular problem in a finite number of steps for a finite-sized input. The algorithms can be classified in various ways. They are: Implementation MethodDesign MethodDesign ApproachesOther ClassificationsIn this article, the different alg
10 min read
Best Data Structures and Algorithms Books Data Structures and Algorithms is one of the most important skills that every Computer Science student must have. There are a number of remarkable publications on DSA in the market, with different difficulty levels, learning approaches and programming languages. In this article we're going to discus
9 min read