SlideShare a Scribd company logo
DOI: 10.23883/IJRTER.2017.3404.4SNDK
196
Examination of Document Similarity Using Rabin-Karp Algorithm
Ranti Eka Putri1
, Andysah Putera Utama Siahaan2
1
Faculty of Computer Science, Universitas Pembanguan Panca Budi, Medan, Indonesia
2
Ph.D. Student of School of Computer and Communication Engineering, Universiti Malaysia Perlis, Kangar,
Malaysia
Abstract — Documents do not always have the same content. However, the similarity between
documents often occurs in the world of writing scientific papers. Some similarities occur because of a
coincidence, but something happens because of the element of intent. On documents that have little
content, this can be checked by the eyes. However, on documents that have thousands of lines and
pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques
performed. Many methods can examine the resemblance of documents, one of them by using the
Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K-
Grams). This algorithm looks at how many hash values are the same in both documents. The
percentage of plagiarism can also be adjusted up to a few percent according to the need for examination
of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of
incoming documents. It is usually done at the time of receipt of a scientific paper to be published.
Keywords —Text Mining, Plagiarism, Rabin-Karp
I. INTRODUCTION
Information is essential in the world of education, especially scientific information. Since
information can be accessed online, this results in information being easily modified. Files downloaded
from the internet allow users to edit. This file can then be saved using a new or even renamed name
with the new user. This process happens so quickly without having to use certain techniques. The
development of this technology has a positive value. Along with the advancement of the era, the
progress of this technology can not be separated from the negative impact it produces. Modification
of information without listing the main source is an action that violates the rules. The modification is
plagiarism [1]. It is an act of abuse, theft or robbery, of publication, of a declaration, or of declaring it
as a property of one's thoughts, ideas, writings, or creations that are not the author idea.
Performing a plagiarism is an easy thing especially with using internet connection. Plagiarism
can kill one's creativity in developing new ideas. It is a fun activity because it can be done easily and
quickly because this action does not require energy and not have to think hard. Plagiarism can be
prevented by using the help of string matching methods. The algorithm can be modified to analyze
text, images, and even sound. This study attempts to match the document matching using the Rabin-
Karp algorithm. This algorithm is known quickly regarding comparing documents [3][4]. Also, the
parameters in this algorithm can be adjusted to the target to be achieved. The author hopes that by
running this system, the action of plagiarism can be avoided.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 197
II. THEORIES
2.1 Plagiarism
Information retrieval is part of computer science related to important documents which will
then be processed in conjunction with other data. It is an information search based on a query that is
expected to meet the previous goal. However, returning documents of plagiarism action may occur.
Plagiarism is a process of plagiarism or recognition of articles, opinions, papers and so on that are not
their own. It is to make the property of another person self-owned without the name of the source. The
person doing the plagiarism is called a plagiarist. It is including a criminal act that is falsifying the
work of others. It is also called copyright theft. Any quotation of words or ideas, the author must
include the name of the original owner. It is also like a book owned by the author may not be reprinted
without the permission of the author or publisher of the essay [2].
In practicing plagiarism, it is not always based on the element of intent. Some have become
plagiarism due to lack of information or reference in making a scientific work. Below are the most
common types of plagiarism:
- Accidental
It occurs since a lack of knowledge of plagiarism and understanding of reference writing. It usually
happens when writing a scientific paper is not based on literature review.
- Unintentional
Information that has frequently been discussed and rewritten again with words that are almost the
same. The same idea can produce different writing if designed well so plagiarism can be avoided.
- Intentional
The act of deliberately quoting a sentence or the whole of another person's work without the
citation of the person's work.
- Self-plagiarism
The use of self-made work in other forms without developing the values or variables present in the
previous work.
The detector of plagiarism is divided into two parts, fingerprinting and full-text comparison.
- Fingerprinting Comparison
It is a technique used to check the relationships between documents whether all the text contained
in a document or text. This technique will break the words on the paper to form a syllable or row
of characters of a certain length. This technique is called hashing. The most commonly used
algorithm is Rabin-Karp.
- Full-text Comparison
This technique performs a content comparison of two documents. It does text comparisons one by
one on each document content. The downside is that it takes longer to compare large documents.
However, the results obtained are quite satisfactory because the results will be used and stored in
a database. Complete text comparison methods can not be applied to documents that are not on the
same storage. The algorithms used in this approach are Brute-Force, Boyer Moore, and
Levenshtein Distance.
2.2 Rabin-Karp
Rabin-Karp algorithm is a search algorithm that searches for a substring pattern in a text using
hashing. It is very effective for multi-pattern matching words [5][7]. One of the practical applications
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 198
of Rabin-Karp's algorithm is plagiarism detection. Rabin-Karp relies on a hash function to determine
the percentage of plagiarism. The accuracy level can be adjusted based on this feature. The hash
function is a function that determines the feature value of a particular syllable fraction. It converts each
string into a number, called a hash value. Rabin-Karp algorithm determines hash value based on the
same word (Figure 1) [6]. There are two barriers in determining the hash value. First, many different
strings are in a particular sentence. This problem can be solved by assigning multiple strings with the
same hash value. The next problem is not necessarily the string that has the same hash value match to
overcome it for each string is assigned to brute-force technique. Rabin-Karp requires a large prime
number to avoid possible hash values similar to different words.
Figure 1 Rabin-Karp hash example
III.IMPLEMENTATION
3.1 Rabin-Karp Process
This stage performs semantic and syntactic analysis of the text. The purpose of the initial
processing is to prepare the text for data that will undergo further processing. The operations that can
be performed at this stage include the process of removing unnecessary parts of the testing process. It
is done to select the data that has been eligible for execution. Filtering is a classification process to
determine the words that will be used in the process of finding the common word. Each sentence will
be broken down into words that will ultimately be a waste of useless words. The document index is a
set of terms that indicate the content or topic contained by the document. Usually, this will be divided
according to need. The index will distinguish a document from other documents that are in the
collection.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 199
The steps that occur in the Rabin-Karb process are as follows:
- Tokenizing
That is to convert a document into a collection of words by entering the words in an array and
separating the punctuation and numbers that are not included in the important words. This process
will also change to lower case.
- Stopword Removal
The process of removing basic words that always exist in the document such as: because, with,
and, or, not and others.
- Stemming
The process of changing the words that still have the prefix and suffix so that it becomes a basic
word.
- Hashing
The process of weighting each word in a document with a value based on a predetermined formula.
3.2 Rabin-Karp Calculation
Hashing is the most important value in the Rabin-Karp algorithm. The result of hashing
letters of k-gram with a certain number of bases is obtained by multiplying the ASCII value with
predetermined numbers where the base is prime. Rabin-Karp method has provisions if two strings are
same then the hash value must be the same as well. Here is an example calculation on Rabin-Karp
algorithm. Assume the text is MEDAN.
K-Gram = 5
Basis = 7
A = MEDAN
A(1) = 77
A(2) = 69
A(3) = 68
A(4) = 65
A(5) = 78
Hash = (77 ∗ 74) + (69 ∗ 73) + (68 ∗ 73) + (65 ∗ 72) + (78 ∗ 71)
= 235599
Tokenizing
Stemming
Stopword Removal
Hashing
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 200
The hash calculation result is 235599. This action is done until all the words on the list are fulfilled.
The following tables 1 and 2 are examples of comparison of documents after the hash values are
obtained. The hash value in the first table will be computed by the hash value of the second table.
Table 1. Hash value of document one
19875 16830 23124 17433 20546
21489 26753 13498 23846 16528
21848 28447 29994 10301 13009
18832 27217 23157 25854 22492
14952 14337 29348 19978 28809
13485 14188 13131 21215 12053
25669 13809 26508 19455 25356
29964 17723 26633 17445 11803
19477 27142 24814 15155 26266
28432 19007 21896 16625 20681
Table 2. Hash value of document two
28432 26406 28424 13930 19187
18049 10867 18516 26753 19975
10152 13053 24120 21896 18351
12605 25101 21215 20750 15513
22949 26006 25045 25932 10695
13254 21504 20286 22492 10615
25565 29941 17403 23018 22666
19744 19769 19877 29535 13139
25669 16830 14297 20916 24640
16960 20681 13131 13009 18947
There are ten pieces of the same hash that both tables have. After calculating the similar hash value,
the next step is to calculate the percentage of similarity of the two documents. The formula used is as
follows:
P =
2 ∗ SH
THA + THB
∗ 100%
Where:
P = Plagiarism Rate
SH = Identical Hash
THA = Total Hash in Document A
THB = Total Hash in Document B
In the previous calculation there are ten values that have the similar value. So the plagiarism level
calculation is as follows.
P =
2∗10
50+50
∗ 100%
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 201
=
20
100
∗ 100%
= 0.5 * 100%
= 20%
The percentage of plagiarism held by both documents is 20%.
IV.CONCLUSION
Rabin-Karp algorithm is very well done to calculate the percentage of document similarity. In
addition to the fast process, this algorithm has adjustable parameters to adjust the accuracy of the
assessment. Calculation of hash value greatly affects the result of this algorithm. Adjustments should
still be made when selecting the K-Gram value to be used. Each analyst can determine the feasibility
tolerance for each document whether he belongs to the category of plagiarism or not. The disadvantage
of this algorithm is that the system can never know which documents came first. The algorithm can
only determine that there are similarities that occur in the comparable documents.
REFERENCES
[1] S. K. Shivaji and P. S., "Plagiarism Detection by using Karp-Rabin and String Matching Algorithm
Together," International Journal of Computer Applications, vol. 116, no. 23, pp. 37-41, 2015.
[2] A. Parker and J. O. Hamblen, "Computer Algorithm for Plagiarism Detection," IEEE Trans.
Education, vol. 32, no. 2, pp. 94-99, 1989.
[3] Sunita, R. Malik and M. Gulia, "Rabin-Karp Algorithm with Hashing a String Matching Tool,"
International Journal of Advanced Research in Computer Science and Software Engineering, vol.
4, no. 3, pp. 389-392, 2014.
[4] A. P. Gope and R. N. Behera, "A Novel Pattern Matching Algorithm in Genome," International
Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5450-5457, 2014.
[5] A. P. U. Siahaan, Mesran, R. Rahim and D. Siregar, "K-Gram As A Determinant Of Plagiarism
Level In Rabin-Karp Algorithm," International Journal of Scientific & Technology Research, vol.
6, no. 7, pp. 350-353, 2017.
[6] S. Popov, "Algorithm of the Week: Rabin-Karp String Searching," DZone / Java Zone, 3 April
2012. [Online]. Available: https://dzone.com/articles/algorithm-week-rabin-karp. [Accessed 20
August 2017].
[7] J. Sharma and M. Singh, "CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection
on a Multicore GPU," International Journal of Computer Network and Information Security, vol.
10, no. 8, pp. 70-77, 2015.

More Related Content

PDF
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...
PDF
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...
PDF
Cg4201552556
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
Keyphrase Extraction using Neighborhood Knowledge
PDF
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
PDF
Approaches for Keyword Query Routing
PDF
Hybrid approach for generating non overlapped substring using genetic algorithm
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...
Cg4201552556
International Journal of Computational Engineering Research(IJCER)
Keyphrase Extraction using Neighborhood Knowledge
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
Approaches for Keyword Query Routing
Hybrid approach for generating non overlapped substring using genetic algorithm

What's hot (20)

PDF
Computing semantic similarity measure between words using web search engine
PDF
BoTLRet: A Template-based Linked Data Information Retrieval
DOC
ast nearest neighbor search with keywords
PPTX
TextRank: Bringing Order into Texts
PPTX
Information retrieval 7 boolean model
PDF
A framework for plagiarism
PPT
Finding Similar Files in Large Document Repositories
PDF
Implementation of Urdu Probabilistic Parser
DOCX
Independent Study_Final Report
PDF
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
PDF
Duplicate Detection on Hoaxy Dataset
PDF
Using Page Size for Controlling Duplicate Query Results in Semantic Web
PDF
Document Retrieval System, a Case Study
PDF
In3415791583
PDF
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PPTX
Search engine. Elasticsearch
PDF
Semantic tagging for documents using 'short text' information
PDF
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
Computing semantic similarity measure between words using web search engine
BoTLRet: A Template-based Linked Data Information Retrieval
ast nearest neighbor search with keywords
TextRank: Bringing Order into Texts
Information retrieval 7 boolean model
A framework for plagiarism
Finding Similar Files in Large Document Repositories
Implementation of Urdu Probabilistic Parser
Independent Study_Final Report
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
Duplicate Detection on Hoaxy Dataset
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Document Retrieval System, a Case Study
In3415791583
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Information_Retrieval_Models_Nfaoui_El_Habib
Search engine. Elasticsearch
Semantic tagging for documents using 'short text' information
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
Ad

Similar to Examination of Document Similarity Using Rabin-Karp Algorithm (20)

PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
Ijetcas14 624
PPTX
bio informatic Database Similarity Searching 3.pptx
PDF
A Literature Review on Plagiarism Detection in Computer Programming Assignments
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
PDF
An Application of Pattern matching for Motif Identification
PDF
Bi4101343346
PDF
Improved Text Mining for Bulk Data Using Deep Learning Approach
PDF
Online Plagiarism Checker
PDF
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
PDF
Nature Inspired Models And The Semantic Web
PDF
Sentence similarity-based-text-summarization-using-clusters
PDF
Measure Term Similarity Using a Semantic Network Approach
DOC
Efficient instant fuzzy search with proximity ranking
PDF
Efficient Parallel Pruning of Associative Rules with Optimized Search
PDF
A Survey On Plagiarism Detection
PDF
At33264269
PDF
At33264269
International Journal of Engineering and Science Invention (IJESI)
Ijetcas14 624
bio informatic Database Similarity Searching 3.pptx
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
An Application of Pattern matching for Motif Identification
Bi4101343346
Improved Text Mining for Bulk Data Using Deep Learning Approach
Online Plagiarism Checker
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
Nature Inspired Models And The Semantic Web
Sentence similarity-based-text-summarization-using-clusters
Measure Term Similarity Using a Semantic Network Approach
Efficient instant fuzzy search with proximity ranking
Efficient Parallel Pruning of Associative Rules with Optimized Search
A Survey On Plagiarism Detection
At33264269
At33264269
Ad

More from Universitas Pembangunan Panca Budi (20)

PDF
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
PDF
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
PDF
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
PDF
Insecure Whatsapp Chat History, Data Storage and Proposed Security
PDF
Online Shoppers Acceptance: An Exploratory Study
PDF
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
PDF
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
PDF
Mobile Application Detection of Road Damage using Canny Algorithm
PDF
Super-Encryption Cryptography with IDEA and WAKE Algorithm
PDF
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
PDF
Prototype Application Multimedia Learning for Teaching Basic English
PDF
TOPSIS Method Application for Decision Support System in Internal Control for...
PDF
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
PDF
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
PDF
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
PDF
Image Similarity Test Using Eigenface Calculation
PDF
Data Compression Using Elias Delta Code
PDF
A Review of IP and MAC Address Filtering in Wireless Network Security
PDF
Expert System of Catfish Disease Determinant Using Certainty Factor Method
PDF
Threats of Computer System and its Prevention
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Insecure Whatsapp Chat History, Data Storage and Proposed Security
Online Shoppers Acceptance: An Exploratory Study
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Mobile Application Detection of Road Damage using Canny Algorithm
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Prototype Application Multimedia Learning for Teaching Basic English
TOPSIS Method Application for Decision Support System in Internal Control for...
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Image Similarity Test Using Eigenface Calculation
Data Compression Using Elias Delta Code
A Review of IP and MAC Address Filtering in Wireless Network Security
Expert System of Catfish Disease Determinant Using Certainty Factor Method
Threats of Computer System and its Prevention

Recently uploaded (20)

PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Institutional Correction lecture only . . .
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Computing-Curriculum for Schools in Ghana
PDF
01-Introduction-to-Information-Management.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
GDM (1) (1).pptx small presentation for students
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
RMMM.pdf make it easy to upload and study
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Cell Types and Its function , kingdom of life
TR - Agricultural Crops Production NC III.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
VCE English Exam - Section C Student Revision Booklet
Institutional Correction lecture only . . .
O7-L3 Supply Chain Operations - ICLT Program
Computing-Curriculum for Schools in Ghana
01-Introduction-to-Information-Management.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
GDM (1) (1).pptx small presentation for students
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
RMMM.pdf make it easy to upload and study
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Microbial disease of the cardiovascular and lymphatic systems

Examination of Document Similarity Using Rabin-Karp Algorithm

  • 1. DOI: 10.23883/IJRTER.2017.3404.4SNDK 196 Examination of Document Similarity Using Rabin-Karp Algorithm Ranti Eka Putri1 , Andysah Putera Utama Siahaan2 1 Faculty of Computer Science, Universitas Pembanguan Panca Budi, Medan, Indonesia 2 Ph.D. Student of School of Computer and Communication Engineering, Universiti Malaysia Perlis, Kangar, Malaysia Abstract — Documents do not always have the same content. However, the similarity between documents often occurs in the world of writing scientific papers. Some similarities occur because of a coincidence, but something happens because of the element of intent. On documents that have little content, this can be checked by the eyes. However, on documents that have thousands of lines and pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques performed. Many methods can examine the resemblance of documents, one of them by using the Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K- Grams). This algorithm looks at how many hash values are the same in both documents. The percentage of plagiarism can also be adjusted up to a few percent according to the need for examination of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of incoming documents. It is usually done at the time of receipt of a scientific paper to be published. Keywords —Text Mining, Plagiarism, Rabin-Karp I. INTRODUCTION Information is essential in the world of education, especially scientific information. Since information can be accessed online, this results in information being easily modified. Files downloaded from the internet allow users to edit. This file can then be saved using a new or even renamed name with the new user. This process happens so quickly without having to use certain techniques. The development of this technology has a positive value. Along with the advancement of the era, the progress of this technology can not be separated from the negative impact it produces. Modification of information without listing the main source is an action that violates the rules. The modification is plagiarism [1]. It is an act of abuse, theft or robbery, of publication, of a declaration, or of declaring it as a property of one's thoughts, ideas, writings, or creations that are not the author idea. Performing a plagiarism is an easy thing especially with using internet connection. Plagiarism can kill one's creativity in developing new ideas. It is a fun activity because it can be done easily and quickly because this action does not require energy and not have to think hard. Plagiarism can be prevented by using the help of string matching methods. The algorithm can be modified to analyze text, images, and even sound. This study attempts to match the document matching using the Rabin- Karp algorithm. This algorithm is known quickly regarding comparing documents [3][4]. Also, the parameters in this algorithm can be adjusted to the target to be achieved. The author hopes that by running this system, the action of plagiarism can be avoided.
  • 2. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 197 II. THEORIES 2.1 Plagiarism Information retrieval is part of computer science related to important documents which will then be processed in conjunction with other data. It is an information search based on a query that is expected to meet the previous goal. However, returning documents of plagiarism action may occur. Plagiarism is a process of plagiarism or recognition of articles, opinions, papers and so on that are not their own. It is to make the property of another person self-owned without the name of the source. The person doing the plagiarism is called a plagiarist. It is including a criminal act that is falsifying the work of others. It is also called copyright theft. Any quotation of words or ideas, the author must include the name of the original owner. It is also like a book owned by the author may not be reprinted without the permission of the author or publisher of the essay [2]. In practicing plagiarism, it is not always based on the element of intent. Some have become plagiarism due to lack of information or reference in making a scientific work. Below are the most common types of plagiarism: - Accidental It occurs since a lack of knowledge of plagiarism and understanding of reference writing. It usually happens when writing a scientific paper is not based on literature review. - Unintentional Information that has frequently been discussed and rewritten again with words that are almost the same. The same idea can produce different writing if designed well so plagiarism can be avoided. - Intentional The act of deliberately quoting a sentence or the whole of another person's work without the citation of the person's work. - Self-plagiarism The use of self-made work in other forms without developing the values or variables present in the previous work. The detector of plagiarism is divided into two parts, fingerprinting and full-text comparison. - Fingerprinting Comparison It is a technique used to check the relationships between documents whether all the text contained in a document or text. This technique will break the words on the paper to form a syllable or row of characters of a certain length. This technique is called hashing. The most commonly used algorithm is Rabin-Karp. - Full-text Comparison This technique performs a content comparison of two documents. It does text comparisons one by one on each document content. The downside is that it takes longer to compare large documents. However, the results obtained are quite satisfactory because the results will be used and stored in a database. Complete text comparison methods can not be applied to documents that are not on the same storage. The algorithms used in this approach are Brute-Force, Boyer Moore, and Levenshtein Distance. 2.2 Rabin-Karp Rabin-Karp algorithm is a search algorithm that searches for a substring pattern in a text using hashing. It is very effective for multi-pattern matching words [5][7]. One of the practical applications
  • 3. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 198 of Rabin-Karp's algorithm is plagiarism detection. Rabin-Karp relies on a hash function to determine the percentage of plagiarism. The accuracy level can be adjusted based on this feature. The hash function is a function that determines the feature value of a particular syllable fraction. It converts each string into a number, called a hash value. Rabin-Karp algorithm determines hash value based on the same word (Figure 1) [6]. There are two barriers in determining the hash value. First, many different strings are in a particular sentence. This problem can be solved by assigning multiple strings with the same hash value. The next problem is not necessarily the string that has the same hash value match to overcome it for each string is assigned to brute-force technique. Rabin-Karp requires a large prime number to avoid possible hash values similar to different words. Figure 1 Rabin-Karp hash example III.IMPLEMENTATION 3.1 Rabin-Karp Process This stage performs semantic and syntactic analysis of the text. The purpose of the initial processing is to prepare the text for data that will undergo further processing. The operations that can be performed at this stage include the process of removing unnecessary parts of the testing process. It is done to select the data that has been eligible for execution. Filtering is a classification process to determine the words that will be used in the process of finding the common word. Each sentence will be broken down into words that will ultimately be a waste of useless words. The document index is a set of terms that indicate the content or topic contained by the document. Usually, this will be divided according to need. The index will distinguish a document from other documents that are in the collection.
  • 4. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 199 The steps that occur in the Rabin-Karb process are as follows: - Tokenizing That is to convert a document into a collection of words by entering the words in an array and separating the punctuation and numbers that are not included in the important words. This process will also change to lower case. - Stopword Removal The process of removing basic words that always exist in the document such as: because, with, and, or, not and others. - Stemming The process of changing the words that still have the prefix and suffix so that it becomes a basic word. - Hashing The process of weighting each word in a document with a value based on a predetermined formula. 3.2 Rabin-Karp Calculation Hashing is the most important value in the Rabin-Karp algorithm. The result of hashing letters of k-gram with a certain number of bases is obtained by multiplying the ASCII value with predetermined numbers where the base is prime. Rabin-Karp method has provisions if two strings are same then the hash value must be the same as well. Here is an example calculation on Rabin-Karp algorithm. Assume the text is MEDAN. K-Gram = 5 Basis = 7 A = MEDAN A(1) = 77 A(2) = 69 A(3) = 68 A(4) = 65 A(5) = 78 Hash = (77 ∗ 74) + (69 ∗ 73) + (68 ∗ 73) + (65 ∗ 72) + (78 ∗ 71) = 235599 Tokenizing Stemming Stopword Removal Hashing
  • 5. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 200 The hash calculation result is 235599. This action is done until all the words on the list are fulfilled. The following tables 1 and 2 are examples of comparison of documents after the hash values are obtained. The hash value in the first table will be computed by the hash value of the second table. Table 1. Hash value of document one 19875 16830 23124 17433 20546 21489 26753 13498 23846 16528 21848 28447 29994 10301 13009 18832 27217 23157 25854 22492 14952 14337 29348 19978 28809 13485 14188 13131 21215 12053 25669 13809 26508 19455 25356 29964 17723 26633 17445 11803 19477 27142 24814 15155 26266 28432 19007 21896 16625 20681 Table 2. Hash value of document two 28432 26406 28424 13930 19187 18049 10867 18516 26753 19975 10152 13053 24120 21896 18351 12605 25101 21215 20750 15513 22949 26006 25045 25932 10695 13254 21504 20286 22492 10615 25565 29941 17403 23018 22666 19744 19769 19877 29535 13139 25669 16830 14297 20916 24640 16960 20681 13131 13009 18947 There are ten pieces of the same hash that both tables have. After calculating the similar hash value, the next step is to calculate the percentage of similarity of the two documents. The formula used is as follows: P = 2 ∗ SH THA + THB ∗ 100% Where: P = Plagiarism Rate SH = Identical Hash THA = Total Hash in Document A THB = Total Hash in Document B In the previous calculation there are ten values that have the similar value. So the plagiarism level calculation is as follows. P = 2∗10 50+50 ∗ 100%
  • 6. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 201 = 20 100 ∗ 100% = 0.5 * 100% = 20% The percentage of plagiarism held by both documents is 20%. IV.CONCLUSION Rabin-Karp algorithm is very well done to calculate the percentage of document similarity. In addition to the fast process, this algorithm has adjustable parameters to adjust the accuracy of the assessment. Calculation of hash value greatly affects the result of this algorithm. Adjustments should still be made when selecting the K-Gram value to be used. Each analyst can determine the feasibility tolerance for each document whether he belongs to the category of plagiarism or not. The disadvantage of this algorithm is that the system can never know which documents came first. The algorithm can only determine that there are similarities that occur in the comparable documents. REFERENCES [1] S. K. Shivaji and P. S., "Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together," International Journal of Computer Applications, vol. 116, no. 23, pp. 37-41, 2015. [2] A. Parker and J. O. Hamblen, "Computer Algorithm for Plagiarism Detection," IEEE Trans. Education, vol. 32, no. 2, pp. 94-99, 1989. [3] Sunita, R. Malik and M. Gulia, "Rabin-Karp Algorithm with Hashing a String Matching Tool," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 3, pp. 389-392, 2014. [4] A. P. Gope and R. N. Behera, "A Novel Pattern Matching Algorithm in Genome," International Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5450-5457, 2014. [5] A. P. U. Siahaan, Mesran, R. Rahim and D. Siregar, "K-Gram As A Determinant Of Plagiarism Level In Rabin-Karp Algorithm," International Journal of Scientific & Technology Research, vol. 6, no. 7, pp. 350-353, 2017. [6] S. Popov, "Algorithm of the Week: Rabin-Karp String Searching," DZone / Java Zone, 3 April 2012. [Online]. Available: https://dzone.com/articles/algorithm-week-rabin-karp. [Accessed 20 August 2017]. [7] J. Sharma and M. Singh, "CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection on a Multicore GPU," International Journal of Computer Network and Information Security, vol. 10, no. 8, pp. 70-77, 2015.