SlideShare a Scribd company logo
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 6 254 - 259
______________________________________________________________________________________
254
IJRITCC | June 2018, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
PAS: A Sampling Based Similarity Identification Algorithm for compression of
Unicode data content
Siddiqui Sara Begum Anjum Parvez
Department of Computer Science and
Engineering, K.S.I.E.T Hingoli-431513
Maharashtra, INDIA.
E-mail id: siddiqui123sara@gmail.com
Ashru L. Korde (Author)
Department of Computer Science and
Engineering, K.S.I.E.T Hingoli-431513
Maharashtra, INDIA.
Email id:ashrukorde@gmail.com
Abstract— Generally, Users perform searches to satisfy their information needs. Now a day’s lots of people are using search engine to satisfy
information need. Server search is one of the techniques of searching the information. the Growth of data brings new changes in Server. The data
usually proposed in timely fashion in server. If there is increase in latency then it may cause a massive loss to the enterprises. The similarity
detection plays very important role in data. while there are many algorithms are used for similarity detection such as Shingle, Simhas TSA and
Position Aware sampling algorithm. The Shingle Simhash and Traits read entire files to calculate similar values. It requires the long delay in
growth of data set value. instead of reading entire Files PAS sample some data in the form of Unicode to calculate similarity characteristic
value.PAS is the advance technique of TSA. However slight modification of file will trigger the position of file content .Therefore the failure of
similarity identification is there due to some modifications.. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to
identify file similarity for the Server. EPAS concurrently samples data blocks from the modulated file to avoid the position shift by the
modifications. While there is an metric is proposed to measure the similarity between different files and make the possible detection probability
close to the actual probability. In this paper describes a PAS algorithm to reduce the time overhead of similarity detection. Using PAS algorithm
we can reduce the complication and time for identifying the similarity. Our result demonstrate that the EPAS significantly outperforms the
existing well known algorithms in terms of time. Therefore, it is an effective approach of similarity identification for the Server.
Keywords: Similarity, identification, Unicode data.
__________________________________________________*****_________________________________________________
I. INTRODUCTION
The growth of the data management significantly increases and
the data risk and cost of data also increases . to address this
kind of problem many users transfer there data to the server.
And we can access that data via internet. This kind of problem
result in large volume of redundant data in server. The main
reson for this is that multiple users tends to store similar files in
the server . Here multiple users store multiple files in the
server. Unfortunately the redundant data not only consume
significant it esurses but also occupy bandwidth for this
puposes the data deduplication is required .
To overcome from all these problems we create the sampling
similarity based identification algorithm for compression of
Unicode data content in server. search techniques perform
comparably despite contrary claims in the literature. During my
evaluation of search effectiveness, I were surprised by the
difficulty I had searching my data sets. In particular,
straightforward implementations of many search techniques
server not scale to databases with hundreds of thousands of
tuples, which forced us to write “lazy” versions of their core
algorithms and reduce their memory footprint. Even then, I
were surprised by the excessive runtime of many search
techniques.
In the present data warehousing environment schemes there
are lots of issues in server computing. Some advanced data
manipulating schemes are required to extending the dats
search paradigm to relational data has been an active area of
research within the database and information retrieval
community. It proves that the effectiveness of performance
in retrieval tasks and data maintaining procedures. The
outcome confirms previous claims regarding the
unacceptable performance of these systems and underscores
the need for standardization as exemplified by the IR
community when evaluating these retrieval systems.
Position Aware similarity identification algorithms belong to
I/O bound and CPU bound tasks. Calculating the Unicode of
similar files requires lots of CPU Corresponding cycles, the
computing increases with the growth of data sets
Position Aware similarity identify algorithms normally
require a large amount of time for detecting the similarity,
which results in long delays and if there is large data sets. it
require more time This makes it difficult to apply the
algorithms to some applications.
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 6 254 - 259
______________________________________________________________________________________
255
IJRITCC | June 2018, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
In this paper, we propose a Position-Aware Similarity (PAS)
identification algorithm to detect the similar files in large data
sets. This method is very effective in dealing with file
modification when performing similarity detection. And here
we use Simhash algorithm called in terms of precision and
recall. Furthermore, the time overhead, CPU and memory
occupation of PAS are much less than that of simhash. This is
because the overhead of PAS is relatively stable. It is not
increases with the growth of data size.
The remainder of this paper is organized as follows: we
present related work in section 2. In section 3 we describe
some background knowledge. Section 4 introduces the basic
idea of PAS algorithm.Section 5 shows Sampling Based
similarity identification. Section 6 shows the evaluation results
of PAS algorithm. Section 7 shows Similarity Identification
Techniques Work Section 8 draws conclusions and Future
use.
II. RELATED WORK
In related work it involved the Server technique and similarity
detection algorithm to avoid the redundant data in server using
Unicode data content. Here we design the sampling based
similarity approach for the detection of data similarity and
perform the various task like upload file and delete files
In the past decade, a lot of research efforts have been invested
in identifying data similarity.Which we explain below.
The first one is similar web page detection with web search
engine. Detecting and removing similar web pages can save
network bandwidth, reduce storage consumption, and improve
the quality of web search engine index. Andrei et al. [17], [20]
proposed a similar web page detection technique called
Shingle algorithm which utilizes set operation to detect
similarity. Shingle is a typical sampling based approach
employed to identify similar web pages. In order to reduce the
size of shingle, Andrei presented Modm and Mins sampling
methods. This algorithm is applied to AltaVista web search
engine at present. Manku et al. [21] applied a Simhash
algorithm to detect similarity in web documents belonging to a
multi-billion page repository. Simhash algorithm practically
runs at Google web search engine combining with Google file
system [22] and MapReduce [23] to achieve batch queries.
Elsayed et al. [24] presented a MapReduce algorithm for
computing pairwise document similarity in large document
collections.
The second one is similar file detection in storage systems. In
storage systems, data similarity detection and encoding play a
crucial role in improving the resource utilization. Forman [25]
presented an approach for finding similar files and applied the
method to document repositories.This approach brings a great
reduction in storage space consumption. Ouyang [26]
presented a large-scale file compression technique based on
cluster by using Shingle similarity detection technique.
Ouyang uses Min-wise [27] sampling method to reduce the
overhead of Shingle algorithm. Han et al. [28] presented fuzzy
file block matching technique, which was first proposed for
opportunistic use of content addressable storage. Fuzzy file
block matching technique employs Shingle to represent the
fuzzy hashing of file blocks for similarity detection. It uses
Mins sampling method to decrease the overhead of shingling
algorithm.
The third one is plagiarism detection. Digital information can
be easily copied and retransmitted. This feature causes owners
copyright be easily violated. In purpose of protecting
copyright and other related rights, we need plagiarism
detection. Baker [29] described a program called dup which
can be used to locate instances of duplication or near
duplication in a software. Shivakumar [30] presented data
structures for finding overlap between documents and
implemented these data structures in SCAM.
The forth one is remote file backup. Traditional remote file
backup approaches take high bandwidth and consume a lot of
resources. Applying similarity detection to remote file backup
can greatly reduce bandwidth consumption.
Teodosiu et al. [15] proposed a Traits algorithm to find out the
client files which are similar to a given server file. Teodosiu
implemented this algorithm in DFSR. Experimental results
suggest that these optimizations may help reduce the
bandwidth required to transfer file updates across a network.
Muthitacharoen et al. [5] presented LBFS which exploits
similarity between files or versions of the same file to save
bandwidth. Cox et al. [31] presented a similaritybased
mechanism for locating a single source file to perform peer-to-
peer backup. They implemented a system prototype called
Pastiche.
The fifth one is the similarity detection for specific domains.
Hua et al. [32] explored and exploited data similarity which
supports efficient data placement for cloud. They designed a
novel multi-core-enabled and localitysensitive hashing that
can accurately capture the differentiated
similarity across data. Biswas et al. [33] proposed a cache
architecture called Mergeable. Mergeable detectsdata
similarities and merges cache blocks so as to decrease cache
storage requirements. Experimental evaluation suggested that
Mergeable reduces off-chip memory accesses and overall
power usage. Vernica et al. [34] proposed a three-stage
approach for end-to-end setsimilarity joins in parallel using the
popular MapReduce framework. Deng et al. [35] proposed a
MapReducebased framework Massjoin for scalable string
similarity joins. The approach achieves both set-based
similarity functions and character-based similarity
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 6 254 - 259
______________________________________________________________________________________
256
IJRITCC | June 2018, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
functions. Most of the above work focus on a specific
application scenario, and the computational or similarity
detection overhead are increased with the growth of data
volume. In addition, the similarity detection metric may not be
able to well measure the similarity between two files.
Therefore, this paper proposes an EPAS algorithm and a new
similarity detection metric to identify file similarity for the
cloud.
According to the analysis (see Section 4 and 5.3) and
experimental results, it illustrates that the proposed similarity
metric catch the similarity between metric catch the similarity
between two files more accuratelythat that of traditional
metric. Furthermore, the overhead of EPAS is fixed and
minimized in contrast to previous work.
III. BACKGROUND
Here We used the hash key which is based on counting the
occurrences of certain Unicode strings within a file. The keys,
along with certain intermediate data, are stored in a relational
database (Figure 1).A separate program then query the
database for keys with similar values, and outputs the results.
Our code was written in PHP, and developed simultaneously
for the Windows platforms. While the code runs equally well
on platforms, we used a Windows machine for primary
development and for most data collection.
File Unicode data
Figure 1: SimHash produces two levels of file similarity data.
the key function was implemented to account for file
extensions.
Here We can not say that two files are different if they contain
different extensions. Assign different values to any two files
with different extension. To this we compute a hash to file
extension with value between 0 and 1.if there is same
extension then this will not affect the relation between files
with same extension .
I. POSITION-AWARE SIMILARITY ALGORITH
In server if there is large similar data set with less overhead.
Then we can use the similarity detection algorithm that is
PAS. Here we use different symbols which are given in the
table bellow.
A. Traditional sampling algorithm
Suppose we sample N data blocks of file A, each data block
sizing Lenc is injected to a hash function. We then can obtain
N fingerprint values that are collected as a fingerprint set
SigA(N; Lenc). In this scenario, similarity detection problem
can be transformed into a set intersection problem. By
analogy, we will have a fingerprint set SigB(N; Lenc) of file
According to equation (1),
Table 1: Symbols used in following
FIGURE. 2: THE SAMPLING POSITIONS OF TSA AND PAS
the degree of similarity between file A and file B can be
described as equation (1), where Sim(A;B) ranges s between 0
and 1. If Sim(A;B) is reaching 1, it means the similarity of file
A and file B are very high, vice verse. After selecting a
threshold of the similarity, we can determine that file A is
similar to file B when Sim(A;B) is satisfied. This TSA is
described in algorithm 1 by using pseudo-code.
TSA is simple, but it is very sensitive to file modifications. A
small modification would cause the sampling positions shifted,
thus resulting a failure. Suppose we have a file A sizing 56KB.
We sample 6 data blocks and each data block sizes 1KB.
According to Algorithm 1, file A has N= 6; Lenc = 1KB;
FileSize = 56KB; LenR = 10KB. If we add 5KB data to file A
to form file B, file Bwill have N = 6; Lenc = 1KB; FileSize =
61KB; LenR = 11KB in terms of algorithm.
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 6 254 - 259
______________________________________________________________________________________
257
IJRITCC | June 2018, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
Adding 5KB data to file A has three situations including the
begging, the middle, and the end of the file A. File B1, B2,
and B3 in figure 2(a) represent
these three different situations. We can find that the above file
modifications cause the sampling position shifted and result in
an inaccuracy of similarity detection.
For example, the six sampling positions of file A are 0KB,
11KB, 22KB, 33KB, 44KB, and 55KB ((1 �1) _ (1 + 10) =
0KB; (2�1) _ (1 +10) = 11KB; (3 �1) _ (1 + 10) = 22KB; (4
�1) (1 +10) = 33KB; (5 �1) _ (1 + 10) = 44KB; (6 �1) _ (1
+ 10) = 55KB), respectively.
However, due to the added 5KB data, the six sampling
positions of file B1,B2, and B3 are shifted to 0KB; 12KB;
24KB; 36KB; 48KB, and 60KB((1 �1) _ (1 + 11) =0KB; (2
�1) _ (1 + 11) = 12KB; (3 �1) _ (1 + 11) = 24KB; (4 �1) _
(1 + 11) =36KB; (5 �1) _ (1 + 11) = 48KB; (6 �1) _ (1 +
11) = 60KB), respectively.
Although the Sim(A;B) is far from actual value when using
TSA, the sampling method is very simple and takes
much less overhead in contrast to the shingle algorithm and
simhash algorithm.
B. PAS algorithm
FPP[17] exploits prefetching fingerprints belonging to the
same file by leveraging file similarity, thus improving the
performance of data deduplication systems.
The experimental results suggest that FPP increases cache hit
ratio and reduces the number of disk accesses greatly. FPP
samples three data blocks in the beginning, the middle, and the
end of files to determine that a forthcoming file is similar to
the files stored in the backed storage system, by using the
TSA. This method is sample and effective. However, as
explained in section 4.1, a single bit modification would result
in a failure. Therefore, PAS is proposed to solve this problem.
II. CONCLUSION AND FUTURE SCOPE
A. Conclusion
In this paper , Overall we will study all the existing
techniques which is available in market. Each system has
some advantages and some disadvantages. Any existing
system cannot fulfill all the requirement of Server search.
They require more space and time; also some techniques
are limited for particular dataset. We Proposed an
algorithm PAS to identify the file similarity of Unicode
data in large data Set.Here many experiments are
performed to select the parameters of PAS. PAS is very
effective in detecting file similarity in contrast in
similarity identification algorithm called Simhas. PAS
required less time than Simhash. The Proposed technique
is satisfying number of requirement of server search using
different algorithms. It also shows the ranking of character
value and not requires the knowledge of database queries.
Compare to existing algorithm it is a fast process.
B. Future Scope
As a future work we can search the techniques which are
useful for all the datasets, means only single technique
can be use. Further research is necessary to investigate the
experimental design decisions that have a significant
impact on the evaluation of server search.
Sampling based similarity identification opens up several
directions for future work. The first one is using content
based chunk algorithm to sample data blocks, since this
approach can avoid content shifting incurred by data
modification. The second one is employing file metadata
to optimize the similarity detection. This is because the
file size and type which are contained in the metadata of
similar file are normally very close.
Evaluate to presented systems it is a fast process and the
Techniques are implausible to have performance
characteristics that are similar to existing systems but be
required to be used if cloud search systems are to scale to
great datasets. The memory exploitation during a search
has not been the focus of any earlier assessment
ACKNOWLEDGMENT
The preferred spelling of the word “acknowledgment” in
America is without an “e” after the “g”. Avoid the stilted
expression, “One of us (R.B.G.) thanks . . .” Instead, try
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 6 254 - 259
______________________________________________________________________________________
258
IJRITCC | June 2018, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
“R.B.G. thanks”. Put applicable sponsor acknowledgments
here; DO NOT place them on the first page of your paper or as
a footnote.
REFERENCES
[1] 1.storage architecture for big data in cloud
environment,” in Green Computing and
Communications (GreenCom), 2013 IEEE and
Internet of Things (iThings/CPSCom), IEEE
International Conference on and IEEE Cyber, Physical
and Social Computing. IEEE, 2013, pp. 476– 480.
[2] J. Gantz and D. Reinsel, “The digital universe decade-
are you ready,” IDC iView, 2010.
[3] H. Biggar, “Experiencing data de-duplication:
Improving efficiency and reducing capacity
requirements,” The Enterprise Strategy Group, 2007.
[4] F. Guo and P. Efstathopoulos, “Building a high
performance deduplication system,” in Proceedings of
the 2011 USENIX conference on USENIX annual
technical conference. USENIX Association, 2011,pp.
25–25.
[5] A. Muthitacharoen, B. Chen, and D. Mazieres, “A
low-bandwidth network file system,” in ACM
SIGOPS Operating Systems Review, vol. 35, no. 5.
ACM, 2001, pp. 174–187.
[6] B. Zhu, K. Li, and R. H. Patterson, “Avoiding the disk
bottleneck in the data domain deduplication file
system.” in Fast, vol. 8, 2008, pp. 1–14.
[7] Y. Deng, “What is the future of disk drives, death or
rebirth?” ACM Computing Surveys (CSUR), vol. 43,
no. 3, p. 23, 2011.
[8] C. Wu, X. LIN, D. Yu, W. Xu, and L. Li, “End-to-end
delay minimization for scientific workflows in clouds
under budget constraint,” IEEE Transaction on Cloud
Computing (TCC), vol. 3, pp. 169–181, 2014.
[9] G. Linden, “Make data useful,”
http://home.blarg.net/_glinden/
StanfordDataMining.2006-11-29.ppt, 2006.
[10] R. Kohavi, R. M. Henne, and D. Sommerfield,
“Practical guide to controlled experiments on the web:
listen to your customers not to the hippo,” in
Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining.
ACM, 2007, pp. 959–967.
[11] J. Hamilton, “The cost of latency,” Perspectives Blog,
2009.
[12] D. Bhagwat, K. Eshghi, D. D. Long, and M.
Lillibridge, “Extreme binning: Scalable, parallel
deduplication for chunk-based file backup,” in
Modeling, Analysis & Simulation of Computer and
Telecommunication Systems, 2009. MASCOTS’09.
IEEE International Symposium on. IEEE, 2009, pp.
[13] W. Xia, H. Jiang, D. Feng, and Y. Hua, “Silo: a
similarity-locality based near-exact deduplication
scheme with low ram overhead and high throughput,”
in Proceedings of the 2011 USENIX conference on
USENIX annual technical conference. USENIX
Association, 2011, pp. 26–28.
[14] Y. Fu, H. Jiang, and N. Xiao, “A scalable inline cluster
deduplication framework for big data protection,” in
Middleware 2012.Springer, 2012, pp. 354–373.
[15] D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse,
and J. Porkka, “Optimizing file replication over
limited bandwidth networks using remote differential
compression,” Microsoft Research TR- 2006-157,
2006.
[16] Y. Zhou, Y. Deng, and J. Xie, “Leverage similarity
and locality to enhance fingerprint perfecting of data
deduplication,” in Proceedings of The 20th IEEE
International Conference on Parallel and Distributed
Systems. Springer, 2014.
[17] A. Z. Broder, “On the resemblance and containment of
documents,” in Compression and Complexity of
Sequences 1997. Proceedings. IEEE, 1997, pp. 21–29.
[18] G. S. Manku, A. Jain, and A. Das Sarma, “Detecting
near duplicates for web crawling,” in Proceedings of
the 16th international conference on World Wide Web.
ACM, 2007, pp. 141–150.
[19] L. Song, Y. Deng, and J. Xie, “Exploiting fingerprint
prefetching to improve the performance of data
deduplication,” in Proceedings of the 15th IEEE
International Conference on High Performance
Computing and Communications. IEEE, 2013.
[20] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G.
Zweig, “Syntactic clustering of the web,” Computer
Networks and ISDN Systems, vol. 29, no. 8, pp. 1157–
1166, 1997.
[21] M. S. Charikar, “Similarity estimation techniques from
rounding algorithms,” in Proceedings of the thiry-
fourth annual ACM symposium on Theory of
computing. ACM, 2002, pp. 380–388.
[22] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The
google file system,” in ACM SIGOPS Operating
Systems Review, vol. 37, no. 5. ACM, 2003, pp. 29–
43.
[23] J. Dean and S. Ghemawat, “Mapreduce: simplified
data processing on large clusters,” Communications of
the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[24] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise
document similarity in large collections with map
reduce,” in Proceedings of the 46th Annual Meeting of
the Association for Computational Linguistics on
Human Language Technologies: Short Papers.
Association for Computational Linguistics, 2008, pp.
265–268.
[25] G. Forman, K. Eshghi, and S. Chiocchetti, “Finding
similar files in large document repositories,” in
Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in
data mining. ACM, 2005, pp. 394–400.
[26] Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov,
“Cluster-based delta compression of a collection of
files,” in Web Information Systems Engineering,
2002. WISE 2002. Proceedings of the Third
International Conference on. IEEE, 2002, pp. 257–
266.
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 6 254 - 259
______________________________________________________________________________________
259
IJRITCC | June 2018, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
[27] A. Z. Broder, M. Charikar, A. M. Frieze, and M.
Mitzenmacher, “Min-wise independent permutations,”
Journal of Computer and System Sciences, vol. 60, no.
3, pp. 630–659, 2000.
[28] B. Han and P. J. Keleher, “Implementation and
performance evaluation of fuzzy file block matching.”
in USENIX Annual Technical Conference, 2007, pp.
199–204.
[29] B. S. Baker, “On finding duplication and near-
duplication in large software systems,” in Reverse
Engineering, 1995., Proceedings of 2nd Working
Conference on. IEEE, 1995, pp. 86–95.
[30] N. Shivakumar and H. Garcia-Molina, “Building a
scalable and accurate copy detection mechanism,” in
Proceedings of the first ACM international conference
on Digital libraries. ACM, 1996, pp.160–168.

More Related Content

What's hot (19)

PDF
Recent Trends in Incremental Clustering: A Review
IOSRjournaljce
 
PDF
Data characterization towards modeling frequent pattern mining algorithms
csandit
 
PDF
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET Journal
 
PDF
Hierarchal clustering and similarity measures along with multi representation
eSAT Journals
 
PDF
Hierarchal clustering and similarity measures along
eSAT Publishing House
 
PDF
Skip List: Implementation, Optimization and Web Search
CSCJournals
 
PDF
11.challenging issues of spatio temporal data mining
Alexander Decker
 
PPT
Data mining in agriculture
Sibananda Khatai
 
PDF
A Framework for Automated Association Mining Over Multiple Databases
ertekg
 
PDF
V3 i35
silverscouts
 
PDF
Analysis of Crime Big Data using MapReduce
Kaushik Rajan
 
PDF
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
PDF
Frequency and similarity aware partitioning for cloud storage based on space ...
redpel dot com
 
PDF
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...
Kaushik Rajan
 
PDF
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Association of Scientists, Developers and Faculties
 
PDF
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
International Educational Applied Scientific Research Journal (IEASRJ)
 
PDF
An Improved Differential Evolution Algorithm for Data Stream Clustering
IJECEIAES
 
PDF
WebSite Visit Forecasting Using Data Mining Techniques
Chandana Napagoda
 
PDF
MOCHA 2018 Challenge @ ESWC2018
Holistic Benchmarking of Big Linked Data
 
Recent Trends in Incremental Clustering: A Review
IOSRjournaljce
 
Data characterization towards modeling frequent pattern mining algorithms
csandit
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET Journal
 
Hierarchal clustering and similarity measures along with multi representation
eSAT Journals
 
Hierarchal clustering and similarity measures along
eSAT Publishing House
 
Skip List: Implementation, Optimization and Web Search
CSCJournals
 
11.challenging issues of spatio temporal data mining
Alexander Decker
 
Data mining in agriculture
Sibananda Khatai
 
A Framework for Automated Association Mining Over Multiple Databases
ertekg
 
V3 i35
silverscouts
 
Analysis of Crime Big Data using MapReduce
Kaushik Rajan
 
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
Frequency and similarity aware partitioning for cloud storage based on space ...
redpel dot com
 
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...
Kaushik Rajan
 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Association of Scientists, Developers and Faculties
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
International Educational Applied Scientific Research Journal (IEASRJ)
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
IJECEIAES
 
WebSite Visit Forecasting Using Data Mining Techniques
Chandana Napagoda
 
MOCHA 2018 Challenge @ ESWC2018
Holistic Benchmarking of Big Linked Data
 

Similar to PAS: A Sampling Based Similarity Identification Algorithm for compression of Unicode data content (20)

PDF
EPAS: A SAMPLING BASED SIMILARITY IDENTIFICATION ALGORITHM FOR THE CLOUD
Nexgen Technology
 
PDF
Ijetcas14 624
Iasir Journals
 
DOC
Fyp ideas
Mr SMAK
 
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
PDF
One dimensional vector based pattern
ijcsit
 
PDF
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
IJDKP
 
PDF
F0431025031
ijceronline
 
PDF
The Search of New Issues in the Detection of Near-duplicated Documents
ijceronline
 
PDF
Bl24409420
IJERA Editor
 
PDF
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGES
IJCI JOURNAL
 
PDF
Visual Search
Lov Loothra
 
PDF
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
dbpublications
 
PPTX
3 - Finding similar items
Viet-Trung TRAN
 
DOCX
Variable length signature for near-duplicate
jpstudcorner
 
PPT
Finding Similar Files in Large Document Repositories
feiwin
 
PDF
A1804010105
IOSR Journals
 
PDF
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Editor IJCATR
 
PDF
Sparse representation in image and video copy detection
Huan-Cheng Hsu
 
PDF
L0261075078
inventionjournals
 
PDF
L0261075078
inventionjournals
 
EPAS: A SAMPLING BASED SIMILARITY IDENTIFICATION ALGORITHM FOR THE CLOUD
Nexgen Technology
 
Ijetcas14 624
Iasir Journals
 
Fyp ideas
Mr SMAK
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
One dimensional vector based pattern
ijcsit
 
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
IJDKP
 
F0431025031
ijceronline
 
The Search of New Issues in the Detection of Near-duplicated Documents
ijceronline
 
Bl24409420
IJERA Editor
 
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGES
IJCI JOURNAL
 
Visual Search
Lov Loothra
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
dbpublications
 
3 - Finding similar items
Viet-Trung TRAN
 
Variable length signature for near-duplicate
jpstudcorner
 
Finding Similar Files in Large Document Repositories
feiwin
 
A1804010105
IOSR Journals
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Editor IJCATR
 
Sparse representation in image and video copy detection
Huan-Cheng Hsu
 
L0261075078
inventionjournals
 
L0261075078
inventionjournals
 
Ad

More from rahulmonikasharma (20)

PDF
Data Mining Concepts - A survey paper
rahulmonikasharma
 
PDF
A Review on Real Time Integrated CCTV System Using Face Detection for Vehicle...
rahulmonikasharma
 
PDF
Considering Two Sides of One Review Using Stanford NLP Framework
rahulmonikasharma
 
PDF
A New Detection and Decoding Technique for (2×N_r ) MIMO Communication Systems
rahulmonikasharma
 
PDF
Broadcasting Scenario under Different Protocols in MANET: A Survey
rahulmonikasharma
 
PDF
Sybil Attack Analysis and Detection Techniques in MANET
rahulmonikasharma
 
PDF
A Landmark Based Shortest Path Detection by Using A* and Haversine Formula
rahulmonikasharma
 
PDF
Processing Over Encrypted Query Data In Internet of Things (IoTs) : CryptDBs,...
rahulmonikasharma
 
PDF
Quality Determination and Grading of Tomatoes using Raspberry Pi
rahulmonikasharma
 
PDF
Comparative of Delay Tolerant Network Routings and Scheduling using Max-Weigh...
rahulmonikasharma
 
PDF
DC Conductivity Study of Cadmium Sulfide Nanoparticles
rahulmonikasharma
 
PDF
A Survey on Peak to Average Power Ratio Reduction Methods for LTE-OFDM
rahulmonikasharma
 
PDF
IOT Based Home Appliance Control System, Location Tracking and Energy Monitoring
rahulmonikasharma
 
PDF
Thermal Radiation and Viscous Dissipation Effects on an Oscillatory Heat and ...
rahulmonikasharma
 
PDF
Advance Approach towards Key Feature Extraction Using Designed Filters on Dif...
rahulmonikasharma
 
PDF
Alamouti-STBC based Channel Estimation Technique over MIMO OFDM System
rahulmonikasharma
 
PDF
Empirical Mode Decomposition Based Signal Analysis of Gear Fault Diagnosis
rahulmonikasharma
 
PDF
Short Term Load Forecasting Using ARIMA Technique
rahulmonikasharma
 
PDF
Impact of Coupling Coefficient on Coupled Line Coupler
rahulmonikasharma
 
PDF
Design Evaluation and Temperature Rise Test of Flameproof Induction Motor
rahulmonikasharma
 
Data Mining Concepts - A survey paper
rahulmonikasharma
 
A Review on Real Time Integrated CCTV System Using Face Detection for Vehicle...
rahulmonikasharma
 
Considering Two Sides of One Review Using Stanford NLP Framework
rahulmonikasharma
 
A New Detection and Decoding Technique for (2×N_r ) MIMO Communication Systems
rahulmonikasharma
 
Broadcasting Scenario under Different Protocols in MANET: A Survey
rahulmonikasharma
 
Sybil Attack Analysis and Detection Techniques in MANET
rahulmonikasharma
 
A Landmark Based Shortest Path Detection by Using A* and Haversine Formula
rahulmonikasharma
 
Processing Over Encrypted Query Data In Internet of Things (IoTs) : CryptDBs,...
rahulmonikasharma
 
Quality Determination and Grading of Tomatoes using Raspberry Pi
rahulmonikasharma
 
Comparative of Delay Tolerant Network Routings and Scheduling using Max-Weigh...
rahulmonikasharma
 
DC Conductivity Study of Cadmium Sulfide Nanoparticles
rahulmonikasharma
 
A Survey on Peak to Average Power Ratio Reduction Methods for LTE-OFDM
rahulmonikasharma
 
IOT Based Home Appliance Control System, Location Tracking and Energy Monitoring
rahulmonikasharma
 
Thermal Radiation and Viscous Dissipation Effects on an Oscillatory Heat and ...
rahulmonikasharma
 
Advance Approach towards Key Feature Extraction Using Designed Filters on Dif...
rahulmonikasharma
 
Alamouti-STBC based Channel Estimation Technique over MIMO OFDM System
rahulmonikasharma
 
Empirical Mode Decomposition Based Signal Analysis of Gear Fault Diagnosis
rahulmonikasharma
 
Short Term Load Forecasting Using ARIMA Technique
rahulmonikasharma
 
Impact of Coupling Coefficient on Coupled Line Coupler
rahulmonikasharma
 
Design Evaluation and Temperature Rise Test of Flameproof Induction Motor
rahulmonikasharma
 
Ad

Recently uploaded (20)

PDF
June 2025 Top 10 Sites -Electrical and Electronics Engineering: An Internatio...
elelijjournal653
 
PPTX
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
PDF
Module - 4 Machine Learning -22ISE62.pdf
Dr. Shivashankar
 
PPTX
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
resming1
 
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
 
PDF
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
 
PPSX
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
 
PDF
Plant Control_EST_85520-01_en_AllChanges_20220127.pdf
DarshanaChathuranga4
 
PDF
PRIZ Academy - Process functional modelling
PRIZ Guru
 
PDF
June 2025 - Top 10 Read Articles in Network Security and Its Applications
IJNSA Journal
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
PDF
01-introduction to the ProcessDesign.pdf
StiveBrack
 
PDF
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
PPT
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
PDF
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
PDF
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
PPTX
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
PDF
Bayesian Learning - Naive Bayes Algorithm
Sharmila Chidaravalli
 
PDF
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
June 2025 Top 10 Sites -Electrical and Electronics Engineering: An Internatio...
elelijjournal653
 
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
Module - 4 Machine Learning -22ISE62.pdf
Dr. Shivashankar
 
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
resming1
 
Precooling and Refrigerated storage.pptx
ThongamSunita
 
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
 
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
 
Plant Control_EST_85520-01_en_AllChanges_20220127.pdf
DarshanaChathuranga4
 
PRIZ Academy - Process functional modelling
PRIZ Guru
 
June 2025 - Top 10 Read Articles in Network Security and Its Applications
IJNSA Journal
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
01-introduction to the ProcessDesign.pdf
StiveBrack
 
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
Bayesian Learning - Naive Bayes Algorithm
Sharmila Chidaravalli
 
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 

PAS: A Sampling Based Similarity Identification Algorithm for compression of Unicode data content

  • 1. International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 6 Issue: 6 254 - 259 ______________________________________________________________________________________ 254 IJRITCC | June 2018, Available @ http://www.ijritcc.org _______________________________________________________________________________________ PAS: A Sampling Based Similarity Identification Algorithm for compression of Unicode data content Siddiqui Sara Begum Anjum Parvez Department of Computer Science and Engineering, K.S.I.E.T Hingoli-431513 Maharashtra, INDIA. E-mail id: [email protected] Ashru L. Korde (Author) Department of Computer Science and Engineering, K.S.I.E.T Hingoli-431513 Maharashtra, INDIA. Email id:[email protected] Abstract— Generally, Users perform searches to satisfy their information needs. Now a day’s lots of people are using search engine to satisfy information need. Server search is one of the techniques of searching the information. the Growth of data brings new changes in Server. The data usually proposed in timely fashion in server. If there is increase in latency then it may cause a massive loss to the enterprises. The similarity detection plays very important role in data. while there are many algorithms are used for similarity detection such as Shingle, Simhas TSA and Position Aware sampling algorithm. The Shingle Simhash and Traits read entire files to calculate similar values. It requires the long delay in growth of data set value. instead of reading entire Files PAS sample some data in the form of Unicode to calculate similarity characteristic value.PAS is the advance technique of TSA. However slight modification of file will trigger the position of file content .Therefore the failure of similarity identification is there due to some modifications.. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the Server. EPAS concurrently samples data blocks from the modulated file to avoid the position shift by the modifications. While there is an metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. In this paper describes a PAS algorithm to reduce the time overhead of similarity detection. Using PAS algorithm we can reduce the complication and time for identifying the similarity. Our result demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time. Therefore, it is an effective approach of similarity identification for the Server. Keywords: Similarity, identification, Unicode data. __________________________________________________*****_________________________________________________ I. INTRODUCTION The growth of the data management significantly increases and the data risk and cost of data also increases . to address this kind of problem many users transfer there data to the server. And we can access that data via internet. This kind of problem result in large volume of redundant data in server. The main reson for this is that multiple users tends to store similar files in the server . Here multiple users store multiple files in the server. Unfortunately the redundant data not only consume significant it esurses but also occupy bandwidth for this puposes the data deduplication is required . To overcome from all these problems we create the sampling similarity based identification algorithm for compression of Unicode data content in server. search techniques perform comparably despite contrary claims in the literature. During my evaluation of search effectiveness, I were surprised by the difficulty I had searching my data sets. In particular, straightforward implementations of many search techniques server not scale to databases with hundreds of thousands of tuples, which forced us to write “lazy” versions of their core algorithms and reduce their memory footprint. Even then, I were surprised by the excessive runtime of many search techniques. In the present data warehousing environment schemes there are lots of issues in server computing. Some advanced data manipulating schemes are required to extending the dats search paradigm to relational data has been an active area of research within the database and information retrieval community. It proves that the effectiveness of performance in retrieval tasks and data maintaining procedures. The outcome confirms previous claims regarding the unacceptable performance of these systems and underscores the need for standardization as exemplified by the IR community when evaluating these retrieval systems. Position Aware similarity identification algorithms belong to I/O bound and CPU bound tasks. Calculating the Unicode of similar files requires lots of CPU Corresponding cycles, the computing increases with the growth of data sets Position Aware similarity identify algorithms normally require a large amount of time for detecting the similarity, which results in long delays and if there is large data sets. it require more time This makes it difficult to apply the algorithms to some applications.
  • 2. International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 6 Issue: 6 254 - 259 ______________________________________________________________________________________ 255 IJRITCC | June 2018, Available @ http://www.ijritcc.org _______________________________________________________________________________________ In this paper, we propose a Position-Aware Similarity (PAS) identification algorithm to detect the similar files in large data sets. This method is very effective in dealing with file modification when performing similarity detection. And here we use Simhash algorithm called in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash. This is because the overhead of PAS is relatively stable. It is not increases with the growth of data size. The remainder of this paper is organized as follows: we present related work in section 2. In section 3 we describe some background knowledge. Section 4 introduces the basic idea of PAS algorithm.Section 5 shows Sampling Based similarity identification. Section 6 shows the evaluation results of PAS algorithm. Section 7 shows Similarity Identification Techniques Work Section 8 draws conclusions and Future use. II. RELATED WORK In related work it involved the Server technique and similarity detection algorithm to avoid the redundant data in server using Unicode data content. Here we design the sampling based similarity approach for the detection of data similarity and perform the various task like upload file and delete files In the past decade, a lot of research efforts have been invested in identifying data similarity.Which we explain below. The first one is similar web page detection with web search engine. Detecting and removing similar web pages can save network bandwidth, reduce storage consumption, and improve the quality of web search engine index. Andrei et al. [17], [20] proposed a similar web page detection technique called Shingle algorithm which utilizes set operation to detect similarity. Shingle is a typical sampling based approach employed to identify similar web pages. In order to reduce the size of shingle, Andrei presented Modm and Mins sampling methods. This algorithm is applied to AltaVista web search engine at present. Manku et al. [21] applied a Simhash algorithm to detect similarity in web documents belonging to a multi-billion page repository. Simhash algorithm practically runs at Google web search engine combining with Google file system [22] and MapReduce [23] to achieve batch queries. Elsayed et al. [24] presented a MapReduce algorithm for computing pairwise document similarity in large document collections. The second one is similar file detection in storage systems. In storage systems, data similarity detection and encoding play a crucial role in improving the resource utilization. Forman [25] presented an approach for finding similar files and applied the method to document repositories.This approach brings a great reduction in storage space consumption. Ouyang [26] presented a large-scale file compression technique based on cluster by using Shingle similarity detection technique. Ouyang uses Min-wise [27] sampling method to reduce the overhead of Shingle algorithm. Han et al. [28] presented fuzzy file block matching technique, which was first proposed for opportunistic use of content addressable storage. Fuzzy file block matching technique employs Shingle to represent the fuzzy hashing of file blocks for similarity detection. It uses Mins sampling method to decrease the overhead of shingling algorithm. The third one is plagiarism detection. Digital information can be easily copied and retransmitted. This feature causes owners copyright be easily violated. In purpose of protecting copyright and other related rights, we need plagiarism detection. Baker [29] described a program called dup which can be used to locate instances of duplication or near duplication in a software. Shivakumar [30] presented data structures for finding overlap between documents and implemented these data structures in SCAM. The forth one is remote file backup. Traditional remote file backup approaches take high bandwidth and consume a lot of resources. Applying similarity detection to remote file backup can greatly reduce bandwidth consumption. Teodosiu et al. [15] proposed a Traits algorithm to find out the client files which are similar to a given server file. Teodosiu implemented this algorithm in DFSR. Experimental results suggest that these optimizations may help reduce the bandwidth required to transfer file updates across a network. Muthitacharoen et al. [5] presented LBFS which exploits similarity between files or versions of the same file to save bandwidth. Cox et al. [31] presented a similaritybased mechanism for locating a single source file to perform peer-to- peer backup. They implemented a system prototype called Pastiche. The fifth one is the similarity detection for specific domains. Hua et al. [32] explored and exploited data similarity which supports efficient data placement for cloud. They designed a novel multi-core-enabled and localitysensitive hashing that can accurately capture the differentiated similarity across data. Biswas et al. [33] proposed a cache architecture called Mergeable. Mergeable detectsdata similarities and merges cache blocks so as to decrease cache storage requirements. Experimental evaluation suggested that Mergeable reduces off-chip memory accesses and overall power usage. Vernica et al. [34] proposed a three-stage approach for end-to-end setsimilarity joins in parallel using the popular MapReduce framework. Deng et al. [35] proposed a MapReducebased framework Massjoin for scalable string similarity joins. The approach achieves both set-based similarity functions and character-based similarity
  • 3. International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 6 Issue: 6 254 - 259 ______________________________________________________________________________________ 256 IJRITCC | June 2018, Available @ http://www.ijritcc.org _______________________________________________________________________________________ functions. Most of the above work focus on a specific application scenario, and the computational or similarity detection overhead are increased with the growth of data volume. In addition, the similarity detection metric may not be able to well measure the similarity between two files. Therefore, this paper proposes an EPAS algorithm and a new similarity detection metric to identify file similarity for the cloud. According to the analysis (see Section 4 and 5.3) and experimental results, it illustrates that the proposed similarity metric catch the similarity between metric catch the similarity between two files more accuratelythat that of traditional metric. Furthermore, the overhead of EPAS is fixed and minimized in contrast to previous work. III. BACKGROUND Here We used the hash key which is based on counting the occurrences of certain Unicode strings within a file. The keys, along with certain intermediate data, are stored in a relational database (Figure 1).A separate program then query the database for keys with similar values, and outputs the results. Our code was written in PHP, and developed simultaneously for the Windows platforms. While the code runs equally well on platforms, we used a Windows machine for primary development and for most data collection. File Unicode data Figure 1: SimHash produces two levels of file similarity data. the key function was implemented to account for file extensions. Here We can not say that two files are different if they contain different extensions. Assign different values to any two files with different extension. To this we compute a hash to file extension with value between 0 and 1.if there is same extension then this will not affect the relation between files with same extension . I. POSITION-AWARE SIMILARITY ALGORITH In server if there is large similar data set with less overhead. Then we can use the similarity detection algorithm that is PAS. Here we use different symbols which are given in the table bellow. A. Traditional sampling algorithm Suppose we sample N data blocks of file A, each data block sizing Lenc is injected to a hash function. We then can obtain N fingerprint values that are collected as a fingerprint set SigA(N; Lenc). In this scenario, similarity detection problem can be transformed into a set intersection problem. By analogy, we will have a fingerprint set SigB(N; Lenc) of file According to equation (1), Table 1: Symbols used in following FIGURE. 2: THE SAMPLING POSITIONS OF TSA AND PAS the degree of similarity between file A and file B can be described as equation (1), where Sim(A;B) ranges s between 0 and 1. If Sim(A;B) is reaching 1, it means the similarity of file A and file B are very high, vice verse. After selecting a threshold of the similarity, we can determine that file A is similar to file B when Sim(A;B) is satisfied. This TSA is described in algorithm 1 by using pseudo-code. TSA is simple, but it is very sensitive to file modifications. A small modification would cause the sampling positions shifted, thus resulting a failure. Suppose we have a file A sizing 56KB. We sample 6 data blocks and each data block sizes 1KB. According to Algorithm 1, file A has N= 6; Lenc = 1KB; FileSize = 56KB; LenR = 10KB. If we add 5KB data to file A to form file B, file Bwill have N = 6; Lenc = 1KB; FileSize = 61KB; LenR = 11KB in terms of algorithm.
  • 4. International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 6 Issue: 6 254 - 259 ______________________________________________________________________________________ 257 IJRITCC | June 2018, Available @ http://www.ijritcc.org _______________________________________________________________________________________ Adding 5KB data to file A has three situations including the begging, the middle, and the end of the file A. File B1, B2, and B3 in figure 2(a) represent these three different situations. We can find that the above file modifications cause the sampling position shifted and result in an inaccuracy of similarity detection. For example, the six sampling positions of file A are 0KB, 11KB, 22KB, 33KB, 44KB, and 55KB ((1 �1) _ (1 + 10) = 0KB; (2�1) _ (1 +10) = 11KB; (3 �1) _ (1 + 10) = 22KB; (4 �1) (1 +10) = 33KB; (5 �1) _ (1 + 10) = 44KB; (6 �1) _ (1 + 10) = 55KB), respectively. However, due to the added 5KB data, the six sampling positions of file B1,B2, and B3 are shifted to 0KB; 12KB; 24KB; 36KB; 48KB, and 60KB((1 �1) _ (1 + 11) =0KB; (2 �1) _ (1 + 11) = 12KB; (3 �1) _ (1 + 11) = 24KB; (4 �1) _ (1 + 11) =36KB; (5 �1) _ (1 + 11) = 48KB; (6 �1) _ (1 + 11) = 60KB), respectively. Although the Sim(A;B) is far from actual value when using TSA, the sampling method is very simple and takes much less overhead in contrast to the shingle algorithm and simhash algorithm. B. PAS algorithm FPP[17] exploits prefetching fingerprints belonging to the same file by leveraging file similarity, thus improving the performance of data deduplication systems. The experimental results suggest that FPP increases cache hit ratio and reduces the number of disk accesses greatly. FPP samples three data blocks in the beginning, the middle, and the end of files to determine that a forthcoming file is similar to the files stored in the backed storage system, by using the TSA. This method is sample and effective. However, as explained in section 4.1, a single bit modification would result in a failure. Therefore, PAS is proposed to solve this problem. II. CONCLUSION AND FUTURE SCOPE A. Conclusion In this paper , Overall we will study all the existing techniques which is available in market. Each system has some advantages and some disadvantages. Any existing system cannot fulfill all the requirement of Server search. They require more space and time; also some techniques are limited for particular dataset. We Proposed an algorithm PAS to identify the file similarity of Unicode data in large data Set.Here many experiments are performed to select the parameters of PAS. PAS is very effective in detecting file similarity in contrast in similarity identification algorithm called Simhas. PAS required less time than Simhash. The Proposed technique is satisfying number of requirement of server search using different algorithms. It also shows the ranking of character value and not requires the knowledge of database queries. Compare to existing algorithm it is a fast process. B. Future Scope As a future work we can search the techniques which are useful for all the datasets, means only single technique can be use. Further research is necessary to investigate the experimental design decisions that have a significant impact on the evaluation of server search. Sampling based similarity identification opens up several directions for future work. The first one is using content based chunk algorithm to sample data blocks, since this approach can avoid content shifting incurred by data modification. The second one is employing file metadata to optimize the similarity detection. This is because the file size and type which are contained in the metadata of similar file are normally very close. Evaluate to presented systems it is a fast process and the Techniques are implausible to have performance characteristics that are similar to existing systems but be required to be used if cloud search systems are to scale to great datasets. The memory exploitation during a search has not been the focus of any earlier assessment ACKNOWLEDGMENT The preferred spelling of the word “acknowledgment” in America is without an “e” after the “g”. Avoid the stilted expression, “One of us (R.B.G.) thanks . . .” Instead, try
  • 5. International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 6 Issue: 6 254 - 259 ______________________________________________________________________________________ 258 IJRITCC | June 2018, Available @ http://www.ijritcc.org _______________________________________________________________________________________ “R.B.G. thanks”. Put applicable sponsor acknowledgments here; DO NOT place them on the first page of your paper or as a footnote. REFERENCES [1] 1.storage architecture for big data in cloud environment,” in Green Computing and Communications (GreenCom), 2013 IEEE and Internet of Things (iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing. IEEE, 2013, pp. 476– 480. [2] J. Gantz and D. Reinsel, “The digital universe decade- are you ready,” IDC iView, 2010. [3] H. Biggar, “Experiencing data de-duplication: Improving efficiency and reducing capacity requirements,” The Enterprise Strategy Group, 2007. [4] F. Guo and P. Efstathopoulos, “Building a high performance deduplication system,” in Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, 2011,pp. 25–25. [5] A. Muthitacharoen, B. Chen, and D. Mazieres, “A low-bandwidth network file system,” in ACM SIGOPS Operating Systems Review, vol. 35, no. 5. ACM, 2001, pp. 174–187. [6] B. Zhu, K. Li, and R. H. Patterson, “Avoiding the disk bottleneck in the data domain deduplication file system.” in Fast, vol. 8, 2008, pp. 1–14. [7] Y. Deng, “What is the future of disk drives, death or rebirth?” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 23, 2011. [8] C. Wu, X. LIN, D. Yu, W. Xu, and L. Li, “End-to-end delay minimization for scientific workflows in clouds under budget constraint,” IEEE Transaction on Cloud Computing (TCC), vol. 3, pp. 169–181, 2014. [9] G. Linden, “Make data useful,” http://home.blarg.net/_glinden/ StanfordDataMining.2006-11-29.ppt, 2006. [10] R. Kohavi, R. M. Henne, and D. Sommerfield, “Practical guide to controlled experiments on the web: listen to your customers not to the hippo,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007, pp. 959–967. [11] J. Hamilton, “The cost of latency,” Perspectives Blog, 2009. [12] D. Bhagwat, K. Eshghi, D. D. Long, and M. Lillibridge, “Extreme binning: Scalable, parallel deduplication for chunk-based file backup,” in Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on. IEEE, 2009, pp. [13] W. Xia, H. Jiang, D. Feng, and Y. Hua, “Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput,” in Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, 2011, pp. 26–28. [14] Y. Fu, H. Jiang, and N. Xiao, “A scalable inline cluster deduplication framework for big data protection,” in Middleware 2012.Springer, 2012, pp. 354–373. [15] D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka, “Optimizing file replication over limited bandwidth networks using remote differential compression,” Microsoft Research TR- 2006-157, 2006. [16] Y. Zhou, Y. Deng, and J. Xie, “Leverage similarity and locality to enhance fingerprint perfecting of data deduplication,” in Proceedings of The 20th IEEE International Conference on Parallel and Distributed Systems. Springer, 2014. [17] A. Z. Broder, “On the resemblance and containment of documents,” in Compression and Complexity of Sequences 1997. Proceedings. IEEE, 1997, pp. 21–29. [18] G. S. Manku, A. Jain, and A. Das Sarma, “Detecting near duplicates for web crawling,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007, pp. 141–150. [19] L. Song, Y. Deng, and J. Xie, “Exploiting fingerprint prefetching to improve the performance of data deduplication,” in Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications. IEEE, 2013. [20] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering of the web,” Computer Networks and ISDN Systems, vol. 29, no. 8, pp. 1157– 1166, 1997. [21] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the thiry- fourth annual ACM symposium on Theory of computing. ACM, 2002, pp. 380–388. [22] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 2003, pp. 29– 43. [23] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. [24] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document similarity in large collections with map reduce,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 2008, pp. 265–268. [25] G. Forman, K. Eshghi, and S. Chiocchetti, “Finding similar files in large document repositories,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005, pp. 394–400. [26] Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov, “Cluster-based delta compression of a collection of files,” in Web Information Systems Engineering, 2002. WISE 2002. Proceedings of the Third International Conference on. IEEE, 2002, pp. 257– 266.
  • 6. International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 6 Issue: 6 254 - 259 ______________________________________________________________________________________ 259 IJRITCC | June 2018, Available @ http://www.ijritcc.org _______________________________________________________________________________________ [27] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” Journal of Computer and System Sciences, vol. 60, no. 3, pp. 630–659, 2000. [28] B. Han and P. J. Keleher, “Implementation and performance evaluation of fuzzy file block matching.” in USENIX Annual Technical Conference, 2007, pp. 199–204. [29] B. S. Baker, “On finding duplication and near- duplication in large software systems,” in Reverse Engineering, 1995., Proceedings of 2nd Working Conference on. IEEE, 1995, pp. 86–95. [30] N. Shivakumar and H. Garcia-Molina, “Building a scalable and accurate copy detection mechanism,” in Proceedings of the first ACM international conference on Digital libraries. ACM, 1996, pp.160–168.