SlideShare a Scribd company logo
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1|P a g e Copyright@IDL-2017
Secure and Efficient Client and Server Side
Data Deduplication to Reduce Storage in
Remote Cloud Computing Systems
Hemanth Chandra N1
, Sahana D. Gowda2
,
Dept. of Computer Science
1 B.N.M Institute of Technology, Bangalore, India
2 B.N.M Institute of Technology, Bangalore, India
Abstract: Duplication of data in storage systems is
becoming increasingly common problem. The system
introduces I/O Deduplication, a storage optimization
that utilizes content similarity for improving I/O
performance by eliminating I/O operations and
reducing the mechanical delays during I/O operations
and shares data with existing users if Deduplication
found on the client or server side. I/O Deduplication
consists of three main techniques: content-based
caching, dynamic replica retrieval and selective
duplication. Each of these techniques is motivated by
our observations with I/O workload traces obtained
from actively-used production storage systems, all of
which revealed surprisingly high levels of content
similarity for both stored and accessed data.
Keywords: Deduplication, POD, Data Redundancy,
Storage Optimization, iDedup, I/O Deduplication,
I/O Performance.
1. INTRODUCTION
Duplication of data in primary storage systems is quite
common due to the technological trends that have been
driving storage capacity consolidation the elimination
of duplicate content at both the file and block levels
for improving storage space utilization is an active
area of research. Indeed, eliminating most duplicate
content is inevitable in capacity sensitive applications
such as archival storage for cost effectiveness. On the
other hand, there exist systems with a moderate degree
of content similarity in their primary storage such as
email servers, virtualized servers and NAS devices
running file and version control servers. In case of
email servers, mailing lists, circulated attachments and
SPAM can lead to duplication. Virtual machines may
run similar software and thus create collocated
duplicate content across their virtual disks. Finally, file
and version control systems servers of collaborative
groups often store copies of the same documents,
sources and executables. In such systems, if the degree
of content similarity is not overwhelming, eliminating
duplicate data may not be a primary concern.
Gray and Shenoy[1] have pointed out that given the
technology trends for price-capacity and price-
performance of memory/disk sizes and disk accesses
respectively, disk data must “cool” at the rate of 10X
per decade. They suggest data replication as a means
to this end. An instantiation of this suggestion is
intrinsic replication of data created due to
consolidation as seen now in many storage systems,
including the ones illustrated earlier. Here, it is
referring to intrinsic (or application/user generated)
data replication as opposed to forced (system
generated) redundancy such as in a RAID-1 storage
system. In such systems, capacity constraints are
invariably secondary to I/O performance.
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 2|P a g e Copyright@IDL-2017
On-disk duplication of content and I/O traces obtained
from three varied production systems at Florida
International University(FIU) that included a
virtualized host running two department web-servers,
the department email server, and a file server for our
research group has been analyzed. Three observations
have been made from the analysis of these traces.
First, our analysis revealed significant levels of both
disk static similarity and workload static similarity
within each of these systems. Disk static similarity is
an indicator of the amount of duplicate content in the
storage medium, while workload static similarity
indicates the degree of on-disk duplicate content
accessed by the I/O workload. These similarity
measures have been defined formally in § 2. Second, a
consistent and marked discrepancy was discovered
between reuse distances for sector and content in the
I/O accesses on these systems indicating that content
is reused more frequently than sectors. Third, there is
significant overlap in content accessed over successive
intervals of longer time-frames such as days or weeks.
Based on the observations, the premise that intrinsic
content similarity is explored in storage systems and
access to replicated content within I/O workloads can
both be utilized to improve I/O performance. In doing
so, a storage optimization that utilizes content
similarity to either eliminate I/O operations altogether
or optimize the resulting disk head movement within
the storage system is designed and evaluate in I/O
Deduplication. I/O Deduplication comprises three key
techniques: (i)Content-based caching that uses the
popularity of “data content” rather than “data location”
of I/O accesses in making caching decisions,
(ii)Dynamic replica retrieval that upon a cache miss
for a read operation, dynamically chooses to retrieve a
content replica which minimizes disk head movement,
and (iii)Selective duplication that dynamically
replicates frequently accessed content in scratch space
that is distributed over the entire storage medium to
increase the effectiveness of dynamic replica retrieval.
2. RELATED WORK
This paper is related to works on Deduplication
concepts in primary memory. Nimrod Megiddo and
Dharmendra S. Modha[20] discussed about the
problem of cache management in a demand paging
scenario with uniform page sizes. A new cache
management policy is proposed, namely, Adaptive
Replacement Cache (ARC), that has several
advantages. In response to evolving and changing
access patterns, ARC dynamically, adaptively and
continually balances between the recency and
frequency components in an online and self tuning
fashion. The policy ARC uses a learning rule to
adaptively and continually revise its assumptions
about the workload. The policy ARC is empirically
universal, that is, it empirically performs as well as a
certain fixed replacement policy– even when the latter
uses the best workload-specific tuning parameter that
was selected in an offline fashion. Consequently, ARC
works uniformly well across varied workloads and
cache sizes without any need for workload specific a
priori knowledge or tuning. Various policies such as
LRU-2, 2Q, LRFU and LIRS require user-defined
parameters, and unfortunately, no single choice works
uniformly well across different workloads and cache
sizes. The policy ARC is simple-to-implement and
like LRU, has constant complexity per request. In
comparison, policies LRU-2 and LRFU both require
logarithmic time complexity in the cache size. The
policy ARC is scan-resistant: it allows one-time
sequential requests to pass through without polluting
the cache. On real-life traces drawn from numerous
domains, ARC leads to substantial performance gains
over LRU for a wide range of cache sizes.
Aayush Gupta et.al[22] have designed CA-SSD which
employs content-addressable storage (CAS) to exploit
such locality. Our CA-SSD design employs
enhancements primarily in the flash translation layer
(FTL) with minimal additional hardware, suggesting
its feasibility. Using three real-world workloads with
content information, statistical characterizations of two
aspects of value locality - value popularity and
temporal value locality - that form the foundation of
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 3|P a g e Copyright@IDL-2017
CA-SSD is devised. CA-SSD is able to reduce average
response times by about 59-84% compared to
traditional SSDs. Even for workloads with little or no
value locality, CA-SSD continues to offer comparable
performance to a traditional SSD. Our findings
advocate adoption of CAS in SSDs, paving the way
for a new generation of these devices.
Chi Zhang, Xiang Yu, Y. Wang[23] have answered
two key questions. First, since both eager-writing and
mirroring rely on extra capacity to deliver
performance improvements, how to satisfy competing
resource demands given a fixed amount of total disk
space? Second, since eager-writing allows data to be
dynamically located, how to exploit this high degree
of location independence in an intelligent disk
scheduler? In this paper, the two key questions were
addressed and compared the resulting EW-Array
prototype performance against that of conventional
approaches. The experimental results demonstrate that
the eager writing disk array is an effective approach to
providing scalable performance for an important class
of transaction processing applications.
Kiran Srinivasan et.al[24] has proposed an inline
Deduplication solution, iDedup, for primary
workloads, while minimizing extra IOs and seeks. Our
algorithm is based on two key insights from real world
workloads: i) spatial locality exists in duplicated
primary data and ii) temporal locality exists in the
access patterns of duplicated data. Using the first
insight, only sequences of disk blocks were selectively
deduplicated. This reduces fragmentation and
amortizes the seeks caused by Deduplication. The
second insight allows us to replace the expensive, on-
disk, Deduplication metadata with a smaller, in-
memory cache. These techniques enable us to tradeoff
capacity savings for performance, as demonstrated in
our evaluation with real-world workloads. Our
evaluation shows that iDedup achieves 60-70% of the
maximum Deduplication with less than a 5% CPU
overhead and a 2-4% latency impact.
Jiri Schindler et.al[26] introduce proximal I/O, a new
technique for improving random disk I/O performance
in file systems. The key enabling technology for
proximal I/O is the ability of disk drives to retire
multiple I/O’s, spread across dozens of tracks, in a
single revolution. Compared to traditional update-in-
place or write-anywhere file systems, this technique
can provide a nearly seven-fold improvement in
random I/O performance while maintaining (near)
sequential on-disk layout. This paper quantifies
proximal I/O performance and proposes a simple data
layout engine that uses a flash memory-based write
cache to aggregate random updates until they have
sufficient density to exploit proximal I/O. The results
show that with cache of just 1% of the overall disk-
based storage capacity, it is possible to service 5.3 user
I/O requests per revolution for random updates
workload. On an aged file system, the layout can
sustain serial read bandwidth within 3% of the best
case. Despite using flash memory, the overall system
cost is just one third of that of a system with the
requisite number of spindles to achieve the equivalent
number of random I/O operations.
2.1 EXISTING SYSTEM
To address the existing data Deduplication schemes for
primary storage, such as iDedup and Offline-Dedupe,
are capacity oriented in that they focus on storage
capacity savings and only select the large requests to
deduplicate and bypass all the small requests (e.g., 4
KB, 8 KB or less). The rationale is that the small I/O
requests only account for a tiny fraction of the storage
capacity requirement, making Deduplication on them
unprofitable and potentially counterproductive
considering the substantial Deduplication overhead
involved. However, previous workload studies have
revealed that small files dominate in primary storage
systems (more than 50 percent) and are at the root of the
system performance bottleneck. Furthermore, due to the
buffer effect, primary storage workloads exhibit obvious
I/O burstiness. The important performance issue of
primary storage in the Cloud and the above
Deduplication-induced problems, a Performance-
Oriented data Deduplication scheme, called POD, rather
than a capacity-oriented one (e.g., iDedup), to improve
the I/O performance of primary storage systems in the
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 4|P a g e Copyright@IDL-2017
Cloud is proposed. Figure 1 represents the architecture
of POD. By considering the workload characteristics.
POD takes a two-pronged approach to improving the
performance of primary storage systems and minimizing
performance overhead of Deduplication, namely, a
request-based selective Deduplication technique, called
Select-Dedup, to alleviate the data fragmentation and an
adaptive memory management scheme, called iCache,
to ease the memory contention between the bursty read
traffic and the bursty write traffic.
2.2 PROPOSED SYSTEM(POD vs IDedup)
A possible future direction is to optionally coalesce or
even eliminate altogether write I/O operations for
content that are already duplicated elsewhere on the
disk, or alternatively direct such writes to alternate
locations in the scratch space.
Figure 1. Architecture diagram of POD.
While the first option might seem similar to data
Deduplication at a high-level, a primary focus on the
performance implications of such optimizations rather
than capacity improvements has been suggested. Any
optimization for writes affects the read-side
optimizations of I/O Deduplication and a careful
analysis and evaluation of the trade-off points in this
design space is important and shares data with existing
users if Deduplication found in the client or server side.
Advantages:
1.Requires less space in the storage server
2.Data Deduplication will be done completely in all
blocks
3. IMPLEMENTATION
POD
In this paper, the system proposes POD, a
performance-oriented Deduplication scheme, to
improve the performance of primary storage systems
in the Cloud by leveraging data Deduplication on the
I/O path to remove redundant write requests while also
saving storage space. In Figure 2 shows the sequence
diagram where one can understand how the sequence
flow happens between each module. POD takes a
request-based selective Deduplication approach
(Select-Dedupe) to Deduplicating the I/O redundancy
on the critical I/O path in such a way that it minimizes
the data fragmentation problem. In the meanwhile, an
intelligent cache management (iCache) is employed in
POD to further improve read performance and
increase space saving, by adapting to I/O burstiness.
Our extensive trace-driven evaluations show that POD
significantly improves the performance and saves the
capacity of primary storage systems in the Cloud.
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 5|P a g e Copyright@IDL-2017
Figure 2. Sequence diagram of POD.
iDedup
In iDedup system, the system describes iDedup, an
inline Deduplication system specifically targeting
latency-sensitive, primary storage workloads and
comparison with latest techniques like POD also
known as I/O Deduplication. With latency sensitive
workloads, inline Deduplication has many
challenges: fragmentation leading to extra disk seeks
for reads, Deduplication processing overheads in the
critical path and extra latency caused by IOs for
Dedup-metadata management.
3.1 COMPARATIVE ANALYSIS
In this section, the performance of POD vs iDedup
Deduplication models is evaluated through extensive
trace-driven experiments.
(a) Time delay in iDedup
(b) Time delay in POD
Figure 3.The time delay performance of the different
Deduplication schemes.
A prototype of POD as a module in the Linux
operating system and use the trace-driven
experiments to evaluate its effectiveness and
efficiency has been implemented. In this paper, POD
with the capacity oriented scheme iDedup is
compared. Two models (POD vs iDedup) with two
parameters are compared, they are Time delay and
Cpu utilization.
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 6|P a g e Copyright@IDL-2017
(a) Cpu utilization in iDedup
(b) Cpu utilization in POD
Figure 4. The Cpu utilization performance of the
different Deduplication schemes.
In the first parameter comparison i.e., Time delay.
Four production files were compared for our trace
driven evaluation. In Figure 3(a) the time delay for
the iDedup model can be seen and in Figure 3(b) the
time delay for same files in Pod model is shown. For
both the deduplicator models, same files were
uploaded with name cmain, owner, clog, search.
Results shows that Pod is efficient than iDedup.
In the second parameter comparison i.e., Cpu
utilization. a single file named abc.txt was uploaded
to both Pod and iDedup models. In Figure 4(a) Cpu
utilization in iDedup model can be seen and in
Figure 4(b) the Cpu utilization in Pod model can be
seen. From the results it states that Pod is more
efficient than the iDedup model.
4. CONCLUSION
System and storage consolidation trends are driving
increased duplication of data within storage systems.
Past efforts have been primarily directed towards the
elimination of such duplication for improving storage
capacity utilization. With I/O Deduplication, a
contrary view is taken that intrinsic duplication in a
class of systems which are not capacity-bound can be
effectively utilized to improve I/O performance – the
traditional Achilles’ heel for storage systems. Three
techniques contained within I/O Deduplication work
together to either optimize I/O operations or eliminate
them altogether. An in-depth evaluation of these
mechanisms revealed that together they reduced
average disk I/O times by 28-47%, a large
improvement all of which can directly impact the
overall application-level performance of disk I/O
bound systems. The content-based caching mechanism
increased memory caching effectiveness by increasing
cache hit rates by 10% to 4x for read operations when
compared to traditional sector-based caching. Head-
position aware dynamic replica retrieval directed I/O
operations to alternate locations on-the-fly and
additionally reduced I/O times by 10-20%. Selective
duplication created additional replicas of popular
content during periods of low foreground I/O activity
and further improved the effectiveness of dynamic
replica retrieval by 23 35%.
FUTURE WORK
I/O Deduplication opens up several directions for
future work. One avenue for future work is to explore
content-based optimizations for write I/O operations.
A possible future direction is to optionally coalesce or
even eliminate altogether write I/O operations for
content that are already duplicated elsewhere on the
disk or alternatively direct such writes to alternate
locations in the scratch space. While the first option
might seem similar to data Deduplication at a high-
level, a primary focus on the performance implications
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 7|P a g e Copyright@IDL-2017
of such optimizations rather than capacity
improvements is suggested. Any optimization for
writes affects the read-side optimizations of I/O
Deduplication and a careful analysis and evaluation of
the trade-off points in this design space is important
and shares data with existing users if Deduplication
found in the client or server side.
REFERENCES
[1] Jim Gray and Prashant Shenoy. Rules of Thumb in
Data Engineering. Proc. of the IEEE International
Conference on Data Engineering, February 2000.
[2] Charles B. Morrey III and Dirk Grunwald.
Peabody: The Time Travelling Disk. In Proc. of the
IEEE/NASA MSST, 2003.
[3] Windsor W. Hsu, Alan Jay Smith, and Honesty C.
Young. The Automatic Improvement of Locality in
Storage Systems. ACM Transactions on Computer
Systems, 23(4):424–473, Nov 2005.
[4] S. Quinlan and S. Dorward. Venti: A New
Approach to Archival Storage. Proc. of the USENIX
Annual Technical Conference on File and Storage
Technologies, January 2002.
[5] Cyril U. Orji and Jon A. Solworth. Doubly
distorted mirrors. In Proceedings of the ACM
SIGMOD, 1993.
[6] Medha Bhadkamkar, Jorge Guerra, Luis Useche,
Sam Burnett, Jason Liptak, Raju Rangaswami, and
Vagelis Hristidis. BORG: Block-reORGanization for
Selfoptimizing Storage Systems. In Proc. of the
USENIX Annual Technical Conference on File and
Storage Technologies, February 2009.
[7] Sergey Brin, James Davis, and Hector Garcia-
Molina. Copy Detection Mechanisms for Digital
Documents. In Proc. of ACM SIGMOD, May 1995.
[8] Austin Clements, Irfan Ahmad, Murali Vilayannur,
and Jinyuan Li. Decentralized Deduplication in san
cluster file systems. In Proc. of the USENIX Annual
Technical Conference, June 2009.
[9] Burton H. Bloom. Space/time trade-offs in hash
coding with allowable errors. Communications of the
ACM, 13(7):422–426, 1970.
[10] Daniel Ellard, Jonathan Ledlie, Pia Malkani, and
Margo Seltzer. Passive NFS Tracing of Email and
Research Workloads. In Proc. of the USENIX Annual
Technical Conferenceon File and Storage
Technologies, March 2003.
[11] P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M.
Tracey. Redundancy Elimination Within Large
Collections of Files. Proc. of the USENIX Annual
Technical Conference, 2004.
[12] Binny S. Gill. On multi-level exclusive caching:
offline optimality and why promotions are better than
demotions. In Proc. of the USENIX Annual Technical
Conference on File and Storage Technologies,
Feburary 2008.
[13] Diwaker Gupta, Sangmin Lee, Michael Vrable,
Stefan Savage, Alex C. Snoeren, George Varghese,
Geoffrey Voelker, and Amin Vahdat. Difference
Engine: Harnessing Memory Redundancy in Virtual
Machines. Proc. Of the USENIX OSDI, December
2008.
[14] Hai Huang, Wanda Hung, and Kang G. Shin.
FS2: Dynamic Data Replication In Free Disk Space
For Improving Disk Performance And Energy
Consumption. In Proc. of the ACM SOSP, October
2005.
[15] Jorge Guerra, Luis Useche, Medha Bhadkamkar,
Ricardo Koller, and Raju Rangaswami. The Case for
Active Block Layer Extensions. ACM Operating
Systems Review,42(6), October 2008.
[16] N. Jain, M. Dahlin, and R. Tewari. TAPER:
Tiered Approach for Eliminating Redundancy in
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 8|P a g e Copyright@IDL-2017
Replica Synchronization. In Proc. of the USENIX
Conference on File and Storage Systems, 2005.
[17] Song Jiang, Feng Chen, and Xiaodong Zhang.
Clock-pro: An effective improvement of the clock
replacement. In Proc. of the USENIX Annual
Technical Conference, April 2005.
[18] Andrew Leung, Shankar Pasupathy, Garth
Goodson and Ethan Miller. Measurement and Analysis
of Large-Scale Network File System Workloads. Proc.
of the USENIX Annual Technical Conference, June
2008.
[19] Xuhui Li, Ashraf Aboulnaga, Kenneth Salem,
Aamer Sachedina, and Shaobo Gao. Second-tier cache
management using write hints. In Proc. of the
USENIX Annual Technical Conference on File and
Storage Technologies, 2005.
[20] Nimrod Megiddo and D. S. Modha. Arc: A self-
tuning, low overhead replacement cache. In Proc. of
USENIX Annual Technical Conference on File and
Storage Technologies, 2003.
[21] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L.
Traiger. Evaluation techniques for storage hierarchies.
IBM Systems Journal, 9(2):78–117, 1970.
[22] Aayush Gupta, Raghav Pisolkar, Bhuvan
Urgaonkar, and Anand Sivasubramaniam. Leveraging
Value Locality in Optimizing NAND Flash-based
SSDs, 2011.
[23] Chi Zhang, Xiang Yu, Y. Wang. Configuring and
Scheduling an Eager-Writing Disk Array for a
Transaction Processing Workload. 2002.
[24] Kiran Srinivasan, Tim Bisson, Garth Goodson,
Kaladhar Voruganti. iDedup: Latency-aware, Inline
Data Deduplication for Primary Storage. 2012.
[25] Dina Bitton and Jim Gray. Disk Shadowing. In
Proc. Of the International Conference on Very Large
Data Bases, 1988.
[26] Jiri Schindler, Sandip Shete, Keith A. Smith.
Improving Throughput for Small Disk Requests with
Proximal I/O. 2011.
[27] Mark Lillibridge, Kave Eshghi, Deepavali
Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter
Camble. Sparse indexing: large scale, inline
deduplication using sampling and locality. In Proc. of
the USENIX Annual Technical Conference on File
and Storage Technologies, February 2009.

More Related Content

PDF
Srinivasan2-10-12
PDF
SiDe Enabled Reliable Replica Optimization
PDF
An experimental evaluation of performance
PDF
Keysum - Using Checksum Keys
PDF
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
PDF
Cloud Computing Ambiance using Secluded Access Control Method
PDF
Postponed Optimized Report Recovery under Lt Based Cloud Memory
PDF
IRJET- Cross User Bigdata Deduplication
Srinivasan2-10-12
SiDe Enabled Reliable Replica Optimization
An experimental evaluation of performance
Keysum - Using Checksum Keys
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
Cloud Computing Ambiance using Secluded Access Control Method
Postponed Optimized Report Recovery under Lt Based Cloud Memory
IRJET- Cross User Bigdata Deduplication

What's hot (17)

PDF
An asynchronous replication model to improve data available into a heterogene...
PDF
50120130406035
PDF
Towards a low cost etl system
PDF
Ijcatr04071003
PDF
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
PDF
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
PDF
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
PDF
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
PDF
Geo distributed parallelization pacts in map reduce
PDF
Dremel
PPTX
Improving availability and reducing redundancy using deduplication of cloud s...
ODT
Data Deduplication: Venti and its improvements
PDF
Performance evaluation and estimation model using regression method for hadoo...
PDF
Fota Delta Size Reduction Using FIle Similarity Algorithms
PDF
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PDF
Provable Multicopy Dynamic Data Possession in Cloud Computing Systems
An asynchronous replication model to improve data available into a heterogene...
50120130406035
Towards a low cost etl system
Ijcatr04071003
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
Geo distributed parallelization pacts in map reduce
Dremel
Improving availability and reducing redundancy using deduplication of cloud s...
Data Deduplication: Venti and its improvements
Performance evaluation and estimation model using regression method for hadoo...
Fota Delta Size Reduction Using FIle Similarity Algorithms
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
Provable Multicopy Dynamic Data Possession in Cloud Computing Systems
Ad

Similar to Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems (20)

PDF
Data Domain Architecture
PPT
lec-7.ppt It Infrastructure: Storage
PDF
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
PDF
Dedupe-Centric Storage for General Applications
 
PDF
Survey on cloud backup services of personal storage
PPTX
Veracity's Coldstore Arcus - Storage as the foundation of your surveillance s...
PDF
Storage Networking and Overview ppt.pdf
PDF
Deduplication - Remove Duplicate
PDF
50120140504001
PDF
International Journal of Computer Science and Security Volume (4) Issue (1)
PDF
Frequency and similarity aware partitioning for cloud storage based on space ...
PDF
Data Storage Considerations for the Tactical Field Collection of Digital Imag...
PDF
Data De-Duplication Engine for Efficient Storage Management
PDF
A Hybrid Cloud Approach for Secure Authorized De-Duplication
PDF
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PDF
Storage Virtualization: Towards an Efficient and Scalable Framework
PDF
Digital preservation: an introduction
PDF
Two Level Auditing Architecture to Maintain Consistent In Cloud
PPTX
Doing Less More Often: An Approach to Digital Strategy for Cultural Heritage ...
PDF
Black Box Backup System
Data Domain Architecture
lec-7.ppt It Infrastructure: Storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
Dedupe-Centric Storage for General Applications
 
Survey on cloud backup services of personal storage
Veracity's Coldstore Arcus - Storage as the foundation of your surveillance s...
Storage Networking and Overview ppt.pdf
Deduplication - Remove Duplicate
50120140504001
International Journal of Computer Science and Security Volume (4) Issue (1)
Frequency and similarity aware partitioning for cloud storage based on space ...
Data Storage Considerations for the Tactical Field Collection of Digital Imag...
Data De-Duplication Engine for Efficient Storage Management
A Hybrid Cloud Approach for Secure Authorized De-Duplication
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
Storage Virtualization: Towards an Efficient and Scalable Framework
Digital preservation: an introduction
Two Level Auditing Architecture to Maintain Consistent In Cloud
Doing Less More Often: An Approach to Digital Strategy for Cultural Heritage ...
Black Box Backup System
Ad

Recently uploaded (20)

PPT
introduction to datamining and warehousing
PPTX
UNIT 4 Total Quality Management .pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
573137875-Attendance-Management-System-original
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Artificial Intelligence
PDF
Digital Logic Computer Design lecture notes
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
web development for engineering and engineering
introduction to datamining and warehousing
UNIT 4 Total Quality Management .pptx
R24 SURVEYING LAB MANUAL for civil enggi
CH1 Production IntroductoryConcepts.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
573137875-Attendance-Management-System-original
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Artificial Intelligence
Digital Logic Computer Design lecture notes
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Lecture Notes Electrical Wiring System Components
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Mechanical Engineering MATERIALS Selection
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
web development for engineering and engineering

Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems

  • 1. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 1|P a g e Copyright@IDL-2017 Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems Hemanth Chandra N1 , Sahana D. Gowda2 , Dept. of Computer Science 1 B.N.M Institute of Technology, Bangalore, India 2 B.N.M Institute of Technology, Bangalore, India Abstract: Duplication of data in storage systems is becoming increasingly common problem. The system introduces I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations and shares data with existing users if Deduplication found on the client or server side. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Keywords: Deduplication, POD, Data Redundancy, Storage Optimization, iDedup, I/O Deduplication, I/O Performance. 1. INTRODUCTION Duplication of data in primary storage systems is quite common due to the technological trends that have been driving storage capacity consolidation the elimination of duplicate content at both the file and block levels for improving storage space utilization is an active area of research. Indeed, eliminating most duplicate content is inevitable in capacity sensitive applications such as archival storage for cost effectiveness. On the other hand, there exist systems with a moderate degree of content similarity in their primary storage such as email servers, virtualized servers and NAS devices running file and version control servers. In case of email servers, mailing lists, circulated attachments and SPAM can lead to duplication. Virtual machines may run similar software and thus create collocated duplicate content across their virtual disks. Finally, file and version control systems servers of collaborative groups often store copies of the same documents, sources and executables. In such systems, if the degree of content similarity is not overwhelming, eliminating duplicate data may not be a primary concern. Gray and Shenoy[1] have pointed out that given the technology trends for price-capacity and price- performance of memory/disk sizes and disk accesses respectively, disk data must “cool” at the rate of 10X per decade. They suggest data replication as a means to this end. An instantiation of this suggestion is intrinsic replication of data created due to consolidation as seen now in many storage systems, including the ones illustrated earlier. Here, it is referring to intrinsic (or application/user generated) data replication as opposed to forced (system generated) redundancy such as in a RAID-1 storage system. In such systems, capacity constraints are invariably secondary to I/O performance.
  • 2. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 2|P a g e Copyright@IDL-2017 On-disk duplication of content and I/O traces obtained from three varied production systems at Florida International University(FIU) that included a virtualized host running two department web-servers, the department email server, and a file server for our research group has been analyzed. Three observations have been made from the analysis of these traces. First, our analysis revealed significant levels of both disk static similarity and workload static similarity within each of these systems. Disk static similarity is an indicator of the amount of duplicate content in the storage medium, while workload static similarity indicates the degree of on-disk duplicate content accessed by the I/O workload. These similarity measures have been defined formally in § 2. Second, a consistent and marked discrepancy was discovered between reuse distances for sector and content in the I/O accesses on these systems indicating that content is reused more frequently than sectors. Third, there is significant overlap in content accessed over successive intervals of longer time-frames such as days or weeks. Based on the observations, the premise that intrinsic content similarity is explored in storage systems and access to replicated content within I/O workloads can both be utilized to improve I/O performance. In doing so, a storage optimization that utilizes content similarity to either eliminate I/O operations altogether or optimize the resulting disk head movement within the storage system is designed and evaluate in I/O Deduplication. I/O Deduplication comprises three key techniques: (i)Content-based caching that uses the popularity of “data content” rather than “data location” of I/O accesses in making caching decisions, (ii)Dynamic replica retrieval that upon a cache miss for a read operation, dynamically chooses to retrieve a content replica which minimizes disk head movement, and (iii)Selective duplication that dynamically replicates frequently accessed content in scratch space that is distributed over the entire storage medium to increase the effectiveness of dynamic replica retrieval. 2. RELATED WORK This paper is related to works on Deduplication concepts in primary memory. Nimrod Megiddo and Dharmendra S. Modha[20] discussed about the problem of cache management in a demand paging scenario with uniform page sizes. A new cache management policy is proposed, namely, Adaptive Replacement Cache (ARC), that has several advantages. In response to evolving and changing access patterns, ARC dynamically, adaptively and continually balances between the recency and frequency components in an online and self tuning fashion. The policy ARC uses a learning rule to adaptively and continually revise its assumptions about the workload. The policy ARC is empirically universal, that is, it empirically performs as well as a certain fixed replacement policy– even when the latter uses the best workload-specific tuning parameter that was selected in an offline fashion. Consequently, ARC works uniformly well across varied workloads and cache sizes without any need for workload specific a priori knowledge or tuning. Various policies such as LRU-2, 2Q, LRFU and LIRS require user-defined parameters, and unfortunately, no single choice works uniformly well across different workloads and cache sizes. The policy ARC is simple-to-implement and like LRU, has constant complexity per request. In comparison, policies LRU-2 and LRFU both require logarithmic time complexity in the cache size. The policy ARC is scan-resistant: it allows one-time sequential requests to pass through without polluting the cache. On real-life traces drawn from numerous domains, ARC leads to substantial performance gains over LRU for a wide range of cache sizes. Aayush Gupta et.al[22] have designed CA-SSD which employs content-addressable storage (CAS) to exploit such locality. Our CA-SSD design employs enhancements primarily in the flash translation layer (FTL) with minimal additional hardware, suggesting its feasibility. Using three real-world workloads with content information, statistical characterizations of two aspects of value locality - value popularity and temporal value locality - that form the foundation of
  • 3. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 3|P a g e Copyright@IDL-2017 CA-SSD is devised. CA-SSD is able to reduce average response times by about 59-84% compared to traditional SSDs. Even for workloads with little or no value locality, CA-SSD continues to offer comparable performance to a traditional SSD. Our findings advocate adoption of CAS in SSDs, paving the way for a new generation of these devices. Chi Zhang, Xiang Yu, Y. Wang[23] have answered two key questions. First, since both eager-writing and mirroring rely on extra capacity to deliver performance improvements, how to satisfy competing resource demands given a fixed amount of total disk space? Second, since eager-writing allows data to be dynamically located, how to exploit this high degree of location independence in an intelligent disk scheduler? In this paper, the two key questions were addressed and compared the resulting EW-Array prototype performance against that of conventional approaches. The experimental results demonstrate that the eager writing disk array is an effective approach to providing scalable performance for an important class of transaction processing applications. Kiran Srinivasan et.al[24] has proposed an inline Deduplication solution, iDedup, for primary workloads, while minimizing extra IOs and seeks. Our algorithm is based on two key insights from real world workloads: i) spatial locality exists in duplicated primary data and ii) temporal locality exists in the access patterns of duplicated data. Using the first insight, only sequences of disk blocks were selectively deduplicated. This reduces fragmentation and amortizes the seeks caused by Deduplication. The second insight allows us to replace the expensive, on- disk, Deduplication metadata with a smaller, in- memory cache. These techniques enable us to tradeoff capacity savings for performance, as demonstrated in our evaluation with real-world workloads. Our evaluation shows that iDedup achieves 60-70% of the maximum Deduplication with less than a 5% CPU overhead and a 2-4% latency impact. Jiri Schindler et.al[26] introduce proximal I/O, a new technique for improving random disk I/O performance in file systems. The key enabling technology for proximal I/O is the ability of disk drives to retire multiple I/O’s, spread across dozens of tracks, in a single revolution. Compared to traditional update-in- place or write-anywhere file systems, this technique can provide a nearly seven-fold improvement in random I/O performance while maintaining (near) sequential on-disk layout. This paper quantifies proximal I/O performance and proposes a simple data layout engine that uses a flash memory-based write cache to aggregate random updates until they have sufficient density to exploit proximal I/O. The results show that with cache of just 1% of the overall disk- based storage capacity, it is possible to service 5.3 user I/O requests per revolution for random updates workload. On an aged file system, the layout can sustain serial read bandwidth within 3% of the best case. Despite using flash memory, the overall system cost is just one third of that of a system with the requisite number of spindles to achieve the equivalent number of random I/O operations. 2.1 EXISTING SYSTEM To address the existing data Deduplication schemes for primary storage, such as iDedup and Offline-Dedupe, are capacity oriented in that they focus on storage capacity savings and only select the large requests to deduplicate and bypass all the small requests (e.g., 4 KB, 8 KB or less). The rationale is that the small I/O requests only account for a tiny fraction of the storage capacity requirement, making Deduplication on them unprofitable and potentially counterproductive considering the substantial Deduplication overhead involved. However, previous workload studies have revealed that small files dominate in primary storage systems (more than 50 percent) and are at the root of the system performance bottleneck. Furthermore, due to the buffer effect, primary storage workloads exhibit obvious I/O burstiness. The important performance issue of primary storage in the Cloud and the above Deduplication-induced problems, a Performance- Oriented data Deduplication scheme, called POD, rather than a capacity-oriented one (e.g., iDedup), to improve the I/O performance of primary storage systems in the
  • 4. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 4|P a g e Copyright@IDL-2017 Cloud is proposed. Figure 1 represents the architecture of POD. By considering the workload characteristics. POD takes a two-pronged approach to improving the performance of primary storage systems and minimizing performance overhead of Deduplication, namely, a request-based selective Deduplication technique, called Select-Dedup, to alleviate the data fragmentation and an adaptive memory management scheme, called iCache, to ease the memory contention between the bursty read traffic and the bursty write traffic. 2.2 PROPOSED SYSTEM(POD vs IDedup) A possible future direction is to optionally coalesce or even eliminate altogether write I/O operations for content that are already duplicated elsewhere on the disk, or alternatively direct such writes to alternate locations in the scratch space. Figure 1. Architecture diagram of POD. While the first option might seem similar to data Deduplication at a high-level, a primary focus on the performance implications of such optimizations rather than capacity improvements has been suggested. Any optimization for writes affects the read-side optimizations of I/O Deduplication and a careful analysis and evaluation of the trade-off points in this design space is important and shares data with existing users if Deduplication found in the client or server side. Advantages: 1.Requires less space in the storage server 2.Data Deduplication will be done completely in all blocks 3. IMPLEMENTATION POD In this paper, the system proposes POD, a performance-oriented Deduplication scheme, to improve the performance of primary storage systems in the Cloud by leveraging data Deduplication on the I/O path to remove redundant write requests while also saving storage space. In Figure 2 shows the sequence diagram where one can understand how the sequence flow happens between each module. POD takes a request-based selective Deduplication approach (Select-Dedupe) to Deduplicating the I/O redundancy on the critical I/O path in such a way that it minimizes the data fragmentation problem. In the meanwhile, an intelligent cache management (iCache) is employed in POD to further improve read performance and increase space saving, by adapting to I/O burstiness. Our extensive trace-driven evaluations show that POD significantly improves the performance and saves the capacity of primary storage systems in the Cloud.
  • 5. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 5|P a g e Copyright@IDL-2017 Figure 2. Sequence diagram of POD. iDedup In iDedup system, the system describes iDedup, an inline Deduplication system specifically targeting latency-sensitive, primary storage workloads and comparison with latest techniques like POD also known as I/O Deduplication. With latency sensitive workloads, inline Deduplication has many challenges: fragmentation leading to extra disk seeks for reads, Deduplication processing overheads in the critical path and extra latency caused by IOs for Dedup-metadata management. 3.1 COMPARATIVE ANALYSIS In this section, the performance of POD vs iDedup Deduplication models is evaluated through extensive trace-driven experiments. (a) Time delay in iDedup (b) Time delay in POD Figure 3.The time delay performance of the different Deduplication schemes. A prototype of POD as a module in the Linux operating system and use the trace-driven experiments to evaluate its effectiveness and efficiency has been implemented. In this paper, POD with the capacity oriented scheme iDedup is compared. Two models (POD vs iDedup) with two parameters are compared, they are Time delay and Cpu utilization.
  • 6. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 6|P a g e Copyright@IDL-2017 (a) Cpu utilization in iDedup (b) Cpu utilization in POD Figure 4. The Cpu utilization performance of the different Deduplication schemes. In the first parameter comparison i.e., Time delay. Four production files were compared for our trace driven evaluation. In Figure 3(a) the time delay for the iDedup model can be seen and in Figure 3(b) the time delay for same files in Pod model is shown. For both the deduplicator models, same files were uploaded with name cmain, owner, clog, search. Results shows that Pod is efficient than iDedup. In the second parameter comparison i.e., Cpu utilization. a single file named abc.txt was uploaded to both Pod and iDedup models. In Figure 4(a) Cpu utilization in iDedup model can be seen and in Figure 4(b) the Cpu utilization in Pod model can be seen. From the results it states that Pod is more efficient than the iDedup model. 4. CONCLUSION System and storage consolidation trends are driving increased duplication of data within storage systems. Past efforts have been primarily directed towards the elimination of such duplication for improving storage capacity utilization. With I/O Deduplication, a contrary view is taken that intrinsic duplication in a class of systems which are not capacity-bound can be effectively utilized to improve I/O performance – the traditional Achilles’ heel for storage systems. Three techniques contained within I/O Deduplication work together to either optimize I/O operations or eliminate them altogether. An in-depth evaluation of these mechanisms revealed that together they reduced average disk I/O times by 28-47%, a large improvement all of which can directly impact the overall application-level performance of disk I/O bound systems. The content-based caching mechanism increased memory caching effectiveness by increasing cache hit rates by 10% to 4x for read operations when compared to traditional sector-based caching. Head- position aware dynamic replica retrieval directed I/O operations to alternate locations on-the-fly and additionally reduced I/O times by 10-20%. Selective duplication created additional replicas of popular content during periods of low foreground I/O activity and further improved the effectiveness of dynamic replica retrieval by 23 35%. FUTURE WORK I/O Deduplication opens up several directions for future work. One avenue for future work is to explore content-based optimizations for write I/O operations. A possible future direction is to optionally coalesce or even eliminate altogether write I/O operations for content that are already duplicated elsewhere on the disk or alternatively direct such writes to alternate locations in the scratch space. While the first option might seem similar to data Deduplication at a high- level, a primary focus on the performance implications
  • 7. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 7|P a g e Copyright@IDL-2017 of such optimizations rather than capacity improvements is suggested. Any optimization for writes affects the read-side optimizations of I/O Deduplication and a careful analysis and evaluation of the trade-off points in this design space is important and shares data with existing users if Deduplication found in the client or server side. REFERENCES [1] Jim Gray and Prashant Shenoy. Rules of Thumb in Data Engineering. Proc. of the IEEE International Conference on Data Engineering, February 2000. [2] Charles B. Morrey III and Dirk Grunwald. Peabody: The Time Travelling Disk. In Proc. of the IEEE/NASA MSST, 2003. [3] Windsor W. Hsu, Alan Jay Smith, and Honesty C. Young. The Automatic Improvement of Locality in Storage Systems. ACM Transactions on Computer Systems, 23(4):424–473, Nov 2005. [4] S. Quinlan and S. Dorward. Venti: A New Approach to Archival Storage. Proc. of the USENIX Annual Technical Conference on File and Storage Technologies, January 2002. [5] Cyril U. Orji and Jon A. Solworth. Doubly distorted mirrors. In Proceedings of the ACM SIGMOD, 1993. [6] Medha Bhadkamkar, Jorge Guerra, Luis Useche, Sam Burnett, Jason Liptak, Raju Rangaswami, and Vagelis Hristidis. BORG: Block-reORGanization for Selfoptimizing Storage Systems. In Proc. of the USENIX Annual Technical Conference on File and Storage Technologies, February 2009. [7] Sergey Brin, James Davis, and Hector Garcia- Molina. Copy Detection Mechanisms for Digital Documents. In Proc. of ACM SIGMOD, May 1995. [8] Austin Clements, Irfan Ahmad, Murali Vilayannur, and Jinyuan Li. Decentralized Deduplication in san cluster file systems. In Proc. of the USENIX Annual Technical Conference, June 2009. [9] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970. [10] Daniel Ellard, Jonathan Ledlie, Pia Malkani, and Margo Seltzer. Passive NFS Tracing of Email and Research Workloads. In Proc. of the USENIX Annual Technical Conferenceon File and Storage Technologies, March 2003. [11] P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey. Redundancy Elimination Within Large Collections of Files. Proc. of the USENIX Annual Technical Conference, 2004. [12] Binny S. Gill. On multi-level exclusive caching: offline optimality and why promotions are better than demotions. In Proc. of the USENIX Annual Technical Conference on File and Storage Technologies, Feburary 2008. [13] Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey Voelker, and Amin Vahdat. Difference Engine: Harnessing Memory Redundancy in Virtual Machines. Proc. Of the USENIX OSDI, December 2008. [14] Hai Huang, Wanda Hung, and Kang G. Shin. FS2: Dynamic Data Replication In Free Disk Space For Improving Disk Performance And Energy Consumption. In Proc. of the ACM SOSP, October 2005. [15] Jorge Guerra, Luis Useche, Medha Bhadkamkar, Ricardo Koller, and Raju Rangaswami. The Case for Active Block Layer Extensions. ACM Operating Systems Review,42(6), October 2008. [16] N. Jain, M. Dahlin, and R. Tewari. TAPER: Tiered Approach for Eliminating Redundancy in
  • 8. IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 8|P a g e Copyright@IDL-2017 Replica Synchronization. In Proc. of the USENIX Conference on File and Storage Systems, 2005. [17] Song Jiang, Feng Chen, and Xiaodong Zhang. Clock-pro: An effective improvement of the clock replacement. In Proc. of the USENIX Annual Technical Conference, April 2005. [18] Andrew Leung, Shankar Pasupathy, Garth Goodson and Ethan Miller. Measurement and Analysis of Large-Scale Network File System Workloads. Proc. of the USENIX Annual Technical Conference, June 2008. [19] Xuhui Li, Ashraf Aboulnaga, Kenneth Salem, Aamer Sachedina, and Shaobo Gao. Second-tier cache management using write hints. In Proc. of the USENIX Annual Technical Conference on File and Storage Technologies, 2005. [20] Nimrod Megiddo and D. S. Modha. Arc: A self- tuning, low overhead replacement cache. In Proc. of USENIX Annual Technical Conference on File and Storage Technologies, 2003. [21] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78–117, 1970. [22] Aayush Gupta, Raghav Pisolkar, Bhuvan Urgaonkar, and Anand Sivasubramaniam. Leveraging Value Locality in Optimizing NAND Flash-based SSDs, 2011. [23] Chi Zhang, Xiang Yu, Y. Wang. Configuring and Scheduling an Eager-Writing Disk Array for a Transaction Processing Workload. 2002. [24] Kiran Srinivasan, Tim Bisson, Garth Goodson, Kaladhar Voruganti. iDedup: Latency-aware, Inline Data Deduplication for Primary Storage. 2012. [25] Dina Bitton and Jim Gray. Disk Shadowing. In Proc. Of the International Conference on Very Large Data Bases, 1988. [26] Jiri Schindler, Sandip Shete, Keith A. Smith. Improving Throughput for Small Disk Requests with Proximal I/O. 2011. [27] Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. Sparse indexing: large scale, inline deduplication using sampling and locality. In Proc. of the USENIX Annual Technical Conference on File and Storage Technologies, February 2009.