SlideShare a Scribd company logo
4
Most read
6
Most read
MCS10 204(A)
Data Compression
Chapter: Introduction
Manish T I
Associate Professor
Department of CSE
MET’s School of Engineering, Mala
E-mail: manishti2004@gmail.com
Definition
• Data compression is the process of converting an
input data stream (the source stream or the original
raw data) into another data stream (the output, or
the compressed, stream) that has a smaller size. A
stream is either a file or a buffer in memory.
Data compression is popular for two reasons:
(1) People like to accumulate data and hate to throw
anything away.
(2) People hate to wait a long time for data transfers.
Data compression is often called source coding
The input symbols are emitted by a certain information source
and have to be coded before being sent to their destination. The
source can be memory less or it can have memory.
In the former case, each bit is independent of its predecessors. In
the latter case, each symbol depends on some of its predecessors
and, perhaps, also on its successors, so they are correlated.
A memory less source is also termed “independent and identically
distributed” or IIID.
• The compressor or encoder is the program that compresses
the raw data in the input stream and creates an output
stream with compressed (low-redundancy) data.
• The decompressor or decoder converts in the opposite
direction.
• The term codec is sometimes used to describe both the
encoder and decoder.
• “Stream” is a more general term because the compressed
data may be transmitted directly to the decoder, instead of
being written to a file and saved.
• A non-adaptive compression method is rigid and does not modify
its operations, its parameters, or its tables in response to the
particular data being compressed.
• In contrast, an adaptive method examines the raw data and
modifies its operations and/or its parameters accordingly.
• A 2-pass algorithm, where the first pass reads the input stream to
collect statistics on the data to be compressed, and the second
pass does the actual compressing using parameters set by the
first pass. Such a method may be called semi-adaptive.
• A data compression method can also be locally adaptive,
meaning it adapts itself to local conditions in the input stream
and varies this adaptation as it moves from area to area in the
input.
• For the original input stream, we use the terms unencoded,
raw, or original data.
• The contents of the final, compressed, stream is considered
the encoded or compressed data.
• The term bit stream is also used in the literature to indicate
the compressed stream.
• Lossy/lossless compression
• Cascaded compression: The difference between lossless
and lossy codecs can be illuminated by considering a
cascade of compressions.
• Perceptive compression: A lossy encoder must take
advantage of the special type of data being compressed.
It should delete only data whose absence would not be
detected by our senses.
• It employ algorithms based on our understanding of
psychoacoustic and psychovisual perception, so it is
often referred to as a perceptive encoder.
• Symmetrical compression is the case where the
compressor and decompressor use basically the same
algorithm but work in “opposite” directions. Such a
method makes sense for general work, where the same
number of files are compressed as are decompressed.
• A data compression method is called universal if the compressor
and decompressor do not know the statistics of the input stream.
A universal method is optimal if the compressor can produce
compression factors that asymptotically approach the entropy of
the input stream for long inputs.
• The term file differencing refers to any method that locates and
compresses the differences between two files. Imagine a file A
with two copies that are kept by two users. When a copy is
updated by one user, it should be sent to the other user, to keep
the two copies identical.
• Most compression methods operate in the streaming mode,
where the codec inputs a byte or several bytes, processes them,
and continues until an end-of-file is sensed
• In the block mode, where the input stream is read block by block
and each block is encoded separately. The block size in this case
should be a user-controlled parameter, since its size may greatly
affect the performance of the method.
• Most compression methods are physical. They look only at the
bits in the input stream and ignore the meaning of the data items
in the input. Such a method translates one bit stream into
another, shorter, one. The only way to make sense of the output
stream (to decode it) is by knowing how it was encoded.
• Some compression methods are logical. They look at individual
data items in the source stream and replace common items with
short codes. Such a method is normally special purpose and can
be used successfully on certain types of data only.
Compression performance
The compression ratio is defined as Compression ratio
= size of the output stream/size of the input stream
A value of 0.6 means that the data occupies 60% of its original size
after compression.
Values greater than 1 mean an output stream bigger than the input
stream (negative compression).
The compression ratio can also be called bpb (bit per bit), since it
equals the number of bits in the compressed stream needed, on
average, to compress one bit in the input stream.
bpb (bits per pixel)
bpc (bits per character)
The term bitrate (or “bit rate”) is a general term for bpb and bpc.
• Compression factor = size of the input stream/size of the output stream.
• The expression 100 × (1 − compression ratio) is also a reasonable measure
of compression performance. A value of 60 means that the output stream
occupies 40% of its original size (or that the compression has resulted in
savings of 60%).
• The expression 100 × (1 − compression ratio) is also a reasonable measure
of compression performance. A value of 60 means that the output stream
occupies 40% of its original size (or that the compression has resulted in
savings of 60%).
• The unit of the compression gain is called percent log ratio and is denoted
by .
• The speed of compression can be measured in cycles per byte (CPB). This is
the average number of machine cycles it takes to compress one byte. This
measure is important when compression is done by special hardware.
• Mean square error (MSE) and peak signal to noise
ratio (PSNR), are used to measure the distortion
caused by lossy compression of images and movies.
• Relative compression is used to measure the
compression gain in lossless audio compression
methods.
• The Calgary Corpus is a set of 18 files traditionally
used to test data compression programs. They
include text, image, and object files, for a total of
more than 3.2 million bytes The corpus can be
downloaded by anonymous ftp
The Canterbury Corpus is another collection of files, introduced in
1997 to provide an alternative to the Calgary corpus for
evaluating lossless compression methods.
1. The Calgary corpus has been used by many researchers to develop,
test, and compare many compression methods, and there is a
chance that new methods would unintentionally be fine-tuned to
that corpus. They may do well on the Calgary corpus documents
but poorly on other documents.
2. The Calgary corpus was collected in 1987 and is getting old.
“Typical” documents change during a decade (e.g., html
documents did not exist until recently), and any body of
documents used for evaluation purposes should be examined from
time to time.
3. The Calgary corpus is more or less an arbitrary collection of
documents, whereas a good corpus for algorithm evaluation
should be selected carefully.
Probability Model
• This concept is important in statistical data compression methods.
• When such a method is used, a model for the data has to be
constructed before compression can begin.
• A typical model is built by reading the entire input stream, counting
the number of times each symbol appears , and computing the
probability of occurrence of each symbol.
• The data stream is then input again, symbol by symbol, and is
compressed using the information in the probability model.
References
• Data Compression: The Complete Reference,
David Salomon, Springer Science & Business
Media, 2004

More Related Content

PPTX
Audio compression
PPS
MPEG/Audio Compression
PPTX
SPREAD SPECTRUM MODULATION.pptx
PPTX
Audio compression
PDF
Arithmetic coding
PPTX
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORM
PPTX
PPTX
Huffman Coding
Audio compression
MPEG/Audio Compression
SPREAD SPECTRUM MODULATION.pptx
Audio compression
Arithmetic coding
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORM
Huffman Coding

What's hot (20)

PPTX
Spread spectrum modulation
PPT
H.264 video standard
PPTX
Adaptive equalization
PPTX
An introduction to discrete wavelet transforms
PPT
Speech encoding techniques
PDF
Orthogonal Frequency Division Multiplexing (OFDM)
PPTX
Vector quantization
PPTX
Vector Quantization Vs Scalar Quantization
PPTX
PPT
Arithmetic coding
PPT
Precoding
PDF
LDPC Codes
PPT
Data compression
PPT
Lecture13
PPT
Digital modulation techniques
PPTX
LDPC Encoding
PPTX
Digital communication unit II
PPT
The motion estimation
Spread spectrum modulation
H.264 video standard
Adaptive equalization
An introduction to discrete wavelet transforms
Speech encoding techniques
Orthogonal Frequency Division Multiplexing (OFDM)
Vector quantization
Vector Quantization Vs Scalar Quantization
Arithmetic coding
Precoding
LDPC Codes
Data compression
Lecture13
Digital modulation techniques
LDPC Encoding
Digital communication unit II
The motion estimation
Ad

Viewers also liked (20)

PDF
Chapter 5 - Data Compression
PPTX
Data compression
PPTX
4 data compression
PPT
Data Compression Technique
PDF
Data compression introduction
PPTX
data compression technique
PPT
Data compression
PPT
Data compression tech cs
PDF
Data compression
PPTX
3 mathematical priliminaries DATA compression
PPTX
Data Compression Project Presentation
PPTX
머피의 머신러닝: 17장 Markov Chain and HMM
PDF
Huffman and Arithmetic coding - Performance analysis
PPTX
Compression project presentation
PDF
Information theory
PPTX
Data compression
PPTX
Data compression techniques
PPTX
Data Compression In SQL
PDF
Information Theory and Coding Notes - Akshansh
PPTX
Data compression
Chapter 5 - Data Compression
Data compression
4 data compression
Data Compression Technique
Data compression introduction
data compression technique
Data compression
Data compression tech cs
Data compression
3 mathematical priliminaries DATA compression
Data Compression Project Presentation
머피의 머신러닝: 17장 Markov Chain and HMM
Huffman and Arithmetic coding - Performance analysis
Compression project presentation
Information theory
Data compression
Data compression techniques
Data Compression In SQL
Information Theory and Coding Notes - Akshansh
Data compression
Ad

Similar to Introduction for Data Compression (20)

PDF
Charter1 material
PPTX
PDF
10lecture10datacompression-171023182241.pdf
PPT
lecture on data compression
PPT
Compressionbasics
PPTX
Introduction to data compression.pptx
PDF
Data Communication & Computer network: Data compression
PPT
VII Compression Introduction
PPTX
Data compression
PPTX
Data Compression and encryption for security
DOC
Seminar Report on image compression
PPT
Lec6 compression
PPT
Lec6 compression
PPT
Lec6 compression
PPTX
Data representation
PDF
The Language of Compression
PDF
The Language of Compression - Leif Walsh
PPTX
Teknik Pengkodean (2).pptx
PDF
Presentation on Image Compression
PDF
Unit 1 Introduction to Data Compression
Charter1 material
10lecture10datacompression-171023182241.pdf
lecture on data compression
Compressionbasics
Introduction to data compression.pptx
Data Communication & Computer network: Data compression
VII Compression Introduction
Data compression
Data Compression and encryption for security
Seminar Report on image compression
Lec6 compression
Lec6 compression
Lec6 compression
Data representation
The Language of Compression
The Language of Compression - Leif Walsh
Teknik Pengkodean (2).pptx
Presentation on Image Compression
Unit 1 Introduction to Data Compression

More from MANISH T I (20)

PDF
Budgerigar
PDF
NAAC Criteria 3
PDF
Artificial intelligence - An Overview
PDF
The future of blogging
PDF
Socrates - Most Important of his Thoughts
PDF
Technical writing
PDF
Shannon-Fano algorithm
PPTX
Solar Image Processing
PDF
Graph Theory Introduction
PDF
Rooted & binary tree
PPTX
PPTX
Colourful Living - Way of Life
PPTX
Introduction to Multimedia
PPT
Soft Computing
PPTX
Research Methodology - Methods of data collection
PPTX
15 lessons of lord buddha
PPTX
Image enhancement
PPTX
Research Methodology - Introduction
PPTX
DBMS - FIRST NORMAL FORM
PPTX
Simple Dictionary Compression
Budgerigar
NAAC Criteria 3
Artificial intelligence - An Overview
The future of blogging
Socrates - Most Important of his Thoughts
Technical writing
Shannon-Fano algorithm
Solar Image Processing
Graph Theory Introduction
Rooted & binary tree
Colourful Living - Way of Life
Introduction to Multimedia
Soft Computing
Research Methodology - Methods of data collection
15 lessons of lord buddha
Image enhancement
Research Methodology - Introduction
DBMS - FIRST NORMAL FORM
Simple Dictionary Compression

Recently uploaded (20)

PPTX
Current and future trends in Computer Vision.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Digital Logic Computer Design lecture notes
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Construction Project Organization Group 2.pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Mechanical Engineering MATERIALS Selection
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
additive manufacturing of ss316l using mig welding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPT
Project quality management in manufacturing
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Current and future trends in Computer Vision.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Digital Logic Computer Design lecture notes
Lecture Notes Electrical Wiring System Components
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Construction Project Organization Group 2.pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mechanical Engineering MATERIALS Selection
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
additive manufacturing of ss316l using mig welding
Model Code of Practice - Construction Work - 21102022 .pdf
Project quality management in manufacturing
Operating System & Kernel Study Guide-1 - converted.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...

Introduction for Data Compression

  • 1. MCS10 204(A) Data Compression Chapter: Introduction Manish T I Associate Professor Department of CSE MET’s School of Engineering, Mala E-mail: [email protected]
  • 2. Definition • Data compression is the process of converting an input data stream (the source stream or the original raw data) into another data stream (the output, or the compressed, stream) that has a smaller size. A stream is either a file or a buffer in memory.
  • 3. Data compression is popular for two reasons: (1) People like to accumulate data and hate to throw anything away. (2) People hate to wait a long time for data transfers.
  • 4. Data compression is often called source coding The input symbols are emitted by a certain information source and have to be coded before being sent to their destination. The source can be memory less or it can have memory. In the former case, each bit is independent of its predecessors. In the latter case, each symbol depends on some of its predecessors and, perhaps, also on its successors, so they are correlated. A memory less source is also termed “independent and identically distributed” or IIID.
  • 5. • The compressor or encoder is the program that compresses the raw data in the input stream and creates an output stream with compressed (low-redundancy) data. • The decompressor or decoder converts in the opposite direction. • The term codec is sometimes used to describe both the encoder and decoder. • “Stream” is a more general term because the compressed data may be transmitted directly to the decoder, instead of being written to a file and saved.
  • 6. • A non-adaptive compression method is rigid and does not modify its operations, its parameters, or its tables in response to the particular data being compressed. • In contrast, an adaptive method examines the raw data and modifies its operations and/or its parameters accordingly. • A 2-pass algorithm, where the first pass reads the input stream to collect statistics on the data to be compressed, and the second pass does the actual compressing using parameters set by the first pass. Such a method may be called semi-adaptive. • A data compression method can also be locally adaptive, meaning it adapts itself to local conditions in the input stream and varies this adaptation as it moves from area to area in the input.
  • 7. • For the original input stream, we use the terms unencoded, raw, or original data. • The contents of the final, compressed, stream is considered the encoded or compressed data. • The term bit stream is also used in the literature to indicate the compressed stream. • Lossy/lossless compression • Cascaded compression: The difference between lossless and lossy codecs can be illuminated by considering a cascade of compressions.
  • 8. • Perceptive compression: A lossy encoder must take advantage of the special type of data being compressed. It should delete only data whose absence would not be detected by our senses. • It employ algorithms based on our understanding of psychoacoustic and psychovisual perception, so it is often referred to as a perceptive encoder. • Symmetrical compression is the case where the compressor and decompressor use basically the same algorithm but work in “opposite” directions. Such a method makes sense for general work, where the same number of files are compressed as are decompressed.
  • 9. • A data compression method is called universal if the compressor and decompressor do not know the statistics of the input stream. A universal method is optimal if the compressor can produce compression factors that asymptotically approach the entropy of the input stream for long inputs. • The term file differencing refers to any method that locates and compresses the differences between two files. Imagine a file A with two copies that are kept by two users. When a copy is updated by one user, it should be sent to the other user, to keep the two copies identical. • Most compression methods operate in the streaming mode, where the codec inputs a byte or several bytes, processes them, and continues until an end-of-file is sensed
  • 10. • In the block mode, where the input stream is read block by block and each block is encoded separately. The block size in this case should be a user-controlled parameter, since its size may greatly affect the performance of the method. • Most compression methods are physical. They look only at the bits in the input stream and ignore the meaning of the data items in the input. Such a method translates one bit stream into another, shorter, one. The only way to make sense of the output stream (to decode it) is by knowing how it was encoded. • Some compression methods are logical. They look at individual data items in the source stream and replace common items with short codes. Such a method is normally special purpose and can be used successfully on certain types of data only.
  • 11. Compression performance The compression ratio is defined as Compression ratio = size of the output stream/size of the input stream A value of 0.6 means that the data occupies 60% of its original size after compression. Values greater than 1 mean an output stream bigger than the input stream (negative compression). The compression ratio can also be called bpb (bit per bit), since it equals the number of bits in the compressed stream needed, on average, to compress one bit in the input stream. bpb (bits per pixel) bpc (bits per character) The term bitrate (or “bit rate”) is a general term for bpb and bpc.
  • 12. • Compression factor = size of the input stream/size of the output stream. • The expression 100 × (1 − compression ratio) is also a reasonable measure of compression performance. A value of 60 means that the output stream occupies 40% of its original size (or that the compression has resulted in savings of 60%). • The expression 100 × (1 − compression ratio) is also a reasonable measure of compression performance. A value of 60 means that the output stream occupies 40% of its original size (or that the compression has resulted in savings of 60%). • The unit of the compression gain is called percent log ratio and is denoted by . • The speed of compression can be measured in cycles per byte (CPB). This is the average number of machine cycles it takes to compress one byte. This measure is important when compression is done by special hardware.
  • 13. • Mean square error (MSE) and peak signal to noise ratio (PSNR), are used to measure the distortion caused by lossy compression of images and movies. • Relative compression is used to measure the compression gain in lossless audio compression methods. • The Calgary Corpus is a set of 18 files traditionally used to test data compression programs. They include text, image, and object files, for a total of more than 3.2 million bytes The corpus can be downloaded by anonymous ftp
  • 14. The Canterbury Corpus is another collection of files, introduced in 1997 to provide an alternative to the Calgary corpus for evaluating lossless compression methods. 1. The Calgary corpus has been used by many researchers to develop, test, and compare many compression methods, and there is a chance that new methods would unintentionally be fine-tuned to that corpus. They may do well on the Calgary corpus documents but poorly on other documents. 2. The Calgary corpus was collected in 1987 and is getting old. “Typical” documents change during a decade (e.g., html documents did not exist until recently), and any body of documents used for evaluation purposes should be examined from time to time. 3. The Calgary corpus is more or less an arbitrary collection of documents, whereas a good corpus for algorithm evaluation should be selected carefully.
  • 15. Probability Model • This concept is important in statistical data compression methods. • When such a method is used, a model for the data has to be constructed before compression can begin. • A typical model is built by reading the entire input stream, counting the number of times each symbol appears , and computing the probability of occurrence of each symbol. • The data stream is then input again, symbol by symbol, and is compressed using the information in the probability model.
  • 16. References • Data Compression: The Complete Reference, David Salomon, Springer Science & Business Media, 2004