Text File Compression And Decompression Using Huffman Coding
Last Updated :
23 Jul, 2025
Text files can be compressed to make them smaller and faster to send, and unzipping files on devices has a low overhead. The process of encoding involves changing the representation of a file so that the (binary) compressed output takes less space to store and takes less time to transmit while retaining the ability to reconstruct the original file exactly from its compressed representation. Text files can be of various file types, such as HTML, JavaScript, CSS, .txt, and so on. Text compression is required because uncompressed data can take up a lot of space, which is inconvenient for device storage and file sharing.
How Does The Process Of Compression Work?
The size of the text file can be reduced by compressing it, which converts the text to a smaller format that takes up less space. It typically works by locating similar strings/characters within a text file and replacing them with a temporary binary representation to reduce the overall file size. There are two types of file compression,
- Lossy compression: Lossy compression shrinks a file by permanently removing certain elements, particularly redundant elements.
- Lossless compression: Lossless compression can restore all elements of a file during decompression without sacrificing data and quality.
Text encoding is also of two types:
- Fixed length encoding and
- Variable length encoding.
The two methods differ in the length of the codes. Analysis shows that variable-length encoding is much better than fixed-length encoding. Characters in variable-length encoding are assigned a variable number of bits based on their frequency in the given text. As a result, some characters may require a single bit, while others may require two bits, while still others may require three bits, and so on.
How to retain uniqueness of compressed text?
During the encoding process in compression, every character can be assigned and represented by a variable-length binary code. But, the problem with this approach is its decoding. At some point during the decoding process, two or more characters may have the same prefix of code, causing the algorithm to become confused. Hence, the "prefix rule" is used which makes sure that the algorithm only generates uniquely decodable codes. In this way, none of the codes are prefixed to the other and hence the uncertainty can be resolved.
Hence, for text file compression in this article, we decide to leverage an algorithm that gives lossless compression and uses variable-length encoding with prefix rule. The article also focuses on regenerating the original file using the decoding process.
Compressing a Text File:
We use the Huffman Coding algorithm for this purpose which is a greedy algorithm that assigns variable length binary codes for each input character in the text file. The length of the binary code depends on the frequency of the character in the file. The algorithm suggests creating a binary tree where all the unique characters of a file are stored in the tree's leaf nodes.
- The algorithm works by first determining all of the file's unique characters and their frequencies.
- The characters and frequencies are then added to a Min-heap.
- It then extracts two minimum frequency characters and adds them as nodes to a dummy root.
- The value of this dummy root is the combined frequency of its nodes and this root node is added back to the Min-heap.
- The procedure is then repeated until there is only one element left in the Min-heap.
This way, a Huffman tree for a particular text file can be created.
Steps to build Huffman Tree:
- The input to the algorithm is the array of characters in the text file.
- The frequency of occurrences of each character in the file is calculated.
- Struct array is created where each element includes the character along with their frequencies. They are stored in a priority queue (min-heap), where the elements are compared using their frequencies.
- To build the Huffman tree, two elements with minimum frequency are extracted from the min-heap.
- The two nodes are added to the tree as left and right children to a new root node which contains the frequency equal to the sum of two frequencies. A lower frequency character is added to the left child node and the higher frequency character into the right child node.
- The root node is then again added back to the priority queue.
- Repeat from step 4 until there is only one element left in the priority queue.
- Finally, the tree's left and right edges are numbered 0 and 1, respectively. For each leaf node, the entire tree is traversed, and the corresponding 1 and 0 are appended to their code until a leaf node is encountered.
- Once we have the unique codes for each unique character in the text, we can replace the text characters with their codes. These codes will be stored in bit-by-bit form, which will take up less space than text.
Compressed File Structure:
We've talked about variable length input code generation and replacing it with the file's original characters so far. However, this only serves to compress the file. The more difficult task is to decompress the file by decoding the binary codes to their original value.
This would necessitate the addition of some additional information to our compressed file in order to use it during the decoding process. As a result, we include the characters in our file, along with their corresponding codes. During the decoding process, this aids in the recreation of the Huffman tree.
The structure of a compressed file:
Number of unique characters in the input file |
Total number of characters in the input file |
All characters with their binary codes (To be used for decoding) |
Storing binary codes by replacing the characters of the input file one by one |
Decompressing the Compressed File:
- The compressed file is opened, and the number of unique characters and the total number of characters in the file are retrieved.
- The characters and their binary codes are then read from the file. We can recreate the Huffman tree using this.
- For each binary code:
- A left edge is created for 0, and a right edge is created for 1.
- Finally, a leaf node is formed and the character is stored within it.
- This is repeated for all characters and binary codes. The Huffman tree is thus recreated in this manner.
- The remaining file is now read bit by bit, and the corresponding 0/1 bit in the tree is traversed. The corresponding character is written into the decompressed file as soon as a leaf node is encountered in the tree.
- Step 4 is repeated until the compressed file has been read completely.
In this manner, we recover all of the characters from our input file into a newly decompressed file with no data or quality loss.
Following the steps above, we can compress a text file and then overcome the bigger task of decompressing the file to its original content without any data loss.
Time Complexity: O(N * logN) where N is the number of unique characters as an efficient priority queue data structure takes O(logN) time per insertion, a complete binary tree with N leaves has (2*N - 1) nodes.
Implementation of Huffman Coding using C Language:
Opening Input/Output Files:
C
// Opening input file in read-only mode
int fd1 = open(“sample.txt”, O_RDONLY);
if (fd1 == -1) {
perror("Open Failed For Input File:\n");
exit(1);
}
// Creating output file in write mode
int fd2 = open(“sample - compressed.txt”,
O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR);
if (fd2 == -1) {
perror("Open Failed For Output File:\n");
exit(1);
}
Function to Initialize and Create Min Heap:
C++
#include <iostream>
#include <utility>
#include <vector>
using namespace std;
// Structure for tree nodes
struct Node {
char character;
int freq;
Node *l, *r;
Node(char c, int f)
: character(c)
, freq(f)
, l(nullptr)
, r(nullptr)
{
}
};
// Structure for min heap
struct Min_Heap {
int size;
vector<Node*> array;
Min_Heap(int s)
: size(s)
, array(s)
{
}
};
// Function to create min heap
Min_Heap* createAndBuildMin_Heap(char arr[], int freq[],
int unique_size)
{
int i;
// Initializing heap
Min_Heap* Min_Heap = new Min_Heap(unique_size);
// Initializing the array of pointers in minheap.
// Pointers pointing to new nodes of character
// and their frequency
for (i = 0; i < unique_size; ++i) {
Min_Heap->array[i] = new Node(arr[i], freq[i]);
}
int n = Min_Heap->size - 1;
for (i = (n - 1) / 2; i >= 0; --i) {
// Standard function for Heap creation
Heapify(Min_Heap, i);
}
return Min_Heap;
}
C
// Structure for tree nodes
struct Node {
char character;
int freq;
struct Node *l, *r;
};
// Structure for min heap
struct Min_Heap {
int size;
struct Node** array;
};
// Function to create min heap
struct Min_Heap* createAndBuildMin_Heap(char arr[],
int freq[],
int unique_size)
{
int i;
// Initializing heap
struct Min_Heap* Min_Heap
= (struct Min_Heap*)malloc(sizeof(struct Min_Heap));
Min_Heap->size = unique_size;
Min_Heap->array = (struct Node**)malloc(
Min_Heap->size * sizeof(struct Node*));
// Initializing the array of pointers in minheap.
// Pointers pointing to new nodes of character
// and their frequency
for (i = 0; i < unique_size; ++i) {
// newNode is a function
// to initialize new node
Min_Heap->array[i] = newNode(arr[i], freq[i]);
}
int n = Min_Heap->size - 1;
for (i = (n - 1) / 2; i >= 0; --i) {
// Standard function for Heap creation
Heapify(Min_Heap, i);
}
return Min_Heap;
}
Java
// Structure for tree nodes
class Node {
char character;
int freq;
Node left, right;
// Constructor to initialize a new node
Node(char character, int freq) {
this.character = character;
this.freq = freq;
this.left = null;
this.right = null;
}
}
// Structure for min heap
class MinHeap {
int size;
Node[] array;
// Constructor to initialize a new min heap
MinHeap(int size) {
this.size = size;
this.array = new Node[size];
}
}
// Function to create min heap
public static MinHeap createAndBuildMinHeap(char[] arr, int[] freq, int unique_size) {
// Initializing heap
MinHeap minHeap = new MinHeap(unique_size);
// Initializing the array of pointers in minheap.
// Pointers pointing to new nodes of character
// and their frequency
for (int i = 0; i < unique_size; i++) {
// newNode is a function
// to initialize new node
minHeap.array[i] = new Node(arr[i], freq[i]);
}
int n = minHeap.size - 1;
for (int i = (n - 1) / 2; i >= 0; i--) {
// Standard function for Heap creation
heapify(minHeap, i);
}
return minHeap;
}
Python
# Structure for tree nodes
class Node:
def __init__(self, character, freq):
self.character = character
self.freq = freq
self.l = None
self.r = None
# Structure for min heap
class Min_Heap:
def __init__(self, size):
self.size = size
self.array = [None]*size
# Function to create min heap
def createAndBuildMin_Heap(arr, freq, unique_size):
# Initializing heap
Min_Heap_ = Min_Heap(unique_size)
# Initializing the array of pointers in minheap.
# Pointers pointing to new nodes of character
# and their frequency
for i in range(unique_size):
# newNode is a function
# to initialize new node
Min_Heap_.array[i] = Node(arr[i], freq[i])
n = Min_Heap_.size - 1
for i in range((n - 1) // 2, -1, -1):
# Standard function for Heap creation
Heapify(Min_Heap_, i)
return Min_Heap_
JavaScript
// Structure for tree nodes
class Node {
constructor(character, freq) {
this.character = character;
this.freq = freq;
this.l = null;
this.r = null;
}
}
// Structure for min heap
class Min_Heap {
constructor(size) {
this.size = size;
this.array = Array(size).fill(null);
}
}
// Function to create min heap
function createAndBuildMin_Heap(arr, freq, unique_size) {
// Initializing heap
const Min_Heap_ = new Min_Heap(unique_size);
// Initializing the array of pointers in minheap.
// Pointers pointing to new nodes of character
// and their frequency
for (let i = 0; i < unique_size; i++) {
// newNode is a function
// to initialize new node
Min_Heap_.array[i] = new Node(arr[i], freq[i]);
}
const n = Min_Heap_.size - 1;
for (let i = Math.floor((n - 1) / 2); i >= 0; i--) {
// Standard function for Heap creation
Heapify(Min_Heap_, i);
}
return Min_Heap_;
}
Function to Build and Create a Huffman Tree:
C
// Function to build Huffman Tree
struct Node* buildHuffmanTree(char arr[], int freq[],
int unique_size)
{
struct Node *l, *r, *top;
while (!isSizeOne(Min_Heap)) {
l = extractMinFromMin_Heap(Min_Heap);
r = extractMinFromMin_Heap(Min_Heap);
top = newNode('$', l->freq + r->freq);
top->l = l;
top->r = r;
insertIntoMin_Heap(Min_Heap, top);
}
return extractMinFromMin_Heap(Min_Heap);
}
Recursive Function to Print Binary Codes into Compressed File:
C
// Structure to store codes in compressed file
typedef struct code {
char k;
int l;
int code_arr[16];
struct code* p;
} code;
// Function to print codes into file
void printCodesIntoFile(int fd2, struct Node* root,
int t[], int top = 0)
{
int i;
if (root->l) {
t[top] = 0;
printCodesIntoFile(fd2, root->l, t, top + 1);
}
if (root->r) {
t[top] = 1;
printCodesIntoFile(fd2, root->r, t, top + 1);
}
if (isLeaf(root)) {
data = (code*)malloc(sizeof(code));
tree = (Tree*)malloc(sizeof(Tree));
data->p = NULL;
data->k = root->character;
tree->g = root->character;
write(fd2, &tree->g, sizeof(char));
for (i = 0; i < top; i++) {
data->code_arr[i] = t[i];
}
tree->len = top;
write(fd2, &tree->len, sizeof(int));
tree->dec
= convertBinaryToDecimal(data->code_arr, top);
write(fd2, &tree->dec, sizeof(int));
data->l = top;
data->p = NULL;
if (k == 0) {
front = rear = data;
k++;
}
else {
rear->p = data;
rear = rear->p;
}
}
}
Function to Compress the File by Substituting Characters with their Huffman Codes:
C
// Function to compress file
void compressFile(int fd1, int fd2, unsigned char a)
{
char n;
int h = 0, i;
// Codes are written into file in bit by bit format
while (read(fd1, &n, sizeof(char)) != 0) {
rear = front;
while (rear->k != n && rear->p != NULL) {
rear = rear->p;
}
if (rear->k == n) {
for (i = 0; i < rear->l; i++) {
if (h < 7) {
if (rear->code_arr[i] == 1) {
a++;
a = a << 1;
h++;
}
else if (rear->code_arr[i] == 0) {
a = a << 1;
h++;
}
}
else if (h == 7) {
if (rear->code_arr[i] == 1) {
a++;
h = 0;
}
else {
h = 0;
}
write(fd2, &a, sizeof(char));
a = 0;
}
}
}
}
for (i = 0; i < 7 - h; i++) {
a = a << 1;
}
write(fd2, &a, sizeof(char));
}
Function to Build Huffman Tree from Data Extracted from Compressed File:
C
typedef struct Tree {
char g;
int len;
int dec;
struct Tree* f;
struct Tree* r;
} Tree;
// Function to extract Huffman codes
// from a compressed file
void ExtractCodesFromFile(int fd1)
{
read(fd1, &t->g, sizeof(char));
read(fd1, &t->len, sizeof(int));
read(fd1, &t->dec, sizeof(int));
}
// Function to rebuild the Huffman tree
void ReBuildHuffmanTree(int fd1, int size)
{
int i = 0, j, k;
tree = (Tree*)malloc(sizeof(Tree));
tree_temp = tree;
tree->f = NULL;
tree->r = NULL;
t = (Tree*)malloc(sizeof(Tree));
t->f = NULL;
t->r = NULL;
for (k = 0; k < size; k++) {
tree_temp = tree;
ExtractCodesFromFile(fd1);
int bin[MAX], bin_con[MAX];
for (i = 0; i < MAX; i++) {
bin[i] = bin_con[i] = 0;
}
convertDecimalToBinary(bin, t->dec, t->len);
for (i = 0; i < t->len; i++) {
bin_con[i] = bin[i];
}
for (j = 0; j < t->len; j++) {
if (bin_con[j] == 0) {
if (tree_temp->f == NULL) {
tree_temp->f
= (Tree*)malloc(sizeof(Tree));
}
tree_temp = tree_temp->f;
}
else if (bin_con[j] == 1) {
if (tree_temp->r == NULL) {
tree_temp->r
= (Tree*)malloc(sizeof(Tree));
}
tree_temp = tree_temp->r;
}
}
tree_temp->g = t->g;
tree_temp->len = t->len;
tree_temp->dec = t->dec;
tree_temp->f = NULL;
tree_temp->r = NULL;
tree_temp = tree;
}
}
Function to Decompress the Compressed File:
C
void decompressFile(int fd1, int fd2, int f)
{
int inp[8], i, k = 0;
unsigned char p;
read(fd1, &p, sizeof(char));
convertDecimalToBinary(inp, p, 8);
tree_temp = tree;
for (i = 0; i < 8 && k < f; i++) {
if (!isroot(tree_temp)) {
if (i != 7) {
if (inp[i] == 0) {
tree_temp = tree_temp->f;
}
if (inp[i] == 1) {
tree_temp = tree_temp->r;
}
}
else {
if (inp[i] == 0) {
tree_temp = tree_temp->f;
}
if (inp[i] == 1) {
tree_temp = tree_temp->r;
}
if (read(fd1, &p, sizeof(char)) != 0) {
convertDecimalToBinary(inp, p, 8);
i = -1;
}
else {
break;
}
}
}
else {
k++;
write(fd2, &tree_temp->g, sizeof(char));
tree_temp = tree;
i--;
}
}
}
When the snippets of code above are combined to form a full implementation of the algorithm and a large corpus of data is passed to it, the following results can be obtained. It clearly demonstrates how a text file can be compressed with a ratio greater than 50% (typically 40-45%) and then decompressed without losing a single byte of data.
Huffman Algorithm Implementation Results
Similar Reads
Basics & Prerequisites
Data Structures
Array Data StructureIn this article, we introduce array, implementation in different popular languages, its basic operations and commonly seen problems / interview questions. An array stores items (in case of C/C++ and Java Primitive Arrays) or their references (in case of Python, JS, Java Non-Primitive) at contiguous
3 min read
String in Data StructureA string is a sequence of characters. The following facts make string an interesting data structure.Small set of elements. Unlike normal array, strings typically have smaller set of items. For example, lowercase English alphabet has only 26 characters. ASCII has only 256 characters.Strings are immut
2 min read
Hashing in Data StructureHashing is a technique used in data structures that efficiently stores and retrieves data in a way that allows for quick access. Hashing involves mapping data to a specific index in a hash table (an array of items) using a hash function. It enables fast retrieval of information based on its key. The
2 min read
Linked List Data StructureA linked list is a fundamental data structure in computer science. It mainly allows efficient insertion and deletion operations compared to arrays. Like arrays, it is also used to implement other data structures like stack, queue and deque. Hereâs the comparison of Linked List vs Arrays Linked List:
2 min read
Stack Data StructureA Stack is a linear data structure that follows a particular order in which the operations are performed. The order may be LIFO(Last In First Out) or FILO(First In Last Out). LIFO implies that the element that is inserted last, comes out first and FILO implies that the element that is inserted first
2 min read
Queue Data StructureA Queue Data Structure is a fundamental concept in computer science used for storing and managing data in a specific order. It follows the principle of "First in, First out" (FIFO), where the first element added to the queue is the first one to be removed. It is used as a buffer in computer systems
2 min read
Tree Data StructureTree Data Structure is a non-linear data structure in which a collection of elements known as nodes are connected to each other via edges such that there exists exactly one path between any two nodes. Types of TreeBinary Tree : Every node has at most two childrenTernary Tree : Every node has at most
4 min read
Graph Data StructureGraph Data Structure is a collection of nodes connected by edges. It's used to represent relationships between different entities. If you are looking for topic-wise list of problems on different topics like DFS, BFS, Topological Sort, Shortest Path, etc., please refer to Graph Algorithms. Basics of
3 min read
Trie Data StructureThe Trie data structure is a tree-like structure used for storing a dynamic set of strings. It allows for efficient retrieval and storage of keys, making it highly effective in handling large datasets. Trie supports operations such as insertion, search, deletion of keys, and prefix searches. In this
15+ min read
Algorithms
Searching AlgorithmsSearching algorithms are essential tools in computer science used to locate specific items within a collection of data. In this tutorial, we are mainly going to focus upon searching in an array. When we search an item in an array, there are two most common algorithms used based on the type of input
2 min read
Sorting AlgorithmsA Sorting Algorithm is used to rearrange a given array or list of elements in an order. For example, a given array [10, 20, 5, 2] becomes [2, 5, 10, 20] after sorting in increasing order and becomes [20, 10, 5, 2] after sorting in decreasing order. There exist different sorting algorithms for differ
3 min read
Introduction to RecursionThe process in which a function calls itself directly or indirectly is called recursion and the corresponding function is called a recursive function. A recursive algorithm takes one step toward solution and then recursively call itself to further move. The algorithm stops once we reach the solution
14 min read
Greedy AlgorithmsGreedy algorithms are a class of algorithms that make locally optimal choices at each step with the hope of finding a global optimum solution. At every step of the algorithm, we make a choice that looks the best at the moment. To make the choice, we sometimes sort the array so that we can always get
3 min read
Graph AlgorithmsGraph is a non-linear data structure like tree data structure. The limitation of tree is, it can only represent hierarchical data. For situations where nodes or vertices are randomly connected with each other other, we use Graph. Example situations where we use graph data structure are, a social net
3 min read
Dynamic Programming or DPDynamic Programming is an algorithmic technique with the following properties.It is mainly an optimization over plain recursion. Wherever we see a recursive solution that has repeated calls for the same inputs, we can optimize it using Dynamic Programming. The idea is to simply store the results of
3 min read
Bitwise AlgorithmsBitwise algorithms in Data Structures and Algorithms (DSA) involve manipulating individual bits of binary representations of numbers to perform operations efficiently. These algorithms utilize bitwise operators like AND, OR, XOR, NOT, Left Shift, and Right Shift.BasicsIntroduction to Bitwise Algorit
4 min read
Advanced
Segment TreeSegment Tree is a data structure that allows efficient querying and updating of intervals or segments of an array. It is particularly useful for problems involving range queries, such as finding the sum, minimum, maximum, or any other operation over a specific range of elements in an array. The tree
3 min read
Pattern SearchingPattern searching algorithms are essential tools in computer science and data processing. These algorithms are designed to efficiently find a particular pattern within a larger set of data. Patten SearchingImportant Pattern Searching Algorithms:Naive String Matching : A Simple Algorithm that works i
2 min read
GeometryGeometry is a branch of mathematics that studies the properties, measurements, and relationships of points, lines, angles, surfaces, and solids. From basic lines and angles to complex structures, it helps us understand the world around us.Geometry for Students and BeginnersThis section covers key br
2 min read
Interview Preparation
Practice Problem