PostgreSQL storage: Comparing storage options

Storing and archiving data efficiently

When it comes to designing and optimizing databases, one of the most critical aspects is the choice of storage options. PostgreSQL, like many other relational databases, provides various storage options that can significantly impact performance, data integrity, and overall database efficiency.

Table of Contents

In this case study, we'll delve into each of PostgreSQL's main storage options, their characteristics, and the factors that influence their choice, enabling you to make informed decisions about your database's storage strategy. You will also learn how you can archive data in a hybrid environment for long term storage.

What we will compare:

PostgreSQL row store (heap)
PostgreSQL column store
CSV files
Parquet files

Each storage type serves a different purpose and can be used to achieve different goals.

Sample data: Oracle audit logs

Compliance is a key topic in database engineering and handling large volumes of audit data matters. Therefore taking data from the “Oracle unified audit trail” is a good way to demonstrate PostgreSQL capabilities.

Using PostgreSQL heap storage

For the purpose of this evaluation we have imported roughly 144 million rows into PostgreSQL using a “heap” (which is the default storage method):

lakehouse=# SELECT count(*) FROM t_row_plain;

   count   

-----------

 144417515

(1 row)

lakehouse=# SELECT count(*) FROM t_row_plain;

count

-----------

144417515

(1 row)

When storing a typical Oracle audit trails those 144 million rows will (without indexes) translate to roughly 500 bytes per entry which means that we can expect a table that is roughly 72 GB in size:

lakehouse=# SELECT pg_size_pretty(
pg_total_relation_size('t_row_plain')
);
 pg_size_pretty 
----------------
 72 GB
(1 row)

lakehouse=# SELECT pg_size_pretty(

pg_total_relation_size('t_row_plain')

);

pg_size_pretty

----------------

72 GB

(1 row)

Heaps have significant advantages but also disadvantages compared to other storage methods. While a normal table needs a lot of space compared to other storage options, we can index at will and access single rows in the most efficient way possible.

When talking about audit logs a heap is ideal if you want to search and find single events with lightning speed. Now: What do we mean by lightning fast? Let us take a look at an example:

lakehouse=# explain analyze 

SELECT * 

FROM t_row_plain 

WHERE event_timestamp_utc > '2025-06-23' 

ORDER BY event_timestamp_utc 

LIMIT 10;

                        QUERY PLAN                                                                                

----------------------------------------------------------------- Limit  (cost=0.57..5.66 rows=10 width=2767) 

  (actual time=0.028..0.038 rows=10 loops=1)

   ->  Index Scan using t_row_plain_event_timestamp_utc_idx 

on t_row_plain  

(cost=0.57..29724708.24 rows=58436231 width=2767) 

(actual time=0.026..0.035 rows=10 loops=1)

         Index Cond: (event_timestamp_utc > 

'2025-06-23 00:00:00+02'::timestamp with time zone)

 Planning Time: 0.160 ms

 Execution Time: 0.070 ms

(5 rows)

lakehouse=# explain analyze

SELECT *

FROM t_row_plain

WHERE event_timestamp_utc > '2025-06-23'

ORDER BY event_timestamp_utc

LIMIT 10;

QUERY PLAN

----------------------------------------------------------------- Limit (cost=0.57..5.66 rows=10 width=2767)

(actual time=0.028..0.038 rows=10 loops=1)

-> Index Scan using t_row_plain_event_timestamp_utc_idx

on t_row_plain

(cost=0.57..29724708.24 rows=58436231 width=2767)

(actual time=0.026..0.035 rows=10 loops=1)

Index Cond: (event_timestamp_utc >

'2025-06-23 00:00:00+02'::timestamp with time zone)

Planning Time: 0.160 ms

Execution Time: 0.070 ms

(5 rows)

Finding the first 10 entries on 23rd of june merely takes a tiny fraction of a millisecond (0.038 milliseconds to be precise). This is the power of indexing in a row store.

Columnar storage in PostgreSQL

What is the purpose of a columnar store? Normally data is stored row by row.

What is wrong with that?

The trick is that there is a lot more redundancy in a column than in a row which in turn means that data can be compressed more easily. What is also important is that often one simply wants to operate on a subset of a row. You might only want to add up your revenue but worry about postal codes, email addresses and other stuff inside your row. A column store comes in handy to help reduce the amount of I/O.

The downside of course is that a column store is really bad at OLTP operations such as UPDATE, DELETE and so on. However, if you are running an (ideally) append-only analytical workload this does not matter at all because in most cases you want to leverage the advantages and not sacrifice them for operations that are not needed anyway.

In PostgreSQL we can use the Citus extension to add a column store to your system. Make sure Citus is installed and enabled. After that execute:

lakehouse=# CREATE EXTENSION IF NOT EXISTS citus_columnar;

CREATE EXTENSION

lakehouse=# CREATE EXTENSION IF NOT EXISTS citus_columnar;

CREATE EXTENSION

Then create a new table using columnar format and copy the data over:

CREATE TABLE t_columnar (LIKE t_row_plain) USING columnar;

INSERT INTO t_columnar SELECT * FROM t_row_plain;

The difference in size is stunning. The table is more than 10 times smaller than the row oriented storage format:

lakehouse=# SELECT pg_size_pretty(

pg_total_relation_size('t_columnar')

);

 pg_size_pretty 

----------------

 6676 MB

(1 row)

CREATE TABLE t_columnar (LIKE t_row_plain) USING columnar;

INSERT INTO t_columnar SELECT * FROM t_row_plain;

The difference in size is stunning. The table is more than 10 times smaller than the row oriented storage format:

lakehouse=# SELECT pg_size_pretty(

pg_total_relation_size('t_columnar')

);

pg_size_pretty

----------------

6676 MB

(1 row)

If you are dealing with huge amounts of data this is really beneficial and helps to reduce the core limitation most databases are facing: The shortage of I/O throughput.

Note that you can index a column store but this is not the way it should be in most cases. If there is a need for a btree index on a column store, it is clearly necessary to rethink is going on here:

lakehouse=# CREATE INDEX ON t_columnar (event_timestamp_utc);

CREATE INDEX

lakehouse=# CREATE INDEX ON t_columnar (event_timestamp_utc);

CREATE INDEX

What is also important here is that the index will be exactly the same size as in case of row store (in our example we are talking about roughly 3 GB for an index on the event_timestamp_utc column).

Exporting PostgreSQL data to CSV

For the sake of comparison it makes sense to take a brief look at CSV files which are often used to transfer and download data. Many data sets in the world of Artificial Intelligence are provided in CSV format so I decided to include this in the comparison.

Here is how it works:

lakehouse=# COPY t_row_plain 

TO '/storage/audit_data.csv' CSV HEADER;

COPY 144417515

lakehouse=# COPY t_row_plain

TO '/storage/audit_data.csv' CSV HEADER;

COPY 144417515

What we see here is that the same data represented as pure text translates to roughly 85 GB which is close to the size one can expect in row store:

$ ls -lh /storage/audit_data.csv

-rw-r--r-- 1 hs hs 85G  8. Jul 10:02 /storage/audit_data.csv

$ ls -lh /storage/audit_data.csv

-rw-r--r-- 1 hs hs 85G 8. Jul 10:02 /storage/audit_data.csv

Obviously one can compress pure text and exchange data easily.

Parquet: Long term storage

If you want to store timeseries data and huge data sets in general it makes sense to take a look at Parquet files which offer an excellent method to exchange data and store it in a highly compact form.

In our case the size of the parquet file resulting from our data set translates to roughly 5.5 GB which is slightly better than Citus columnar store (which of course allows us to process data inside PostgreSQL easily).

If you want to experience with Parquet yourself you can use simple Python scripts as shown below:

import sys, io

import polars as pl

pl.scan_csv("audit_data.csv").sink_parquet(

    "audit.parquet",

    compression="gzip",

    row_group_size=100_000

)

import sys, io

import polars as pl

pl.scan_csv("audit_data.csv").sink_parquet(

"audit.parquet",

compression="gzip",

row_group_size=100_000

)

Alternatively, we recommend taking a look at pg_duckdb which helps with many other operations related to importing and exporting data.

Conclusion and comparison

There is no storage format that fits all needs. It therefore makes sense to use what works best for a certain use case. In this post we have compared some of the more common formats to handle data in PostgreSQL.

As a quick summary the following image summarizes the results:

If you want to learn more about storage and storage processing, we want to recommend taking a look at CYBERTEC LakeHouse for Scalefield and our other blogs related to this important topic.

PostgreSQL storage: Comparing storage options

Storing and archiving data efficiently

Sample data: Oracle audit logs

Using PostgreSQL heap storage

Columnar storage in PostgreSQL

Exporting PostgreSQL data to CSV

Parquet: Long term storage

Conclusion and comparison

Leave a Reply Cancel reply

Abhisek Goswami

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts