Are you ready: big data

Showing posts with label big data. Show all posts

Tuesday, 8 December 2020

Disk storage algorithm

This is follow up post on rethinking-key-value-store article to explore storage part of system.

Many data systems support plugin based storage layers and it opens a whole set of options to use one from a shelf or build one that suits your needs.

In this post i will share how a new storage system can be built and later it is used for building time series application on top of it.

Before we go deep in disk based algorithm, let's look at why non-volatile storage is required.

In today times when machines with Terabytes RAM are available why do we have to bother to store stuff on disk ?

Couple of good reasons why still having good non-volatile storage manager makes sense today.

It is cheap

Disk is very cheap as compared RAM, so it does not make sense to store data in expensive store especially when data is not being used frequently. Lots of cloud provider bill can saved!

It is unlimited
Although a machine with big RAM is available but it is still limited , it will not continue get bigger at the same rate as in the past and if we want applications to have the illusion that it has unlimited memory then flushing to disk is a must.

Allow fast restarts
Think what will happen if application crash ? Application has to rebuild the whole state and it could take very long time before application is available again to serve request. Saving computed data to disk and restoring from it will be way faster.

Information exchange

How do two application running on different machine can communicate ? For inter application communication in-memory data has to written in wire format so that it can be sent over network.

and many more reasons..

Application has volatile & non-volatile area and storage manager sits in middle of that (.ie RAM and Disk) and provide efficient access to data.

RAM and DISK are very different types of hardware and access patterns are also very different.

On one hand RAM can be accessed randomly and it is fast for both read/write.

Disks are accessed sequentially using blocks and very slow write and slow read, SSD has improved the access time but sequential access is the recommended to get best performance.

Storage managers have to use efficient data structure on disk to get best performance, another thing is that disk has nothing like malloc to manage allocation. Everything is bare metal and the application developer has to manage allocation, garbage collector, locks etc.

Disk read/write access is based on a block which is usually 4 KB, but memory read/write is based on a cacheline which is 64 Bytes, just this difference in read/write size requires new ways of organizing data to get the best out of the device.

All the above problems make writing disk based algorithms very challenging.

Lets look at some options of storing data on disk.

Generally file on disk looks some thing like below, each block is of fix sized and it depends on hardware, most of the vendors use 4 KB blocks. IO device provide atomic read/write guarantee at block level.

Page Layout

Lets unpack disk block to explore options to store data.

Fixed Size

Fixed size data is very common and intuitive way to store data in block provided underlying data is like that and mostly applicable for number variants data type like ( byte, short, int, long , float & double). It is possible to make it work for varchar but padding is required to achieve this. If underlying data can be mapped to fixed size value then this is best option.

Fixed size has good random access property, just by doing simple multiplication specific record can be found for eg to find 3rd record we will use record * sizeof(record) (i.e 3 * 4) to find the offset of data and read it.

Most of the application has variable record size due to which more flexible storage layout is required.

Size Prefix

In this approach every record is prefixed with 4 Byte size and followed with data.

This has overhead of extra 4 Bytes and sometime this can be more than actual data and other thing that is not good is that it is sequential access, if last record is required then it requires to scan full page.

One more downsize is what happens when records are updated ? this will cause overflow or fragmentation.

%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfillColor%3D%23fff2cc%3BstrokeColor%3D%23d6b656%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22890%22%20y%3D%22385%22%20width%3D%22190%22%20height%3D%22130%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E

This is good for queue based system where write is always at the end and read is also large sequential scan.

Slotted Page

This approach is hybrid one and takes advantage of both fixed and size prefix page.

Slotted page is transformation of Size Prefix page to co-locating related data, for eg all data is together and all size is together.

Single page contains 2 Region

Header Region

This section contains some metadata about the page that include version, page id , hash, number of records , compression flag etc.

Data Region

Data section is subdivided in data segment & index segment. Index segment is also called as Slot Array and it can be 4 byte or 2 byte fixed size value, it contains pointer to data segment.

Data Segment is written from left and Slot Array is written from right side. Page is considered full once no space is available for either data segment or Slot.

This approach gives random access to records by using Slot array, every record can be addressed by (PageId, Record Id). Full file content can be seen as a 2 dimensional array.

Slotted page is a very popular layout for many databases. This also allows to build sparse or dense indexes based on page & slot.

Disk Data Structure

Now we will explore how smallest unit of storage (i.e. page) can be taken to build some data structures on top of it and finally some application using disk based data structure.

Remember disk has no malloc API, so we have to build something like pagealloc that will enable dynamic allocation of blocks/page.

Page Allocator interface is an low level API and API looks something like below.

public interface PageAllocator {
    WritePage newPage();

    long commit(WritePage page);

    ReadPage readByPageId(int pageId);

    int noOfPages();

    int pageSize();

    byte version();

    List<PageInfo> pages();

    ReadPage readByPageOffset(long offSet);

}

Application - Time Series Database

Using Page allocator abstraction we will build Time series database that will use Sorted String table as underlying store.

SSTable stores immutable rows that are ordered by some key in files. SSTable is basis for Log structured merge tree that is alternative to B+Tree.

Log structured merge tree powers many popular data stores like Bigtable, HBase, LevelDB, RocksDB, WiredTiger, Cassandra, InfluxDB and many more.

SSTable

SSTable is made of couple of ordered memory maps & ordered rows on disk. Storage manage sits right in middle to manage sorted structures on disk & memory.

Writers

All the write requests are handled by writing to In-Memory ordered map and once those maps are full then get converted to read only In-Memory maps and periodically flushed to disk for durability.

Writing to such a system is very fast because it is done using in-memory data structure.

Readers

Readers is where this gets more interesting because now read has to hit multiple data structures to find records. First it scans mutable map, then immutable maps and finally on the disk.

Rather than doing a single seek it has to do multiple seeks but since all the data structure be on memory or disk is sorted, so requests can be processed in LOG N time.

Over a period of time a number of files can grow, so a compaction process is required that will merge multiple sorted files and create a small number of files. This compaction process is what makes SSTable as Log structured merge tree.

Some code !

To have some thing working i used ConcurrentSkipListMap from JDK to create In-Memory ordered map and use PageAllocator to flush ordered map to disk.

SortedStringTable

public interface SortedStringTable<V> {

    void append(String key, V value);

    void iterate(String from, String to, Function<V, Boolean> consumer);

    // API for saving SST table for persistence storage
    Collection<PageRecord<V>> buffers();

    void remove(int pageId);

    void flush();
}

Working SSTable code can be found @ sst github.

First data structure is ready for our time series database :-)

Time Series

Time series application will be built on top of SSTable.

Timeseries interface is simple, it looks something like below.

public interface TimeSeriesStore {

    <T> EventInfo insert(T row);

    void gt(LocalDateTime fromTime, Function<EventInfo, Boolean> consumer);

    void lt(LocalDateTime toTime, Function<EventInfo, Boolean> consumer);

    void between(LocalDateTime startTime, LocalDateTime endTime, Function<EventInfo, Boolean> consumer);

    void flush();

  
}

Time series application code can be found @ timeseries repo.

To experiment with some some real time series data, i picked up sample data from Jan Yellow Taxi Trip and loaded in the app. yellow_tripdata_2020-01 has 6+ Million records.

Sample time series application using this data can be found @ NYTaxiRides.java

All the code has good unit test coverage, so feel free to hack and learn.

Conclusion

Disk based algorithm are very cool and understanding it gives good idea about how modern data systems work. You might not build data system from scratch but knowing these algorithm will definitely help in deciding which data system to pick based on trade off.

Thursday, 26 July 2018

Scala Tuple performance

Tuple is very powerful construct in programming language, it allows to create sequence of finite elements.

Elements in tuple can be of different type and very easy to declare like ("something",1,new Date())
Nice thing about tuple is you have to only decided on data type of element not the name.

Computer science has 2 hard problem : Cache invalidation and naming things.

Tuple helps with naming problem.

Nothing comes for free and every thing has some trade off. In this blog i will share dark side of tuple. I will take Scala tuple for example.

What are different ways to implement tuple ?

Class with object array
This is first option that comes to my mind, it is good for flexibility but bad for performance like

type checks are expensive
and has overhead of array index checking.
parameter are of object type so puts memory pressure on both read and write.
expensive to make immutable. We will talk about immutable later.
no types so serialization will be big overhead

Class with fixed number of parameter.
This is better than the first one but it also has few issues

Parameter are of object type so puts memory pressure
Mutable unless framework or library generates code or maintain fixed layout objects like (Tuple, Tuple2, Tuple3...)
Serialization overhead due to object type.

Scala is using fixed number of parameter approach.

In absence of Tuple poor man choice was to create class with N number of instance variable and give them proper type, scala has Case class which is based on old school thought.

Lets compare Tuple vs Case class. I will take tuple with 4 parameter and 2 of these are primitive type(Int & Double).

Tuple : (String,String,Int,Double)
Case class : case class Trade(symbol: String, exchange: String, qty: Int, price: Double)

Structure wise both are same and can be replace each other.

Memory Test

Benchmark create N instance of tuple/case class and putting it in collection and measure memory allocation.

Memory usage for Tuple is double, for 5 Million object tuple takes 384 MB and case class takes just 189 MB.

Read performance test

In this test objects are allocated once and it is accessed to do basic aggregation.

This chart show time taken to do sum on Double value for 1 Million,5 Million etc object.
Read from tuple is slow, it takes double the time.

One thing that is not shown in this chart is memory pressure created during read. Tuple put memory pressure during read.

These numbers shows that Tuple is not good both from memory and cpu usage.
Lets deep dive into the code that is generated for both tuple and case class to understand why we see these numbers.

I will put java code that is generated by Scala compiler.

Scala does well to mark values final so it gets some read efficiency from that but all that is thrown away by creating object type and doing runtime type casting every time value is requested.

Case class code

For caseclass code scala is still using primitive type and that gives all the memory & cpu efficiency .

Spark is very popular large scale data processing framework and lot of production code is written using Tuple.
Tuple based transformation puts lots of load on GC and has impact on CPU utilization.
Since tuple is all based on Object type so it has effect on network transfer also.

Type info is very important for optimization and that is the reason why Spark 2 is based on Dataset which has compact representation.

So next time looking for quick improvement change Tuple to case class.

Code used for benchmarking is available @ githib

Friday, 11 December 2015

Mr Cool - Bloom Filter

HashCode based data structure are basis of many algorithms, it is used for fast look up , membership queries, group by etc.

HashCode is also used to build some interesting class of data structure. In this blog i will share one of such data structure that has many useful application in real world.

Take a example that you have million or billions of item and you want to keep track of distinct items.
This is very simple problem and can be solved by using HashMap but that requires huge memory depending on number of distinct items because it has to stores actual element also.

To make this problem little interesting we can put memory constraints on solution, so effectively we need space efficient solution to test membership of item.

Nothing comes for free in this world , so with less memory we have to trade off accuracy and in some case it is fine for e.g Unique Visitor on website, Distinct Page visited etc

First intuition to solve this problem by allocating N buckets and mark the bucket based on hashcode, no need to store actual value just mark slot index.

Quick question that comes to mind is what will happen in case of collision and in case of that our result will be wrong but it will be wrong by some percentage only and there are couple of ways improve result.

- Allocate more buckets but use single hash function.
- Use multiple hash functions and each of the hash function has its own dedicated bucket.

Multiple hash function options works very well , using that result can be around 96% correct !

So if we use multiple hash function then data structure will look something like below.

This data structure is called Bloom Filter and it maintains X bit vector and apply X hash function on input value to mark bits.

for checking if value already exists just check if bits are set to true or not by using multiple hash function.

In terms of memory requirement bit vector of X length is required and get more accurate result multiple hash based bit vector can be used. Quick math will give fair idea about memory requirement

Bloom filter is very compact in terms of memory and size can be controlled depending on requirement.
Memory can be further reduced by using compressed bit vector.

Trade Off
It is important to know trade off done to achieve the efficiency .

- It is probabilistic data structure means result are not 100% correct but interesting thing about this data structure is that it can give FALSE POSITIVE but never FALSE NEGATIVE. Which makes it good fit for many practical application

- Multiple hash function are used , so read/write performance will be based on hash function.

Application
- DB Query filter
One of the most common use case. Useful in reducing false query to DB

- Joins
Yes bloom filter provide alternate way to do joins especially in distributed system. So on one of the dataset(i.e smaller) build bloom filter and send it to other nodes for joining, since it is very compact in memory so transport to remote system does not adds big overhead.
One of the disruptive Big data processing system Spark is using it for joins

- Alternate to B-Tree
B-Tree can be also used for membership queries but if size of B-Tree becomes large enough that it can't fit in the memory then all the benefit is lost.
Another big data system Cassandra is using it for read request by having bloom filter type data-structure on top of data block to avoid IO operation

Another Open source in big data space Parquet is using it for fast filters.

- Partition Search
This is just another variation of above use case, Apache druid.io which is again into big data space for fast analytic is using it by partitioning data on ingestion by time and in each partition column has bloom filter type of index for fast filters.

- Safe Website filter
This is an interesting one, google chrome uses bloom filter to find if site is safe or not.

Simple implementation of Bloom Filter is available @ github
In this implementation each hash function has independent bit vector , but single common vector can be used for all the hash function.