Are you ready: MultiCore

Showing posts with label MultiCore. Show all posts

Thursday, 10 January 2019

Is it worth learning Golang ?

I was looking for new language to learn and Go looked very good candidate. It is getting popular due to its simplicity and power.

It is created by some of best minds of our industry

Robert Griesemer - Google V8 javascript engine, Java hotspot virtual machine
Rob Pike -UNIX and co creator of world most popular character encoding UTF8
Ken Thompson - UNIX, B ,C language and creator of UTF8

Now we have so many language choices.

For ease of programming people use dynamic language like Python, Ruby , Javascript etc and for safety options are C++, Java, C#, Functional or VM based lang( Scala,Clojure etc)

So it becomes like if you want ease then give up on safety or vice versa. Some of new lang came with fancy syntax to give both but made it really hard learn.

Go took very different approach(i.e still using curly braces) by keeping the syntax simple that most of the programmer can read the code and solve the hard issues like

- Memory management/Garbage collection.

- Making pure value types. No abstraction on top of abstraction. Data oriented design.

- Design for Multi core.

- Distributed computing support.

- Access to low level programming construct.

- Portable to many OS.

- Interesting module system and dependency management.

- Very simple error handling.

- Interesting support for OOPS.

- Easy to read and simple mental model. No hiding of cost like how much memory allocation or CPU processing required.

Pictures are better than thousands words, so i picked up some content from GoCon Tokyo

Efficient

Concurrency

If you want learn new programming language today then Go looks very interesting choice.

It is not perfect , read about things that community don't like @ go-is-not-good also to get idea about what is left out in Go.

I am starting to learn Go and will be sharing my experience about it.

Lets Go for it:-)

Saturday, 14 September 2013

Concurrent Counter With No False Sharing

This blog is continuation of Scalable Counter post.

One of the reader shared result from his system. He ran test on XEON intel processor with 16 core and total time taken for each type of counter is almost same, although Atomic counter has CAS failure & other type does't has any CAS failure but it does't make any difference in execution time.
Very strange result , needs further investigation.

Another reader pointed out that it could be due to false sharing, so it worth to take that into account and i created another class that handles FALSE SHARING

Time take by different counter

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

PaddedAtomicCounter is new type of counter that i have added to test & it it outperforms all other counters.

It is using cache line padding to avoid false sharing.

Cacheline is of 64 byte on most of the today processor. PaddedCounter it Integer based counter so it is adding 16 slot per counter, by using this techniques we avoid cache pollution & as result we see 16X times gain as compared to AtomicCounter, without padding the gain was 5X and with cache line padding gain jumps to 16X.

With cacheline padding you need some extra space, so it is trade off of memory for speed, you can choose what you want!

CAS failure rate

Lets look at the CAS failure for different counter & what it means for performance.

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

PaddedAtomic has some CAS failure as compared to other counters , but it does't make any difference in execution time of the counter.

CAS failure is not the only factory that can determined execution time, false sharing make significant contribution to it, so this gives good explanation of behavior seen in XEON processor

Conclusion

To get better performance you have to take care of few things

- Contention - There are many techniques to avoid it , this blogs shows one of them

- False Sharing - You have to avoid false sharing to get best out of processor, padding is required for that. Some of the JDK classes are using padding are ThreadLocalRandom , now we have @Contended annotation from java to achive same thing, it is being used in ForkAndJoinPool

Code is available @ github

Tuesday, 10 September 2013

Scalable Counters For Multi Core

Counters are required everywhere , for e.g. to find key KPI of application, load on application, total number of request served, some KPI for finding throughput of application & many more.

With all these requirement complexity of concurrency is also added & that makes this problem interesting.

How to implement concurrent counter

- Synchronized - This was the only option before JDK 1.5, since now we are waiting for JDK8 release , so definitely this is not an option.

- Lock based - You should never attempt this for counter , it will perform very badly

- Wait Free - Java does't have support for Fetch-and-add, so bit difficult to implement it.

- Lock free - With very good support of Compare-and-swap, this looks good option to use.

How does Compare-and-Swap based counter performs

I used AtomicInteger for this test and this counter is incremented for 1 Million time by each thread & to increase the contention number of threads are increased gradually.

Test Machine Details
OS : Windows 8
JDK : 1.7.0.25
CPU : Intel i7-3632QM , 8 Core
RAM : 8 GB

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

As number of threads are increased, time taken to increment counter increases and it is due to contention.
For CAS based counter , it is CAS failure that causes slowdown.

Is this best performance that we can get ? no definitely their are better solution to implement concurrent counter, lets have look at them.

Alternate Concurrent Counter
Lets look at some of the solution to implement counter that handles contention in better way

- Core Based Counter - Maintain counter for each logical core, so that way you will have less contention. Only issue you have this type of counter is that if number of threads are more than logical core then you will start noticing contention.

- Thread Based Counter - Maintain counters for total number of threads that will be using system. This works well when number of threads are more than number of logical cores.

Lets test it

Time taken by different types of counter

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

Concurrent Counter performs much better than Atomic based counter, for 16 threads it is around 5X times better, that is huge difference!

CAS Failure Rate

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

Due to contention, Atomic based counter see lot of failure and it goes up exponential as i add more threads & other counters performs pretty well.

Observation
Multi core machines becoming easily available & we have to change the way we handle concurrency, traditional way of doing concurrency is not going to scale in today's time when having 24 or 48 core server is very common.

- To reduce the contention you have to use multiple counters and then aggregate them later

- Core based counter works well if number of threads will be less or same as number of cores

- Thread based counter is good when number of threads are much more than available core

- Key to reduce contention is identify counter to which thread will write,i have used simple approach based on thread id, but much better approach are available, look at ThreadLocalRandom of JDK 8 for some ideas.

- Thread based approach is used in LongAdder of JDK8, which creates many slots to reduce contention.

Code for all the counters used in this test are available @ Github