Cuda vs C/C++ auto-vectorized

Hi All,

I have to develop a Deep Neural Network. I can make it run on a big cluster with hundreds of CPUs or on a GPU.
I would like to know what is faster:

  • develop a Cuda version (I don’t ming by the time required to exchange memory with the GPU)
  • develop a C/C++ auto-vectorized version that I can run on multiple CPUs.

Would someone have an idea?

I do have a second question: as the different operations are simple (mainly arithmetic operations between arrays), the soft will not saturate the GPU. So is it possible using a single GPU to run in parallel multiple simple processes?

Thank you for your help.

A good implementation of a large neural network (as long as it isn’t too sparsely connected) should be FLOP limited, so a GPU would probably have an advantage (close to the ratio of peak FLOPs).

Note that a simple implementation of a Neural Network would involve simple array updates, but as you mention, these would not be compute intensive enough to saturate a GPU (or a CPU for that matter). A better implementation would merge multiple array operations together into more heavy-weight operations, but this will also mean that you can’t use off-the-shelf libraries.

What people commonly do is to build networks with locally-connected layers, and implement the forward propagation with block-sparse (i.e. batched) SGEMM operations. There will still be some additional operations left over, and you would need to fuse these into the SGEMM kernels (effectively writing your own custom SGEMM) to really get close to peak.