Distributed computing in TensorFlow
In this section, you will learn how to distribute computation in TensorFlow; the importance of knowing how to do this is highlighted as follows:
- Run more experiments in parallel (namely, finding hyperparameters, for example, gridsearch)
- Distribute model training over multiple GPUs (on multiple servers) to reduce training time
One famous use case was when Facebook published a paper that was able to train ImageNet in 1 hour (instead of weeks). Basically, it trained a ResNet-50 on ImageNet on 256 GPUs, distributed on 32 servers, with a batch size of 8,192 images.
Model/data parallelism
There are mainly two ways to achieve parallelism and scale your task in multiple servers:
- Model Parallelism: When your model does not fit on the GPU, you need to compute layers on different servers.
- Data Parallelism: When we have the same model distributed on different servers but handling different batches, so each server will have a different gradient and we need some sort of...