Questions on parallelization

Hi,

I'm currently trying to get AlphaZero running in full parallelization, but I'm having issues at all levels of parallelization. I'm new to parallelization, so I might also have misunderstood some parts. I'm running it on a machine with 128 CPUS, but I cannot achieve a very high CPU utilization, no matter if I try multi-threading or multi-processing.

1. Multi-threading (without GPU):
I have tried starting julia with different numbers of threads and different AlphaZero parameters, no matter if I start `julia -t 128` or `julia -t 20`, `htop` only shows a CPU utilization of around 1200% for this process, so only around 12 threads are working. I was wondering if that is due to them waiting for the inference server, but I got similar results when using a very small dummy network. Also, `SimParams.num_workers` was 128 and batch size 64, so shouldn't other workers continue simulations while some are waiting for the inference server? If the inference is the reason, would I be better off with a small batch size or a large batch size?
![Screenshot from 2021-08-30 13-58-47](https://user-images.githubusercontent.com/42044738/131335954-ff663054-3a04-4545-871e-4c9c450bc9c0.png)
When not using a GPU, are there benefits of batching inference requests at all? I.e. is it better to use multi-threading or multi-processing on one machine?
I also remember seeing a plot of AlphaZero performance over the number of workers somewhere in the AlphaZero.jl documentation or some post you made (i think), but I cannot find it anymore. Do you happen to know which plot I'm referring to?

2. Multi-processing (without GPU):
When using multiple processes (on the same machine, e.g. `julia -p 64`), `htop` shows all workers having a high CPU load during self-play. However, if I understand correctly, this is a waste of resources, since each process has to start its own inference server. Or is this better when not using a GPU?
What also confused me is that even even when calling **single-threaded** `julia -p 64`, `htop` shows **multiple threads** belonging to the main process during benchmarking (where AlphaZero does not use multi-processing). This is not problematic, I'm just trying to understand what's happening. I don't see how Util.mapreduce spawns multiple threads since `Threads.nthreads()` should be 1. Furthermore, it is 8 threads that are working at full load (the ones from `julia -p 20` call, not 20, which would be the number of processes). So where does that number 8 come from?
![Screenshot from 2021-08-30 13-23-44](https://user-images.githubusercontent.com/42044738/131335932-03998ba7-abab-4638-95d5-e25f3febbb51.png)


3. Using GPU:
When I try running AlphaZero.jl with GPU, for some reason it becomes incredibly slow, a lot slower than without GPU. `htop` now shows a CPU usage of around 500%:
![Screenshot from 2021-08-30 14-29-08](https://user-images.githubusercontent.com/42044738/131339140-777b720e-a132-4a82-82e8-3d56ab2c5b2c.png)
The machine has multiple GeForce RTX 2080 Ti with 10GB memory. Any ideas what could cause this?

Here are the parameters I used, in case this is relevant:
```
Network = NetLib.SimpleNet
netparams = NetLib.SimpleNetHP(
  width=128,
  depth_common=2,
  depth_phead=2,
  depth_vhead=2,
)

self_play = SelfPlayParams(
  sim=SimParams(
    num_games=5000,
    num_workers=128,
    batch_size=64,
    use_gpu=true,
    reset_every=1,
  ),
  mcts=PWMctsParams(
    num_iters_per_turn=1000,
    cpuct=1.0,
    prior_temperature=1.0,
    temperature=PLSchedule([10, 20, 30, 50], [1.0, 0.8, 0.3, 0.1]),
    dirichlet_noise_ϵ=0.25,
    dirichlet_noise_α=1.,
    k=1., 
    α=0.1,
  )
)

arena = ArenaParams(
  sim=SimParams(
    self_play.sim,
    num_games=128,
    use_gpu=true,
    reset_every=1,
  ),
  mcts=PWMctsParams(
    self_play.mcts,
    temperature=ConstSchedule(0.2),
    dirichlet_noise_ϵ=0.05,
  ),
  update_threshold=0.05)

learning = LearningParams(
  use_gpu=true,
  use_position_averaging=false,
  samples_weighing_policy=CONSTANT_WEIGHT,
  rewards_renormalization=10,
  l2_regularization=1e-4,
  optimiser=Adam(lr=5e-3),
  batch_size=1024,
  loss_computation_batch_size=1024,
  nonvalidity_penalty=1.,
  min_checkpoints_per_epoch=1,
  max_batches_per_checkpoint=5_000,
  num_checkpoints=1)

params = Params(
  arena=arena,
  self_play=self_play,
  learning=learning,
  num_iters=100,
  memory_analysis=nothing,
  ternary_rewards=false,
  use_symmetries=false,
  mem_buffer_size=PLSchedule(
    [      0,        30],
    [400_000, 1_000_000])
)
```
(`PWMctsParams` are for a progressive widening MCTS I've implemented for continuous states)


As a general question, I was also wondering about why you removed the asynchronous MCTS version - is there simply no benefit because CPU power can also be used to parallelize over different MCTS tree instead of within the same tree?

Any help is appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions on parallelization #71

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions on parallelization #71

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions