Skip to content

Questions on parallelization #71

@johannes-fischer

Description

@johannes-fischer

Hi,

I'm currently trying to get AlphaZero running in full parallelization, but I'm having issues at all levels of parallelization. I'm new to parallelization, so I might also have misunderstood some parts. I'm running it on a machine with 128 CPUS, but I cannot achieve a very high CPU utilization, no matter if I try multi-threading or multi-processing.

  1. Multi-threading (without GPU):
    I have tried starting julia with different numbers of threads and different AlphaZero parameters, no matter if I start julia -t 128 or julia -t 20, htop only shows a CPU utilization of around 1200% for this process, so only around 12 threads are working. I was wondering if that is due to them waiting for the inference server, but I got similar results when using a very small dummy network. Also, SimParams.num_workers was 128 and batch size 64, so shouldn't other workers continue simulations while some are waiting for the inference server? If the inference is the reason, would I be better off with a small batch size or a large batch size?
    Screenshot from 2021-08-30 13-58-47
    When not using a GPU, are there benefits of batching inference requests at all? I.e. is it better to use multi-threading or multi-processing on one machine?
    I also remember seeing a plot of AlphaZero performance over the number of workers somewhere in the AlphaZero.jl documentation or some post you made (i think), but I cannot find it anymore. Do you happen to know which plot I'm referring to?

  2. Multi-processing (without GPU):
    When using multiple processes (on the same machine, e.g. julia -p 64), htop shows all workers having a high CPU load during self-play. However, if I understand correctly, this is a waste of resources, since each process has to start its own inference server. Or is this better when not using a GPU?
    What also confused me is that even even when calling single-threaded julia -p 64, htop shows multiple threads belonging to the main process during benchmarking (where AlphaZero does not use multi-processing). This is not problematic, I'm just trying to understand what's happening. I don't see how Util.mapreduce spawns multiple threads since Threads.nthreads() should be 1. Furthermore, it is 8 threads that are working at full load (the ones from julia -p 20 call, not 20, which would be the number of processes). So where does that number 8 come from?
    Screenshot from 2021-08-30 13-23-44

  3. Using GPU:
    When I try running AlphaZero.jl with GPU, for some reason it becomes incredibly slow, a lot slower than without GPU. htop now shows a CPU usage of around 500%:
    Screenshot from 2021-08-30 14-29-08
    The machine has multiple GeForce RTX 2080 Ti with 10GB memory. Any ideas what could cause this?

Here are the parameters I used, in case this is relevant:

Network = NetLib.SimpleNet
netparams = NetLib.SimpleNetHP(
  width=128,
  depth_common=2,
  depth_phead=2,
  depth_vhead=2,
)

self_play = SelfPlayParams(
  sim=SimParams(
    num_games=5000,
    num_workers=128,
    batch_size=64,
    use_gpu=true,
    reset_every=1,
  ),
  mcts=PWMctsParams(
    num_iters_per_turn=1000,
    cpuct=1.0,
    prior_temperature=1.0,
    temperature=PLSchedule([10, 20, 30, 50], [1.0, 0.8, 0.3, 0.1]),
    dirichlet_noise_ϵ=0.25,
    dirichlet_noise_α=1.,
    k=1., 
    α=0.1,
  )
)

arena = ArenaParams(
  sim=SimParams(
    self_play.sim,
    num_games=128,
    use_gpu=true,
    reset_every=1,
  ),
  mcts=PWMctsParams(
    self_play.mcts,
    temperature=ConstSchedule(0.2),
    dirichlet_noise_ϵ=0.05,
  ),
  update_threshold=0.05)

learning = LearningParams(
  use_gpu=true,
  use_position_averaging=false,
  samples_weighing_policy=CONSTANT_WEIGHT,
  rewards_renormalization=10,
  l2_regularization=1e-4,
  optimiser=Adam(lr=5e-3),
  batch_size=1024,
  loss_computation_batch_size=1024,
  nonvalidity_penalty=1.,
  min_checkpoints_per_epoch=1,
  max_batches_per_checkpoint=5_000,
  num_checkpoints=1)

params = Params(
  arena=arena,
  self_play=self_play,
  learning=learning,
  num_iters=100,
  memory_analysis=nothing,
  ternary_rewards=false,
  use_symmetries=false,
  mem_buffer_size=PLSchedule(
    [      0,        30],
    [400_000, 1_000_000])
)

(PWMctsParams are for a progressive widening MCTS I've implemented for continuous states)

As a general question, I was also wondering about why you removed the asynchronous MCTS version - is there simply no benefit because CPU power can also be used to parallelize over different MCTS tree instead of within the same tree?

Any help is appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions