-
Notifications
You must be signed in to change notification settings - Fork 142
Description
Hi,
I'm currently trying to get AlphaZero running in full parallelization, but I'm having issues at all levels of parallelization. I'm new to parallelization, so I might also have misunderstood some parts. I'm running it on a machine with 128 CPUS, but I cannot achieve a very high CPU utilization, no matter if I try multi-threading or multi-processing.
-
Multi-threading (without GPU):
I have tried starting julia with different numbers of threads and different AlphaZero parameters, no matter if I startjulia -t 128
orjulia -t 20
,htop
only shows a CPU utilization of around 1200% for this process, so only around 12 threads are working. I was wondering if that is due to them waiting for the inference server, but I got similar results when using a very small dummy network. Also,SimParams.num_workers
was 128 and batch size 64, so shouldn't other workers continue simulations while some are waiting for the inference server? If the inference is the reason, would I be better off with a small batch size or a large batch size?
When not using a GPU, are there benefits of batching inference requests at all? I.e. is it better to use multi-threading or multi-processing on one machine?
I also remember seeing a plot of AlphaZero performance over the number of workers somewhere in the AlphaZero.jl documentation or some post you made (i think), but I cannot find it anymore. Do you happen to know which plot I'm referring to? -
Multi-processing (without GPU):
When using multiple processes (on the same machine, e.g.julia -p 64
),htop
shows all workers having a high CPU load during self-play. However, if I understand correctly, this is a waste of resources, since each process has to start its own inference server. Or is this better when not using a GPU?
What also confused me is that even even when calling single-threadedjulia -p 64
,htop
shows multiple threads belonging to the main process during benchmarking (where AlphaZero does not use multi-processing). This is not problematic, I'm just trying to understand what's happening. I don't see how Util.mapreduce spawns multiple threads sinceThreads.nthreads()
should be 1. Furthermore, it is 8 threads that are working at full load (the ones fromjulia -p 20
call, not 20, which would be the number of processes). So where does that number 8 come from?
-
Using GPU:
When I try running AlphaZero.jl with GPU, for some reason it becomes incredibly slow, a lot slower than without GPU.htop
now shows a CPU usage of around 500%:
The machine has multiple GeForce RTX 2080 Ti with 10GB memory. Any ideas what could cause this?
Here are the parameters I used, in case this is relevant:
Network = NetLib.SimpleNet
netparams = NetLib.SimpleNetHP(
width=128,
depth_common=2,
depth_phead=2,
depth_vhead=2,
)
self_play = SelfPlayParams(
sim=SimParams(
num_games=5000,
num_workers=128,
batch_size=64,
use_gpu=true,
reset_every=1,
),
mcts=PWMctsParams(
num_iters_per_turn=1000,
cpuct=1.0,
prior_temperature=1.0,
temperature=PLSchedule([10, 20, 30, 50], [1.0, 0.8, 0.3, 0.1]),
dirichlet_noise_ϵ=0.25,
dirichlet_noise_α=1.,
k=1.,
α=0.1,
)
)
arena = ArenaParams(
sim=SimParams(
self_play.sim,
num_games=128,
use_gpu=true,
reset_every=1,
),
mcts=PWMctsParams(
self_play.mcts,
temperature=ConstSchedule(0.2),
dirichlet_noise_ϵ=0.05,
),
update_threshold=0.05)
learning = LearningParams(
use_gpu=true,
use_position_averaging=false,
samples_weighing_policy=CONSTANT_WEIGHT,
rewards_renormalization=10,
l2_regularization=1e-4,
optimiser=Adam(lr=5e-3),
batch_size=1024,
loss_computation_batch_size=1024,
nonvalidity_penalty=1.,
min_checkpoints_per_epoch=1,
max_batches_per_checkpoint=5_000,
num_checkpoints=1)
params = Params(
arena=arena,
self_play=self_play,
learning=learning,
num_iters=100,
memory_analysis=nothing,
ternary_rewards=false,
use_symmetries=false,
mem_buffer_size=PLSchedule(
[ 0, 30],
[400_000, 1_000_000])
)
(PWMctsParams
are for a progressive widening MCTS I've implemented for continuous states)
As a general question, I was also wondering about why you removed the asynchronous MCTS version - is there simply no benefit because CPU power can also be used to parallelize over different MCTS tree instead of within the same tree?
Any help is appreciated!