Skip to content

Race between grow_thread_array and _Py_qsbr_reserve under free threading #129732

Closed
@hawkinsp

Description

@hawkinsp

Bug report

Bug description:

I don't have a succinct reproducer for this bug yet, but I saw the following race in JAX CI:

jax-ml/jax#26359

WARNING: ThreadSanitizer: data race (pid=208275)
  Write of size 8 at 0x555555d43b60 by thread T121:
    #0 grow_thread_array /__w/jax/jax/cpython/Python/qsbr.c:101:19 (python3.13+0x4a3905) (BuildId: 8f8869b5f3143bd14dda26aa2bf37336b4902370)
    #1 _Py_qsbr_reserve /__w/jax/jax/cpython/Python/qsbr.c:203:13 (python3.13+0x4a3905)
    #2 new_threadstate /__w/jax/jax/cpython/Python/pystate.c:1569:27 (python3.13+0x497df2) (BuildId: 8f8869b5f3143bd14dda26aa2bf37336b4902370)
    #3 PyGILState_Ensure /__w/jax/jax/cpython/Python/pystate.c:2766:16 (python3.13+0x49af78) (BuildId: 8f8869b5f3143bd14dda26aa2bf37336b4902370)
    #4 nanobind::gil_scoped_acquire::gil_scoped_acquire() /proc/self/cwd/external/nanobind/include/nanobind/nb_misc.h:15:43 (xla_extension.so+0xa4fe551) (BuildId: 32eac14928efa68545d22a6013f16aa63a686fef)
    #5 xla::CpuCallback::PrepareAndCall(void*, void**) /proc/self/cwd/external/xla/xla/python/callback.cc:67:26 (xla_extension.so+0xa4fe551)
    #6 xla::XlaPythonCpuCallback(void*, void**, XlaCustomCallStatus_*) /proc/self/cwd/external/xla/xla/python/callback.cc:177:22 (xla_extension.so+0xa500c9a) (BuildId: 32eac14928efa68545d22a6013f16aa63a686fef)
...

  Previous read of size 8 at 0x555555d43b60 by thread T124:
    #0 _Py_qsbr_reserve /__w/jax/jax/cpython/Python/qsbr.c:216:47 (python3.13+0x4a3ad7) (BuildId: 8f8869b5f3143bd14dda26aa2bf37336b4902370)
    #1 new_threadstate /__w/jax/jax/cpython/Python/pystate.c:1569:27 (python3.13+0x497df2) (BuildId: 8f8869b5f3143bd14dda26aa2bf37336b4902370)
    #2 PyGILState_Ensure /__w/jax/jax/cpython/Python/pystate.c:2766:16 (python3.13+0x49af78) (BuildId: 8f8869b5f3143bd14dda26aa2bf37336b4902370)
    #3 nanobind::gil_scoped_acquire::gil_scoped_acquire() /proc/self/cwd/external/nanobind/include/nanobind/nb_misc.h:15:43 (xla_extension.so+0xa4fe551) (BuildId: 32eac14928efa68545d22a6013f16aa63a686fef)
    #4 xla::CpuCallback::PrepareAndCall(void*, void**) /proc/self/cwd/external/xla/xla/python/callback.cc:67:26 (xla_extension.so+0xa4fe551)
    #5 xla::XlaPythonCpuCallback(void*, void**, XlaCustomCallStatus_*) /proc/self/cwd/external/xla/xla/python/callback.cc:177:22 (xla_extension.so+0xa500c9a) (BuildId: 32eac14928efa68545d22a6013f16aa63a686fef)
...

I think what's happening here is that two threads that were not created by Python are calling PyGILState_Ensure concurrently, so they can call into CPython APIs.

This appears to be an unlocked access on shared->array and it would probably be sufficient to move that read under the mutex in _Py_qsbr_reserve.

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions