-
Notifications
You must be signed in to change notification settings - Fork 38
[refer #PGPRO-6599]: Avoid race conditions while processing PROFILE_R… #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Not deadlock but infinite waiting of requested profile backend process caused by lost In general, your proposed solution (moving of request value saving under collector lock) resolves your problem. But we might end up to lost You might rewrite patch to include some feedback mechanics for backend performing |
Making PROFILE_RESET request processing synchronous was not the goal of these patches. But this synchronization could be easily added.
|
Yes, deadlock in your presented above scenario occurs when successive queries As I have mentioned, making
I cannot find coredump( |
@rzharkov could you make necessary changes - remove |
Of course, I'll do it. The task is stuck, because my reviewer is on vacation :) |
No, it's not redundant. Consider the following interleaving:
The following test script hangs sooner or later if
For pg_wait_sampling_reset_profile it doesn't matter, but it's better to keep both functions in sync. |
Instead of init_lock_tag that clashes with the like named symbol in aqo, we now have two other unprefixed external symbols that may clash with other modules. I think we should prefix all external symbols with "pgws_" to be safe. And it's better to do this in a separate PR. (Roman, I'm sorry for asking you to fix both problems together. Now I see that they are unrelated.) I will split the patch. |
Hmm, agreed. Free interference of Approved this fix. @rzharkov could you rewrite your commit message so that schedules causing concurrency issue would be exposes in step-by-step manner. And this message have to include two schedules (your provided and one from @shinderuk) to motivate proposed changes: reorder request keeping after collector lock acquiring in collector process and setting request variable in regular backends under this lock. |
Consider the following sequence of events: 1. Session 1 calls pg_wait_sampling_reset_profile and sets the request shared variable to PROFILE_RESET. 2. The collector reads request and saves PROFILE_RESET to a local variable. 3. Session 2 queries pg_wait_sampling_profile, which sets request to PROFILE_REQUEST and waits for the collector in shm_mq_receive. 4. The collector continues and clears shared request, thus dropping PROFILE_REQUEST from Session 2. 5. Session 2 waits indefinitely in shm_mq_receive. A similar example with query cancellation: 1. Session 1 queries pg_wait_sampling_history and sets request to HISTORY_REQUEST. 2. Session 1 cancels the query while waiting for the collector. 3. The collector reads request and saves HISTORY_REQUEST to a local variable. 4. Session 2 queries pg_wait_sampling_profile, sets request to PROFILE_REQUEST and waits for the collector. 5. The collector continues and responds to HISTORY_REQUEST. 6. Session 2 receives history data and renders them as profile data returning invalid counts. These interleavings are avoided by acquiring the collector lock before reading request from shared memory in the collector. But we also need to hold the collector lock when we set request in receive_array in a backend. Otherwise, the following interleaving is possible: 1. Session 1 calls pg_wait_sampling_reset_profile and sets request to PROFILE_RESET. 2. Session 2 queries pg_wait_sampling_profile, acquires and releases the collector lock. 3. The collector acquires the lock, reads request and saves PROFILE_RESET to a local variable. 4. Session 2 sets request to PROFILE_REQUEST. 5. The collector clears request, and PROFILE_REQUEST is lost. 6. Session 2 waits indefinitely in shm_mq_receive. Same for the second example above. This patch, however, doesn't prevent loosing PROFILE_RESET requests: 1. Session 1 calls pg_wait_sampling_reset_profile and sets request to PROFILE_RESET. 2. Session 2 queries pg_wait_sampling_profile before the collector reads request. 3. The collector reads PROFILE_REQUEST, while PROFILE_RESET is lost. To fix this, we could make pg_wait_sampling_reset_profile wait for the collector, but we decided not to, as loosing a PROFILE_RESET isn't critical. Resolves #48. Author: Roman Zharkov Reported-By: Alexander Lakhin Reviewed-By: Maksim Milyutin, Sergey Shinderuk
@maksm90 I dropped the changes related to init_lock_tag from the patch and rewrote the commit message. Please take a look. Are you satisfied with the commit message? |
It looked as harmless refactoring. I would accepted patch with this fix.
All right! Approved. |
…EQUEST and
PROFILE_RESET requests.
After initialization of "request" variable in collector.c main loop, another
client backend could change the value of "collector_hdr->request" variable.
Changing this value from "PROFILE_RESET" to "PROFILE_REQUEST" could cause
deadlock: "collector" processed "PROFILE_RESET" query while client backend
waits data from the "collector_mq" shared memory message queue. Now we read
and write "collector_hdr->request" variable only using "PGWS_COLLECTOR_LOCK"
lock.
Removed obsolete "read_current_wait()" function definition from
pg_wait_sampling.h.
tags: pg_wait_sampling