Re: Intermittent pg_ctl failures on Windows - Mailing list pgsql-hackers
From | [email protected] |
---|---|
Subject | Re: Intermittent pg_ctl failures on Windows |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | RE: Intermittent pg_ctl failures on Windows (Badrul Chowdhury <[email protected]>) |
Responses |
Re: Intermittent pg_ctl failures on Windows
|
List | pgsql-hackers |
Hello! We reproduced these errors on out buildfarm and my windows workstation. We used small TAP test that restarts PostgresNode in loop. Additionally, constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000 times. There are two different problems with pg_ctl: 1 - share access to postmaster.pid; 2 - share access to logfile. Postmaster runs in the shell (CMD.EXE). When the server stops cmd.exe may blocks logfile from been opened by new cmd.exe. > waiting for server to shut down.... done > server stopped > waiting for server to start....The process cannot access the file > because it is being used by another process. > stopped waiting > pg_ctl: could not start server > Examine the log output. To avoid the first error we made two changes: pg_ctl now opens the postmaster.pid file using pgwin32_open() function to correctly handle share locks. On Windows systems we cannot handle ERROR_DELETE_PENDING because GetLastError() returns ERROR_ACCESS_DENIED instead. So we rename the lock files before delete them. To avoid the second error we added the wait for cmd.exe finish in the do_stop() and do_restart() functions. Git patch is in the attachment. On 2018-03-12 18:55, Badrul Chowdhury wrote: > Hi Tom, > > This is a great catch. I am looking into it: I will start by > reproducing the error as you suggested. > > Thanks, > Badrul > > -----Original Message----- > From: Tom Lane [mailto:[email protected]] > Sent: Saturday, March 10, 2018 2:48 PM > To: [email protected] > Subject: Intermittent pg_ctl failures on Windows > > The buildfarm's Windows members occasionally show weird pg_ctl > failures, for instance this recent case: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fp.rizon.top%3A443%2Fhttps%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0 > > ### Restarting node "master" > # Running: pg_ctl -D > G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata > -l > G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log > restart waiting for server to shut down.... done server stopped > waiting for server to start....The process cannot access the file > because it is being used by another process. > stopped waiting > pg_ctl: could not start server > Examine the log output. > Bail out! system pg_ctl failed > > or this one: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fp.rizon.top%3A443%2Fhttps%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0 > > ### Stopping node "subscriber" using mode fast # Running: pg_ctl -D > c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata > -m fast stop waiting for server to shut down....pg_ctl: could not open > PID file > "c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid": > Permission denied Bail out! system pg_ctl failed > > I'd been writing these off as Microsoft randomness and/or antivirus > interference, but it suddenly occurred to me that there might be a > consistent explanation: since commit f13ea95f9, when pg_ctl is waiting > for server start/stop, it is trying to read postmaster.pid > more-or-less concurrently with the postmaster writing to that file. > On Unix that's not much of a problem, but I believe that on Windows > you have to specifically open the file with sharing enabled, or you > get error messages like these. > The postmaster should be enabling sharing, because port.h redirects > open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing > flags. > But it only does that #ifndef FRONTEND. So pg_ctl is just using naked > open(), which could explain these failures. > > If this theory is accurate, it should be pretty easy to replicate the > problem if you modify the postmaster to hold postmaster.pid open > longer when rewriting it, e.g. stick fractional-second sleeps into > CreateLockFile and AddToDataDirLockFile. > > I'm not in a position to investigate this in detail nor test a fix, > but I think somebody should. > > regards, tom lane -- regars, Roman Zharkov
Attachment
pgsql-hackers by date: