Re: [pgsql-sql] Daily digest v1.2492 (19 messages) - Mailing list pgsql-sql
From | Steve Midgley |
---|---|
Subject | Re: [pgsql-sql] Daily digest v1.2492 (19 messages) |
Date | |
Msg-id | [email protected] Whole thread Raw |
List | pgsql-sql |
Hi John, It sounds like a disk-bound operation, so cpu is not maxed out. I'm not clear on all the details of your operation but it sounds like you're using Java to do row-by-row based inserts, selects and updates within a transaction, from a file. This can be a very slow process if you have many rows. The OS stats you describe fits that theory (but not conclusively). If you are using (psuedo-)code such as: Open file { Read line { select from Pg: "select from [other_table] where val = [line[colN]]" exec to Pg: "insert into [table] (col1, col2, ...) values (line[col1], line[col2]..." } } You can radically speed up such a system by using the "copy" (http://www.postgresql.org/docs/8.2/interactive/sql-copy.html) command to load all the data at once from the file into Pg and then do post-processing with Java/SQL to get all the fields looking right. Doing a bulk update with a join across several tables is so much faster than looping through them with a wrapper in Java (or other lang) you won't believe it. I hope this helps and is on-topic for you. Steve At 09:38 AM 4/3/2007, [email protected] wrote: >Date: Tue, 03 Apr 2007 22:16:13 +0800 >From: John Summerfield <[email protected]> >To: [email protected] >Subject: A long-running transaction >Message-ID: <[email protected]> > >I have a Java (java 1.1) program that I wrote some years ago, to read >records from a text file and insert it into a ostgresql database. > >One of the assumptions I made was that one file contained one day's >data, maybe as many as 1500 records, and I coded it to do the whole >lot >as one transaction so either a single file was loaded in its entirity, > >or none of its data was. > >I lost the Java code, but revived the idea and I've collected about >two >years' data using (Linux) shell scripts, and loading the data using >psql. > >Then, I found the Java code on a disused hard disk:-) > >I made the necessary changes for it to build in java 1.5, and used >psql >to extract data from my new database in the correct format for the old > >program. This time, I have a little more data than I ever loaded at >once >before: >summer@Bandicoot:~$ wc -l testdata >6242217 testdata >summer@Bandicoot:~$ \ls -hl testdata >-rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata >summer@Bandicoot:~$ > >Now, it wouldn't surprise me if postgresql used lots of memory - but >how >much could it possibly need? My laptop, where I first tried this, has >1.25 Gbytes, so I could allow it some. > >It wouldn't surprise me a lot if it used lots of memory and caused all > >sorts of problems, but no, it's chugging away, still using no more RAM > >than it could have had on my old Pentium 133 all those years ago. > >In the process of checking it out, I've set it running on a machine >with >a AMD Sempron(tm) 2400+ running Kubuntu 6.10 (kernel is >2.6.17-6-server-xen0) and 512 Mbytes of RAM. > >This is the java program:-) >summer pts/6 :0.0 Thu20 5days 1:07 1:07 >/usr/bin/gij-4.1 -cp /usr/s >It's been running five days so far, and I can see where it's up to by >attaching strace. It's reading 2k of the input file every few seconds. > >Okay, clearly something's wrong, and I don't think it's all my crddu >code. >No probs swapping: >summer@Bandicoot:~$ free > total used free shared buffers > cached >Mem: 460800 456472 4328 0 860 >262164 >-/+ buffers/cache: 193448 267352 >Swap: 1461872 284 1461588 >summer@Bandicoot:~$ > >It is hitting the disk pretty hard now on this machine, but the >laptop's >still going too, and the disk seems to run about half the time, part >of >a second running, part idle (but the intervals are getting shorter). > >It struck me as fairly curious that neither postgresql nor the >application was hogging the CPU. > >Perhaps the laptop is more interesting: look at the size of the buffer >pool: >summer@Echidna:~> free > total used free shared buffers > cached >Mem: 1295528 1268548 26980 0 3976 >392388 >-/+ buffers/cache: 872184 423344 >Swap: 1941496 32656 1908840 >summer@Echidna:~> >Again, no problem with over-use of RAM, and I'm logged on using KDE >too >and that's running fine. > >It's been running a little longer here: >summer pts/23 28Mar07 5days 25:12 25:11 java -cp >/home/summer/Classes/:/usr/share/p > >This is Sun's Java 1.5 on OpenSUSE 10.2. > > >This is what suggested I should write: >summer@Echidna:~> procinfo >Linux 2.6.18.8-0.1-default (geeko@buildhost) (gcc 4.1.2 20061115) #1 >1CPU [Echidna.] > >Memory: Total Used Free Shared Buffers >Mem: 1295528 1271720 23808 0 3716 >Swap: 1941496 32656 1908840 > >Bootup: Tue Mar 27 18:50:19 2007 Load average: 2.21 2.65 2.69 2/243 >19305 > >user : 1d 3:17:04.03 16.0% page in : 131097310 disk 1: >3079516r20087664w >nice : 0:05:39.64 0.1% page out: 197016649 >system: 2d 20:38:37.13 40.1% page act: 87906251 >IOwait: 2d 0:46:37.33 28.5% page dea: 16218135 >hw irq: 0:44:46.71 0.4% page flt: 306255213 >sw irq: 0:50:04.69 0.5% swap in : 4026 >idle : 1d 0:36:29.73 14.4% swap out: 9552 >uptime: 7d 2:59:20.97 context : 702502442 > >irq 0: 153880209 timer irq 7: 0 parport0 >irq 1: 69402 i8042 irq 8: 2 rtc >irq 2: 0 cascade [4] irq 9: 1696942 acpi >irq 3: 4 irq 10: 1 >irq 4: 4 irq 11: 71842329 >ehci_hcd:usb1, >uhci_ >irq 5: 28545863 Intel 82801DB-ICH4 irq 12: 467432 i8042 >irq 6: 1 irq 14: 25021586 ide0 > >summer@Echidna:~> > > >Look at that line beginning "System:" two days 20 hours in the Linux >kernel. It's my guess that the Linux kernel is spending a great deal >of >time manipulating that buffer pool. > >This shows postgresql taking 60% CPU: >summer@Echidna:~> ps xaru >USER PID %CPU %MEM VSZ RSS TTY STAT START TIME >COMMAND >summer 20237 0.2 1.2 263716 15988 pts/23 Rl+ Mar28 25:11 java >-cp /home/summer/Class >postgres 19321 0.1 0.3 19844 3984 ? D 21:50 0:00 >postgres: summer stocksshar >summer 19344 0.0 0.0 2484 852 pts/25 R+ 21:52 0:00 ps >xaru >summer@Echidna:~> ps xaru >USER PID %CPU %MEM VSZ RSS TTY STAT START TIME >COMMAND >postgres 20248 60.0 0.8 20760 11536 ? D Mar28 5167:27 >postgres: summer sharetrad >postgres 19321 0.1 0.3 19844 4008 ? D 21:50 0:00 >postgres: summer stocksshar >summer 19348 0.0 0.0 2480 848 pts/25 R+ 21:52 0:00 ps >xaru >summer@Echidna:~> > >Note that pid 20248 is the one, the other has nothing to do with the >problem. > >This is the SQL I used to create the table: > String createTableStatement = "create table TJS0102_trades (" > + " trade date" > + ", ttime int2" > + ", ASX varchar(7)" > + ", thigh int4" > + ", tlow int4" > + ", tclose int4" > + ", topen int4" > + ", tvolume int4" > + ", tvalue int4" >// + ")" > + ", unique(ASX,trade,ttime)" > + ");"; > > >For each record, I update a non-key field in another table; the source > >data for that other table is less than a megabyte. > > >I have changed the program so as to load each day's data as a single >transaction; it ran on a Pentium IV 2.8 Ghz with HT, 512 Mbytes in >about >three and an half hours. This is more-or-less what I expected. > >If you agree with me that postgresql should do better, now is a good >time to ask for this as a test case. > >Note, you could also get suitable test data from float.com.au - it's >not >where mine came from, but it should be substantially the same.