From: Koichi S. <koi...@gm...> - 2013-12-09 05:45:27
|
So nice to find the cause finally. Hope your work goes smoothly then. Thank you ; --- Koichi Suzuki 2013/12/6 Sandeep Gupta <gup...@gm...>: > Hi Koichi, > > We finally fixed the mystery of the killing postgres processes during copy. > It was nothing to do with postgres-xc. Turns out that some misconfiguration > of our compute system that resulted in killing the process. Sorry for the > trouble. We have executed several runs with the copy command and they are > working fine so far. > > -Sandeep > > > > On Mon, Nov 18, 2013 at 10:17 PM, Koichi Suzuki <koi...@gm...> > wrote: >> >> Thanks. I'll be looking forward to your test result. I talked with >> Masataka and he was interested that no relevant messages were found in >> dmsg. >> >> Regards; >> --- >> Koichi Suzuki >> >> >> 2013/11/19 Sandeep Gupta <gup...@gm...>: >> > Hi Koichi, >> > >> > Thanks for taking a look. I did some further debugging and I won't rule >> > out the possibility of the the OS killing some postgres processes which >> > leads to copy execution reaching in a inconsistent state. >> > Will do some further tests to confirm this as well. >> > >> > -Sandeep >> > >> > >> > >> > >> > On Mon, Nov 18, 2013 at 1:06 AM, Koichi Suzuki <koi...@gm...> >> > wrote: >> >> >> >> This happens when the message is not null-terminated properly. Most >> >> case is a bug. In this case, different from vanilla PostgreSQL, copy >> >> data is involved in. Please let me find a time to look into it. >> >> >> >> Regards; >> >> --- >> >> Koichi Suzuki >> >> >> >> >> >> 2013/11/16 Sandeep Gupta <gup...@gm...>: >> >> > Forgot to mention that the error is still the same "invalid string >> >> > in >> >> > message". >> >> > >> >> > -Sandeep >> >> > >> >> > >> >> > >> >> > On Fri, Nov 15, 2013 at 8:27 PM, Sandeep Gupta >> >> > <gup...@gm...> >> >> > wrote: >> >> >> >> >> >> Hi Koichi, >> >> >> >> >> >> We tried the copy with the patch but it did not fix the problem >> >> >> for >> >> >> us. >> >> >> If you need any additional details then let me know. >> >> >> >> >> >> Best, >> >> >> -Sandeep >> >> >> >> >> >> >> >> >> On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta >> >> >> <gup...@gm...> >> >> >> wrote: >> >> >>> >> >> >>> Hi Koichi, >> >> >>> >> >> >>> Many thanks for sharing the patch. We would test this in our >> >> >>> environment >> >> >>> and would report back on the result. >> >> >>> >> >> >>> Regarding the very high memory consumption, it is not the >> >> >>> coordinator >> >> >>> that sees the high memory usage. >> >> >>> The setup is following: We have one coordinator and 4 datanodes. >> >> >>> All >> >> >>> the >> >> >>> datanodes run on separate machine >> >> >>> then the coordinator. It is the machine where the datanodes are >> >> >>> running >> >> >>> that have 48GB memory usage. >> >> >>> The other caveat is that that "top" doesn't show that it is the >> >> >>> datanodes >> >> >>> processes using the memory. >> >> >>> >> >> >>> In any case, the last run of copy even with the high usage went >> >> >>> through >> >> >>> fine. >> >> >>> >> >> >>> >> >> >>> Best, >> >> >>> Sandeep >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki >> >> >>> <koi...@gm...> >> >> >>> wrote: >> >> >>>> >> >> >>>> Yes, I've already found this kind of failure. When disk write is >> >> >>>> quite slower than read, coordinator overshoots. However, in this >> >> >>>> case, I thought we will see that palloc() overflows. If each >> >> >>>> piece >> >> >>>> of COPY TO text data is less than 1GB, then each COPY TO succeeds. >> >> >>>> If the write is much slower than read, each coordinator will >> >> >>>> allocate >> >> >>>> memory for each COPY TO cache going to datanodes, ending up with >> >> >>>> enormous amount of memory consumption and have a good chance to >> >> >>>> run >> >> >>>> OOM killer. Your memory usage looks enormous. Maybe you have >> >> >>>> many COPY TO instances each of which allocates too much memory. >> >> >>>> >> >> >>>> Attached patch fixes COPY TO. Because we have a chance to have >> >> >>>> similar overshoots in COPY FROM, I'd like to write a patch to fix >> >> >>>> this >> >> >>>> too before committing. This is for the current master but will >> >> >>>> work >> >> >>>> with 1.1.x (even with 1.0.x hopefully). >> >> >>>> >> >> >>>> The patch looks to be fine. I asked to test it in such an >> >> >>>> environment and they reported the patch fixed their issue. >> >> >>>> >> >> >>>> I'm not sure but hope this works in your case. May reduce memory >> >> >>>> consumption drastically. Please let me know the result. >> >> >>>> >> >> >>>> I have to allocate some of my resource to fix the same issue in >> >> >>>> COPY >> >> >>>> FROM. I have another request to fix this so if the patch works >> >> >>>> for >> >> >>>> you, I may be able to commit only COPY TO fix first. >> >> >>>> >> >> >>>> Regards; >> >> >>>> --- >> >> >>>> Koichi Suzuki >> >> >>>> >> >> >>>> >> >> >>>> 2013/11/15 Sandeep Gupta <gup...@gm...>: >> >> >>>> > Hi Koichi, >> >> >>>> > >> >> >>>> > I am using the copy command over the text file to populate a >> >> >>>> > table. >> >> >>>> > >> >> >>>> > Sorry for the confusion. I meant 628 -- 638 lines in the >> >> >>>> > pqformat.c. >> >> >>>> > Let me explain this a bit more. I get error in some of the >> >> >>>> > datanode >> >> >>>> > logfiles >> >> >>>> > as "invalid string in message". The message is mentioned only in >> >> >>>> > pqformat.c >> >> >>>> > at line 628. Although I am not able to trace back as to where >> >> >>>> > error >> >> >>>> > may >> >> >>>> > have first ordered in the datanode that triggered this error. >> >> >>>> > >> >> >>>> > In order to debug this I started printing out the rows the >> >> >>>> > coordinator >> >> >>>> > is >> >> >>>> > sending to the datanode (in function DataNodeCopyIn). For some >> >> >>>> > reason >> >> >>>> > the >> >> >>>> > error invalid string in messge error has not happened so far. >> >> >>>> > It >> >> >>>> > seems >> >> >>>> > there is some issue with the flow control of message between >> >> >>>> > coordinator and >> >> >>>> > datanode. >> >> >>>> > Hope this helps in narrowing down what is causing the problem in >> >> >>>> > the >> >> >>>> > first >> >> >>>> > place. >> >> >>>> > >> >> >>>> > >> >> >>>> > However, the new problem I see is that the memory usage goes >> >> >>>> > very >> >> >>>> > high. Of >> >> >>>> > the 48 GB almost all of the memory is getting used. Is this >> >> >>>> > normal? >> >> >>>> > If not can you give suggestion as to how to keep the memory >> >> >>>> > consumption >> >> >>>> > low. >> >> >>>> > >> >> >>>> > -Sandeep >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki >> >> >>>> > <koi...@gm...> >> >> >>>> > wrote: >> >> >>>> >> >> >> >>>> >> Could you let me know what "line 682 -- 637" means? >> >> >>>> >> >> >> >>>> >> Because any communication between coordinator and datanode is >> >> >>>> >> done >> >> >>>> >> using libpq, I don't think XC populates information using raw >> >> >>>> >> stringinfo. In fact, from datanodes, coordinator is just a >> >> >>>> >> client >> >> >>>> >> except for supplying XID and snapshot. However, COPY TO uses >> >> >>>> >> send()/receive(), not through libpq, to send data to datanode. >> >> >>>> >> In >> >> >>>> >> this case, if incoming data file contains NULL in the data, >> >> >>>> >> such >> >> >>>> >> data >> >> >>>> >> may be sent via send() to datanodes. COPY TO assumes incoming >> >> >>>> >> data >> >> >>>> >> in a text format. Do you think you have such situations? >> >> >>>> >> >> >> >>>> >> Analysis may take a bit but let me try. >> >> >>>> >> >> >> >>>> >> Regards; >> >> >>>> >> --- >> >> >>>> >> Koichi Suzuki >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> 2013/11/15 Sandeep Gupta <gup...@gm...>: >> >> >>>> >> > Hi Koichi, Masataka, >> >> >>>> >> > >> >> >>>> >> > Thanks for taking a look. I double checked both memory usage >> >> >>>> >> > and >> >> >>>> >> > the >> >> >>>> >> > dmesg >> >> >>>> >> > log. The kill is not happening due to the out of memory. The >> >> >>>> >> > kernel >> >> >>>> >> > doesn't >> >> >>>> >> > show any processes being killed. >> >> >>>> >> > >> >> >>>> >> > My current suspicion is that something amiss is happening >> >> >>>> >> > over >> >> >>>> >> > the >> >> >>>> >> > network. >> >> >>>> >> > This happens consistently when performing copy over large >> >> >>>> >> > datasets. >> >> >>>> >> > Few >> >> >>>> >> > of >> >> >>>> >> > datanodes end up with "invalid string in message". >> >> >>>> >> > Going over the code in pqformat.c this happens because the >> >> >>>> >> > invariant >> >> >>>> >> > "StringInfo is guaranteed to have a trailing null byte" is >> >> >>>> >> > not >> >> >>>> >> > valid >> >> >>>> >> > (line >> >> >>>> >> > 682 -- 637). I am not sure why this happens. >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > After this the datanodes do something illegal (probably >> >> >>>> >> > buffer >> >> >>>> >> > overflow >> >> >>>> >> > on >> >> >>>> >> > the port..just a guess) that the OS decides to issue a >> >> >>>> >> > sigkill. >> >> >>>> >> > I >> >> >>>> >> > am >> >> >>>> >> > also >> >> >>>> >> > not sure why the OS is not logging this killed operation. >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > Please let me know of suggestions you may have regarding the >> >> >>>> >> > "invalid >> >> >>>> >> > string in message". In particular, where in the code is >> >> >>>> >> > StringInfo >> >> >>>> >> > msg >> >> >>>> >> > is >> >> >>>> >> > getting populated i.e. which code reads the message over the >> >> >>>> >> > network and >> >> >>>> >> > keeps it ready for the copy operation on the datanode. Any >> >> >>>> >> > other >> >> >>>> >> > suggestion >> >> >>>> >> > would also be helpful to trace this bug. >> >> >>>> >> > >> >> >>>> >> > Best, >> >> >>>> >> > Sandeep >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito >> >> >>>> >> > <pg...@gm...> >> >> >>>> >> > wrote: >> >> >>>> >> >> >> >> >>>> >> >> Hi, >> >> >>>> >> >> >> >> >>>> >> >> I think I may be onto something. >> >> >>>> >> >> >> >> >>>> >> >> When the memory is exhausted, the Linux kernel kills process >> >> >>>> >> >> randomly >> >> >>>> >> >> by SIGKILL. >> >> >>>> >> >> The mechanism is called OOM(Out Of Memory) killer. >> >> >>>> >> >> OOM killer logs its activity to kernel mesage buffer. >> >> >>>> >> >> >> >> >>>> >> >> Could you check display message (dmesg) and memory resource? >> >> >>>> >> >> >> >> >>>> >> >> Regards. >> >> >>>> >> >> >> >> >>>> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta >> >> >>>> >> >> <gup...@gm...> >> >> >>>> >> >> wrote: >> >> >>>> >> >> > Hi Koichi, >> >> >>>> >> >> > >> >> >>>> >> >> > It is a bit of mystery because it does not happen >> >> >>>> >> >> > consistently. >> >> >>>> >> >> > Thanks >> >> >>>> >> >> > for the clarification though. It is indeed helpful. >> >> >>>> >> >> > >> >> >>>> >> >> > -Sandeep >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 >> >> >>>> >> >> > <ko...@in...> >> >> >>>> >> >> > wrote: >> >> >>>> >> >> >> >> >> >>>> >> >> >> Someone is sending SIGKILL to coordinator/datanode >> >> >>>> >> >> >> backend >> >> >>>> >> >> >> processes. >> >> >>>> >> >> >> Although XC (originally PG) code has a handler for >> >> >>>> >> >> >> SIGKILL, >> >> >>>> >> >> >> I >> >> >>>> >> >> >> didn’t >> >> >>>> >> >> >> found >> >> >>>> >> >> >> any code in XC sending SIGKILL to other processes. I’m >> >> >>>> >> >> >> afraid >> >> >>>> >> >> >> there >> >> >>>> >> >> >> could >> >> >>>> >> >> >> be another process sending SIGKILL to them. >> >> >>>> >> >> >> >> >> >>>> >> >> >> Best Regards; >> >> >>>> >> >> >> --- >> >> >>>> >> >> >> Koichi Suzuki >> >> >>>> >> >> >> >> >> >>>> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> >> >> >>>> >> >> >> のメール: >> >> >>>> >> >> >> >> >> >>>> >> >> >> > I am running single instance of postgres-xc with >> >> >>>> >> >> >> > several >> >> >>>> >> >> >> > datanode >> >> >>>> >> >> >> > Each data node runs on its own machine. >> >> >>>> >> >> >> > After instantiating the cluster the database sits idle. >> >> >>>> >> >> >> > I >> >> >>>> >> >> >> > do >> >> >>>> >> >> >> > not >> >> >>>> >> >> >> > perform >> >> >>>> >> >> >> > any create table or insert operations. However, after >> >> >>>> >> >> >> > some >> >> >>>> >> >> >> > time >> >> >>>> >> >> >> > all >> >> >>>> >> >> >> > the >> >> >>>> >> >> >> > datanode automatically shuts down (log messages >> >> >>>> >> >> >> > attached). >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > Any pointers as to why this maybe happening would be >> >> >>>> >> >> >> > very >> >> >>>> >> >> >> > useful. >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > -Sandeep >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > The log files across all datanode contain various error >> >> >>>> >> >> >> > messages. >> >> >>>> >> >> >> > These >> >> >>>> >> >> >> > include: >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > LOG: statistics collector process (PID 23073) was >> >> >>>> >> >> >> > terminated >> >> >>>> >> >> >> > by >> >> >>>> >> >> >> > signal >> >> >>>> >> >> >> > 9: Killed >> >> >>>> >> >> >> > LOG: WAL writer process (PID 23071) was terminated by >> >> >>>> >> >> >> > signal >> >> >>>> >> >> >> > 9: >> >> >>>> >> >> >> > Killed >> >> >>>> >> >> >> > LOG: terminating any other active server processes >> >> >>>> >> >> >> > WARNING: terminating connection because of crash of >> >> >>>> >> >> >> > another >> >> >>>> >> >> >> > server >> >> >>>> >> >> >> > process >> >> >>>> >> >> >> > DETAIL: The postmaster has commanded this server >> >> >>>> >> >> >> > process >> >> >>>> >> >> >> > to >> >> >>>> >> >> >> > roll >> >> >>>> >> >> >> > back >> >> >>>> >> >> >> > the current transaction and exit, because another >> >> >>>> >> >> >> > server >> >> >>>> >> >> >> > process >> >> >>>> >> >> >> > exited >> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory. >> >> >>>> >> >> >> > HINT: In a moment you should be able to reconnect to >> >> >>>> >> >> >> > the >> >> >>>> >> >> >> > database >> >> >>>> >> >> >> > and >> >> >>>> >> >> >> > repeat your command. >> >> >>>> >> >> >> > LOG: all server processes terminated; reinitializing >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > LOG: checkpointer process (PID 21406) was terminated >> >> >>>> >> >> >> > by >> >> >>>> >> >> >> > signal 9: >> >> >>>> >> >> >> > Killed >> >> >>>> >> >> >> > LOG: terminating any other active server processes >> >> >>>> >> >> >> > WARNING: terminating connection because of crash of >> >> >>>> >> >> >> > another >> >> >>>> >> >> >> > server >> >> >>>> >> >> >> > process >> >> >>>> >> >> >> > DETAIL: The postmaster has commanded this server >> >> >>>> >> >> >> > process >> >> >>>> >> >> >> > to >> >> >>>> >> >> >> > roll >> >> >>>> >> >> >> > back >> >> >>>> >> >> >> > the current transaction and exit, because another >> >> >>>> >> >> >> > server >> >> >>>> >> >> >> > process >> >> >>>> >> >> >> > exited >> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory. >> >> >>>> >> >> >> > LOG: statistics collector process (PID 28881) was >> >> >>>> >> >> >> > terminated >> >> >>>> >> >> >> > by >> >> >>>> >> >> >> > signal >> >> >>>> >> >> >> > 9: Killed >> >> >>>> >> >> >> > LOG: autovacuum launcher process (PID 28880) was >> >> >>>> >> >> >> > terminated >> >> >>>> >> >> >> > by >> >> >>>> >> >> >> > signal >> >> >>>> >> >> >> > 9: Killed >> >> >>>> >> >> >> > LOG: terminating any other active server processes >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > ------------------------------------------------------------------------------ >> >> >>>> >> >> >> > November Webinars for C, C++, Fortran Developers >> >> >>>> >> >> >> > Accelerate application performance with scalable >> >> >>>> >> >> >> > programming >> >> >>>> >> >> >> > models. >> >> >>>> >> >> >> > Explore >> >> >>>> >> >> >> > techniques for threading, error checking, porting, and >> >> >>>> >> >> >> > tuning. Get >> >> >>>> >> >> >> > the >> >> >>>> >> >> >> > most >> >> >>>> >> >> >> > from the latest Intel processors and coprocessors. See >> >> >>>> >> >> >> > abstracts >> >> >>>> >> >> >> > and >> >> >>>> >> >> >> > register >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ >> >> >>>> >> >> >> > Postgres-xc-general mailing list >> >> >>>> >> >> >> > Pos...@li... >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > >> >> >>>> >> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >>>> >> >> >> >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > ------------------------------------------------------------------------------ >> >> >>>> >> >> > November Webinars for C, C++, Fortran Developers >> >> >>>> >> >> > Accelerate application performance with scalable >> >> >>>> >> >> > programming >> >> >>>> >> >> > models. >> >> >>>> >> >> > Explore >> >> >>>> >> >> > techniques for threading, error checking, porting, and >> >> >>>> >> >> > tuning. >> >> >>>> >> >> > Get >> >> >>>> >> >> > the >> >> >>>> >> >> > most >> >> >>>> >> >> > from the latest Intel processors and coprocessors. See >> >> >>>> >> >> > abstracts >> >> >>>> >> >> > and >> >> >>>> >> >> > register >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk >> >> >>>> >> >> > _______________________________________________ >> >> >>>> >> >> > Postgres-xc-general mailing list >> >> >>>> >> >> > Pos...@li... >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >>>> >> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > ------------------------------------------------------------------------------ >> >> >>>> >> > DreamFactory - Open Source REST & JSON Services for HTML5 & >> >> >>>> >> > Native >> >> >>>> >> > Apps >> >> >>>> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External >> >> >>>> >> > API >> >> >>>> >> > Access >> >> >>>> >> > Free app hosting. Or install the open source package on any >> >> >>>> >> > LAMP >> >> >>>> >> > server. >> >> >>>> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch >> >> >>>> >> > and >> >> >>>> >> > Native! >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk >> >> >>>> >> > _______________________________________________ >> >> >>>> >> > Postgres-xc-general mailing list >> >> >>>> >> > Pos...@li... >> >> >>>> >> > >> >> >>>> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >>>> >> > >> >> >>>> > >> >> >>>> > >> >> >>> >> >> >>> >> >> >> >> >> > >> > >> > > > |