|
From: Sandeep G. <gup...@gm...> - 2013-11-18 15:19:13
|
Hi Koichi, Thanks for taking a look. I did some further debugging and I won't rule out the possibility of the the OS killing some postgres processes which leads to copy execution reaching in a inconsistent state. Will do some further tests to confirm this as well. -Sandeep On Mon, Nov 18, 2013 at 1:06 AM, Koichi Suzuki <koi...@gm...>wrote: > This happens when the message is not null-terminated properly. Most > case is a bug. In this case, different from vanilla PostgreSQL, copy > data is involved in. Please let me find a time to look into it. > > Regards; > --- > Koichi Suzuki > > > 2013/11/16 Sandeep Gupta <gup...@gm...>: > > Forgot to mention that the error is still the same "invalid string in > > message". > > > > -Sandeep > > > > > > > > On Fri, Nov 15, 2013 at 8:27 PM, Sandeep Gupta <gup...@gm...> > > wrote: > >> > >> Hi Koichi, > >> > >> We tried the copy with the patch but it did not fix the problem for > us. > >> If you need any additional details then let me know. > >> > >> Best, > >> -Sandeep > >> > >> > >> On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta <gup...@gm... > > > >> wrote: > >>> > >>> Hi Koichi, > >>> > >>> Many thanks for sharing the patch. We would test this in our > environment > >>> and would report back on the result. > >>> > >>> Regarding the very high memory consumption, it is not the coordinator > >>> that sees the high memory usage. > >>> The setup is following: We have one coordinator and 4 datanodes. All > the > >>> datanodes run on separate machine > >>> then the coordinator. It is the machine where the datanodes are > running > >>> that have 48GB memory usage. > >>> The other caveat is that that "top" doesn't show that it is the > datanodes > >>> processes using the memory. > >>> > >>> In any case, the last run of copy even with the high usage went through > >>> fine. > >>> > >>> > >>> Best, > >>> Sandeep > >>> > >>> > >>> > >>> On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki <koi...@gm...> > >>> wrote: > >>>> > >>>> Yes, I've already found this kind of failure. When disk write is > >>>> quite slower than read, coordinator overshoots. However, in this > >>>> case, I thought we will see that palloc() overflows. If each piece > >>>> of COPY TO text data is less than 1GB, then each COPY TO succeeds. > >>>> If the write is much slower than read, each coordinator will allocate > >>>> memory for each COPY TO cache going to datanodes, ending up with > >>>> enormous amount of memory consumption and have a good chance to run > >>>> OOM killer. Your memory usage looks enormous. Maybe you have > >>>> many COPY TO instances each of which allocates too much memory. > >>>> > >>>> Attached patch fixes COPY TO. Because we have a chance to have > >>>> similar overshoots in COPY FROM, I'd like to write a patch to fix this > >>>> too before committing. This is for the current master but will work > >>>> with 1.1.x (even with 1.0.x hopefully). > >>>> > >>>> The patch looks to be fine. I asked to test it in such an > >>>> environment and they reported the patch fixed their issue. > >>>> > >>>> I'm not sure but hope this works in your case. May reduce memory > >>>> consumption drastically. Please let me know the result. > >>>> > >>>> I have to allocate some of my resource to fix the same issue in COPY > >>>> FROM. I have another request to fix this so if the patch works for > >>>> you, I may be able to commit only COPY TO fix first. > >>>> > >>>> Regards; > >>>> --- > >>>> Koichi Suzuki > >>>> > >>>> > >>>> 2013/11/15 Sandeep Gupta <gup...@gm...>: > >>>> > Hi Koichi, > >>>> > > >>>> > I am using the copy command over the text file to populate a table. > >>>> > > >>>> > Sorry for the confusion. I meant 628 -- 638 lines in the pqformat.c. > >>>> > Let me explain this a bit more. I get error in some of the datanode > >>>> > logfiles > >>>> > as "invalid string in message". The message is mentioned only in > >>>> > pqformat.c > >>>> > at line 628. Although I am not able to trace back as to where error > >>>> > may > >>>> > have first ordered in the datanode that triggered this error. > >>>> > > >>>> > In order to debug this I started printing out the rows the > coordinator > >>>> > is > >>>> > sending to the datanode (in function DataNodeCopyIn). For some > reason > >>>> > the > >>>> > error invalid string in messge error has not happened so far. It > >>>> > seems > >>>> > there is some issue with the flow control of message between > >>>> > coordinator and > >>>> > datanode. > >>>> > Hope this helps in narrowing down what is causing the problem in the > >>>> > first > >>>> > place. > >>>> > > >>>> > > >>>> > However, the new problem I see is that the memory usage goes very > >>>> > high. Of > >>>> > the 48 GB almost all of the memory is getting used. Is this normal? > >>>> > If not can you give suggestion as to how to keep the memory > >>>> > consumption > >>>> > low. > >>>> > > >>>> > -Sandeep > >>>> > > >>>> > > >>>> > > >>>> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki < > koi...@gm...> > >>>> > wrote: > >>>> >> > >>>> >> Could you let me know what "line 682 -- 637" means? > >>>> >> > >>>> >> Because any communication between coordinator and datanode is done > >>>> >> using libpq, I don't think XC populates information using raw > >>>> >> stringinfo. In fact, from datanodes, coordinator is just a client > >>>> >> except for supplying XID and snapshot. However, COPY TO uses > >>>> >> send()/receive(), not through libpq, to send data to datanode. In > >>>> >> this case, if incoming data file contains NULL in the data, such > data > >>>> >> may be sent via send() to datanodes. COPY TO assumes incoming > data > >>>> >> in a text format. Do you think you have such situations? > >>>> >> > >>>> >> Analysis may take a bit but let me try. > >>>> >> > >>>> >> Regards; > >>>> >> --- > >>>> >> Koichi Suzuki > >>>> >> > >>>> >> > >>>> >> 2013/11/15 Sandeep Gupta <gup...@gm...>: > >>>> >> > Hi Koichi, Masataka, > >>>> >> > > >>>> >> > Thanks for taking a look. I double checked both memory usage and > >>>> >> > the > >>>> >> > dmesg > >>>> >> > log. The kill is not happening due to the out of memory. The > kernel > >>>> >> > doesn't > >>>> >> > show any processes being killed. > >>>> >> > > >>>> >> > My current suspicion is that something amiss is happening over > the > >>>> >> > network. > >>>> >> > This happens consistently when performing copy over large > datasets. > >>>> >> > Few > >>>> >> > of > >>>> >> > datanodes end up with "invalid string in message". > >>>> >> > Going over the code in pqformat.c this happens because the > >>>> >> > invariant > >>>> >> > "StringInfo is guaranteed to have a trailing null byte" is not > >>>> >> > valid > >>>> >> > (line > >>>> >> > 682 -- 637). I am not sure why this happens. > >>>> >> > > >>>> >> > > >>>> >> > After this the datanodes do something illegal (probably buffer > >>>> >> > overflow > >>>> >> > on > >>>> >> > the port..just a guess) that the OS decides to issue a sigkill. > I > >>>> >> > am > >>>> >> > also > >>>> >> > not sure why the OS is not logging this killed operation. > >>>> >> > > >>>> >> > > >>>> >> > Please let me know of suggestions you may have regarding the > >>>> >> > "invalid > >>>> >> > string in message". In particular, where in the code is > StringInfo > >>>> >> > msg > >>>> >> > is > >>>> >> > getting populated i.e. which code reads the message over the > >>>> >> > network and > >>>> >> > keeps it ready for the copy operation on the datanode. Any other > >>>> >> > suggestion > >>>> >> > would also be helpful to trace this bug. > >>>> >> > > >>>> >> > Best, > >>>> >> > Sandeep > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito < > pg...@gm...> > >>>> >> > wrote: > >>>> >> >> > >>>> >> >> Hi, > >>>> >> >> > >>>> >> >> I think I may be onto something. > >>>> >> >> > >>>> >> >> When the memory is exhausted, the Linux kernel kills process > >>>> >> >> randomly > >>>> >> >> by SIGKILL. > >>>> >> >> The mechanism is called OOM(Out Of Memory) killer. > >>>> >> >> OOM killer logs its activity to kernel mesage buffer. > >>>> >> >> > >>>> >> >> Could you check display message (dmesg) and memory resource? > >>>> >> >> > >>>> >> >> Regards. > >>>> >> >> > >>>> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta > >>>> >> >> <gup...@gm...> > >>>> >> >> wrote: > >>>> >> >> > Hi Koichi, > >>>> >> >> > > >>>> >> >> > It is a bit of mystery because it does not happen > >>>> >> >> > consistently. > >>>> >> >> > Thanks > >>>> >> >> > for the clarification though. It is indeed helpful. > >>>> >> >> > > >>>> >> >> > -Sandeep > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 > >>>> >> >> > <ko...@in...> > >>>> >> >> > wrote: > >>>> >> >> >> > >>>> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend > >>>> >> >> >> processes. > >>>> >> >> >> Although XC (originally PG) code has a handler for SIGKILL, I > >>>> >> >> >> didn’t > >>>> >> >> >> found > >>>> >> >> >> any code in XC sending SIGKILL to other processes. I’m > afraid > >>>> >> >> >> there > >>>> >> >> >> could > >>>> >> >> >> be another process sending SIGKILL to them. > >>>> >> >> >> > >>>> >> >> >> Best Regards; > >>>> >> >> >> --- > >>>> >> >> >> Koichi Suzuki > >>>> >> >> >> > >>>> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> > のメール: > >>>> >> >> >> > >>>> >> >> >> > I am running single instance of postgres-xc with several > >>>> >> >> >> > datanode > >>>> >> >> >> > Each data node runs on its own machine. > >>>> >> >> >> > After instantiating the cluster the database sits idle. I > do > >>>> >> >> >> > not > >>>> >> >> >> > perform > >>>> >> >> >> > any create table or insert operations. However, after some > >>>> >> >> >> > time > >>>> >> >> >> > all > >>>> >> >> >> > the > >>>> >> >> >> > datanode automatically shuts down (log messages attached). > >>>> >> >> >> > > >>>> >> >> >> > Any pointers as to why this maybe happening would be very > >>>> >> >> >> > useful. > >>>> >> >> >> > > >>>> >> >> >> > -Sandeep > >>>> >> >> >> > > >>>> >> >> >> > The log files across all datanode contain various error > >>>> >> >> >> > messages. > >>>> >> >> >> > These > >>>> >> >> >> > include: > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > LOG: statistics collector process (PID 23073) was > terminated > >>>> >> >> >> > by > >>>> >> >> >> > signal > >>>> >> >> >> > 9: Killed > >>>> >> >> >> > LOG: WAL writer process (PID 23071) was terminated by > signal > >>>> >> >> >> > 9: > >>>> >> >> >> > Killed > >>>> >> >> >> > LOG: terminating any other active server processes > >>>> >> >> >> > WARNING: terminating connection because of crash of > another > >>>> >> >> >> > server > >>>> >> >> >> > process > >>>> >> >> >> > DETAIL: The postmaster has commanded this server process > to > >>>> >> >> >> > roll > >>>> >> >> >> > back > >>>> >> >> >> > the current transaction and exit, because another server > >>>> >> >> >> > process > >>>> >> >> >> > exited > >>>> >> >> >> > abnormally and possibly corrupted shared memory. > >>>> >> >> >> > HINT: In a moment you should be able to reconnect to the > >>>> >> >> >> > database > >>>> >> >> >> > and > >>>> >> >> >> > repeat your command. > >>>> >> >> >> > LOG: all server processes terminated; reinitializing > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > LOG: checkpointer process (PID 21406) was terminated by > >>>> >> >> >> > signal 9: > >>>> >> >> >> > Killed > >>>> >> >> >> > LOG: terminating any other active server processes > >>>> >> >> >> > WARNING: terminating connection because of crash of > another > >>>> >> >> >> > server > >>>> >> >> >> > process > >>>> >> >> >> > DETAIL: The postmaster has commanded this server process > to > >>>> >> >> >> > roll > >>>> >> >> >> > back > >>>> >> >> >> > the current transaction and exit, because another server > >>>> >> >> >> > process > >>>> >> >> >> > exited > >>>> >> >> >> > abnormally and possibly corrupted shared memory. > >>>> >> >> >> > LOG: statistics collector process (PID 28881) was > terminated > >>>> >> >> >> > by > >>>> >> >> >> > signal > >>>> >> >> >> > 9: Killed > >>>> >> >> >> > LOG: autovacuum launcher process (PID 28880) was > terminated > >>>> >> >> >> > by > >>>> >> >> >> > signal > >>>> >> >> >> > 9: Killed > >>>> >> >> >> > LOG: terminating any other active server processes > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > ------------------------------------------------------------------------------ > >>>> >> >> >> > November Webinars for C, C++, Fortran Developers > >>>> >> >> >> > Accelerate application performance with scalable > programming > >>>> >> >> >> > models. > >>>> >> >> >> > Explore > >>>> >> >> >> > techniques for threading, error checking, porting, and > >>>> >> >> >> > tuning. Get > >>>> >> >> >> > the > >>>> >> >> >> > most > >>>> >> >> >> > from the latest Intel processors and coprocessors. See > >>>> >> >> >> > abstracts > >>>> >> >> >> > and > >>>> >> >> >> > register > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > >>>> >> >> >> > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ > >>>> >> >> >> > Postgres-xc-general mailing list > >>>> >> >> >> > Pos...@li... > >>>> >> >> >> > > >>>> >> >> >> > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >>>> >> >> >> > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > ------------------------------------------------------------------------------ > >>>> >> >> > November Webinars for C, C++, Fortran Developers > >>>> >> >> > Accelerate application performance with scalable programming > >>>> >> >> > models. > >>>> >> >> > Explore > >>>> >> >> > techniques for threading, error checking, porting, and tuning. > >>>> >> >> > Get > >>>> >> >> > the > >>>> >> >> > most > >>>> >> >> > from the latest Intel processors and coprocessors. See > abstracts > >>>> >> >> > and > >>>> >> >> > register > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk > >>>> >> >> > _______________________________________________ > >>>> >> >> > Postgres-xc-general mailing list > >>>> >> >> > Pos...@li... > >>>> >> >> > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >>>> >> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > ------------------------------------------------------------------------------ > >>>> >> > DreamFactory - Open Source REST & JSON Services for HTML5 & > Native > >>>> >> > Apps > >>>> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API > >>>> >> > Access > >>>> >> > Free app hosting. Or install the open source package on any LAMP > >>>> >> > server. > >>>> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch and > >>>> >> > Native! > >>>> >> > > >>>> >> > > >>>> >> > > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk > >>>> >> > _______________________________________________ > >>>> >> > Postgres-xc-general mailing list > >>>> >> > Pos...@li... > >>>> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >>>> >> > > >>>> > > >>>> > > >>> > >>> > >> > > > |