|
From: Koichi S. <koi...@gm...> - 2013-11-18 09:06:57
|
This happens when the message is not null-terminated properly. Most case is a bug. In this case, different from vanilla PostgreSQL, copy data is involved in. Please let me find a time to look into it. Regards; --- Koichi Suzuki 2013/11/16 Sandeep Gupta <gup...@gm...>: > Forgot to mention that the error is still the same "invalid string in > message". > > -Sandeep > > > > On Fri, Nov 15, 2013 at 8:27 PM, Sandeep Gupta <gup...@gm...> > wrote: >> >> Hi Koichi, >> >> We tried the copy with the patch but it did not fix the problem for us. >> If you need any additional details then let me know. >> >> Best, >> -Sandeep >> >> >> On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta <gup...@gm...> >> wrote: >>> >>> Hi Koichi, >>> >>> Many thanks for sharing the patch. We would test this in our environment >>> and would report back on the result. >>> >>> Regarding the very high memory consumption, it is not the coordinator >>> that sees the high memory usage. >>> The setup is following: We have one coordinator and 4 datanodes. All the >>> datanodes run on separate machine >>> then the coordinator. It is the machine where the datanodes are running >>> that have 48GB memory usage. >>> The other caveat is that that "top" doesn't show that it is the datanodes >>> processes using the memory. >>> >>> In any case, the last run of copy even with the high usage went through >>> fine. >>> >>> >>> Best, >>> Sandeep >>> >>> >>> >>> On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki <koi...@gm...> >>> wrote: >>>> >>>> Yes, I've already found this kind of failure. When disk write is >>>> quite slower than read, coordinator overshoots. However, in this >>>> case, I thought we will see that palloc() overflows. If each piece >>>> of COPY TO text data is less than 1GB, then each COPY TO succeeds. >>>> If the write is much slower than read, each coordinator will allocate >>>> memory for each COPY TO cache going to datanodes, ending up with >>>> enormous amount of memory consumption and have a good chance to run >>>> OOM killer. Your memory usage looks enormous. Maybe you have >>>> many COPY TO instances each of which allocates too much memory. >>>> >>>> Attached patch fixes COPY TO. Because we have a chance to have >>>> similar overshoots in COPY FROM, I'd like to write a patch to fix this >>>> too before committing. This is for the current master but will work >>>> with 1.1.x (even with 1.0.x hopefully). >>>> >>>> The patch looks to be fine. I asked to test it in such an >>>> environment and they reported the patch fixed their issue. >>>> >>>> I'm not sure but hope this works in your case. May reduce memory >>>> consumption drastically. Please let me know the result. >>>> >>>> I have to allocate some of my resource to fix the same issue in COPY >>>> FROM. I have another request to fix this so if the patch works for >>>> you, I may be able to commit only COPY TO fix first. >>>> >>>> Regards; >>>> --- >>>> Koichi Suzuki >>>> >>>> >>>> 2013/11/15 Sandeep Gupta <gup...@gm...>: >>>> > Hi Koichi, >>>> > >>>> > I am using the copy command over the text file to populate a table. >>>> > >>>> > Sorry for the confusion. I meant 628 -- 638 lines in the pqformat.c. >>>> > Let me explain this a bit more. I get error in some of the datanode >>>> > logfiles >>>> > as "invalid string in message". The message is mentioned only in >>>> > pqformat.c >>>> > at line 628. Although I am not able to trace back as to where error >>>> > may >>>> > have first ordered in the datanode that triggered this error. >>>> > >>>> > In order to debug this I started printing out the rows the coordinator >>>> > is >>>> > sending to the datanode (in function DataNodeCopyIn). For some reason >>>> > the >>>> > error invalid string in messge error has not happened so far. It >>>> > seems >>>> > there is some issue with the flow control of message between >>>> > coordinator and >>>> > datanode. >>>> > Hope this helps in narrowing down what is causing the problem in the >>>> > first >>>> > place. >>>> > >>>> > >>>> > However, the new problem I see is that the memory usage goes very >>>> > high. Of >>>> > the 48 GB almost all of the memory is getting used. Is this normal? >>>> > If not can you give suggestion as to how to keep the memory >>>> > consumption >>>> > low. >>>> > >>>> > -Sandeep >>>> > >>>> > >>>> > >>>> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki <koi...@gm...> >>>> > wrote: >>>> >> >>>> >> Could you let me know what "line 682 -- 637" means? >>>> >> >>>> >> Because any communication between coordinator and datanode is done >>>> >> using libpq, I don't think XC populates information using raw >>>> >> stringinfo. In fact, from datanodes, coordinator is just a client >>>> >> except for supplying XID and snapshot. However, COPY TO uses >>>> >> send()/receive(), not through libpq, to send data to datanode. In >>>> >> this case, if incoming data file contains NULL in the data, such data >>>> >> may be sent via send() to datanodes. COPY TO assumes incoming data >>>> >> in a text format. Do you think you have such situations? >>>> >> >>>> >> Analysis may take a bit but let me try. >>>> >> >>>> >> Regards; >>>> >> --- >>>> >> Koichi Suzuki >>>> >> >>>> >> >>>> >> 2013/11/15 Sandeep Gupta <gup...@gm...>: >>>> >> > Hi Koichi, Masataka, >>>> >> > >>>> >> > Thanks for taking a look. I double checked both memory usage and >>>> >> > the >>>> >> > dmesg >>>> >> > log. The kill is not happening due to the out of memory. The kernel >>>> >> > doesn't >>>> >> > show any processes being killed. >>>> >> > >>>> >> > My current suspicion is that something amiss is happening over the >>>> >> > network. >>>> >> > This happens consistently when performing copy over large datasets. >>>> >> > Few >>>> >> > of >>>> >> > datanodes end up with "invalid string in message". >>>> >> > Going over the code in pqformat.c this happens because the >>>> >> > invariant >>>> >> > "StringInfo is guaranteed to have a trailing null byte" is not >>>> >> > valid >>>> >> > (line >>>> >> > 682 -- 637). I am not sure why this happens. >>>> >> > >>>> >> > >>>> >> > After this the datanodes do something illegal (probably buffer >>>> >> > overflow >>>> >> > on >>>> >> > the port..just a guess) that the OS decides to issue a sigkill. I >>>> >> > am >>>> >> > also >>>> >> > not sure why the OS is not logging this killed operation. >>>> >> > >>>> >> > >>>> >> > Please let me know of suggestions you may have regarding the >>>> >> > "invalid >>>> >> > string in message". In particular, where in the code is StringInfo >>>> >> > msg >>>> >> > is >>>> >> > getting populated i.e. which code reads the message over the >>>> >> > network and >>>> >> > keeps it ready for the copy operation on the datanode. Any other >>>> >> > suggestion >>>> >> > would also be helpful to trace this bug. >>>> >> > >>>> >> > Best, >>>> >> > Sandeep >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> >>>> >> > wrote: >>>> >> >> >>>> >> >> Hi, >>>> >> >> >>>> >> >> I think I may be onto something. >>>> >> >> >>>> >> >> When the memory is exhausted, the Linux kernel kills process >>>> >> >> randomly >>>> >> >> by SIGKILL. >>>> >> >> The mechanism is called OOM(Out Of Memory) killer. >>>> >> >> OOM killer logs its activity to kernel mesage buffer. >>>> >> >> >>>> >> >> Could you check display message (dmesg) and memory resource? >>>> >> >> >>>> >> >> Regards. >>>> >> >> >>>> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta >>>> >> >> <gup...@gm...> >>>> >> >> wrote: >>>> >> >> > Hi Koichi, >>>> >> >> > >>>> >> >> > It is a bit of mystery because it does not happen >>>> >> >> > consistently. >>>> >> >> > Thanks >>>> >> >> > for the clarification though. It is indeed helpful. >>>> >> >> > >>>> >> >> > -Sandeep >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 >>>> >> >> > <ko...@in...> >>>> >> >> > wrote: >>>> >> >> >> >>>> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend >>>> >> >> >> processes. >>>> >> >> >> Although XC (originally PG) code has a handler for SIGKILL, I >>>> >> >> >> didn’t >>>> >> >> >> found >>>> >> >> >> any code in XC sending SIGKILL to other processes. I’m afraid >>>> >> >> >> there >>>> >> >> >> could >>>> >> >> >> be another process sending SIGKILL to them. >>>> >> >> >> >>>> >> >> >> Best Regards; >>>> >> >> >> --- >>>> >> >> >> Koichi Suzuki >>>> >> >> >> >>>> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: >>>> >> >> >> >>>> >> >> >> > I am running single instance of postgres-xc with several >>>> >> >> >> > datanode >>>> >> >> >> > Each data node runs on its own machine. >>>> >> >> >> > After instantiating the cluster the database sits idle. I do >>>> >> >> >> > not >>>> >> >> >> > perform >>>> >> >> >> > any create table or insert operations. However, after some >>>> >> >> >> > time >>>> >> >> >> > all >>>> >> >> >> > the >>>> >> >> >> > datanode automatically shuts down (log messages attached). >>>> >> >> >> > >>>> >> >> >> > Any pointers as to why this maybe happening would be very >>>> >> >> >> > useful. >>>> >> >> >> > >>>> >> >> >> > -Sandeep >>>> >> >> >> > >>>> >> >> >> > The log files across all datanode contain various error >>>> >> >> >> > messages. >>>> >> >> >> > These >>>> >> >> >> > include: >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > LOG: statistics collector process (PID 23073) was terminated >>>> >> >> >> > by >>>> >> >> >> > signal >>>> >> >> >> > 9: Killed >>>> >> >> >> > LOG: WAL writer process (PID 23071) was terminated by signal >>>> >> >> >> > 9: >>>> >> >> >> > Killed >>>> >> >> >> > LOG: terminating any other active server processes >>>> >> >> >> > WARNING: terminating connection because of crash of another >>>> >> >> >> > server >>>> >> >> >> > process >>>> >> >> >> > DETAIL: The postmaster has commanded this server process to >>>> >> >> >> > roll >>>> >> >> >> > back >>>> >> >> >> > the current transaction and exit, because another server >>>> >> >> >> > process >>>> >> >> >> > exited >>>> >> >> >> > abnormally and possibly corrupted shared memory. >>>> >> >> >> > HINT: In a moment you should be able to reconnect to the >>>> >> >> >> > database >>>> >> >> >> > and >>>> >> >> >> > repeat your command. >>>> >> >> >> > LOG: all server processes terminated; reinitializing >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > LOG: checkpointer process (PID 21406) was terminated by >>>> >> >> >> > signal 9: >>>> >> >> >> > Killed >>>> >> >> >> > LOG: terminating any other active server processes >>>> >> >> >> > WARNING: terminating connection because of crash of another >>>> >> >> >> > server >>>> >> >> >> > process >>>> >> >> >> > DETAIL: The postmaster has commanded this server process to >>>> >> >> >> > roll >>>> >> >> >> > back >>>> >> >> >> > the current transaction and exit, because another server >>>> >> >> >> > process >>>> >> >> >> > exited >>>> >> >> >> > abnormally and possibly corrupted shared memory. >>>> >> >> >> > LOG: statistics collector process (PID 28881) was terminated >>>> >> >> >> > by >>>> >> >> >> > signal >>>> >> >> >> > 9: Killed >>>> >> >> >> > LOG: autovacuum launcher process (PID 28880) was terminated >>>> >> >> >> > by >>>> >> >> >> > signal >>>> >> >> >> > 9: Killed >>>> >> >> >> > LOG: terminating any other active server processes >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > ------------------------------------------------------------------------------ >>>> >> >> >> > November Webinars for C, C++, Fortran Developers >>>> >> >> >> > Accelerate application performance with scalable programming >>>> >> >> >> > models. >>>> >> >> >> > Explore >>>> >> >> >> > techniques for threading, error checking, porting, and >>>> >> >> >> > tuning. Get >>>> >> >> >> > the >>>> >> >> >> > most >>>> >> >> >> > from the latest Intel processors and coprocessors. See >>>> >> >> >> > abstracts >>>> >> >> >> > and >>>> >> >> >> > register >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > >>>> >> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ >>>> >> >> >> > Postgres-xc-general mailing list >>>> >> >> >> > Pos...@li... >>>> >> >> >> > >>>> >> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >>>> >> >> >> >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > ------------------------------------------------------------------------------ >>>> >> >> > November Webinars for C, C++, Fortran Developers >>>> >> >> > Accelerate application performance with scalable programming >>>> >> >> > models. >>>> >> >> > Explore >>>> >> >> > techniques for threading, error checking, porting, and tuning. >>>> >> >> > Get >>>> >> >> > the >>>> >> >> > most >>>> >> >> > from the latest Intel processors and coprocessors. See abstracts >>>> >> >> > and >>>> >> >> > register >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk >>>> >> >> > _______________________________________________ >>>> >> >> > Postgres-xc-general mailing list >>>> >> >> > Pos...@li... >>>> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >>>> >> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > ------------------------------------------------------------------------------ >>>> >> > DreamFactory - Open Source REST & JSON Services for HTML5 & Native >>>> >> > Apps >>>> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API >>>> >> > Access >>>> >> > Free app hosting. Or install the open source package on any LAMP >>>> >> > server. >>>> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch and >>>> >> > Native! >>>> >> > >>>> >> > >>>> >> > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk >>>> >> > _______________________________________________ >>>> >> > Postgres-xc-general mailing list >>>> >> > Pos...@li... >>>> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >>>> >> > >>>> > >>>> > >>> >>> >> > |