|
From: Sandeep G. <gup...@gm...> - 2013-11-16 04:32:41
|
Forgot to mention that the error is still the same "invalid string in message". -Sandeep On Fri, Nov 15, 2013 at 8:27 PM, Sandeep Gupta <gup...@gm...>wrote: > Hi Koichi, > > We tried the copy with the patch but it did not fix the problem for us. > If you need any additional details then let me know. > > Best, > -Sandeep > > > On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta <gup...@gm...>wrote: > >> Hi Koichi, >> >> Many thanks for sharing the patch. We would test this in our environment >> and would report back on the result. >> >> Regarding the very high memory consumption, it is not the coordinator >> that sees the high memory usage. >> The setup is following: We have one coordinator and 4 datanodes. All the >> datanodes run on separate machine >> then the coordinator. It is the machine where the datanodes are running >> that have 48GB memory usage. >> The other caveat is that that "top" doesn't show that it is the datanodes >> processes using the memory. >> >> In any case, the last run of copy even with the high usage went through >> fine. >> >> >> Best, >> Sandeep >> >> >> >> On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki <koi...@gm...>wrote: >> >>> Yes, I've already found this kind of failure. When disk write is >>> quite slower than read, coordinator overshoots. However, in this >>> case, I thought we will see that palloc() overflows. If each piece >>> of COPY TO text data is less than 1GB, then each COPY TO succeeds. >>> If the write is much slower than read, each coordinator will allocate >>> memory for each COPY TO cache going to datanodes, ending up with >>> enormous amount of memory consumption and have a good chance to run >>> OOM killer. Your memory usage looks enormous. Maybe you have >>> many COPY TO instances each of which allocates too much memory. >>> >>> Attached patch fixes COPY TO. Because we have a chance to have >>> similar overshoots in COPY FROM, I'd like to write a patch to fix this >>> too before committing. This is for the current master but will work >>> with 1.1.x (even with 1.0.x hopefully). >>> >>> The patch looks to be fine. I asked to test it in such an >>> environment and they reported the patch fixed their issue. >>> >>> I'm not sure but hope this works in your case. May reduce memory >>> consumption drastically. Please let me know the result. >>> >>> I have to allocate some of my resource to fix the same issue in COPY >>> FROM. I have another request to fix this so if the patch works for >>> you, I may be able to commit only COPY TO fix first. >>> >>> Regards; >>> --- >>> Koichi Suzuki >>> >>> >>> 2013/11/15 Sandeep Gupta <gup...@gm...>: >>> > Hi Koichi, >>> > >>> > I am using the copy command over the text file to populate a table. >>> > >>> > Sorry for the confusion. I meant 628 -- 638 lines in the pqformat.c. >>> > Let me explain this a bit more. I get error in some of the datanode >>> > logfiles >>> > as "invalid string in message". The message is mentioned only in >>> pqformat.c >>> > at line 628. Although I am not able to trace back as to where error >>> may >>> > have first ordered in the datanode that triggered this error. >>> > >>> > In order to debug this I started printing out the rows the coordinator >>> is >>> > sending to the datanode (in function DataNodeCopyIn). For some reason >>> the >>> > error invalid string in messge error has not happened so far. It >>> seems >>> > there is some issue with the flow control of message between >>> coordinator and >>> > datanode. >>> > Hope this helps in narrowing down what is causing the problem in the >>> first >>> > place. >>> > >>> > >>> > However, the new problem I see is that the memory usage goes very >>> high. Of >>> > the 48 GB almost all of the memory is getting used. Is this normal? >>> > If not can you give suggestion as to how to keep the memory >>> consumption >>> > low. >>> > >>> > -Sandeep >>> > >>> > >>> > >>> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki <koi...@gm...> >>> > wrote: >>> >> >>> >> Could you let me know what "line 682 -- 637" means? >>> >> >>> >> Because any communication between coordinator and datanode is done >>> >> using libpq, I don't think XC populates information using raw >>> >> stringinfo. In fact, from datanodes, coordinator is just a client >>> >> except for supplying XID and snapshot. However, COPY TO uses >>> >> send()/receive(), not through libpq, to send data to datanode. In >>> >> this case, if incoming data file contains NULL in the data, such data >>> >> may be sent via send() to datanodes. COPY TO assumes incoming data >>> >> in a text format. Do you think you have such situations? >>> >> >>> >> Analysis may take a bit but let me try. >>> >> >>> >> Regards; >>> >> --- >>> >> Koichi Suzuki >>> >> >>> >> >>> >> 2013/11/15 Sandeep Gupta <gup...@gm...>: >>> >> > Hi Koichi, Masataka, >>> >> > >>> >> > Thanks for taking a look. I double checked both memory usage and >>> the >>> >> > dmesg >>> >> > log. The kill is not happening due to the out of memory. The kernel >>> >> > doesn't >>> >> > show any processes being killed. >>> >> > >>> >> > My current suspicion is that something amiss is happening over the >>> >> > network. >>> >> > This happens consistently when performing copy over large datasets. >>> Few >>> >> > of >>> >> > datanodes end up with "invalid string in message". >>> >> > Going over the code in pqformat.c this happens because the invariant >>> >> > "StringInfo is guaranteed to have a trailing null byte" is not valid >>> >> > (line >>> >> > 682 -- 637). I am not sure why this happens. >>> >> > >>> >> > >>> >> > After this the datanodes do something illegal (probably buffer >>> overflow >>> >> > on >>> >> > the port..just a guess) that the OS decides to issue a sigkill. I >>> am >>> >> > also >>> >> > not sure why the OS is not logging this killed operation. >>> >> > >>> >> > >>> >> > Please let me know of suggestions you may have regarding the >>> "invalid >>> >> > string in message". In particular, where in the code is StringInfo >>> msg >>> >> > is >>> >> > getting populated i.e. which code reads the message over the >>> network and >>> >> > keeps it ready for the copy operation on the datanode. Any other >>> >> > suggestion >>> >> > would also be helpful to trace this bug. >>> >> > >>> >> > Best, >>> >> > Sandeep >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> >>> >> > wrote: >>> >> >> >>> >> >> Hi, >>> >> >> >>> >> >> I think I may be onto something. >>> >> >> >>> >> >> When the memory is exhausted, the Linux kernel kills process >>> randomly >>> >> >> by SIGKILL. >>> >> >> The mechanism is called OOM(Out Of Memory) killer. >>> >> >> OOM killer logs its activity to kernel mesage buffer. >>> >> >> >>> >> >> Could you check display message (dmesg) and memory resource? >>> >> >> >>> >> >> Regards. >>> >> >> >>> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta >>> >> >> <gup...@gm...> >>> >> >> wrote: >>> >> >> > Hi Koichi, >>> >> >> > >>> >> >> > It is a bit of mystery because it does not happen >>> consistently. >>> >> >> > Thanks >>> >> >> > for the clarification though. It is indeed helpful. >>> >> >> > >>> >> >> > -Sandeep >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in... >>> > >>> >> >> > wrote: >>> >> >> >> >>> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend >>> >> >> >> processes. >>> >> >> >> Although XC (originally PG) code has a handler for SIGKILL, I >>> didn’t >>> >> >> >> found >>> >> >> >> any code in XC sending SIGKILL to other processes. I’m afraid >>> >> >> >> there >>> >> >> >> could >>> >> >> >> be another process sending SIGKILL to them. >>> >> >> >> >>> >> >> >> Best Regards; >>> >> >> >> --- >>> >> >> >> Koichi Suzuki >>> >> >> >> >>> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: >>> >> >> >> >>> >> >> >> > I am running single instance of postgres-xc with several >>> datanode >>> >> >> >> > Each data node runs on its own machine. >>> >> >> >> > After instantiating the cluster the database sits idle. I do >>> not >>> >> >> >> > perform >>> >> >> >> > any create table or insert operations. However, after some >>> time >>> >> >> >> > all >>> >> >> >> > the >>> >> >> >> > datanode automatically shuts down (log messages attached). >>> >> >> >> > >>> >> >> >> > Any pointers as to why this maybe happening would be very >>> useful. >>> >> >> >> > >>> >> >> >> > -Sandeep >>> >> >> >> > >>> >> >> >> > The log files across all datanode contain various error >>> messages. >>> >> >> >> > These >>> >> >> >> > include: >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > LOG: statistics collector process (PID 23073) was terminated >>> by >>> >> >> >> > signal >>> >> >> >> > 9: Killed >>> >> >> >> > LOG: WAL writer process (PID 23071) was terminated by signal >>> 9: >>> >> >> >> > Killed >>> >> >> >> > LOG: terminating any other active server processes >>> >> >> >> > WARNING: terminating connection because of crash of another >>> >> >> >> > server >>> >> >> >> > process >>> >> >> >> > DETAIL: The postmaster has commanded this server process to >>> roll >>> >> >> >> > back >>> >> >> >> > the current transaction and exit, because another server >>> process >>> >> >> >> > exited >>> >> >> >> > abnormally and possibly corrupted shared memory. >>> >> >> >> > HINT: In a moment you should be able to reconnect to the >>> database >>> >> >> >> > and >>> >> >> >> > repeat your command. >>> >> >> >> > LOG: all server processes terminated; reinitializing >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > LOG: checkpointer process (PID 21406) was terminated by >>> signal 9: >>> >> >> >> > Killed >>> >> >> >> > LOG: terminating any other active server processes >>> >> >> >> > WARNING: terminating connection because of crash of another >>> >> >> >> > server >>> >> >> >> > process >>> >> >> >> > DETAIL: The postmaster has commanded this server process to >>> roll >>> >> >> >> > back >>> >> >> >> > the current transaction and exit, because another server >>> process >>> >> >> >> > exited >>> >> >> >> > abnormally and possibly corrupted shared memory. >>> >> >> >> > LOG: statistics collector process (PID 28881) was terminated >>> by >>> >> >> >> > signal >>> >> >> >> > 9: Killed >>> >> >> >> > LOG: autovacuum launcher process (PID 28880) was terminated >>> by >>> >> >> >> > signal >>> >> >> >> > 9: Killed >>> >> >> >> > LOG: terminating any other active server processes >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> ------------------------------------------------------------------------------ >>> >> >> >> > November Webinars for C, C++, Fortran Developers >>> >> >> >> > Accelerate application performance with scalable programming >>> >> >> >> > models. >>> >> >> >> > Explore >>> >> >> >> > techniques for threading, error checking, porting, and >>> tuning. Get >>> >> >> >> > the >>> >> >> >> > most >>> >> >> >> > from the latest Intel processors and coprocessors. See >>> abstracts >>> >> >> >> > and >>> >> >> >> > register >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ >>> >> >> >> > Postgres-xc-general mailing list >>> >> >> >> > Pos...@li... >>> >> >> >> > >>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >>> >> >> >> >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> ------------------------------------------------------------------------------ >>> >> >> > November Webinars for C, C++, Fortran Developers >>> >> >> > Accelerate application performance with scalable programming >>> models. >>> >> >> > Explore >>> >> >> > techniques for threading, error checking, porting, and tuning. >>> Get >>> >> >> > the >>> >> >> > most >>> >> >> > from the latest Intel processors and coprocessors. See abstracts >>> and >>> >> >> > register >>> >> >> > >>> >> >> > >>> >> >> > >>> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk >>> >> >> > _______________________________________________ >>> >> >> > Postgres-xc-general mailing list >>> >> >> > Pos...@li... >>> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >>> >> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> ------------------------------------------------------------------------------ >>> >> > DreamFactory - Open Source REST & JSON Services for HTML5 & Native >>> Apps >>> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API >>> Access >>> >> > Free app hosting. Or install the open source package on any LAMP >>> server. >>> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch and >>> Native! >>> >> > >>> >> > >>> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk >>> >> > _______________________________________________ >>> >> > Postgres-xc-general mailing list >>> >> > Pos...@li... >>> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >>> >> > >>> > >>> > >>> >> >> > |