|
From: Sandeep G. <gup...@gm...> - 2013-11-16 04:27:32
|
Hi Koichi, We tried the copy with the patch but it did not fix the problem for us. If you need any additional details then let me know. Best, -Sandeep On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta <gup...@gm...>wrote: > Hi Koichi, > > Many thanks for sharing the patch. We would test this in our environment > and would report back on the result. > > Regarding the very high memory consumption, it is not the coordinator > that sees the high memory usage. > The setup is following: We have one coordinator and 4 datanodes. All the > datanodes run on separate machine > then the coordinator. It is the machine where the datanodes are running > that have 48GB memory usage. > The other caveat is that that "top" doesn't show that it is the datanodes > processes using the memory. > > In any case, the last run of copy even with the high usage went through > fine. > > > Best, > Sandeep > > > > On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki <koi...@gm...>wrote: > >> Yes, I've already found this kind of failure. When disk write is >> quite slower than read, coordinator overshoots. However, in this >> case, I thought we will see that palloc() overflows. If each piece >> of COPY TO text data is less than 1GB, then each COPY TO succeeds. >> If the write is much slower than read, each coordinator will allocate >> memory for each COPY TO cache going to datanodes, ending up with >> enormous amount of memory consumption and have a good chance to run >> OOM killer. Your memory usage looks enormous. Maybe you have >> many COPY TO instances each of which allocates too much memory. >> >> Attached patch fixes COPY TO. Because we have a chance to have >> similar overshoots in COPY FROM, I'd like to write a patch to fix this >> too before committing. This is for the current master but will work >> with 1.1.x (even with 1.0.x hopefully). >> >> The patch looks to be fine. I asked to test it in such an >> environment and they reported the patch fixed their issue. >> >> I'm not sure but hope this works in your case. May reduce memory >> consumption drastically. Please let me know the result. >> >> I have to allocate some of my resource to fix the same issue in COPY >> FROM. I have another request to fix this so if the patch works for >> you, I may be able to commit only COPY TO fix first. >> >> Regards; >> --- >> Koichi Suzuki >> >> >> 2013/11/15 Sandeep Gupta <gup...@gm...>: >> > Hi Koichi, >> > >> > I am using the copy command over the text file to populate a table. >> > >> > Sorry for the confusion. I meant 628 -- 638 lines in the pqformat.c. >> > Let me explain this a bit more. I get error in some of the datanode >> > logfiles >> > as "invalid string in message". The message is mentioned only in >> pqformat.c >> > at line 628. Although I am not able to trace back as to where error may >> > have first ordered in the datanode that triggered this error. >> > >> > In order to debug this I started printing out the rows the coordinator >> is >> > sending to the datanode (in function DataNodeCopyIn). For some reason >> the >> > error invalid string in messge error has not happened so far. It seems >> > there is some issue with the flow control of message between >> coordinator and >> > datanode. >> > Hope this helps in narrowing down what is causing the problem in the >> first >> > place. >> > >> > >> > However, the new problem I see is that the memory usage goes very high. >> Of >> > the 48 GB almost all of the memory is getting used. Is this normal? >> > If not can you give suggestion as to how to keep the memory consumption >> > low. >> > >> > -Sandeep >> > >> > >> > >> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki <koi...@gm...> >> > wrote: >> >> >> >> Could you let me know what "line 682 -- 637" means? >> >> >> >> Because any communication between coordinator and datanode is done >> >> using libpq, I don't think XC populates information using raw >> >> stringinfo. In fact, from datanodes, coordinator is just a client >> >> except for supplying XID and snapshot. However, COPY TO uses >> >> send()/receive(), not through libpq, to send data to datanode. In >> >> this case, if incoming data file contains NULL in the data, such data >> >> may be sent via send() to datanodes. COPY TO assumes incoming data >> >> in a text format. Do you think you have such situations? >> >> >> >> Analysis may take a bit but let me try. >> >> >> >> Regards; >> >> --- >> >> Koichi Suzuki >> >> >> >> >> >> 2013/11/15 Sandeep Gupta <gup...@gm...>: >> >> > Hi Koichi, Masataka, >> >> > >> >> > Thanks for taking a look. I double checked both memory usage and the >> >> > dmesg >> >> > log. The kill is not happening due to the out of memory. The kernel >> >> > doesn't >> >> > show any processes being killed. >> >> > >> >> > My current suspicion is that something amiss is happening over the >> >> > network. >> >> > This happens consistently when performing copy over large datasets. >> Few >> >> > of >> >> > datanodes end up with "invalid string in message". >> >> > Going over the code in pqformat.c this happens because the invariant >> >> > "StringInfo is guaranteed to have a trailing null byte" is not valid >> >> > (line >> >> > 682 -- 637). I am not sure why this happens. >> >> > >> >> > >> >> > After this the datanodes do something illegal (probably buffer >> overflow >> >> > on >> >> > the port..just a guess) that the OS decides to issue a sigkill. I am >> >> > also >> >> > not sure why the OS is not logging this killed operation. >> >> > >> >> > >> >> > Please let me know of suggestions you may have regarding the >> "invalid >> >> > string in message". In particular, where in the code is StringInfo >> msg >> >> > is >> >> > getting populated i.e. which code reads the message over the network >> and >> >> > keeps it ready for the copy operation on the datanode. Any other >> >> > suggestion >> >> > would also be helpful to trace this bug. >> >> > >> >> > Best, >> >> > Sandeep >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> >> >> > wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> I think I may be onto something. >> >> >> >> >> >> When the memory is exhausted, the Linux kernel kills process >> randomly >> >> >> by SIGKILL. >> >> >> The mechanism is called OOM(Out Of Memory) killer. >> >> >> OOM killer logs its activity to kernel mesage buffer. >> >> >> >> >> >> Could you check display message (dmesg) and memory resource? >> >> >> >> >> >> Regards. >> >> >> >> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta >> >> >> <gup...@gm...> >> >> >> wrote: >> >> >> > Hi Koichi, >> >> >> > >> >> >> > It is a bit of mystery because it does not happen >> consistently. >> >> >> > Thanks >> >> >> > for the clarification though. It is indeed helpful. >> >> >> > >> >> >> > -Sandeep >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in...> >> >> >> > wrote: >> >> >> >> >> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend >> >> >> >> processes. >> >> >> >> Although XC (originally PG) code has a handler for SIGKILL, I >> didn’t >> >> >> >> found >> >> >> >> any code in XC sending SIGKILL to other processes. I’m afraid >> >> >> >> there >> >> >> >> could >> >> >> >> be another process sending SIGKILL to them. >> >> >> >> >> >> >> >> Best Regards; >> >> >> >> --- >> >> >> >> Koichi Suzuki >> >> >> >> >> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: >> >> >> >> >> >> >> >> > I am running single instance of postgres-xc with several >> datanode >> >> >> >> > Each data node runs on its own machine. >> >> >> >> > After instantiating the cluster the database sits idle. I do >> not >> >> >> >> > perform >> >> >> >> > any create table or insert operations. However, after some time >> >> >> >> > all >> >> >> >> > the >> >> >> >> > datanode automatically shuts down (log messages attached). >> >> >> >> > >> >> >> >> > Any pointers as to why this maybe happening would be very >> useful. >> >> >> >> > >> >> >> >> > -Sandeep >> >> >> >> > >> >> >> >> > The log files across all datanode contain various error >> messages. >> >> >> >> > These >> >> >> >> > include: >> >> >> >> > >> >> >> >> > >> >> >> >> > LOG: statistics collector process (PID 23073) was terminated >> by >> >> >> >> > signal >> >> >> >> > 9: Killed >> >> >> >> > LOG: WAL writer process (PID 23071) was terminated by signal >> 9: >> >> >> >> > Killed >> >> >> >> > LOG: terminating any other active server processes >> >> >> >> > WARNING: terminating connection because of crash of another >> >> >> >> > server >> >> >> >> > process >> >> >> >> > DETAIL: The postmaster has commanded this server process to >> roll >> >> >> >> > back >> >> >> >> > the current transaction and exit, because another server >> process >> >> >> >> > exited >> >> >> >> > abnormally and possibly corrupted shared memory. >> >> >> >> > HINT: In a moment you should be able to reconnect to the >> database >> >> >> >> > and >> >> >> >> > repeat your command. >> >> >> >> > LOG: all server processes terminated; reinitializing >> >> >> >> > >> >> >> >> > >> >> >> >> > LOG: checkpointer process (PID 21406) was terminated by >> signal 9: >> >> >> >> > Killed >> >> >> >> > LOG: terminating any other active server processes >> >> >> >> > WARNING: terminating connection because of crash of another >> >> >> >> > server >> >> >> >> > process >> >> >> >> > DETAIL: The postmaster has commanded this server process to >> roll >> >> >> >> > back >> >> >> >> > the current transaction and exit, because another server >> process >> >> >> >> > exited >> >> >> >> > abnormally and possibly corrupted shared memory. >> >> >> >> > LOG: statistics collector process (PID 28881) was terminated >> by >> >> >> >> > signal >> >> >> >> > 9: Killed >> >> >> >> > LOG: autovacuum launcher process (PID 28880) was terminated by >> >> >> >> > signal >> >> >> >> > 9: Killed >> >> >> >> > LOG: terminating any other active server processes >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> ------------------------------------------------------------------------------ >> >> >> >> > November Webinars for C, C++, Fortran Developers >> >> >> >> > Accelerate application performance with scalable programming >> >> >> >> > models. >> >> >> >> > Explore >> >> >> >> > techniques for threading, error checking, porting, and tuning. >> Get >> >> >> >> > the >> >> >> >> > most >> >> >> >> > from the latest Intel processors and coprocessors. See >> abstracts >> >> >> >> > and >> >> >> >> > register >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ >> >> >> >> > Postgres-xc-general mailing list >> >> >> >> > Pos...@li... >> >> >> >> > >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >> >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> ------------------------------------------------------------------------------ >> >> >> > November Webinars for C, C++, Fortran Developers >> >> >> > Accelerate application performance with scalable programming >> models. >> >> >> > Explore >> >> >> > techniques for threading, error checking, porting, and tuning. Get >> >> >> > the >> >> >> > most >> >> >> > from the latest Intel processors and coprocessors. See abstracts >> and >> >> >> > register >> >> >> > >> >> >> > >> >> >> > >> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk >> >> >> > _______________________________________________ >> >> >> > Postgres-xc-general mailing list >> >> >> > Pos...@li... >> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> ------------------------------------------------------------------------------ >> >> > DreamFactory - Open Source REST & JSON Services for HTML5 & Native >> Apps >> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access >> >> > Free app hosting. Or install the open source package on any LAMP >> server. >> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch and >> Native! >> >> > >> >> > >> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk >> >> > _______________________________________________ >> >> > Postgres-xc-general mailing list >> >> > Pos...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> > >> > >> > >> > > |