|
From: Nick M. <nm...@gm...> - 2012-08-21 14:44:58
|
All, I am currently exploring PostgresXC as a clustering solution for a project i am working on. The use case is a follows: - Time series data from multiple sensors - Sensors report at various rates from 50Hz to once every 5 minutes - INSERTs (COPYs) on the order of 1000+/s - No UPDATEs once the data is in the database we consider it immutable - Large volumes of data needs to be stored (one sensor 50Hz sensor = ~1.5 billion rows for a year of collection) - SELECTs need to run as quick as possible for UI and data analysis - Number of clients connections = 10-20, +95% of the INSERTs are done by one node, +99% of the SELECTs are done by the rest of the nodes - Very write heavy application, reads are not nearly as frequent as writes but usually involve large amounts of data. My current cluster configuration is as follows Server A: GTM Server B: GTM Proxy, Coordinator Server C: Datanode Server D: Datanode Server E: Datanode My question is, in your documentation you recommend having a coordinator at each datanode, what is the rational for this? Do you think it would be appropriate in my situation with so few connections? Would i get better read performance, and not hurt my write performance too much (write performance is more important than read)? Thanks, Nick |
|
From: Mason S. <ma...@st...> - 2012-08-21 20:33:36
|
On Tue, Aug 21, 2012 at 10:44 AM, Nick Maludy <nm...@gm...> wrote: > All, > > I am currently exploring PostgresXC as a clustering solution for a project i > am working on. The use case is a follows: > > - Time series data from multiple sensors > - Sensors report at various rates from 50Hz to once every 5 minutes > - INSERTs (COPYs) on the order of 1000+/s This should not be a problem, even for a single PostgreSQL instance. Nonetheless, I would recommend to use COPY when uploading these batches. > - No UPDATEs once the data is in the database we consider it immutable Nice, no need to worry about update bloat and long vacuums. > - Large volumes of data needs to be stored (one sensor 50Hz sensor = ~1.5 > billion rows for a year of collection) No problem. > - SELECTs need to run as quick as possible for UI and data analysis > - Number of clients connections = 10-20, +95% of the INSERTs are done by one > node, +99% of the SELECTs are done by the rest of the nodes I am not sure what you mean. One client connection is doing 95% of the inserts? Or 95% of the writes ends up on one single data node? Same thing with the 99%. Sorry, I am not quite sure I understand. > - Very write heavy application, reads are not nearly as frequent as writes > but usually involve large amounts of data. Since you said it is sensor data, is it pretty much one large table? That should work fine for large reads on Postgres-XC. This is sounding like a good use case for Postgres-XC. > > My current cluster configuration is as follows > > Server A: GTM > Server B: GTM Proxy, Coordinator > Server C: Datanode > Server D: Datanode > Server E: Datanode > > My question is, in your documentation you recommend having a coordinator at > each datanode, what is the rational for this? > You don't necessarily need to. If you have a lot of replicated tables (not distributed), it can help because it just reads locally without needing to hit up another server. It also ensures an even distribution of your workload across the cluster. The flip side of this is a dedicated coordinator server can be a less expensive server compared to the data nodes, so you can consider price/performance. You can also easily add another dedicated coordinator if it turns out your coordinator is bottle-necked, though you could do that with the other configuration as well. So, it depends on your workload. If you have 3 data nodes and you also ran a coordinator process on each and load balanced, 1/3rd of the time a local read could be done. > Do you think it would be appropriate in my situation with so few > connections? > > Would i get better read performance, and not hurt my write performance too > much (write performance is more important than read)? > If you have the time, ideally I would test it out and see how it performs for your workload. From what you described, there may not be much of a difference. > Thanks, > Nick > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > -- Mason Sharp StormDB - http://www.stormdb.com The Database Cloud - Postgres-XC Support and Service |
|
From: Koichi S. <koi...@gm...> - 2012-08-22 00:16:54
|
---------- Koichi Suzuki 2012/8/22 Mason Sharp <ma...@st...>: > On Tue, Aug 21, 2012 at 10:44 AM, Nick Maludy <nm...@gm...> wrote: >> All, >> >> I am currently exploring PostgresXC as a clustering solution for a project i >> am working on. The use case is a follows: >> >> - Time series data from multiple sensors >> - Sensors report at various rates from 50Hz to once every 5 minutes >> - INSERTs (COPYs) on the order of 1000+/s > > This should not be a problem, even for a single PostgreSQL instance. > Nonetheless, I would recommend to use COPY when uploading these > batches. > >> - No UPDATEs once the data is in the database we consider it immutable > > Nice, no need to worry about update bloat and long vacuums. > >> - Large volumes of data needs to be stored (one sensor 50Hz sensor = ~1.5 >> billion rows for a year of collection) > > No problem. > >> - SELECTs need to run as quick as possible for UI and data analysis >> - Number of clients connections = 10-20, +95% of the INSERTs are done by one >> node, +99% of the SELECTs are done by the rest of the nodes > > I am not sure what you mean. One client connection is doing 95% of the > inserts? Or 95% of the writes ends up on one single data node? > > Same thing with the 99%. Sorry, I am not quite sure I understand. > > >> - Very write heavy application, reads are not nearly as frequent as writes >> but usually involve large amounts of data. > > Since you said it is sensor data, is it pretty much one large table? > That should work fine for large reads on Postgres-XC. This is sounding > like a good use case for Postgres-XC. > >> >> My current cluster configuration is as follows >> >> Server A: GTM >> Server B: GTM Proxy, Coordinator >> Server C: Datanode >> Server D: Datanode >> Server E: Datanode >> >> My question is, in your documentation you recommend having a coordinator at >> each datanode, what is the rational for this? >> > > You don't necessarily need to. If you have a lot of replicated tables > (not distributed), it can help because it just reads locally without > needing to hit up another server. It also ensures an even distribution > of your workload across the cluster. > > The flip side of this is a dedicated coordinator server can be a less > expensive server compared to the data nodes, so you can consider > price/performance. You can also easily add another dedicated > coordinator if it turns out your coordinator is bottle-necked, though > you could do that with the other configuration as well. > > So, it depends on your workload. If you have 3 data nodes and you also > ran a coordinator process on each and load balanced, 1/3rd of the time > a local read could be done. > >> Do you think it would be appropriate in my situation with so few >> connections? >> >> Would i get better read performance, and not hurt my write performance too >> much (write performance is more important than read)? >> > > If you have the time, ideally I would test it out and see how it > performs for your workload. From what you described, there may not be > much of a difference. There're couple of reasons to configure both coordinator and datanode in each server. 1) You don't have to worry about load balancing between coordinator and datanode. 2) If target data is located locally, you can save network communication. In DBT-1 benchmark, this contributes to the overall throughput. 3) More datanodes, better parallelism. If you have four servers of the same spec, you can have four parallel I/O, instead of three. Of course, they depend on your transaction. Regards; --- Koichi Suzuki So, if you can have > >> Thanks, >> Nick >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Postgres-xc-general mailing list >> Pos...@li... >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> > > > > -- > Mason Sharp > > StormDB - http://www.stormdb.com > The Database Cloud - Postgres-XC Support and Service > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general |
|
From: Nick M. <nm...@gm...> - 2012-08-22 13:40:27
|
Koichi, Thank you for your insight, i am going to create coordinators on each datanode and try to distribute my connections from my nodes evenly. Does PostgresXC have the ability to automatically load balance my connections (say coordinator1 is too loaded my connection would get routed to coordinator2)? Or would i have to do this manully? Mason, I've commented inline below. Thank you both for you input, -Nick On Tue, Aug 21, 2012 at 8:16 PM, Koichi Suzuki <koi...@gm...>wrote: > ---------- > Koichi Suzuki > > > 2012/8/22 Mason Sharp <ma...@st...>: > > On Tue, Aug 21, 2012 at 10:44 AM, Nick Maludy <nm...@gm...> wrote: > >> All, > >> > >> I am currently exploring PostgresXC as a clustering solution for a > project i > >> am working on. The use case is a follows: > >> > >> - Time series data from multiple sensors > >> - Sensors report at various rates from 50Hz to once every 5 minutes > >> - INSERTs (COPYs) on the order of 1000+/s > > > > This should not be a problem, even for a single PostgreSQL instance. > > Nonetheless, I would recommend to use COPY when uploading these > > batches. > - Yes our batches of 1000-5000 were working fine with regular Postgres on our current load. However our load is expected to increase next year and my benchmarks showed that regular Postgres couldn't keep up with much more than this. I am sorry to mislead you also, these are 5000 messages. Some of our messages are quite complex, containing lists of other messages which may contain lists of yet more messages, etc. We have put these nested lists into separate tables and so saving one message could mean numerous inserts into various tables, i can go into more detail later if needed. > > > >> - No UPDATEs once the data is in the database we consider it immutable > > > > Nice, no need to worry about update bloat and long vacuums. > > > >> - Large volumes of data needs to be stored (one sensor 50Hz sensor = > ~1.5 > >> billion rows for a year of collection) > > > > No problem. > > > >> - SELECTs need to run as quick as possible for UI and data analysis > >> - Number of clients connections = 10-20, +95% of the INSERTs are done > by one > >> node, +99% of the SELECTs are done by the rest of the nodes > > > > I am not sure what you mean. One client connection is doing 95% of the > > inserts? Or 95% of the writes ends up on one single data node? > > > > Same thing with the 99%. Sorry, I am not quite sure I understand. > > > - We currently only have one node in our network which writes to the database, so all of the COPYs come from one libpq client connection. There is one small use case where this isn't true so that's why i said 95%, but to simplify things we can say only one node writes to the database. - We have several other nodes which do data crunching and display information to users, these nodes do all of the SELECTs. > > > >> - Very write heavy application, reads are not nearly as frequent as > writes > >> but usually involve large amounts of data. > > > > Since you said it is sensor data, is it pretty much one large table? > > That should work fine for large reads on Postgres-XC. This is sounding > > like a good use case for Postgres-XC. > > > - Our system collects data from several different types of sensors so we have a table for each type, along with tables for our application specific data. I would estimate around 10 tables contain a majority of our data currently. > >> > >> My current cluster configuration is as follows > >> > >> Server A: GTM > >> Server B: GTM Proxy, Coordinator > >> Server C: Datanode > >> Server D: Datanode > >> Server E: Datanode > >> > >> My question is, in your documentation you recommend having a > coordinator at > >> each datanode, what is the rational for this? > >> > > > > You don't necessarily need to. If you have a lot of replicated tables > > (not distributed), it can help because it just reads locally without > > needing to hit up another server. It also ensures an even distribution > > of your workload across the cluster. > > > > The flip side of this is a dedicated coordinator server can be a less > > expensive server compared to the data nodes, so you can consider > > price/performance. You can also easily add another dedicated > > coordinator if it turns out your coordinator is bottle-necked, though > > you could do that with the other configuration as well. > > > > So, it depends on your workload. If you have 3 data nodes and you also > > ran a coordinator process on each and load balanced, 1/3rd of the time > > a local read could be done. > > > - I like your reasoning for having a coordinator on each datanode so we can exploit local reads. - I have chosen not to have any replicated tables simply because these tables are expected to grow extremely large and will be too big to fit on one node. My current DISTRIBUTE BY scheme is ROUND ROBIN so the data is balanced between all of my nodes. > >> Do you think it would be appropriate in my situation with so few > >> connections? > >> > >> Would i get better read performance, and not hurt my write performance > too > >> much (write performance is more important than read)? > >> > > > > If you have the time, ideally I would test it out and see how it > > performs for your workload. From what you described, there may not be > > much of a difference. > > There're couple of reasons to configure both coordinator and datanode > in each server. > > 1) You don't have to worry about load balancing between coordinator > and datanode. > 2) If target data is located locally, you can save network > communication. In DBT-1 benchmark, this contributes to the overall > throughput. > 3) More datanodes, better parallelism. If you have four servers of > the same spec, you can have four parallel I/O, instead of three. > > Of course, they depend on your transaction. > Regards; > --- > Koichi Suzuki > > So, if you can have > > > >> Thanks, > >> Nick > >> > >> > >> > ------------------------------------------------------------------------------ > >> Live Security Virtual Conference > >> Exclusive live event will cover all the ways today's security and > >> threat landscape has changed and how IT managers can respond. > Discussions > >> will include endpoint security, mobile security and the latest in > malware > >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> _______________________________________________ > >> Postgres-xc-general mailing list > >> Pos...@li... > >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >> > > > > > > > > -- > > Mason Sharp > > > > StormDB - http://www.stormdb.com > > The Database Cloud - Postgres-XC Support and Service > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Postgres-xc-general mailing list > > Pos...@li... > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > |
|
From: Koichi S. <koi...@gm...> - 2012-08-22 14:19:00
|
Unfortunately, XC does not come with load balancer. I hope static load balancing work in your case, that is, implement connection pooler and assign static (different) access point to different thread. Hope it helps. ---------- Koichi Suzuki 2012/8/22 Nick Maludy <nm...@gm...>: > Koichi, > > Thank you for your insight, i am going to create coordinators on each > datanode and try to distribute my connections from my nodes evenly. > > Does PostgresXC have the ability to automatically load balance my > connections (say coordinator1 is too loaded my connection would get routed > to coordinator2)? Or would i have to do this manully? > > > > Mason, > > I've commented inline below. > > > Thank you both for you input, > -Nick > > On Tue, Aug 21, 2012 at 8:16 PM, Koichi Suzuki <koi...@gm...> > wrote: >> >> ---------- >> Koichi Suzuki >> >> >> 2012/8/22 Mason Sharp <ma...@st...>: >> > On Tue, Aug 21, 2012 at 10:44 AM, Nick Maludy <nm...@gm...> wrote: >> >> All, >> >> >> >> I am currently exploring PostgresXC as a clustering solution for a >> >> project i >> >> am working on. The use case is a follows: >> >> >> >> - Time series data from multiple sensors >> >> - Sensors report at various rates from 50Hz to once every 5 minutes >> >> - INSERTs (COPYs) on the order of 1000+/s >> > >> > This should not be a problem, even for a single PostgreSQL instance. >> > Nonetheless, I would recommend to use COPY when uploading these >> > batches. > > > - Yes our batches of 1000-5000 were working fine with regular Postgres on > our current load. However our load is expected to increase next year and my > benchmarks showed that regular Postgres couldn't keep up with much more than > this. I am sorry to mislead you also, these are 5000 messages. Some of our > messages are quite complex, containing lists of other messages which may > contain lists of yet more messages, etc. We have put these nested lists into > separate tables and so saving one message could mean numerous inserts into > various tables, i can go into more detail later if needed. > >> >> > >> >> - No UPDATEs once the data is in the database we consider it immutable >> > >> > Nice, no need to worry about update bloat and long vacuums. >> > >> >> - Large volumes of data needs to be stored (one sensor 50Hz sensor = >> >> ~1.5 >> >> billion rows for a year of collection) >> > >> > No problem. >> > >> >> - SELECTs need to run as quick as possible for UI and data analysis >> >> - Number of clients connections = 10-20, +95% of the INSERTs are done >> >> by one >> >> node, +99% of the SELECTs are done by the rest of the nodes >> > >> > I am not sure what you mean. One client connection is doing 95% of the >> > inserts? Or 95% of the writes ends up on one single data node? >> > >> > Same thing with the 99%. Sorry, I am not quite sure I understand. >> > > > - We currently only have one node in our network which writes to the > database, so all of the COPYs come from one libpq client connection. There > is one small use case where this isn't true so that's why i said 95%, but to > simplify things we can say only one node writes to the database. > > - We have several other nodes which do data crunching and display > information to users, these nodes do all of the SELECTs. > >> >> > >> >> - Very write heavy application, reads are not nearly as frequent as >> >> writes >> >> but usually involve large amounts of data. >> > >> > Since you said it is sensor data, is it pretty much one large table? >> > That should work fine for large reads on Postgres-XC. This is sounding >> > like a good use case for Postgres-XC. >> > > > > - Our system collects data from several different types of sensors so we > have a table for each type, along with tables for our application specific > data. I would estimate around 10 tables contain a majority of our data > currently. > >> >> >> >> >> My current cluster configuration is as follows >> >> >> >> Server A: GTM >> >> Server B: GTM Proxy, Coordinator >> >> Server C: Datanode >> >> Server D: Datanode >> >> Server E: Datanode >> >> >> >> My question is, in your documentation you recommend having a >> >> coordinator at >> >> each datanode, what is the rational for this? >> >> >> > >> > You don't necessarily need to. If you have a lot of replicated tables >> > (not distributed), it can help because it just reads locally without >> > needing to hit up another server. It also ensures an even distribution >> > of your workload across the cluster. >> > >> > The flip side of this is a dedicated coordinator server can be a less >> > expensive server compared to the data nodes, so you can consider >> > price/performance. You can also easily add another dedicated >> > coordinator if it turns out your coordinator is bottle-necked, though >> > you could do that with the other configuration as well. >> > >> > So, it depends on your workload. If you have 3 data nodes and you also >> > ran a coordinator process on each and load balanced, 1/3rd of the time >> > a local read could be done. >> > > > > - I like your reasoning for having a coordinator on each datanode so we can > exploit local reads. > - I have chosen not to have any replicated tables simply because these > tables are expected to grow extremely large and will be too big to fit on > one node. My current DISTRIBUTE BY scheme is ROUND ROBIN so the data is > balanced between all of my nodes. > >> >> >> Do you think it would be appropriate in my situation with so few >> >> connections? >> >> >> >> Would i get better read performance, and not hurt my write performance >> >> too >> >> much (write performance is more important than read)? >> >> >> > >> > If you have the time, ideally I would test it out and see how it >> > performs for your workload. From what you described, there may not be >> > much of a difference. >> >> There're couple of reasons to configure both coordinator and datanode >> in each server. >> >> 1) You don't have to worry about load balancing between coordinator >> and datanode. >> 2) If target data is located locally, you can save network >> communication. In DBT-1 benchmark, this contributes to the overall >> throughput. >> 3) More datanodes, better parallelism. If you have four servers of >> the same spec, you can have four parallel I/O, instead of three. >> >> Of course, they depend on your transaction. >> >> >> Regards; >> --- >> Koichi Suzuki >> >> So, if you can have >> > >> >> Thanks, >> >> Nick >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> Live Security Virtual Conference >> >> Exclusive live event will cover all the ways today's security and >> >> threat landscape has changed and how IT managers can respond. >> >> Discussions >> >> will include endpoint security, mobile security and the latest in >> >> malware >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> _______________________________________________ >> >> Postgres-xc-general mailing list >> >> Pos...@li... >> >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >> > >> > >> > >> > -- >> > Mason Sharp >> > >> > StormDB - http://www.stormdb.com >> > The Database Cloud - Postgres-XC Support and Service >> > >> > >> > ------------------------------------------------------------------------------ >> > Live Security Virtual Conference >> > Exclusive live event will cover all the ways today's security and >> > threat landscape has changed and how IT managers can respond. >> > Discussions >> > will include endpoint security, mobile security and the latest in >> > malware >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> > _______________________________________________ >> > Postgres-xc-general mailing list >> > Pos...@li... >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > > |
|
From: Michael P. <mic...@gm...> - 2012-08-22 23:26:11
|
On Wed, Aug 22, 2012 at 10:39 PM, Nick Maludy <nm...@gm...> wrote: > Koichi, > > Thank you for your insight, i am going to create coordinators on each > datanode and try to distribute my connections from my nodes evenly. > > Does PostgresXC have the ability to automatically load balance my > connections (say coordinator1 is too loaded my connection would get routed > to coordinator2)? Or would i have to do this manully? > There is no load balancer included in the XC core package when you connect an application to Coordinators. However, depending on how you defined you table distribution strategy, you might reach a good load balance in both write and reads between Datanodes and Coordinators. Regards, -- Michael Paquier http://michael.otacoo.com |
|
From: Ashutosh B. <ash...@en...> - 2012-08-23 06:48:22
|
Hi Nick, I was going to ask you about the SELECT queries you are firing, but I see that you have sent those in another thread. So, I will respond more there. Following factors matter when it comes to coordinators 1. Number of connections - Coordinator is point of contact for applications, so more the connections more is the load, and thus having multiple coordinators helps there. But in your case, you mentioned that the number of connections are not so many (probably your current PG system is able to handle it), so you may want to look at the other factors. 2. Load on coordinator - In case of distributed tables, coordinator spends CPU time, in combining those results (aggregates, sorting etc.), so even though, there are small number of connections, a coordinator may get loaded, because of query processing. So, in your case, check if coordinator machine is reaching its CPU/network/disk IO/memory limits. If so, try putting coordinator on a machine different from those where datanodes are running. You may choose to share that machine with GTM, if needs so. This will provide coordinator with the needed CPU/network/RAM resources. This might actually work for you. In such case, you may want to give coordinator a machine with higher CPU/core power and higher RAM. On Tue, Aug 21, 2012 at 8:14 PM, Nick Maludy <nm...@gm...> wrote: > All, > > I am currently exploring PostgresXC as a clustering solution for a project > i am working on. The use case is a follows: > > - Time series data from multiple sensors > - Sensors report at various rates from 50Hz to once every 5 minutes > - INSERTs (COPYs) on the order of 1000+/s > - No UPDATEs once the data is in the database we consider it immutable > - Large volumes of data needs to be stored (one sensor 50Hz sensor = ~1.5 > billion rows for a year of collection) > - SELECTs need to run as quick as possible for UI and data analysis > - Number of clients connections = 10-20, +95% of the INSERTs are done by > one node, +99% of the SELECTs are done by the rest of the nodes > - Very write heavy application, reads are not nearly as frequent as writes > but usually involve large amounts of data. > > My current cluster configuration is as follows > > Server A: GTM > Server B: GTM Proxy, Coordinator > Server C: Datanode > Server D: Datanode > Server E: Datanode > > My question is, in your documentation you recommend having a coordinator > at each datanode, what is the rational for this? > > Do you think it would be appropriate in my situation with so few > connections? > > Would i get better read performance, and not hurt my write performance too > much (write performance is more important than read)? > > Thanks, > Nick > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |