Re: [Postgres-xc-general] Node Configuration For High Volume Writes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Unfortunately, XC does not come with load balancer.   I hope static
load balancing work in your case, that is, implement connection pooler
and assign static (different) access point to different thread.

Hope it helps.
----------
Koichi Suzuki

2012/8/22 Nick Maludy <nm...@gm...>:
> Koichi,
>
> Thank you for your insight, i am going to create coordinators on each
> datanode and try to distribute my connections from my nodes evenly.
>
> Does PostgresXC have the ability to automatically load balance my
> connections (say coordinator1 is too loaded my connection would get routed
> to coordinator2)? Or would i have to do this manully?
>
>
>
> Mason,
>
> I've commented inline below.
>
>
> Thank you both for you input,
> -Nick
>
> On Tue, Aug 21, 2012 at 8:16 PM, Koichi Suzuki <koi...@gm...>
> wrote:
>>
>> ----------
>> Koichi Suzuki
>>
>>
>> 2012/8/22 Mason Sharp <ma...@st...>:
>> > On Tue, Aug 21, 2012 at 10:44 AM, Nick Maludy <nm...@gm...> wrote:
>> >> All,
>> >>
>> >> I am currently exploring PostgresXC as a clustering solution for a
>> >> project i
>> >> am working on. The use case is a follows:
>> >>
>> >> - Time series data from multiple sensors
>> >> - Sensors report at various rates from 50Hz to once every 5 minutes
>> >> - INSERTs (COPYs) on the order of 1000+/s
>> >
>> > This should not be a problem, even for a single PostgreSQL instance.
>> > Nonetheless, I would recommend to use COPY when uploading these
>> > batches.
>
>
> - Yes our batches of 1000-5000 were working fine with regular Postgres on
> our current load. However our load is expected to increase next year and my
> benchmarks showed that regular Postgres couldn't keep up with much more than
> this. I am sorry to mislead you also, these are 5000 messages. Some of our
> messages are quite complex, containing lists of other messages which may
> contain lists of yet more messages, etc. We have put these nested lists into
> separate tables and so saving one message could mean numerous inserts into
> various tables, i can go into more detail later if needed.
>
>>
>> >
>> >> - No UPDATEs once the data is in the database we consider it immutable
>> >
>> > Nice, no need to worry about update bloat and long vacuums.
>> >
>> >> - Large volumes of data needs to be stored (one sensor 50Hz sensor =
>> >> ~1.5
>> >> billion rows for a year of collection)
>> >
>> > No problem.
>> >
>> >> - SELECTs need to run as quick as possible for UI and data analysis
>> >> - Number of clients connections = 10-20, +95% of the INSERTs are done
>> >> by one
>> >> node, +99% of the SELECTs are done by the rest of the nodes
>> >
>> > I am not sure what you mean. One client connection is doing 95% of the
>> > inserts? Or 95% of the writes ends up on one single data node?
>> >
>> > Same thing with the 99%. Sorry, I am not quite sure I understand.
>> >
>
> - We currently only have one node in our network which writes to the
> database, so all of the COPYs come from one libpq client connection. There
> is one small use case where this isn't true so that's why i said 95%, but to
> simplify things we can say only one node writes to the database.
>
> - We have several other nodes which do data crunching and display
> information to users, these nodes do all of the SELECTs.
>
>>
>> >
>> >> - Very write heavy application, reads are not nearly as frequent as
>> >> writes
>> >> but usually involve large amounts of data.
>> >
>> > Since you said it is sensor data, is it pretty much one large table?
>> > That should work fine for large reads on Postgres-XC. This is sounding
>> > like a good use case for Postgres-XC.
>> >
>
>
> - Our system collects data from several different types of sensors so we
> have a table for each type, along with tables for our application specific
> data. I would estimate around 10 tables contain a majority of our data
> currently.
>
>>
>> >>
>> >> My current cluster configuration is as follows
>> >>
>> >> Server A: GTM
>> >> Server B: GTM Proxy, Coordinator
>> >> Server C: Datanode
>> >> Server D: Datanode
>> >> Server E: Datanode
>> >>
>> >> My question is, in your documentation you recommend having a
>> >> coordinator at
>> >> each datanode, what is the rational for this?
>> >>
>> >
>> > You don't necessarily need to.  If you have a lot of replicated tables
>> > (not distributed), it can help because it just reads locally without
>> > needing to hit up another server. It also ensures an even distribution
>> > of your workload across the cluster.
>> >
>> > The flip side of this is a dedicated coordinator server can be a less
>> > expensive server compared to the data nodes, so you can consider
>> > price/performance. You can also easily add another dedicated
>> > coordinator if it turns out your coordinator is bottle-necked, though
>> > you could do that with the other configuration as well.
>> >
>> > So, it depends on your workload. If you have 3 data nodes and you also
>> > ran a coordinator process on each and load balanced, 1/3rd of the time
>> > a local read could be done.
>> >
>
>
> - I like your reasoning for having a coordinator on each datanode so we can
> exploit local reads.
> - I have chosen not to have any replicated tables simply because these
> tables are expected to grow extremely large and will be too big to fit on
> one node. My current DISTRIBUTE BY scheme is ROUND ROBIN so the data is
> balanced between all of my nodes.
>
>>
>> >> Do you think it would be appropriate in my situation with so few
>> >> connections?
>> >>
>> >> Would i get better read performance, and not hurt my write performance
>> >> too
>> >> much (write performance is more important than read)?
>> >>
>> >
>> > If you have the time, ideally I would test it out and see how it
>> > performs for your workload.  From what you described, there may not be
>> > much of a difference.
>>
>> There're couple of reasons to configure both coordinator and datanode
>> in each server.
>>
>> 1) You don't have to worry about load balancing between coordinator
>> and datanode.
>> 2) If target data is located locally, you can save network
>> communication.   In DBT-1 benchmark, this contributes to the overall
>> throughput.
>> 3) More datanodes, better parallelism.   If you have four servers of
>> the same spec, you can have four parallel I/O, instead of three.
>>
>> Of course, they depend on your transaction.
>>
>>
>> Regards;
>> ---
>> Koichi Suzuki
>>
>> So, if you can have
>> >
>> >> Thanks,
>> >> Nick
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> Live Security Virtual Conference
>> >> Exclusive live event will cover all the ways today's security and
>> >> threat landscape has changed and how IT managers can respond.
>> >> Discussions
>> >> will include endpoint security, mobile security and the latest in
>> >> malware
>> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> >> _______________________________________________
>> >> Postgres-xc-general mailing list
>> >> Pos...@li...
>> >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>> >>
>> >
>> >
>> >
>> > --
>> > Mason Sharp
>> >
>> > StormDB - http://www.stormdb.com
>> > The Database Cloud - Postgres-XC Support and Service
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Live Security Virtual Conference
>> > Exclusive live event will cover all the ways today's security and
>> > threat landscape has changed and how IT managers can respond.
>> > Discussions
>> > will include endpoint security, mobile security and the latest in
>> > malware
>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> > _______________________________________________
>> > Postgres-xc-general mailing list
>> > Pos...@li...
>> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>
>