Re: [Postgres-xc-general] Node Configuration For High Volume Writes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Koichi,

Thank you for your insight, i am going to create coordinators on each
datanode and try to distribute my connections from my nodes evenly.

Does PostgresXC have the ability to automatically load balance my
connections (say coordinator1 is too loaded my connection would get routed
to coordinator2)? Or would i have to do this manully?

Mason,

I've commented inline below.

Thank you both for you input,
-Nick

On Tue, Aug 21, 2012 at 8:16 PM, Koichi Suzuki <koi...@gm...>wrote:

> ----------
> Koichi Suzuki
>
>
> 2012/8/22 Mason Sharp <ma...@st...>:
> > On Tue, Aug 21, 2012 at 10:44 AM, Nick Maludy <nm...@gm...> wrote:
> >> All,
> >>
> >> I am currently exploring PostgresXC as a clustering solution for a
> project i
> >> am working on. The use case is a follows:
> >>
> >> - Time series data from multiple sensors
> >> - Sensors report at various rates from 50Hz to once every 5 minutes
> >> - INSERTs (COPYs) on the order of 1000+/s
> >
> > This should not be a problem, even for a single PostgreSQL instance.
> > Nonetheless, I would recommend to use COPY when uploading these
> > batches.
>

- Yes our batches of 1000-5000 were working fine with regular Postgres on
our current load. However our load is expected to increase next year and my
benchmarks showed that regular Postgres couldn't keep up with much more
than this. I am sorry to mislead you also, these are 5000 messages. Some of
our messages are quite complex, containing lists of other messages which
may contain lists of yet more messages, etc. We have put these nested lists
into separate tables and so saving one message could mean numerous inserts
into various tables, i can go into more detail later if needed.

> >
> >> - No UPDATEs once the data is in the database we consider it immutable
> >
> > Nice, no need to worry about update bloat and long vacuums.
> >
> >> - Large volumes of data needs to be stored (one sensor 50Hz sensor =
> ~1.5
> >> billion rows for a year of collection)
> >
> > No problem.
> >
> >> - SELECTs need to run as quick as possible for UI and data analysis
> >> - Number of clients connections = 10-20, +95% of the INSERTs are done
> by one
> >> node, +99% of the SELECTs are done by the rest of the nodes
> >
> > I am not sure what you mean. One client connection is doing 95% of the
> > inserts? Or 95% of the writes ends up on one single data node?
> >
> > Same thing with the 99%. Sorry, I am not quite sure I understand.
> >
>
- We currently only have one node in our network which writes to the
database, so all of the COPYs come from one libpq client connection. There
is one small use case where this isn't true so that's why i said 95%, but
to simplify things we can say only one node writes to the database.

- We have several other nodes which do data crunching and display
information to users, these nodes do all of the SELECTs.

>  >
> >> - Very write heavy application, reads are not nearly as frequent as
> writes
> >> but usually involve large amounts of data.
> >
> > Since you said it is sensor data, is it pretty much one large table?
> > That should work fine for large reads on Postgres-XC. This is sounding
> > like a good use case for Postgres-XC.
> >
>

- Our system collects data from several different types of sensors so we
have a table for each type, along with tables for our application specific
data. I would estimate around 10 tables contain a majority of our data
currently.

> >>
> >> My current cluster configuration is as follows
> >>
> >> Server A: GTM
> >> Server B: GTM Proxy, Coordinator
> >> Server C: Datanode
> >> Server D: Datanode
> >> Server E: Datanode
> >>
> >> My question is, in your documentation you recommend having a
> coordinator at
> >> each datanode, what is the rational for this?
> >>
> >
> > You don't necessarily need to.  If you have a lot of replicated tables
> > (not distributed), it can help because it just reads locally without
> > needing to hit up another server. It also ensures an even distribution
> > of your workload across the cluster.
> >
> > The flip side of this is a dedicated coordinator server can be a less
> > expensive server compared to the data nodes, so you can consider
> > price/performance. You can also easily add another dedicated
> > coordinator if it turns out your coordinator is bottle-necked, though
> > you could do that with the other configuration as well.
> >
> > So, it depends on your workload. If you have 3 data nodes and you also
> > ran a coordinator process on each and load balanced, 1/3rd of the time
> > a local read could be done.
> >
>

- I like your reasoning for having a coordinator on each datanode so we can
exploit local reads.
- I have chosen not to have any replicated tables simply because these
tables are expected to grow extremely large and will be too big to fit on
one node. My current DISTRIBUTE BY scheme is ROUND ROBIN so the data is
balanced between all of my nodes.

> >> Do you think it would be appropriate in my situation with so few
> >> connections?
> >>
> >> Would i get better read performance, and not hurt my write performance
> too
> >> much (write performance is more important than read)?
> >>
> >
> > If you have the time, ideally I would test it out and see how it
> > performs for your workload.  From what you described, there may not be
> > much of a difference.
>
> There're couple of reasons to configure both coordinator and datanode
> in each server.
>
> 1) You don't have to worry about load balancing between coordinator
> and datanode.
> 2) If target data is located locally, you can save network
> communication.   In DBT-1 benchmark, this contributes to the overall
> throughput.
> 3) More datanodes, better parallelism.   If you have four servers of
> the same spec, you can have four parallel I/O, instead of three.
>
> Of course, they depend on your transaction.

> Regards;
> ---
> Koichi Suzuki
>
> So, if you can have
> >
> >> Thanks,
> >> Nick
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Live Security Virtual Conference
> >> Exclusive live event will cover all the ways today's security and
> >> threat landscape has changed and how IT managers can respond.
> Discussions
> >> will include endpoint security, mobile security and the latest in
> malware
> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> _______________________________________________
> >> Postgres-xc-general mailing list
> >> Pos...@li...
> >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >>
> >
> >
> >
> > --
> > Mason Sharp
> >
> > StormDB - http://www.stormdb.com
> > The Database Cloud - Postgres-XC Support and Service
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Postgres-xc-general mailing list
> > Pos...@li...
> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>