From: Jean-Michel P. - G. <jm...@go...> - 2015-08-12 08:57:33
|
Dear Lionel, Thanks for your answers. I agree with most of them, except the indexing issue. I a completely opposed to database clusters, they do not work and never will. PostgreSQL (and other databases) work like that: * Each SQL query is parsed and analysed before execution. The SQL optimizer rewrites the query and executes the best "guessed" query plan. A query plan is a number of steps needed to execute the query. * There are basically two steps in a query plan : indexes lookups (aka memory) and disc lookups. * Disc lookups will trigger more CPU cycles. If your table "from_awl" has 1.000.000 entries and you need to go through all records without indexing, it will take you at least 1.000.000 CPU cycles if you using an SSD disc and ten times more if using a SATA disc. On the converse, indexes allow the database to pick-up exactly the needed records. So if there are only let's say 5 records in the result and you are using indexes, it will eat-up only 5 CPU cycles. The number of CPU cycles is approximate, but the idea behind it there. Even if a table has a unique index on OID, it will not speed-up a search on a table. You need indexing on EACH colum which is used in queries where speed matters. * If we take the example of from_awl table, there are only two indexes: main OID and lastseen. Were need more: CREATE INDEX from_awl_sendername ON from_awl USING btree (sender_name); CREATE INDEX from_awl_senderdomain ON from_awl USING btree (sender_domain); CREATE INDEX from_awl_src ON from_awl USING btree (src); CREATE INDEX from_awl_fseen ON from_awl USING btree (first_seen); A nice PotgreSQL server with enough memory and SSD disc can reach a huge amount of queries. I think 100.000 queries each second should not be a problem. If you need to reconnect for each query, you can create a dedicated user with read-only access to tables and allow authentication without password in PostgreSQL. There are also bottlenecks on client side. I did not read the Perl code, but all data searching should be done at SQL level and avoid searching using Perl code. To know the real performance of a database, it is possible to log slow queries. In PostgreSQL it is done by setting in postgresql.conf: log_min_duration_statement = 30 This is 30 milliseconds. It should be possible to create a small script and populate the database with millions of records and see how it works on high loads. No SQL query should run in more than 30 ms. Kind regards, Jean-Michel |