Re: [Sqlgrey-users] Congratulations & PostgreSQL issue

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Lionel,

Thanks for your answers. I agree with most of them, except the indexing
issue. I a completely opposed to database clusters, they do not work
and never will. 

PostgreSQL (and other databases) work like that:

* Each SQL query is parsed and analysed before execution. The SQL
optimizer rewrites the query and executes the best "guessed" query
plan. A query plan is a number of steps needed to execute the query.

* There are basically two steps in a query plan : indexes lookups (aka
memory) and disc lookups.

* Disc lookups will trigger more CPU cycles. If your table "from_awl"
has 1.000.000 entries and you need to go through all records without
indexing, it will take you at least 1.000.000 CPU cycles if you using
an SSD disc and ten times more if using a SATA disc. On the converse,
indexes allow the database to pick-up exactly the needed records. So if
there are only let's say 5 records in the result and you are using
indexes, it will eat-up only 5 CPU cycles. 

The number of CPU cycles is approximate, but the idea behind it there.

Even if a table has a unique index on OID, it will not speed-up a
search on a table. You need indexing on EACH colum which is used in
queries where speed matters.

* If we take the example of from_awl table, there are only two indexes:
main OID and lastseen. Were need more:

CREATE INDEX from_awl_sendername
  ON from_awl
  USING btree
  (sender_name);

CREATE INDEX from_awl_senderdomain
  ON from_awl
  USING btree
  (sender_domain);

CREATE INDEX from_awl_src
  ON from_awl
  USING btree
  (src);

CREATE INDEX from_awl_fseen
  ON from_awl
  USING btree
  (first_seen);

A nice PotgreSQL server with enough memory and SSD disc can reach a
huge amount of queries. I think 100.000 queries each second should not
be a problem.

If you need to reconnect for each query, you can create a dedicated
user with read-only access to tables and allow authentication without
password in PostgreSQL. 

There are also bottlenecks on client side. I did not read the Perl
code, but all data searching should be done at SQL level and avoid
searching using Perl code.

To know the real performance of a database, it is possible to log slow
queries. In PostgreSQL it is done by setting in postgresql.conf:
log_min_duration_statement = 30

This is 30 milliseconds.

It should be possible to create a small script and populate the
database with millions of records and see how it works on high loads.
No SQL query should run in more than 30 ms.

Kind regards,
Jean-Michel