From: Jean-Michel P. - G. <jm...@go...> - 2015-08-11 20:45:46
Attachments:
smime.p7s
|
Dear Lionel, First, I would like to congratulate you for bringing SqlGrey to the community. I just discovered SqlGrey and it solved my SPAM problems. Together with SPF validation, DNSBL and other filtering techniques, I now have nearly zero spam. I am running Debian and I am a big fan of PostgreSQL. The automatic population of the database and the SQL code using prepared queries was a very nice surprise. Previously, I tested other grey listing solutions and they don't have the simplicity and power of SqlGrey. I am surprised to see that some SqlGrey databases have been running years without incident. There might be modifications needed to stand "huge" loads under PostgreSQL. For example, I think of as much as 10.000 queries per second. This does not happen very often, I agree. First, I think it should be interesting to : * Use special the inet type for IPs and subnets http://www.postgresql.org/docs/current/static/datatype-net-types.html I believe that using the inet type should be compatible with current code in a fresh database. SELECT '216.34.181'::cidr => "216.34.181.0/24" SELECT '216.34.181.6'::inet => "216.34.181.6" * IP4r type could also be used from managing ranges of servers and spammers.IP4r reported 6000 queries per second on complex projects: https://www.siafoo.net/article/53 * Add indexes on domain_awl and from_awl tables. The advantage of CIDR is that it will be very well indexed. Indexes allow PostgreSQL planner to make clean search plans, mainly on indexes and reducing disc access to what should be zero. Just my 2 cents! Kind regards, Jean-Michel |
From: Lionel B. <lio...@bo...> - 2015-08-11 22:28:40
|
Le 11/08/2015 22:26, Jean-Michel Pouré - GOOZE a écrit : > Dear Lionel, > > First, I would like to congratulate you for bringing SqlGrey to the > community. I just discovered SqlGrey and it solved my SPAM problems. Happy to be of service. Numerous others have helped too (see the CONTRIB file). > > Together with SPF validation, DNSBL and other filtering techniques, I > now have nearly zero spam. > > I am running Debian and I am a big fan of PostgreSQL. The automatic > population of the database and the SQL code using prepared queries was > a very nice surprise. > > Previously, I tested other grey listing solutions and they don't have > the simplicity and power of SqlGrey. I am surprised to see that some > SqlGrey databases have been running years without incident. This is a testament to the robustness of both the database systems and SQLgrey. > > There might be modifications needed to stand "huge" loads under > PostgreSQL. For example, I think of as much as 10.000 queries per > second. This does not happen very often, I agree. In theory you should be able to reach these levels with PostgreSQL easily even with write-intensive situations like SPAM storms if you use enterprise level SSD to store your database. I've not tested SQLgrey itself at these levels though (you may need to distribute the workload on multiple instances). > > First, I think it should be interesting to : > > * Use special the inet type for IPs and subnets > http://www.postgresql.org/docs/current/static/datatype-net-types.html > > I believe that using the inet type should be compatible with current > code in a fresh database. > > SELECT '216.34.181'::cidr => "216.34.181.0/24" > SELECT '216.34.181.6'::inet => "216.34.181.6" I considered this very early (and in fact I ran some code using these datatypes the very first day I started developing SQLgrey). PostgreSQL is by far my favourite RDBMS but I wanted SQLgrey to avoid being a PostgreSQL-only project so I refrained from using proprietary datatypes. The performance difference should be negligible as a string representation doesn't have much storage overhead (probably <3x) and allows us to use a very simple (and fast) string operation in Perl and equality for matching class C networks using the src column. It is used like this when using greymethod "smart" (default) or "classc". Using a cidr datatype could be faster but I'm not aware of a single SQLgrey installation where this would make a difference. Switching to the cidr datatype could probably give a minor speedup in PostgreSQL, but would stop SQLgrey support (or make it complex) for SQLite and MySQL/MariaDB and would probably bring a minor slowdown in the Perl part. > > * IP4r type could also be used from managing ranges of servers and > spammers.IP4r reported 6000 queries per second on complex projects: > https://www.siafoo.net/article/53 I'm not familiar with the IP4r module but I guess that my answer above would apply to it too. > > * Add indexes on domain_awl and from_awl tables. They are already indexed: the primary key automatically adds a unique index which should be used by all queries (at least it was designed to). > The advantage of CIDR > is that it will be very well indexed. The string index is very efficient too, the only real drawback is the storage. If you have a SQLgrey database where shaving half of the storage requirement would be beneficial the cidr datatype might be interesting to look at but I seriously doubt that such a database exists. Best regards, Lionel |
From: Karl O. P. <ko...@me...> - 2015-08-12 00:03:24
|
On Wed, 12 Aug 2015 00:28:30 +0200 Lionel Bouton <lio...@bo...> wrote: > Le 11/08/2015 22:26, Jean-Michel Pouré - GOOZE a écrit : > > There might be modifications needed to stand "huge" loads under > > PostgreSQL. For example, I think of as much as 10.000 queries per > > second. This does not happen very often, I agree. > > In theory you should be able to reach these levels with PostgreSQL > easily even with write-intensive situations like SPAM storms if you > use enterprise level SSD to store your database. I've not tested > SQLgrey itself at these levels though (you may need to distribute the > workload on multiple instances). You could always throw hardware at it. The latest Postgres (at least) has multi-master capability. Karl <ko...@me...> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein |
From: Jean-Michel P. - G. <jm...@go...> - 2015-08-12 08:57:33
|
Dear Lionel, Thanks for your answers. I agree with most of them, except the indexing issue. I a completely opposed to database clusters, they do not work and never will. PostgreSQL (and other databases) work like that: * Each SQL query is parsed and analysed before execution. The SQL optimizer rewrites the query and executes the best "guessed" query plan. A query plan is a number of steps needed to execute the query. * There are basically two steps in a query plan : indexes lookups (aka memory) and disc lookups. * Disc lookups will trigger more CPU cycles. If your table "from_awl" has 1.000.000 entries and you need to go through all records without indexing, it will take you at least 1.000.000 CPU cycles if you using an SSD disc and ten times more if using a SATA disc. On the converse, indexes allow the database to pick-up exactly the needed records. So if there are only let's say 5 records in the result and you are using indexes, it will eat-up only 5 CPU cycles. The number of CPU cycles is approximate, but the idea behind it there. Even if a table has a unique index on OID, it will not speed-up a search on a table. You need indexing on EACH colum which is used in queries where speed matters. * If we take the example of from_awl table, there are only two indexes: main OID and lastseen. Were need more: CREATE INDEX from_awl_sendername ON from_awl USING btree (sender_name); CREATE INDEX from_awl_senderdomain ON from_awl USING btree (sender_domain); CREATE INDEX from_awl_src ON from_awl USING btree (src); CREATE INDEX from_awl_fseen ON from_awl USING btree (first_seen); A nice PotgreSQL server with enough memory and SSD disc can reach a huge amount of queries. I think 100.000 queries each second should not be a problem. If you need to reconnect for each query, you can create a dedicated user with read-only access to tables and allow authentication without password in PostgreSQL. There are also bottlenecks on client side. I did not read the Perl code, but all data searching should be done at SQL level and avoid searching using Perl code. To know the real performance of a database, it is possible to log slow queries. In PostgreSQL it is done by setting in postgresql.conf: log_min_duration_statement = 30 This is 30 milliseconds. It should be possible to create a small script and populate the database with millions of records and see how it works on high loads. No SQL query should run in more than 30 ms. Kind regards, Jean-Michel |
From: Jean-Michel P. - G. <jm...@go...> - 2015-08-12 10:10:50
Attachments:
smime.p7s
|
Another issue about: CREATE INDEX connect_idx ON "connect" USING btree (src , sender_domain , sender_name); This is an index on three columns. It is effective only if the SQL query includes these three columns. It can be very fast, but it should not be considered the same as THREE different indexes. I agree it is not very well documented in PostgreSQL documentation, but all databases work the same and this is only effective in a limited number of queries. Anyway, I don't have enough data to start a performance test. I will write another message on the list to launch a small testing project. Kind regards, Jean-Michel |
From: Lionel B. <lio...@bo...> - 2015-08-12 12:52:37
|
Le 12/08/2015 10:57, Jean-Michel Pouré - GOOZE a écrit : > [...] > * If we take the example of from_awl table, there are only two indexes: > main OID No: sqlgrey=> \d from_awl Table "public.from_awl" Column | Type | Modifiers ---------------+-----------------------------+----------- sender_name | character varying(64) | not null sender_domain | character varying(255) | not null src | character varying(39) | not null first_seen | timestamp without time zone | not null last_seen | timestamp without time zone | not null Indexes: "from_awl_pkey" PRIMARY KEY, btree (src, sender_domain, sender_name) "from_awl_lseen" btree (last_seen) There is no index on OID in the SQLgrey database. > and lastseen. Were need more: > CREATE INDEX from_awl_sendername > ON from_awl > USING btree > (sender_name); > > CREATE INDEX from_awl_senderdomain > ON from_awl > USING btree > (sender_domain); > > CREATE INDEX from_awl_src > ON from_awl > USING btree > (src); > > CREATE INDEX from_awl_fseen > ON from_awl > USING btree > (first_seen); You don't need these indexes and switching from the from_awl_pkey to these indexes would slow SQLgrey. Run EXPLAIN on any query done by SQLgrey and you should see that it uses the from_awl_pkey index (if it's not, your database is probably not in a sane state). Best regards, Lionel |
From: Jean-Michel P. - G. <jm...@go...> - 2015-08-12 13:01:28
Attachments:
smime.p7s
|
Le mercredi 12 août 2015 à 14:52 +0200, Lionel Bouton a écrit : > You don't need these indexes and switching from the from_awl_pkey to > these indexes would slow SQLgrey. Run EXPLAIN on any query done by > SQLgrey and you should see that it uses the from_awl_pkey index (if > it's > not, your database is probably not in a sane state). Okay, thanks. I am not sure that an index on several columns is useful as a primary index."from_awl_pkey" PRIMARY KEY, btree (src, sender_domain, sender_name) This is not the same as three indexes and can be very fast when a query is being run on all those three columns. Otherwize, it should be slow. I will try to create a fake database and will report later. Kind regards,Jean-Michel |
From: Lionel B. <lio...@bo...> - 2015-08-12 12:57:04
|
Le 12/08/2015 12:10, Jean-Michel Pouré - GOOZE a écrit : > Another issue about: > CREATE INDEX connect_idx > ON "connect" > USING btree > (src , sender_domain , sender_name); > > This is an index on three columns. It is effective only if the SQL > query includes these three columns. No, that's not the case. It is used when only the first or the first and second column are used in a query. > It can be very fast, but it should > not be considered the same as THREE different indexes. I agree it is > not very well documented in PostgreSQL documentation, but all databases > work the same and this is only effective in a limited number of > queries. And the order of the columns in this index was used so that all SQLgrey queries can be sped up by this index. Having three indexes instead of one would be both slower when updating or inserting data in the database and when selecting records based on conditions on multiple columns like SQLgrey does. You only have to use EXPLAIN on queries done by SQLgrey on your database to verify this. Best regards, Lionel. |
From: Jean-Michel P. - G. <jm...@go...> - 2015-08-12 13:05:22
Attachments:
smime.p7s
|
Le mercredi 12 août 2015 à 14:56 +0200, Lionel Bouton a écrit : > You only have to use EXPLAIN on queries done by SQLgrey on your > database > to verify this. Sorry, I replied before reading this. I am convinced, your index is very interesting indeed. Kind regards, Jean-Michel |