From: Holger S. <lis...@li...> - 2016-01-26 17:51:10
|
Hi all. I am working on a small private project which also involves sqlgrey integration. Lately, I was playing around with the data and just came to the point I was before: The tables do not have real primary keys. Did anyone try to run sqlgrey with an altered database schema, where the ID is generic and generated for each entry? Also, is sqlgrey still in development? Just asking if it is “worth” raising an Feature Request :) Any help / hints are highly appreciated. Cheers, Holger |
From: Jean-Michel P. - G. <jm...@go...> - 2016-02-08 21:03:37
Attachments:
smime.p7s
|
Le mardi 26 janvier 2016 à 17:31 +0000, Holger Steinhauer a écrit : > I am working on a small private project which also involves sqlgrey > integration. Lately, I was playing around with the data and just came > to the point I was before: The tables do not have real primary keys. > Did anyone try to run sqlgrey with an altered database schema, where > the ID is generic and generated for each entry? > Also, is sqlgrey still in development? Just asking if it is “worth” > raising an Feature Request :) > > Any help / hints are highly appreciated. I posted the same message on Tue, 11 Aug 2015 22:26:05 +0200 Just my 2 cents, I guess this is to allow huge loads, maybe hundred thousands of queries every second. Kind regards, Jean-Michel |
From: Karl O. P. <ko...@me...> - 2016-02-08 22:10:05
|
On Mon, 08 Feb 2016 21:47:17 +0100 Jean-Michel Pouré - GOOZE <jm...@go...> wrote: > Le mardi 26 janvier 2016 à 17:31 +0000, Holger Steinhauer a écrit : > > I am working on a small private project which also involves sqlgrey > > integration. Lately, I was playing around with the data and just > > came to the point I was before: The tables do not have real primary > > keys. Did anyone try to run sqlgrey with an altered database > > schema, where the ID is generic and generated for each entry? > > Also, is sqlgrey still in development? Just asking if it is “worth” > > raising an Feature Request :) You'd have to put forward a use-case. Otherwise it's wasted storage. Karl <ko...@me...> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein |
From: Lionel B. <lio...@bo...> - 2016-02-08 23:30:15
|
Hi, I didn't answer the original message at first because : - I believed I reacted on a similar subject several years ago, - the message title doesn't actually make sense to me. I suspect some people are coming here with a basic ORM background and don't fully understand the difference between primary keys and object ids (PKs/OIDs in the following). So here is a complete answer.. Le 08/02/2016 23:09, Karl O. Pinc a écrit : > On Mon, 08 Feb 2016 21:47:17 +0100 > Jean-Michel Pouré - GOOZE <jm...@go...> wrote: > >> Le mardi 26 janvier 2016 à 17:31 +0000, Holger Steinhauer a écrit : >>> I am working on a small private project which also involves sqlgrey >>> integration. Lately, I was playing around with the data and just >>> came to the point I was before: The tables do not have real primary >>> keys. Sure they have at least for tables where it makes sense. The notable exception is "connect" because enforcing unicity with a PK would both slow down access to this very time-sensitive table and force SQLgrey to handle collisions which it really doesn't need to. I suspect that what you call "real primary keys" are in fact OIDs (not in the "internal database value" sense, but in the "ORM object id" sense). From the top of my head, OIDs are only useful in the following cases : - identifying objects that don't have any obvious primary key (for example an user on a web application could want to change his login and not break all relations with the data linked to him/her), - simplifying ORMs code and especially how they manage foreign keys (some ORMs can handle composite PKs though), - making indexes on foreign keys accesses faster. We don't use any ORM or foreign keys in SQLgrey and all the objects we manage are uniquely defined with invariable properties so we are free to avoid OIDs as they would serve no purpose at all and just slow us down. >>> Did anyone try to run sqlgrey with an altered database >>> schema, where the ID is generic and generated for each entry? I'm not sure why you would do such a thing. >>> >>> Also, is sqlgrey still in development? Just asking if it is “worth” >>> raising an Feature Request :) There hasn't been any development for a while but I'm not aware of any bug surfacing for a long time either. AFAIK it just works. I'm not likely to put much effort in it. 99% of my developments are in Ruby now and while Perl isn't bad it's not as comfortable for me. So the latest changes have been implemented by other developers. I thought about rewriting it in Ruby some years ago. This would have been an opportunity to make the code more readable (I think I could do a better job on this front) but as I said, it just works (I still use it myself on several personal and corporate servers). > You'd have to put forward a use-case. Otherwise it's wasted storage. Not only would it be wasted storage but it would slow down the database too (dead weight slows you down, for example not being able to keep the whole database in RAM because of the additional weight would have a significant impact on large sites). Jean-Michel Pouré seemed to imply that it could have performance benefits bat I can't see how. There's not a single access to the database that could even use an OID, every object must be accessed based on conditions on its attributes (parts of email addresses, which are the only information relevant to SQLgrey's database a mail server can transmit to it). Best regards, Lionel |
From: Jean-Michel P. - G. <jm...@go...> - 2016-02-09 18:41:11
Attachments:
smime.p7s
|
Le mardi 09 février 2016 à 00:14 +0100, Lionel Bouton a écrit : > Jean-Michel Pouré seemed to imply that it could have performance > benefits bat I can't see how. There's not a single access to the > database that could even use an OID, every object must be accessed > based > on conditions on its attributes (parts of email addresses, which are > the > only information relevant to SQLgrey's database a mail server can > transmit to it). No I don't meant that. As it is the database structure is good. |
From: Lionel B. <lio...@bo...> - 2016-02-08 23:30:19
|
Le 09/02/2016 00:14, Lionel Bouton a écrit : > Hi, > > I didn't answer the original message at first because : > - I believed I reacted on a similar subject several years ago, > - the message title doesn't actually make sense to me. > > I suspect some people are coming here with a basic ORM background and > don't fully understand the difference between primary keys and object > ids (PKs/OIDs in the following). So here is a complete answer.. If people want to dig more OIDs are in fact surrogate keys (by opposition to natural keys which are used in SQLgrey). There's a ton of documentation/discussion on the subject. Best regards, Lionel |
From: incredibleh0lg <lis...@li...> - 2016-02-09 13:38:34
|
Hi. > Le 09/02/2016 00:14, Lionel Bouton a écrit : >> Hi, >> >> I didn't answer the original message at first because : >> - I believed I reacted on a similar subject several years ago, >> - the message title doesn't actually make sense to me. >> >> I suspect some people are coming here with a basic ORM background and >> don't fully understand the difference between primary keys and object >> ids (PKs/OIDs in the following). So here is a complete answer.. > > If people want to dig more OIDs are in fact surrogate keys (by > opposition to natural keys which are used in SQLgrey). There's a ton of > documentation/discussion on the subject. I understand that this can bring up discussions. But I would disagree with the term OIDs (and thanks for the kind words about my “basic" ORM background...). Basically one can choose between Natural Keys and Generated / Artificial Keys. In my experience natural keys are always not the best choice. In every real world commercial project you will end up in changing your keys, because the natural candidates that are so unchangeable, do change. So generic keys are - at least for me - always the safest way to go. One can still use generic primary keys and define constraints that the combination of local part and domain have to be unique. Regarding the speed part: It is a completely valid point to decide to not use real primary keys to improve the performance. But in a normal database design you can achieve this by having views or pre-aggregated tables - or use one of the NoSQL databases. Another point made in this discussion was to give a use case: Third party applications using the data in the database. In my case, I’d like to provide a similar functionality as SQLgrey Webinterface provides. So present the user with relevant entries - e.g. in the connect table - and let him decide what to do with it. In the current format this means, that the request has to carry the WHOLE entry as parameter to get it moved or deleted. This is pretty wasteful and it is the reason why I asked if somebody tested a slightly adjusted schema where generated primary keys are added by the database.In theory, this shouldn’t break the application. Last but not least, the argument with “wasted space” is insignificant today. The only concern I have about this is that this can be an issue for servers with huge traffic, as the keys can get pretty big quite quickly. So having this as an optional feature would be the best. As there is no real further development, it is maybe a good exercise to reimplement it with support for NoSQL databases (which, btw, use generic keys too). But thanks for the feedback anyway - sqlgrey is a really reliable tool which is doing great work for years now. So please do not see this is critic, it was more a feature request / question about experiences :) Cheers, Holger |
From: Lionel B. <lio...@bo...> - 2016-02-09 17:28:43
|
Hi, Le 09/02/2016 14:38, incredibleh0lg a écrit : > [...] > Another point made in this discussion was to give a use case: Third party applications using the data in the database. In my case, I’d like to provide a similar functionality as SQLgrey Webinterface provides. So present the user with relevant entries - e.g. in the connect table - and let him decide what to do with it. I don't think it's a good example because you definitely don't want to modify the connect table content in the back of SQLgrey : this is the table it uses to implement the basic greylisting algorithm. So if you want to modify the connect table you actually want to modify the greylisting algorithm. It seems to me that the two sane approaches if that's what you really want would be to : - implement your changes in SQLgrey itself, not in a separate application, - replace SQLgrey completely or fork it to implement your own algorithm if what you want to do is very different or solves a specific problem which isn't commonly encountered by SQLgrey uers. The same limitations apply to the config/from_awl/domain_awl tables (awl means *automatic* white list), if you try to modify them behind SQLgrey's back you are actually designing a new greylisting program. The only tables I created for external configuration are the opt[in|out]_[email|domain] tables. > In the current format this means, that the request has to carry the WHOLE entry as parameter to get it moved or deleted. In fact the connect table is a special case and doesn't even guarantee that each entry is unique so you can't event select a single entry deterministically. The tables that are supposed to be managed externally only have a single column : so the "WHOLE" entry is actually a single value which is obviously known by any application which would be designed to insert or delete entries (update doesn't make sense for these tables)... See below if "obviously" is not a given for you. > This is pretty wasteful and it is the reason why I asked if somebody tested a slightly adjusted schema where generated primary keys are added by the database.In theory, this shouldn’t break the application. It wouldn't break SQLgrey but it would make developing the application managing SQLgrey's configuration tables more complex and bug-prone (see a detailed explanation below). I think you are probably approaching the SQLgrey database and an external application's database as a single entity (see why I think so below too), which I don't think is a good way to approach the external configuration of SQLgrey. In a single database for any moderately complex application surrogate keys are nearly always the best choice for various reasons. For a database dedicated to a trivial service like SQLgrey this is clearly not the case. Let's use a practical example to demonstrate. Take the optout_email table (the other configuration tables are similar: all have a single column) and create an application managing its content (creating entries when users want to optout and deleting them when they want to activate greylisting again). If you use surrogate keys in the optout_email table the first question you have to ask yourself is : does my application separate database's user objects store : - user emails, - references to the surrogate keys in optout_email, - both and what is used among them to reference the optout_email entries. You made clear that don't want to use the user emails to reference optout_email entries so the only 2 choices for you remaining are the last ones. But actually if you store the references to surrogate keys you will most probably want to keep a copy of the email address too. Not only because it would be convenient in your application which would probably access the Postfix user database (or a database used to create the Postfix one which obviously must have the email addresses itself) but also because optout_email won't have the email if the user didn't choose to opt out. If you want to avoid storing the address at all costs you would make the application and Postfix depend on a SQLgrey change to store all your users' email addresses to solve this. This is not SQLgrey's role: it's a greylisting service, not an email user database. If you want to create your user database you obviously won't try to do so by installing a greylisting service. In conclusion there's no real choice once you have chosen to use surrogate keys : you will have to store both the email and the reference to the surrogate key in the external database in one way or another. So now if you really want to use the surrogate keys instead of the email itself you have to maintain consistency between your reference and the email addresses stored in two separate databases with 2 different codebases managed by 2 different sets of people. This is clearly (I hope so anyway) not something you want to do *at all*. By avoiding surrogate keys SQLgrey actually forces a potential separate database to be inherently consistent with its own database. In case this isn't crystal clear, as an example of the consistency problems consider these common scenarios : - admins create entries manually until enough users ask for direct access and install the configuration application, - admins use batches to create initial entries or maintain a subset of the optout_email table in parallel to the configuration application, these batches probably aren't aware of the existence of the configuration application and don't know how to update its database, - after a crash, on restoring the 2 separate databases (SQLgrey and configuration application) you can get a more recent SQLgrey database with entries not (yet) referenced by the other restored database version, - ... All these scenario imply that there will be entries in optout_email not yet referenced by the configuration application. When creating new users with addresses already in optout_email, how will you create the references to these entries ? The only way is to check that for each user you create in your application there isn't already an entry in the optout_email table with the same email address. So for new users your reference to the optout_email table is the email address and for existing users the reference is the surrogate key : it's bad design and a loud and clear invitation to bugs. Now let's see how you would design the application without surrogate keys. The application has its user objects with their own email attribute. - when you want to activate optout, you use your application's email attribute to find or create an entry in the optout_email table if needed, - when you want to deactivate it, you find and delete the entry with the same email, - when you change the email address of a user, you remove the old entry (if it exists) and create a new one if applicable. This is *trivial* logic. An environment where this is difficult to do would have to prevent you to make the most basic SQL queries imaginable. Why should we even consider a change only useful to people starting to develop external tools by shooting themselves in the foot when choosing their development environment ? Best regards, Lionel |
From: incredibleh0lg <lis...@li...> - 2016-02-09 22:11:03
|
Hi. Thanks @Lionel for the long explanation. I really like to discuss this :) So let’s go into it. > On 9 Feb 2016, at 17:28, Lionel Bouton <lio...@bo...> wrote: > > Hi, > > Le 09/02/2016 14:38, incredibleh0lg a écrit : >> [...] >> Another point made in this discussion was to give a use case: Third party applications using the data in the database. In my case, I’d like to provide a similar functionality as SQLgrey Webinterface provides. So present the user with relevant entries - e.g. in the connect table - and let him decide what to do with it. > > I don't think it's a good example because you definitely don't want to > modify the connect table content in the back of SQLgrey : this is the > table it uses to implement the basic greylisting algorithm. So if you > want to modify the connect table you actually want to modify the > greylisting algorithm. It seems to me that the two sane approaches if > that's what you really want would be to : > - implement your changes in SQLgrey itself, not in a separate application, > - replace SQLgrey completely or fork it to implement your own algorithm > if what you want to do is very different or solves a specific problem > which isn't commonly encountered by SQLgrey uers. > > The same limitations apply to the config/from_awl/domain_awl tables (awl > means *automatic* white list), if you try to modify them behind > SQLgrey's back you are actually designing a new greylisting program. > > The only tables I created for external configuration are the > opt[in|out]_[email|domain] tables. Ok, my bad (or more: my English skills :)) I see and treat SQLgrey’s database as what it is - a database used by an external tool. The “relevant entries” is something I still try to get my head around, as I will have to implement some logic *outside* SQLgrey (+ it’s database) to filter entries. This is definitely not SQLgrey’s job. It is and will always be the greylisting solution of my choice. My “problem”, I try to solve, is to provide some kind of integrated UI where users can manage their own Greylisting settings (opting out, whitelisting one specific email etc.) SQLgrey should do what it is really good at: greylisting. > >> In the current format this means, that the request has to carry the WHOLE entry as parameter to get it moved or deleted. > > In fact the connect table is a special case and doesn't even guarantee > that each entry is unique so you can't event select a single entry > deterministically. > The tables that are supposed to be managed externally only have a single > column : so the "WHOLE" entry is actually a single value which is > obviously known by any application which would be designed to insert or > delete entries (update doesn't make sense for these tables)... See below > if "obviously" is not a given for you. Fair enough. I was just thinking about an option to get an optional surrogate key (to stick with your terminology - every day a new word / term :D) to make it easier to address an entry. As far as I understand the logic for whitelisting an email currently “sitting in” greylisting, we have to create a corresponding entry with the data from the ‘connect' table within the ‘from_awl’ table and delete the entry in ‘connect’. At the moment I would have to submit 4 different values. As the entries in ‘connect’ are not necessarily unique, a surrogate key would make this easier. But life isn’t always easy, is it :) > >> This is pretty wasteful and it is the reason why I asked if somebody tested a slightly adjusted schema where generated primary keys are added by the database.In theory, this shouldn’t break the application. > > It wouldn't break SQLgrey but it would make developing the application > managing SQLgrey's configuration tables more complex and bug-prone (see > a detailed explanation below). > > I think you are probably approaching the SQLgrey database and an > external application's database as a single entity (see why I think so > below too), which I don't think is a good way to approach the external > configuration of SQLgrey. See above, I wasn’t really specific enough. So sorry again :) > > In a single database for any moderately complex application surrogate > keys are nearly always the best choice for various reasons. For a > database dedicated to a trivial service like SQLgrey this is clearly not > the case. Agreed. Just the point I tried to make in my previous email. > > Let's use a practical example to demonstrate. > Take the optout_email table (the other configuration tables are similar: > all have a single column) and create an application managing its content > (creating entries when users want to optout and deleting them when they > want to activate greylisting again). > If you use surrogate keys in the optout_email table the first question > you have to ask yourself is : does my application separate database's > user objects store : > - user emails, > - references to the surrogate keys in optout_email, > - both > and what is used among them to reference the optout_email entries. > > You made clear that don't want to use the user emails to reference > optout_email entries so the only 2 choices for you remaining are the > last ones. > But actually if you store the references to surrogate keys you will most > probably want to keep a copy of the email address too. Not only because > it would be convenient in your application which would probably access > the Postfix user database (or a database used to create the Postfix one > which obviously must have the email addresses itself) but also because > optout_email won't have the email if the user didn't choose to opt out. > If you want to avoid storing the address at all costs you would make the > application and Postfix depend on a SQLgrey change to store all your > users' email addresses to solve this. This is not SQLgrey's role: it's a > greylisting service, not an email user database. If you want to create > your user database you obviously won't try to do so by installing a > greylisting service. Well, I wouldn’t mind to use the email address - although this can be tricky in itself, depending on how the data gets transfered (http-encoded email addresses can be a pain, esp. with + addresses, which I personally use extensively). I absolutely get your point and agree 100%. My application should not “hijack” SQLgrey’s schema (and was never planned to do it). Otherwise it would be worth implementing my own greylisting service. Which would be an interesting challenge in itself, but not very high on my long ToDo list… > > In conclusion there's no real choice once you have chosen to use > surrogate keys : you will have to store both the email and the reference > to the surrogate key in the external database in one way or another. > > So now if you really want to use the surrogate keys instead of the email > itself you have to maintain consistency between your reference and the > email addresses stored in two separate databases with 2 different > codebases managed by 2 different sets of people. This is clearly (I hope > so anyway) not something you want to do *at all*. By avoiding surrogate > keys SQLgrey actually forces a potential separate database to be > inherently consistent with its own database. > > In case this isn't crystal clear, as an example of the consistency > problems consider these common scenarios : > - admins create entries manually until enough users ask for direct > access and install the configuration application, > - admins use batches to create initial entries or maintain a subset of > the optout_email table in parallel to the configuration application, > these batches probably aren't aware of the existence of the > configuration application and don't know how to update its database, > - after a crash, on restoring the 2 separate databases (SQLgrey and > configuration application) you can get a more recent SQLgrey database > with entries not (yet) referenced by the other restored database version, > - ... > All these scenario imply that there will be entries in optout_email not > yet referenced by the configuration application. > When creating new users with addresses already in optout_email, how will > you create the references to these entries ? The only way is to check > that for each user you create in your application there isn't already an > entry in the optout_email table with the same email address. So for new > users your reference to the optout_email table is the email address and > for existing users the reference is the surrogate key : it's bad design > and a loud and clear invitation to bugs. Absolutely agreed. And on this special occasion, I think it is absolutely fine to not normalise the data. It is simply not necessary or helpful. The complete opposite is the case: this would definitely have an performance impact. > > Now let's see how you would design the application without surrogate keys. > The application has its user objects with their own email attribute. > - when you want to activate optout, you use your application's email > attribute to find or create an entry in the optout_email table if needed, > - when you want to deactivate it, you find and delete the entry with the > same email, > - when you change the email address of a user, you remove the old entry > (if it exists) and create a new one if applicable. > This is *trivial* logic. An environment where this is difficult to do > would have to prevent you to make the most basic SQL queries imaginable. > > Why should we even consider a change only useful to people starting to > develop external tools by shooting themselves in the foot when choosing > their development environment ? Couldn’t agree more. Well, I think it was worth asking anyway. Thanks again for the very detailed answer. All the best, Holger |