From: Matthias A. <mat...@gm...> - 2010-05-19 21:48:40
|
Am 19.05.2010, 20:30 Uhr, schrieb Rainer Weikusat: > "Matthias Andree" <mat...@gm...> writes: > > [...] > >> please do (show an advance/preview version) - if you don't want to >> expose it to the list yet for whatever reason, but want my comments, >> feel free to send personal mail. > > I'll happily listen to anyone who has something more constructive to > say about this than "I hate it!". This is my last 'standalone test > program'. It can be compiled and used to do testing/ measurement of > the trie-code based on reading 'unique message IDs' from stdin which > must point to 'something seekable'. The interface is known to be > incomplete and the halfway integrated version contains everything in > here and whatever was additionally needed in order to use this as > 'backend' for the code in uid.c. My present approach to doing lookups > based on 'message numbers' is very simplistic --- just an array going > from 0 .. <highest message number>. There's a potential scalability > issue in there: Realloc with a constant increment is O(n^2) when the > memory area needs to be copied. If the pop3.c code works as I believe > it does (inserts message numbers in increasing order), I will likely Hi Rainer, thanks a lot! Regarding what you're writing: We know the message count in advance, because we query it (through STAT, in pop3_getrange()), so we have a good guess of how many messages there are, and we need very few realloc(). Look for *countp (a pointer passed by the caller) to get the message count. Message numbers are also ephemeral (i. e. not saved to disk), in case that helps. > change this to additionally store the offset of the first element in > the array to keep this shorter for the case when a large mailbox > contains a few new messages at the end. I will also change the > realloc-calculation to something more sensible (usually, this means > doubling the allocated area every time it needs to grow. > > Any feedback on the implementation would be very welcome. Looks quite alright at first glance. As a general remark, I wonder if it would be worthwhile splitting the PATRICIA core code from the fetchmail interfacing code. gets(DEST) is a no-go for me - please use fgets(DEST, sizeof(DEST), stdin), even if it's just in test code. I've often wasted time debugging such test code because it wasn't half as robust as the non-test code. Could I also ask that you spell out implicit "int" types, i. e. change unsigned to unsigned int? Implicit int is obsolecscent usage. It is valid C89, but causes C99 and C++ compiler complaints. Please spell out "unsigned int", or use uintfast_t or something like that. I wonder if using a typedef for the integers in the .h file would be useful. To give a feeling for compiler verbosity, I'm including compiler diagnostics from three compilers below, omitting warnings for unused variables, and more comments inline. ======================================================================================= GCC 4.4: ======================================================================================= $ gcc -O -Wc++-compat -std=c99 -Wall -Wextra -pedantic \ -o uidl_db uidl_db.c -Wno-unused -Dinline=__inline__ # -Dinline for -std=c89 :-) uidl_db.c: In function ‘init_uidl_db’: uidl_db.c:63: warning: request for implicit conversion from ‘void *’ to ‘struct uidl_record **’ not permitted in C++ uidl_db.c: At top level: uidl_db.c:69: warning: ISO C does not allow extra ‘;’ outside of a function uidl_db.c: In function ‘walk_down’: uidl_db.c:83: warning: comparison between signed and unsigned integer expressions uidl_db.c:85: warning: comparison between signed and unsigned integer expressions uidl_db.c:93: warning: request for implicit conversion from ‘void *’ to ‘struct pat_node *’ not permitted in C++ uidl_db.c: In function ‘first_set_bit_in_char’: uidl_db.c:100: warning: implicit declaration of function ‘ffs’ uidl_db.c: In function ‘get_pat_node’: uidl_db.c:152: warning: request for implicit conversion from ‘void *’ to ‘struct pat_node *’ not permitted in C++ uidl_db.c: In function ‘get_uidl_record’: uidl_db.c:242: warning: request for implicit conversion from ‘void *’ to ‘struct uidl_record *’ not permitted in C++ uidl_db.c:245: warning: request for implicit conversion from ‘void *’ to ‘char *’ not permitted in C++ uidl_db.c: In function ‘insert_into_records’: uidl_db.c:262: warning: request for implicit conversion from ‘void *’ to ‘struct uidl_record **’ not permitted in C++ uidl_db.c: In function ‘set_uidl_num’: uidl_db.c:299: warning: request for implicit conversion from ‘void *’ to ‘struct uidl_record **’ not permitted in C++ uidl_db.c: In function ‘find_uidl_by_id’: uidl_db.c:325: warning: comparison between signed and unsigned integer expressions uidl_db.c:327: warning: comparison between signed and unsigned integer expressions uidl_db.c:314: warning: ‘v’ may be used uninitialized in this function /tmp/cc6WZFTd.o: In function `main': uidl_db.c:(.text+0x602): warning: the `gets' function is dangerous and should not be used. I'm not so concerned about the extra ; or the implicit conversions, but I am concerned about the signedness mismatches. The ffs declaration is harmless and fixed with "#include <strings.h>". ======================================================================================= Intel C++ 11.1: ======================================================================================= $ icc -O -w2 -Wdeprecated -Wall -Wcheck -Wp64 -diag-disable 177,1418 -std=c99 -o uidl_db uidl_db.c uidl_db.c(69): remark #424: extra ";" ignored }; ^ uidl_db.c(92): warning #1684: conversion from pointer to same-sized integral type (potential portability problem) offsetof(struct pat_node, ptrs_[2]) ^ uidl_db.c(93): warning #1684: conversion from pointer to same-sized integral type (potential portability problem) : offsetof(struct pat_node, ptrs_[0]))); ^ uidl_db.c(100): warning #266: function "ffs" declared implicitly return ffs(v) - 1; ^ uidl_db.c(119): remark #2259: non-pointer conversion from "int" to "unsigned char" may lose significant bits v = r0->id[ofs] ^ r1->id[ofs]; ^ uidl_db.c(244): warning #2259: non-pointer conversion from "size_t={unsigned long}" to "unsigned int" may lose significant bits id_len = strlen(id); ^ uidl_db.c(317): warning #2259: non-pointer conversion from "size_t={unsigned long}" to "unsigned int" may lose significant bits len = strlen(id); ^ /tmp/iccQw73Ea.o: In function `main': uidl_db.c:(.text+0x6d): warning: the `gets' function is dangerous and should not be used. ======================================================================================= And finally clang 2.7: ======================================================================================= $ clang -pedantic -Wall -O3 -std=c99 -o uidl_db uidl_db.c uidl_db.c:69:2: warning: extra ';' outside of a function }; ^ uidl_db.c:92:6: warning: using extended field designator is an extension [-pedantic] offsetof(struct pat_node, ptrs_[2]) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from uidl_db.c:11: In file included from /usr/include/stdlib.h:33: /usr/local/lib/clang/2.0/include/stddef.h:48:24: note: instantiated from: #define offsetof(t, d) __builtin_offsetof(t, d) ^ uidl_db.c:93:8: warning: using extended field designator is an extension [-pedantic] : offsetof(struct pat_node, ptrs_[0]))); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from uidl_db.c:11: In file included from /usr/include/stdlib.h:33: /usr/local/lib/clang/2.0/include/stddef.h:48:24: note: instantiated from: #define offsetof(t, d) __builtin_offsetof(t, d) ^ uidl_db.c:100:12: warning: implicit declaration of function 'ffs' is invalid in C99 [-Wimplicit-function-declaration] return ffs(v) - 1; ^ ... /tmp/cc-GdRjxd.o: In function `main': uidl_db.c:(.text+0x779): warning: the `gets' function is dangerous and should not be used. ======================================================================================= Further comments inlined below. ======================================================================================= (sorry for the awkward quoting, that's an Opera M2 issue -- if someone has an "automatic mailing list folder/virtual folder generation tool" add-on that behaves similar to Opera so that I don't need to do the 100 lists manually, I'd be grateful and switch soonish.) > ------------- uidl_db.h ------------------- > /* > uidl db > > Copyright (c) 2010 MAD Partners, Ltd. > > This file is being published according to the terms of the > GPLv2 license contained in the 6.3.17 release of fetchmail. > > */ > #ifndef fetchmail_uidl_db_h > #define fetchmail_uidl_db_h > > /* includes */ > #include <stddef.h> > > /* types */ > struct uidl_record { > char *id; > unsigned id_len; > unsigned num, status, pos; > > struct uidl_record *next; > }; > > struct pat_node; > > struct uidl_db > { > struct pat_node *pat_root; > struct uidl_record **records; > unsigned records_max, records_next; What's the difference between _max and _next? Should probably be documented in the .h file later, although it seems quite clear from the init_uid_db() function. > struct uidl_record **num_ndx; > unsigned num_ndx_max; > }; > > /* routines */ > void init_uidl_db(struct uidl_db *db); I wonder if it would be good to start all functions with uidl_* consistently, for namespace purity. > > struct uidl_record *insert_uidl(struct uidl_db *db, > char const *id, unsigned status); > > struct uidl_record *insert_uidl_noindex(struct uidl_db *db, > char const *id, > unsigned status); > > void set_uidl_num(struct uidl_db *db, struct uidl_record *rec, > unsigned num); > > struct uidl_record *find_uidl_by_id(struct uidl_db *db, char const *id); > > static inline struct uidl_record * > find_uidl_by_num(struct uidl_db *db, unsigned num) > { > return num < db->num_ndx_max ? db->num_ndx[num] : NULL; > } > static inline struct uidl_record * > find_uidl_by_pos(struct uidl_db *db, unsigned pos) > { > return pos < db->records_next ? db->records[pos] : NULL; > } > > static inline struct uidl_record * > first_uidl_in_db(struct uidl_db *db, char const *id) > { > return find_uidl_by_id(db, id); > } > struct uidl_record *last_uidl_in_db(struct uidl_db *db, char const *id); > > #endif > --------------------------------------- > > ------------------- uidl_db.c ----------------- > /* > uidl db > > Copyright (c) 2010 MAD Partners, Ltd. > > This file is being published according to the terms of the > GPLv2 license contained in the 6.3.17 release of fetchmail. > */ > > /* includes */ > #include <stdlib.h> > #include <stdio.h> > #include <string.h> Please add, for ffs: #include <strings.h> > #include "uidl_db.h" > > /* constants */ > enum { > MIN_RECORDS = 16 > }; > > /* types */ > struct pat_node { > struct pat_node *ptrs_[3]; > unsigned bit_ndx, bit_mask; > > struct uidl_record *rec; > }; > > /* > The idea behind this is that the 'left' pointer is at -1 and > the 'right' pointer at 1. This implies that dealing with 'symmetric > cases' for left and right pointers isn't necessary. There's always > just 'this pointer' (offset n) and 'the other pointer' (offset -n). > */ > #define ptrs(p) ((p)->ptrs_ + 1) > > /* routines */ > void *xmalloc(size_t len) > { > return malloc(len); > } > > void *xrealloc(void *p, size_t new_size) > { > return realloc(p, new_size); > } > > static inline unsigned bit_ofs(unsigned bit_ndx) > { > return bit_ndx >> 3; > } > > static inline unsigned bit_mask(unsigned bit_ndx) > { > return 1 << (bit_ndx & 7); > } > > void init_uidl_db(struct uidl_db *db) > { > db->pat_root = NULL; > db->records = xmalloc(MIN_RECORDS * sizeof(*db->records)); > db->records_max = MIN_RECORDS; > db->records_next = 0; > > db->num_ndx = NULL; > db->num_ndx_max = 0; > }; > > static struct pat_node *walk_down(struct uidl_db *db, struct uidl_record > *rec, > struct pat_node ***edgep, struct pat_node **parentp) > { > struct pat_node *cur, **edge; > unsigned bit_ndx, v; > int ofs; > > cur = db->pat_root; > ofs = -1; > do { > bit_ndx = cur->bit_ndx; > > if (bit_ofs(bit_ndx) != ofs) { > ofs = bit_ofs(bit_ndx); > v = ofs < rec->id_len ? rec->id[ofs] : 0; > } > > edge = ptrs(cur) + (v & cur->bit_mask ? 1 : -1); > } while ((cur = *edge) && cur->bit_ndx > bit_ndx); > > *parentp = (void *)((char *)edge - (v & bit_mask(bit_ndx) ? > offsetof(struct pat_node, ptrs_[2]) > : offsetof(struct pat_node, ptrs_[0]))); > *edgep = edge; > return cur; > } > > static inline unsigned first_set_bit_in_char(unsigned v) > { > return ffs(v) - 1; > } > > static int find_first_diff_bit(struct uidl_record const *r0, > struct uidl_record const *r1) > { > struct uidl_record const *long_id; > unsigned ofs, max; > unsigned char v; > > max = r0->id_len; > if (max > r1->id_len) { > max = r1->id_len; > long_id = r0; > } else > long_id = r1; > > ofs = 0; > do > v = r0->id[ofs] ^ r1->id[ofs]; > while (!v && ++ofs < max); > > if (!v) { > if (r0->id_len == r1->id_len) return -1; > v = long_id->id[ofs]; > } > > return first_set_bit_in_char(v) + ofs * 8; > } > > static inline int record_id_equal(struct uidl_record const *r0, > struct uidl_record const *r1) > { > return > r0->id_len == r1->id_len > && memcmp(r0->id, r1->id, r0->id_len) == 0; > } > > static struct uidl_record *append_to_list(struct uidl_record **recp, > struct uidl_record *rec) > { > while (*recp) recp = &(*recp)->next; > *recp = rec; > rec->next = NULL; > return rec; > } > > static struct pat_node *get_pat_node(struct uidl_record *rec) > { > struct pat_node *np; > > np = xmalloc(sizeof(*np)); > np->rec = rec; > rec->next = NULL; > return np; > } > > static struct pat_node *get_standalone_node(struct uidl_record *rec) > { > struct pat_node *np; > > np = get_pat_node(rec); > np->bit_ndx = first_set_bit_in_char(*rec->id); > np->bit_mask = bit_mask(np->bit_ndx); > return np; > } > > static inline unsigned bit_set(unsigned bit_ndx, struct uidl_record > const *rec) > { > unsigned ofs; > > ofs = bit_ofs(bit_ndx); > if (ofs >= rec->id_len) return 0; > return rec->id[ofs] & bit_mask(bit_ndx); > } > > static inline struct pat_node *walk_up(unsigned diff_ndx, struct > pat_node **parent) > { > struct pat_node *p, *np; > > np = *parent; > while ((p = *ptrs(np)) && p->bit_ndx > diff_ndx) > np = p; > > *parent = p; > return np; > } > > static struct uidl_record *pat_insert(struct uidl_db *db, > struct uidl_record *rec) > { > struct pat_node *np, *closest, *parent, **edge; > int me, bit_ndx; > > if (!db->pat_root) { > np = get_standalone_node(rec); > ptrs(np)[-1] = *ptrs(np) = NULL; > ptrs(np)[1] = np; > > db->pat_root = np; > return rec; > } > > closest = walk_down(db, rec, &edge, &parent); > > if (closest) { > bit_ndx = find_first_diff_bit(closest->rec, rec); > if (bit_ndx < 0) > return append_to_list(&closest->rec->next, rec); > > np = get_pat_node(rec); > np->bit_ndx = bit_ndx; > np->bit_mask = bit_mask(bit_ndx); > } else > np = get_standalone_node(rec); > > if (parent->bit_ndx > np->bit_ndx) { > closest = walk_up(np->bit_ndx, &parent); > > if (!parent) edge = &db->pat_root; > else edge = ptrs(parent)[-1] == closest ? > ptrs(parent) - 1 : ptrs(parent) + 1; > *ptrs(closest) = np; > } > *edge = np; > *ptrs(np) = parent; > > me = bit_set(np->bit_ndx, rec) ? 1 : -1; > ptrs(np)[me] = np; > ptrs(np)[-me] = closest; > > return rec; > } > > static struct uidl_record *get_uidl_record(char const *id, unsigned > status) > { > struct uidl_record *rec; > unsigned id_len; > > rec = xmalloc(sizeof(*rec)); > > id_len = strlen(id); > rec->id = memcpy(xmalloc(id_len + 1), id, id_len + 1); > rec->id_len = id_len; > rec->status = status; > rec->num = 0; > > return rec; > } > > static struct uidl_record *insert_into_records(struct uidl_db *db, > struct uidl_record *rec) > { > unsigned next, want; > > next = db->records_next; > if (next == db->records_max) { > want = db->records_max *= 2; > db->records = xrealloc(db->records, want * sizeof(rec)); > } > > rec->pos = next; > db->records[next] = rec; > db->records_next = next + 1; > > return rec; > } > > struct uidl_record *insert_uidl(struct uidl_db *db, > char const *id, unsigned status) > { > struct uidl_record *rec; > > rec = insert_uidl_noindex(db, id, status); > return pat_insert(db, rec); > } > > struct uidl_record *insert_uidl_noindex(struct uidl_db *db, > char const *id, unsigned status) > { > struct uidl_record *rec; > > rec = get_uidl_record(id, status); > return insert_into_records(db, rec); > } > > void set_uidl_num(struct uidl_db *db, struct uidl_record *rec, > unsigned num) > { > unsigned want, have; > > have = db->num_ndx_max; > if (num >= have) { > want = num + MIN_RECORDS; > db->num_ndx = xrealloc(db->num_ndx, want * sizeof(*db->num_ndx)); > db->num_ndx_max = want; > > do db->num_ndx[--want] = NULL; while (want > have); > } > > db->num_ndx[num] = rec; > rec->num = num; > } > > struct uidl_record *find_uidl_by_id(struct uidl_db *db, char const *id) > { > static unsigned step_sum, searches; > struct pat_node *np; > struct uidl_record *rec; > unsigned v, bit_ndx, len; > int ofs; > len = strlen(id); > np = db->pat_root; > if (np) { > ofs = -1; > do { > bit_ndx = np->bit_ndx; > > if (bit_ofs(bit_ndx) != ofs) { > ofs = bit_ofs(bit_ndx); > v = ofs < len ? id[ofs] : 0; > } > > np = ptrs(np)[v & np->bit_mask ? 1 : -1]; > } while (np && np->bit_ndx > bit_ndx); > > if (!np) return NULL; > > rec = np->rec; > return rec->id_len == len && memcmp(id, rec->id, len) == 0 ? > rec : NULL; > } > > #if 0 > report(stderr, GT_("id search in unindexed db\n")); > #endif > > ofs = db->records_next; > while (ofs) { > rec = db->records[--ofs]; > if (rec->id_len == len && memcmp(rec->id, id, len) == 0) > return rec; > } > > return 0; > } > > struct uidl_record *last_uidl_in_db(struct uidl_db *db, char const *id) > { > struct uidl_record *rec; > > rec = find_uidl_by_id(db, id); > if (!rec) return NULL; > > while (rec->next) rec = rec->next; > return rec; > } > > static void free_uidl_list(struct uidl_record *rec) > { > if (rec->next) free_uidl_list(rec->next); > > free(rec->id); > free(rec); > } > > static void free_pat_trie(struct pat_node *np) > { > struct pat_node *child; > > child = ptrs(np)[-1]; > if (child && child->bit_ndx > np->bit_ndx) free_pat_trie(child); > > child = ptrs(np)[1]; > if (child->bit_ndx > np->bit_ndx) free_pat_trie(child); > > free_uidl_list(np->rec); > free(np); > } > > void free_uidl_db(struct uidl_db *db) > { > struct uidl_record *rec; > unsigned ndx; > if (db->pat_root) free_pat_trie(db->pat_root); > > free(db->records); > free(db->num_ndx); > } > > /* test code */ > int main(void) > { > struct uidl_db db; > struct uidl_record *rec; > unsigned n; > char buf[80]; > init_uidl_db(&db); > while (gets(buf)) insert_uidl(&db, buf, 0); > > n = 1000; > while (n--) { > fseek(stdin, 0, 0); > while (gets(buf)) { > rec = find_uidl_by_id(&db, buf); > if (!rec || strcmp(rec->id, buf) != 0) > fprintf(stderr, "not found\n"); > } > } > return 0; > } > --------------------------------------------- -- Matthias Andree |