From: Gerald B. <ger...@gm...> - 2011-01-15 17:37:31
|
On Sat, Jan 15, 2011 at 12:31 PM, Jérôme <rom...@ya...> wrote: > Gerald, > > > Is it not possible that bsddb hash collisions slow down import ? > > I suppose I have a DB synchronization issue! [1] > Is there any fsync() like? > http://www.bash-linux.com/unix-man-fsync.html > > Why the pseudo random keys (output) are not used on all tables or not > visible use (import/export) on some tables [2] ? Not sure I understand the question. All eight object types have use BSDDB hashed indices on their handles. However, if you're lucky, you might see them come in the same order sometimes, I suppose. Not that you could count on that. > > > [1] http://www.gramps-project.org/bugs/view.php?id=4428 > [2] http://www.gramps-project.org/bugs/view.php?id=4365 > > > thanks, > Jérôme > > > > Gerald Britton a écrit : >> >> The handle key is a bsddb hash thus stored randomly (pseudo randomly). >> So export, import, export will not likely produce two identical >> exports. >> >> Otoh. You could run the xml file through xml query and use the "order" >> clause to sort it. >> >> On 1/15/11, jerome <rom...@ya...> wrote: >>> >>> Sorry, it was not a request for sorting handle or use of index. >>> It is was just the only one way to get Gramps more idempotent with the >>> Gramps XML file format, which is not the bsddb. >>> >>> True, the title is about iteration an sort argument but it was an >>> illustration related to ExportXml.py module. >>> >>> ImportXml.py (or something else) is rewritting some family objects and >>> re-order events, notes, places, etc ... after a simple Gramps XML >>> round-trip. This cannot affect bsddb performance, right ? >>> >>> To get a list or to iterate was only one method to try to get Gramps more >>> idempotent after a simple Gramps XML round-trip without any DB change >>> done >>> by the user. >>> >>> $ gramps -i import.gramps -e export.gramps >>> >>> As user, I thought "import.gramps = export.gramps" >>> >>> If it is related to bsddb performances, then this makes me thinking that >>> Gramps should not use this XML parser!!! >>> As Gramps uses ImportXml.py, something might be wrong here ... >>> If to write XML data to bsddb is not able to keep the order, then >>> something >>> is strange because person handles seem to keep the order! >>> >>> >>> >>> PS: there is also a performance issues on Gramps XML import (slow down : >>> python 2.6/7, bsddb ?) >>> >>> >>> Jérôme >>> >>> >>> --- En date de : Sam 15.1.11, Benny Malengier <ben...@gm...> >>> a >>> écrit : >>> >>> De: Benny Malengier <ben...@gm...> >>> Objet: Re: [Gramps-devel] self.db.iter_object_handles(sort_handles=True) >>> À: "Doug Blank" <dou...@gm...> >>> Cc: rom...@ya..., gra...@li..., "Gerald >>> Britton" >>> <ger...@gm...> >>> Date: Samedi 15 janvier 2011, 16h13 >>> >>> >>> >>> 2011/1/15 Doug Blank <dou...@gm...> >>> >>> On Sat, Jan 15, 2011 at 8:44 AM, Benny Malengier >>> >>> <ben...@gm...> wrote: >>> >>>> We should _never_ order on export. >>>> We should only access things via an index in the database. >>> >>> >>> Benny, >>> >>> >>> >>> If I understand what you mean, you mean don't sort export by something >>> >>> *other* than an index. As long as we have an index to sort by, then we >>> >>> are fine, right? Or did you mean something else? >>> >>> >>> >>> For 300000 people, not following the index would be best, as then you >>> don't >>> hit a database page twice. >>> When you follow an index, you will jump over your database pages, and the >>> data is too large to stay in memory. >>> >>> It depends on bsddb structure if following sorted record key has this >>> effect >>> or not. >>> >>> So, index is good, but not as good as just reading the database table >>> out. >>> It depends on how much performance you want. >>> For Gramps as a desktop applicatoin I can accept following an index is >>> good >>> enough, even if not the best. >>> >>> >>> Benny >>> >>> >>> -Doug >>> >>> >>> >>>> Ordering would mean a huge time penalty on exporting for those with very >>>> large family trees. >>>> Even exporting along a bsddb index would be much slower, as now we go >>>> from >>>> database page to database page. >>>> Just looping over the data and exporting means the the harddisk is the >>>> least >>>> read (it goes from database page to database page). >>>> In other words: >>>> 1/ default should be just a cursor of the database table, so order >>>> cannot >>>> be >>>> maintained >>>> 2/ ordered output could be optional. If we add an ordered output, it >>>> should >>>> be along an index page of the database, so no in memory sorting must >>>> occur >>>> before export can be done. I think ID has a sorted index over it. Handle >>>> normally also, as it is the primary key, and will hence be in some sort >>>> of >>>> B-tree. You must be sure to use the sort index on looping however. >>>> Benny >>>> 2011/1/15 Jérôme <rom...@ya...> >>>>>> >>>>>> if the round-trip through gramps was idempotent, then the diff would >>>>>> be >>>>>> empty. >>>>> >>>>> Expected result was: minor change on date generation (if generated on >>>>> an >>>>> other day) and maybe media objects (media paths). >>>>> I do not expect a full idem potent after round-trip, but currently we >>>>> cannot easily get the differences. I just wanted testing complete XML >>>>> migration before major release. >>>>> Jérôme >>>>> Doug Blank a écrit : >>>>>> >>>>>> On Fri, Jan 14, 2011 at 4:31 PM, jerome <rom...@ya...> wrote: >>>>>>>>> >>>>>>>>> gramps ids could be exotic! >>>>>>>> >>>>>>>> Do you mean unique? Anyway it is a good sort-key >>>>>>>> candidate >>>>>>> >>>>>>> ids = [I000001, IAYUTRE235, zharb, /empty/ , etc ...] >>>>>>> In 'handle' I trust! ;) >>>>>>>>> >>>>>>>>> Every time I import a Gramps XML, Gramps rebuilds >>>>>>>> >>>>>>>> (write, DB commit) some objects! Change time is not the same >>>>>>>> with a simple import then export. >>>>>>>> Well, they all need new handles, right? Possibility >>>>>>>> of collisions. >>>>>>>> Also with gramps ids. >>>>>>> >>>>>>> In fact, I want to keep handles: they should be the keys control. >>>>>>> My problem could be illustrated by something like: >>>>>>> $ gramps -i import.gramps -e export.gramps >>>>>>> $ gunzip < import.gramps > import.xml >>>>>>> $ gunzip < export.gramps > export.xml >>>>>>> $ diff -u import.xml export.xml > diff.txt >>>>>>> where import.gramps is our "Scientific control". >>>>>>> What should be the content of diff.txt ? >>>>>>> For me, it should be few lines... >>>>>>> Unfortunatly there is some change (order, change time on family >>>>>>> objects): that's strange! >>>>>> >>>>>> Yes, it would be handy to do this. This might be called "idempotent" >>>>>> by a mathematician: if the round-trip through gramps was idempotent, >>>>>> then the diff would be empty. >>>>>> What we need is: >>>>>> 1. something smarter than diff for this usage >>>>>> 2. sort on something that doesn't change (like the handle), just for >>>>>> this purpose >>>>>> 3. make it so that the order is preserved >>>>>> I would lean towards #3. I've "fixed" some other places where the >>>>>> order was lost. If you let me know which orders are lost, I'll >>>>>> address. >>>>>> -Doug >>>>>>> >>>>>>> Jérôme >>>>>>> --- En date de : Ven 14.1.11, Gerald Britton >>>>>>> <ger...@gm...> >>>>>>> a écrit : >>>>>>>> >>>>>>>> De: Gerald Britton <ger...@gm...> >>>>>>>> Objet: Re: [Gramps-devel] >>>>>>>> self.db.iter_object_handles(sort_handles=True) >>>>>>>> À: "jerome" <rom...@ya...> >>>>>>>> Cc: gra...@li... >>>>>>>> Date: Vendredi 14 janvier 2011, 22h10 >>>>>>>> On Fri, Jan 14, 2011 at 3:59 PM, >>>>>>>> jerome <rom...@ya...> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I am not certain to understand ... >>>>>>>>>>> Keys should be handles, no ? >>>>>>>>>> >>>>>>>>>> Well, that's the question! I can see a case for >>>>>>>>>> gramps ids, or >>>>>>>>>> surnames, or event dates, etc. etc. >>>>>>>>> >>>>>>>>> But handle is the easiest way and safe key for >>>>>>>> >>>>>>>> ordering our data. >>>>>>>> Only if that's the order you want >>>>>>>>> >>>>>>>>> gramps ids could be exotic! >>>>>>>> >>>>>>>> Do you mean unique? Anyway it is a good sort-key >>>>>>>> candidate >>>>>>>>> >>>>>>>>> surnames is not a good key :( >>>>>>>> >>>>>>>> I can see that some would like it...makes the XML easier to >>>>>>>> read by a human >>>>>>>>> >>>>>>>>> date => date_object => year, then month, then >>>>>>>> >>>>>>>> day, then rank, etc ... = horrible index >>>>>>>> Probably, but its just one possibility >>>>>>>>> >>>>>>>>> My problem is on plugins/export/ExportXML.py >>>>>>>>> I saw a sortByID function not used, then sometimes the >>>>>>>> >>>>>>>> use of list (get_...), then iteration (only family >>>>>>>> handles). >>>>>>>>> >>>>>>>>> I thought on use lists sorted by handle for having an >>>>>>>> >>>>>>>> order rule. I do not want to group handles, handles will be >>>>>>>> grouped into the Gramps XML, so it was not planned to parse >>>>>>>> one flat XML file or something like that! >>>>>>>>> >>>>>>>>> But it is not my main problem ... >>>>>>>>> I thought that to sort handles means objects lists >>>>>>>> >>>>>>>> will be consistent (Persons, Families, Events, etc ...) >>>>>>>>> >>>>>>>>> Every time I import a Gramps XML, Gramps rebuilds >>>>>>>> >>>>>>>> (write, DB commit) some objects! Change time is not the same >>>>>>>> with a simple import then export. >>>>>>>> Well, they all need new handles, right? Possibility >>>>>>>> of collisions. >>>>>>>> Also with gramps ids. >>>>>>>>> >>>>>>>>> I can understand the random order used by bsddb, but >>>>>>>> >>>>>>>> this should not be done on some objects (like family) and >>>>>>>> not on the others. >>>>>>>>> >>>>>>>>> In my mind, an import without DB change is like a >>>>>>>> >>>>>>>> "read-only": it is not the case. OK, you are saying that it >>>>>>>> is the way used by bsddb. XML files should be able to use >>>>>>>> 'diff' or revision control tools. With current Gramps XML >>>>>>>> import/export, these tools are limited. :( >>>>>>>> Yep. You're probably looking for something like a >>>>>>>> UUID for each >>>>>>>> record. Not a bad idea but not implemented at the >>>>>>>> moment. >>>>>>>>> >>>>>>>>> Jérôme >>>>>>>>> --- En date de : Ven 14.1.11, Gerald Britton >>>>>>>>> <ger...@gm...> >>>>>>>> >>>>>>>> a écrit : >>>>>>>>>> >>>>>>>>>> De: Gerald Britton <ger...@gm...> >>>>>>>>>> Objet: Re: [Gramps-devel] >>>>>>>> >>>>>>>> self.db.iter_object_handles(sort_handles=True) >>>>>>>>>> >>>>>>>>>> À: "jerome" <rom...@ya...> >>>>>>>>>> Cc: gra...@li... >>>>>>>>>> Date: Vendredi 14 janvier 2011, 21h21 >>>>>>>>>> On Fri, Jan 14, 2011 at 3:11 PM, >>>>>>>>>> jerome <rom...@ya...> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I am not certain to understand ... >>>>>>>>>>> Keys should be handles, no ? >>>>>>>>>> >>>>>>>>>> Well, that's the question! I can see a case for >>>>>>>>>> gramps ids, or >>>>>>>>>> surnames, or event dates, etc. etc. >>>>>>>> >>>>>>>> 'self.db.get_{object}_handles(sort_handles=True)' is >>>>>>>>>> >>>>>>>>>> allowed, >>>>>>>>>>> >>>>>>>>>>> not >>>>>>>> >>>>>>>> 'self.db.iter_{object}_handles(sort_handles=True)'! >>>>>>>>>>> >>>>>>>>>>> There is two questions: >>>>>>>>>>> 1. Why does Gramps only use >>>>>>>>>> >>>>>>>>>> self.db.iter_family_handles(), else >>>>>>>>>> self.get_{object}_handles(), where {object} is >>>>>>>> >>>>>>>> person or >>>>>>>>>> >>>>>>>>>> event or source or place or repository or note or >>>>>>>> >>>>>>>> media >>>>>>>>>> >>>>>>>>>> object. >>>>>>>>>> the get_...handles methods return a list, which >>>>>>>> >>>>>>>> can be >>>>>>>>>> >>>>>>>>>> expensive in >>>>>>>>>> memory and must read all objects in one pass. >>>>>>>> >>>>>>>> The >>>>>>>>>> >>>>>>>>>> iter... methods >>>>>>>>>> just return one at at time, so are cheaper in >>>>>>>> >>>>>>>> memory. >>>>>>>>>> >>>>>>>>>> So, the iter... >>>>>>>>>> methods are preferable. OTOH, they cannot do >>>>>>>> >>>>>>>> sorting, >>>>>>>>>> >>>>>>>>>> since by >>>>>>>>>> definition you need to read all records before you >>>>>>>> >>>>>>>> can sort >>>>>>>>>> >>>>>>>>>> them. >>>>>>>>>>> >>>>>>>>>>> 2. Why 'sort_handles=True' argument is >>>>>>>> >>>>>>>> allowed on all >>>>>>>>>> >>>>>>>>>> primary objects except family object ? >>>>>>>>>> I suppose that there has been no requirement so >>>>>>>> >>>>>>>> far so no >>>>>>>>>> >>>>>>>>>> one coded it up. >>>>>>>>>>>> >>>>>>>>>>>> The data is not ordered since it >>>>>>>>>>>> comes from bsddb in random order. >>>>>>>>>>> >>>>>>>>>>> This could explain why I will not be able to >>>>>>>> >>>>>>>> keep >>>>>>>>>> >>>>>>>>>> order on XML import (to bsddb). :( >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> Jérôme >>>>>>>>>>> --- En date de : Ven 14.1.11, Gerald Britton >>>>>>>> >>>>>>>> <ger...@gm...> >>>>>>>>>> >>>>>>>>>> a écrit : >>>>>>>>>>>> >>>>>>>>>>>> De: Gerald Britton <ger...@gm...> >>>>>>>>>>>> Objet: Re: [Gramps-devel] >>>>>>>>>> >>>>>>>>>> self.db.iter_object_handles(sort_handles=True) >>>>>>>>>>>> >>>>>>>>>>>> À: "jerome" <rom...@ya...> >>>>>>>>>>>> Cc: gra...@li... >>>>>>>>>>>> Date: Vendredi 14 janvier 2011, 19h53 >>>>>>>>>>>> The data is not ordered since it >>>>>>>>>>>> comes from bsddb in random order. If >>>>>>>>>>>> we ordered it, we would have to sort it >>>>>>>> >>>>>>>> by some >>>>>>>>>> >>>>>>>>>> key. >>>>>>>>>>>> >>>>>>>>>>>> So, if we did, >>>>>>>>>>>> what keys would you use for: >>>>>>>>>>>> person >>>>>>>>>>>> family >>>>>>>>>>>> event >>>>>>>>>>>> source >>>>>>>>>>>> place >>>>>>>>>>>> repository >>>>>>>>>>>> note >>>>>>>>>>>> media object >>>>>>>>>>>> On Fri, Jan 14, 2011 at 1:36 PM, jerome >>>>>>>> >>>>>>>> <rom...@ya...> >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> I am trying to get an answer to a >>>>>>>> >>>>>>>> question >>>>>>>>>> >>>>>>>>>> about the >>>>>>>>>>>> >>>>>>>>>>>> code: why we cannot keep the order of >>>>>>>> >>>>>>>> objects >>>>>>>>>> >>>>>>>>>> after a Gramps >>>>>>>>>>>> >>>>>>>>>>>> XML file import against export ? >>>>>>>>>>>>> >>>>>>>>>>>>> Nick pointed out that objects are >>>>>>>> >>>>>>>> not ordered >>>>>>>>>> >>>>>>>>>> on >>>>>>>>>>>> >>>>>>>>>>>> export[1]. >>>>>>>>>>>>> >>>>>>>>>>>>> Why ? I suppose backup scripts or >>>>>>>> >>>>>>>> revision >>>>>>>>>> >>>>>>>>>> control >>>>>>>>>>>> >>>>>>>>>>>> tools will work better with ordered >>>>>>>> >>>>>>>> objects! >>>>>>>>>> >>>>>>>>>> Anyway, to use >>>>>>>>>>>> >>>>>>>>>>>> 'sort_handles=True' works on export, >>>>>>>> >>>>>>>> except for >>>>>>>>>> >>>>>>>>>> family >>>>>>>>>>>> >>>>>>>>>>>> handles. Any reason for that ? A typo >>>>>>>> >>>>>>>> somewhere ? >>>>>>>>>> >>>>>>>>>> On my side >>>>>>>>>>>> >>>>>>>>>>>> ? >>>>>>>>>>>>> >>>>>>>>>>>>> [1] http://www.gramps-project.org/bugs/view.php?id=4365 >>>>>>>>>>>>> regards, >>>>>>>>>>>>> Jérôme >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> >>>>>>>>>>>>> Protect Your Site and Customers from >>>>>>>> >>>>>>>> Malware >>>>>>>>>> >>>>>>>>>> Attacks >>>>>>>>>>>>> >>>>>>>>>>>>> Learn about various malware tactics >>>>>>>> >>>>>>>> and how >>>>>>>>>> >>>>>>>>>> to avoid >>>>>>>>>>>> >>>>>>>>>>>> them. Understand >>>>>>>>>>>>> >>>>>>>>>>>>> malware threats, the impact they can >>>>>>>> >>>>>>>> have on >>>>>>>>>> >>>>>>>>>> your >>>>>>>>>>>> >>>>>>>>>>>> business, and how you >>>>>>>>>>>>> >>>>>>>>>>>>> can protect your company and >>>>>>>> >>>>>>>> customers by >>>>>>>>>> >>>>>>>>>> using code >>>>>>>>>>>> >>>>>>>>>>>> signing. >>>>>>>>>>>>> >>>>>>>>>>>>> http://p.sf.net/sfu/oracle-sfdevnl >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> >>>>>>>>>>>>> Gramps-devel mailing list >>>>>>>>>>>>> Gra...@li... >>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/gramps-devel >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Gerald Britton >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Gerald Britton >>>>>>>> >>>>>>>> -- >>>>>>>> Gerald Britton >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Protect Your Site and Customers from Malware Attacks >>>>>>> Learn about various malware tactics and how to avoid them. Understand >>>>>>> malware threats, the impact they can have on your business, and how >>>>>>> you >>>>>>> can protect your company and customers by using code signing. >>>>>>> http://p.sf.net/sfu/oracle-sfdevnl >>>>>>> _______________________________________________ >>>>>>> Gramps-devel mailing list >>>>>>> Gra...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/gramps-devel >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Protect Your Site and Customers from Malware Attacks >>>>> Learn about various malware tactics and how to avoid them. Understand >>>>> malware threats, the impact they can have on your business, and how you >>>>> can protect your company and customers by using code signing. >>>>> http://p.sf.net/sfu/oracle-sfdevnl >>>>> _______________________________________________ >>>>> Gramps-devel mailing list >>>>> Gra...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/gramps-devel >>> >>> >>> >>> >>> >>> >>> >> > > -- Gerald Britton |