From: Enno B. <enn...@gm...> - 2013-02-18 17:20:48
|
Hi, When I try to import a GEDCOM file created by PAF, I see that the current code in trunk is not able to detect the file type by interpreting the 1st 2 bytes of that file. Instead, I see the error shown below: /home/test/trunk/gramps/plugins/lib/libgedcom.py:7445: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if line == "\xef\xbb": /home/test/trunk/gramps/plugins/lib/libgedcom.py:7449: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal elif line == "\xff\xfe": /home/test/trunk/gramps/plugins/lib/libgedcom.py:7455: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal elif line[0] == "\x00" or line[1] == "\x00": 2013-02-18 18:00:08.556: WARNING: libgedcom.py: line 7482: Foute lijn1 in GEDCOM bestand. My GEDCOM file starts like this (od -t x1): 0000000 ef bb bf 30 20 48 45 41 44 0d 0a 31 20 53 4f 55 So I know that I need to get above code working to be able to import files that I create in PAF or RM. It's in __detect_file_decoder(self, input_file), and it works OK in 3.4. Can anyone help me here? I read a few things about the subject on stackoverflow, but I'm still lost here. I use Python 2.7.3, on Mint 14. I tried opening the file in another mode, but that didn't help. Putting str() at the right side of == helps me to get rid of the Unicode error messages, but it doesn't help to get the import running. I assume that line is sort of a byte string that can't be converted to Unicode, so maybe the whole comparison code must be rewritten, but I don't know how. My bug report is here: http://www.gramps-project.org/bugs/view.php?id=6462 And I have enough time to experiment, once I know where to start. thanks, Enno |
From: John R. <jr...@ce...> - 2013-02-18 18:16:33
|
On Feb 18, 2013, at 9:20 AM, Enno Borgsteede <enn...@gm...> wrote: > Hi, > > When I try to import a GEDCOM file created by PAF, I see that the > current code in trunk is not able to detect the file type by > interpreting the 1st 2 bytes of that file. Instead, I see the error > shown below: > > /home/test/trunk/gramps/plugins/lib/libgedcom.py:7445: UnicodeWarning: > Unicode equal comparison failed to convert both arguments to Unicode - > interpreting them as being unequal > if line == "\xef\xbb": > /home/test/trunk/gramps/plugins/lib/libgedcom.py:7449: UnicodeWarning: > Unicode equal comparison failed to convert both arguments to Unicode - > interpreting them as being unequal > elif line == "\xff\xfe": > /home/test/trunk/gramps/plugins/lib/libgedcom.py:7455: UnicodeWarning: > Unicode equal comparison failed to convert both arguments to Unicode - > interpreting them as being unequal > elif line[0] == "\x00" or line[1] == "\x00": > 2013-02-18 18:00:08.556: WARNING: libgedcom.py: line 7482: Foute lijn1 > in GEDCOM bestand. > > My GEDCOM file starts like this (od -t x1): > > 0000000 ef bb bf 30 20 48 45 41 44 0d 0a 31 20 53 4f 55 > > So I know that I need to get above code working to be able to import > files that I create in PAF or RM. It's in __detect_file_decoder(self, > input_file), and it works OK in 3.4. Can anyone help me here? I read a > few things about the subject on stackoverflow, but I'm still lost here. > I use Python 2.7.3, on Mint 14. > > I tried opening the file in another mode, but that didn't help. Putting > str() at the right side of == helps me to get rid of the Unicode error > messages, but it doesn't help to get the import running. I assume that > line is sort of a byte string that can't be converted to Unicode, so > maybe the whole comparison code must be rewritten, but I don't know how. > > My bug report is here: > > http://www.gramps-project.org/bugs/view.php?id=6462 > > And I have enough time to experiment, once I know where to start. Two suggestions: Put a print(lines) at libgedcom.py:7445 to see what it's actually reading, and figure out where it's actually opening the file and make sure that it's doing so in binary ("rb"). Regards, John Ralls. |
From: Enno B. <enn...@gm...> - 2013-02-20 14:30:05
|
Hi Benny, > After a GEDCOM import with problems, it is a good idea to run the > check and repair database tool > and the > rebuild secondary indexes / rebuild reference map tools > > It has suggested on this list to do that automatically, but I'm > personally not in favour of this. I ran a search for 'unknown' on Mantis, and found a report that gave me the impression that typeless objects are created on purpose, when unresolved references (to repos, notes, whatever) are found in a GEDCOM file. While I understand the reasoning behind this, I would personally prefer the creation of dummy objects with a type that does not create problems on subsequent GEDCOM export. Then it's up to the user to check those some time later, just like one can check for GEDCOM-import notes. regards, Enno |
From: Tim L. <guy...@gm...> - 2013-02-21 21:39:43
|
enno wrote > I ran a search for 'unknown' on Mantis, and found a report that gave me > the impression that typeless objects are created on purpose, when > unresolved references (to repos, notes, whatever) are found in a GEDCOM > file. > > While I understand the reasoning behind this, I would personally prefer > the creation of dummy objects with a type that does not create problems > on subsequent GEDCOM export. Then it's up to the user to check those > some time later, just like one can check for GEDCOM-import notes. (I haven't actually looked at the code just now, but...) It is not actually creating typeless objects, indeed there would be no way to do that. It is creating objects of whatever type is required, but is setting some 'characterising attribute' of the object to 'unknown' Indeed, when we implemented this, we changed the message from talking about unknown type to talking about unknown characterising attribute as 'unknown type' was both wrong and confusing. >From your error messages, it looks as though it is trying to create a Repository object, and to link the note about 'this is a note about a repository object that has been created' to the repository. But evidently something is going wrong, and possibly the repository object is not correctly created, and possibly the note is not correctly linked. At least the link from an object to the repository (I think it must be a link from a source to a repository, because that is the only type of object that links to repositories) does not seem to be created properly. Thanks for the info about the problem - it is supposed to be doing what you suggest, but is just not doing it right. -- View this message in context: http://gramps.1791082.n4.nabble.com/trunk-problem-importing-GEDCOM-files-created-by-Windows-programs-tp4658840p4658895.html Sent from the GRAMPS - Dev mailing list archive at Nabble.com. |
From: Enno B. <enn...@gm...> - 2013-02-22 21:22:02
|
Hi Tim, > (I haven't actually looked at the code just now, but...) It is not > actually creating typeless objects, indeed there would be no way to do > that. It is creating objects of whatever type is required, but is > setting some 'characterising attribute' of the object to 'unknown' Right, and my analysis was wrong. I have a RootsMagic GEDCOM file that has a lot of missing notes, and when I import that, Gramps creates note object with type unknown. And my mistake was that I thought that that unknown type was causing those subsequent GEDCOM export problem. It's not. I can say that now, because the export runs fine after running check and repair, and after that, those notes of unknown type are still there. In other words, things are indeed like you say they are, meaning that objects are created, but some links are wrong, or not all needed objects are created. > From your error messages, it looks as though it is trying to create a > Repository object, and to link the note about 'this is a note about a > repository object that has been created' to the repository. But > evidently something is going wrong, and possibly the repository object > is not correctly created, and possibly the note is not correctly linked. In my latest test, which goes wrong in 3.4.3 and trunk, it's not about repositories, but notes, and I can upload the offending GEDCOM as private if that helps. During import more than 1400 notes are created, but apparently that's not enough, since check and repair reports about 12 notes missing. The error message that I get looks very much like the one in http://www.gramps-project.org/bugs/view.php?id=5184 which was closed because there was not enough feedback to reproduce it. I can provide that feedback, for 3.4.3 or trunk, so I suggest that above bug is reopened. Jerome may not like that, because he closed it just yesterday ... thanks, Enno |
From: jerome <rom...@ya...> - 2013-02-23 08:28:55
|
> http://www.gramps-project.org/bugs/view.php?id=5184 > which was closed because there was not enough feedback to > reproduce it. > I can provide that feedback, for 3.4.3 or trunk, so I > suggest that above > bug is reopened. Jerome may not like that, because he closed > it just > yesterday ... Yes, feel free to re-open. I suppose there is no specific rights access on tracker for this? There is button for that bettween 'Relations' section and 'Attachements' section. Note, this report was for Gramps 3.3.0. The first release of 3.3.x branch. Reported on 2011-08-26. I have tried to understand where or what could be wrong. Asked for trying to run 'check' and repair' after a gedcom import and if there is any testcase database for reproducing it. Tim has reviewed citation section and gedcom handling. As we cannot reproduce it and as today the code might be different, as there was no additional informations from the reporter, as it was for 3.3.0; maybe to close it might one solution? If someone else can reproduce it, then we could re-open it or fill a new bug report. Note, playing with an experimental GenoPro to Gramps converter via python core modules and XML file format[1], I often failed into broken or missing references issues. I also used an 'unknown' string when I got this type of problems... We cannot fix all types of errors and we cannot reject complete file for one error (even we can check syntax and validate file). About the experimental converter, the main issues were family relations, status and individual locations. Python has nice and simple ways for handling that (map, list, dict, etc ...). I did not used the simpliers, but I suppose I get the expected ideas/concepts! The second issue was related to note. Every time I saw no right place for data, I stored content into a Note object. But like with Gedcom file format, there is no index for notes... About family, we can check references into individuals and into a complete family. How many passes for a complete family? As said, there is so many way for generating Gedcom sequences... To run 'Check and Repair' after a gedcom import can be useful. Otherwise, keep in mind that gecom file format can be corrupted (ie. did you already send a gedcom with non-ASCII characters via mail server using different encoding as the gedcom file?). Is there any way to know if IDs set into the gedcom are good for a trusted index? Does the program, which has generated the gedcom, aims exchange and genealogical collaborations? Or does it use time for having the better gedcom handling as possible? etc ... [1] http://www.gramps-project.org/bugs/view.php?id=2757 Jérôme --- En date de : Ven 22.2.13, Enno Borgsteede <enn...@gm...> a écrit : > De: Enno Borgsteede <enn...@gm...> > Objet: Re: [Gramps-devel] trunk problem importing GEDCOM files created by Windows programs > À: gra...@li... > Date: Vendredi 22 février 2013, 22h21 > Hi Tim, > > (I haven't actually looked at the code just now, > but...) It is not > > actually creating typeless objects, indeed there would > be no way to do > > that. It is creating objects of whatever type is > required, but is > > setting some 'characterising attribute' of the object > to 'unknown' > Right, and my analysis was wrong. I have a RootsMagic GEDCOM > file that > has a lot of missing notes, and when I import that, Gramps > creates note > object with type unknown. And my mistake was that I thought > that that > unknown type was causing those subsequent GEDCOM export > problem. It's not. > > I can say that now, because the export runs fine after > running check and > repair, and after that, those notes of unknown type are > still there. In > other words, things are indeed like you say they are, > meaning that > objects are created, but some links are wrong, or not all > needed objects > are created. > > > From your error messages, it looks as though it is > trying to create a > > Repository object, and to link the note about 'this is > a note about a > > repository object that has been created' to the > repository. But > > evidently something is going wrong, and possibly the > repository object > > is not correctly created, and possibly the note is not > correctly linked. > In my latest test, which goes wrong in 3.4.3 and trunk, it's > not about > repositories, but notes, and I can upload the offending > GEDCOM as > private if that helps. During import more than 1400 notes > are created, > but apparently that's not enough, since check and repair > reports about > 12 notes missing. > > The error message that I get looks very much like the one > in > > http://www.gramps-project.org/bugs/view.php?id=5184 > > which was closed because there was not enough feedback to > reproduce it. > I can provide that feedback, for 3.4.3 or trunk, so I > suggest that above > bug is reopened. Jerome may not like that, because he closed > it just > yesterday ... > > thanks, > > Enno > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > _______________________________________________ > Gramps-devel mailing list > Gra...@li... > https://lists.sourceforge.net/lists/listinfo/gramps-devel > |
From: Enno B. <enn...@gm...> - 2013-02-23 17:12:44
|
Hi Jerome, I created a new issue for 3.4.2 as a followup to this: > http://www.gramps-project.org/bugs/view.php?id=5184 I was not able to reopen it, and I guess that’s because my Mantis status is reporter, not developer. But like with Gedcom file format, there is no index for notes... H’m, I don’t understand what you mean my that. Sorry. regards, Enno |
From: Jérôme <rom...@ya...> - 2013-02-24 09:08:22
|
Enno, > I was not able to reopen it, and I guess that’s because my Mantis status > is reporter, not developer Maybe only the reporter and people having developer status are allowed to do that? > I created a new issue for 3.4.2 as a followup to this: > > http://www.gramps-project.org/bugs/view.php?id=5184 OK, thank you. Yes, it seems to be the same problem, still there with last stable release. It also makes 'summary' tab per project more clear[1]! > But like with Gedcom file format, there is no index for notes... > > H’m, I don’t understand what you mean my that. Sorry. You can put Note almost everywhere, via simple note having an ID, source text, transcription into a source/citation, comment, etc ... Same multiple locations/choices for source, objects, which all have an ID. See tab[2] which can give an overview. Without safe ID (xref or whatever key for a mapping) you can easily corrupt and lost data... See quick tests[3] on FAM tags and relations. Having only a part of comment, text, note with an ID is a Gedcom design choice. = > One will say that Gramps should follow Gedcom logic... Otherwise, should a person exist: 1. once there is gender value 2. once there is a name 3. once there is an ID 4. once there is a relation with someone else: parent of, child of, spouse of, partner of, witness at shared event, noted into a source or transcription, etc ... On gedcom's Family, we suppose that a parent with gender male is the father and a parent with gender female is the mother. What happen into a gedcom where there is a gender issue: someone noted gender:male (IND) set as wife (FAM) or someone noted gender:female (IND) set as husband (FAM)? It seems that Gramps inherited from these limitations/issues... Where I want to go? Well, it is time to have a gedcom 5.5.x replacement! 'bettergedcom' or 'gedcomx' sound better... At least, everything around individual relations and family, source/citation, places, note and object, privacy: lack on exchange. I do not care of others specification goals related to data business. Anyway, the title of the subject might be a little bit different (update)? Title : 'trunk problem importing GEDCOM files created by Windows programs' becomes something less general about the OS, but rather oriented gedcom customization/declinaison/declension and current gramps' handling (also on 3.4.x). [1] http://www.gramps-project.org/bugs/summary_page.php [2] http://bertrand-maillard.pagesperso-orange.fr/sylvain/tabletag.html [3] http://www.gramps-project.org/bugs/view.php?id=2710. regards, Jérôme Le 23/02/2013 18:12, Enno Borgsteede a écrit : > Hi Jerome, > I created a new issue for 3.4.2 as a followup to this: > > > http://www.gramps-project.org/bugs/view.php?id=5184 > > I was not able to reopen it, and I guess that’s because my Mantis status > is reporter, not developer. > > But like with Gedcom file format, there is no index for notes... > > H’m, I don’t understand what you mean my that. Sorry. > regards, > Enno |
From: Enno B. <enn...@gm...> - 2013-02-25 17:43:21
|
Hi Jérôme, > Enno, > >> I was not able to reopen it, and I guess that’s because my Mantis status >> is reporter, not developer > > Maybe only the reporter and people having developer status are allowed > to do that? I think so, yes. > Without safe ID (xref or whatever key for a mapping) you can easily > corrupt and lost data... See quick tests[3] on FAM tags and relations. I know, and I have a new thought about that. Here it is: After changing the transaction stuff for citation merge, I tried .gramps import with the test file with 100 001 citations. It takes more than an hour here, but it doesn't crash, and while it was running, I looked at some import source, and found that transactions are OFF during GEDCOM import. Can that be related to the corruptions we find? I assume that transactions were switched off, because GEDCOM imports would also be very slow otherwise, and I haven't tested what happens when I switch them off in XML import too. It's worth a try though, I think, later ... > Well, it is time to have a gedcom 5.5.x replacement! > > 'bettergedcom' or 'gedcomx' sound better... I don't see much happening at better gedcom, but gedcomx sure does interesting things. One thing I see there, is that they got rid of citations, but do allow for nested sources! And the FamilySearch tree also takes a very simple approach to sources, which looks like the opposite of what others are doing. I like that. I'm experimenting a lot with the FS tree, using RootsMagic, and previously Ancestral Quest, and if that's the future of genealogy, we may see more exchange of small datasets, i.e. single persons, families, and associated objects like places, notes, and sources. Do you see that too? cheers, Enno |
From: Peter L. <pet...@te...> - 2013-02-18 18:42:42
|
Den Monday 18 February 2013 18.20.26 skrev Enno Borgsteede: > Hi, > > When I try to import a GEDCOM file created by PAF, I see that the > current code in trunk is not able to detect the file type by > interpreting the 1st 2 bytes of that file. Instead, I see the error > shown below: > > /home/test/trunk/gramps/plugins/lib/libgedcom.py:7445: UnicodeWarning: > Unicode equal comparison failed to convert both arguments to Unicode - > interpreting them as being unequal > if line == "\xef\xbb": > /home/test/trunk/gramps/plugins/lib/libgedcom.py:7449: UnicodeWarning: > Unicode equal comparison failed to convert both arguments to Unicode - > interpreting them as being unequal > elif line == "\xff\xfe": > /home/test/trunk/gramps/plugins/lib/libgedcom.py:7455: UnicodeWarning: > Unicode equal comparison failed to convert both arguments to Unicode - > interpreting them as being unequal > elif line[0] == "\x00" or line[1] == "\x00": > 2013-02-18 18:00:08.556: WARNING: libgedcom.py: line 7482: Foute lijn1 > in GEDCOM bestand. libgedcom tries to compare a bytestring in "line" with an ordinary string (unicode) which fails. /Peter > My GEDCOM file starts like this (od -t x1): > > 0000000 ef bb bf 30 20 48 45 41 44 0d 0a 31 20 53 4f 55 > > So I know that I need to get above code working to be able to import > files that I create in PAF or RM. It's in __detect_file_decoder(self, > input_file), and it works OK in 3.4. Can anyone help me here? I read a > few things about the subject on stackoverflow, but I'm still lost here. > I use Python 2.7.3, on Mint 14. > > I tried opening the file in another mode, but that didn't help. Putting > str() at the right side of == helps me to get rid of the Unicode error > messages, but it doesn't help to get the import running. I assume that > line is sort of a byte string that can't be converted to Unicode, so > maybe the whole comparison code must be rewritten, but I don't know how. > > My bug report is here: > > http://www.gramps-project.org/bugs/view.php?id=6462 > > And I have enough time to experiment, once I know where to start. > > thanks, > > Enno > > > --------------------------------------------------------------------------- > --- The Go Parallel Website, sponsored by Intel - in partnership with > Geeknet, is your hub for all things parallel software development, from > weekly thought leadership blogs to news, videos, case studies, tutorials, > tech docs, whitepapers, evaluation guides, and opinion stories. Check out > the most recent posts - join the conversation now. > http://goparallel.sourceforge.net/ > _______________________________________________ > Gramps-devel mailing list > Gra...@li... > https://lists.sourceforge.net/lists/listinfo/gramps-devel -- Peter Landgren Talken Hagen 671 94 BRUNSKOG 0570-530 21 070-345 0964 pet...@te... Skype: pgl4820.2 |
From: Enno B. <enn...@gm...> - 2013-02-18 20:08:22
|
Hi Peter, > Den Monday 18 February 2013 18.20.26 skrev Enno Borgsteede: >> /home/test/trunk/gramps/plugins/lib/libgedcom.py:7445: UnicodeWarning: >> Unicode equal comparison failed to convert both arguments to Unicode - >> interpreting them as being unequal >> if line == "\xef\xbb": >> > libgedcom tries to compare a bytestring in "line" with an ordinary string (unicode) which fails. > > /Peter I got that, but I have no idea on the repair strategy. Tried backquotes around line, and extra ' and \ to the right of the == and that works, but there must be a better way, like forcing the right parameter to be interpreted as a bytestring, just like I would do in C(++). I tried str() on the right string, which let me get rid of the error, but the compare didn't return the right encoding. Elsewhere, I read about using encode, but I haven't tried that yet. thanks, Enno |
From: Enno B. <enn...@gm...> - 2013-02-18 19:54:38
|
Hi John, > /home/test/trunk/gramps/plugins/lib/libgedcom.py:7445: UnicodeWarning: > Unicode equal comparison failed to convert both arguments to Unicode - > interpreting them as being unequal > if line == "\xef\xbb": > Two suggestions: > > Put a print(lines) at libgedcom.py:7445 to see what it's actually reading, > and figure out where it's actually opening the file and make sure that it's doing so in binary ("rb"). I added print(`line`) and got this '\xef\xbb' so the file is being read like it should, even though it seems to be opened as "r". In other words, I still have to repair the ifs, so that they compare byte strings instead of unicode. Changing the 1st if to if `line` == "'\\xef\\xbb'": and making the same change to the other ones, including one in UTF8Reader works, but I hope there is a nicer way to accomplish this. thanks, Enno |
From: John R. <jr...@ce...> - 2013-02-18 20:30:57
|
On Feb 18, 2013, at 11:54 AM, Enno Borgsteede <enn...@gm...> wrote: > Hi John, >> /home/test/trunk/gramps/plugins/lib/libgedcom.py:7445: UnicodeWarning: >> Unicode equal comparison failed to convert both arguments to Unicode - >> interpreting them as being unequal >> if line == "\xef\xbb": >> Two suggestions: >> >> Put a print(lines) at libgedcom.py:7445 to see what it's actually reading, >> and figure out where it's actually opening the file and make sure that it's doing so in binary ("rb"). > I added > > print(`line`) > > and got this > > '\xef\xbb' > > so the file is being read like it should, even though it seems to be opened as "r". > > In other words, I still have to repair the ifs, so that they compare byte strings instead of unicode. > > Changing the 1st if to > > if `line` == "'\\xef\\xbb'": > > and making the same change to the other ones, including one in UTF8Reader works, but I hope there is a nicer way to accomplish this. > I think that's a band-aid. line shouldn't be unicode. That it is suggests the file is getting opened with io.open(encoding="utf-8"), which it clearly shouldn't be: It should be opened binary ("rb") with no encoding. Then, so that it will work with both Py2 and Py3, the tests should be if line == b"\xef\xbb" Regards, John Ralls |
From: Enno B. <enn...@gm...> - 2013-02-18 21:21:27
|
Hi John, > I think that's a band-aid. line shouldn't be unicode. I agree. > That it is suggests the file is getting opened with > io.open(encoding="utf-8"), which it clearly shouldn't be: It should be > opened binary ("rb") with no encoding. So, this section in GedcomInfoDB if sys.version_info[0] < 3: try: filepath = os.path.join(DATA_DIR, "gedcom.xml") ged_file = open(filepath.encode('iso8859-1'), "r") except: return else: try: filepath = os.path.join(DATA_DIR, "gedcom.xml") ged_file = open(filepath, "rb") except: return can be reduced to the else part. Sounds good. > Then, so that it will work with both Py2 and Py3, the tests should be > if line == b"\xef\xbb" That's the info that I needed. Thanks! regards, Enno |
From: Enno B. <enn...@gm...> - 2013-02-18 22:18:05
|
On 18-02-13 21:30, John Ralls wrote: > Then, so that it will work with both Py2 and Py3, the tests should be > if line == b"\xef\xbb" Something weird here: The Python site says that the b prefix is ignored in Python 2, but my experience shows that it definitely makes a difference in 2.7.3. cheers, Enno |
From: John R. <jr...@ce...> - 2013-02-18 23:16:56
|
On Feb 18, 2013, at 2:17 PM, Enno Borgsteede <enn...@gm...> wrote: > On 18-02-13 21:30, John Ralls wrote: >> Then, so that it will work with both Py2 and Py3, the tests should be if line == b"\xef\xbb" > Something weird here: The Python site says that the b prefix is ignored in Python 2, but my experience shows that it definitely makes a difference in 2.7.3. Really? Python 2.7.3 (default, Jan 6 2013, 14:08:41) [GCC 4.2.1 (Apple Inc. build 5659)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> p b"\xec\xbb" File "<stdin>", line 1 p b"\xec\xbb" ^ SyntaxError: invalid syntax >>> b"\xec\xbb" '\xec\xbb' >>> "\xec\xbb" == b"\xec\xbb" True >>> u"\xec\xbb" == b"\xec\xbb" __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal False >>> u"\xec\xbb" u'\xec\xbb' >>> u"\xec\xbb" == "\xec\xbb" False >>> Tells me that if the b isn't ignored, it's not doing anything. But you still need it for Py3. Regards, John Ralls |
From: Benny M. <ben...@gm...> - 2013-02-19 09:19:34
|
2013/2/19 John Ralls <jr...@ce...> > > On Feb 18, 2013, at 2:17 PM, Enno Borgsteede <enn...@gm...> wrote: > > > On 18-02-13 21:30, John Ralls wrote: > >> Then, so that it will work with both Py2 and Py3, the tests should be > if line == b"\xef\xbb" > > Something weird here: The Python site says that the b prefix is ignored > in Python 2, but my experience shows that it definitely makes a difference > in 2.7.3. > > Really? > > Python 2.7.3 (default, Jan 6 2013, 14:08:41) > [GCC 4.2.1 (Apple Inc. build 5659)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> p b"\xec\xbb" > File "<stdin>", line 1 > p b"\xec\xbb" > ^ > SyntaxError: invalid syntax > >>> b"\xec\xbb" > '\xec\xbb' > >>> "\xec\xbb" == b"\xec\xbb" > True > >>> u"\xec\xbb" == b"\xec\xbb" > __main__:1: UnicodeWarning: Unicode equal comparison failed to convert > both arguments to Unicode - interpreting them as being unequal > False > >>> u"\xec\xbb" > u'\xec\xbb' > >>> u"\xec\xbb" == "\xec\xbb" > False > >>> > > Tells me that if the b isn't ignored, it's not doing anything. But you > still need it for Py3. > Enno, python 2 behaves different because of pieces like: if sys.version_info[0] < 3: try: filepath = os.path.join(DATA_DIR, "gedcom.xml") ged_file = open(filepath.encode('iso8859- 1'), "r") except: return else: try: filepath = os.path.join(DATA_DIR, "gedcom.xml") ged_file = open(filepath, "rb") except: return the first part is the code from gramps 3.4, which is working at the moment. So if you now have a fail with that, something somewhere else changed. Even if you only do the else part in your change, python2 might still behave different because the conversion functions that are used behave different. See what is imported from constfunc, eg cuni, ... Benny |
From: Enno B. <enn...@gm...> - 2013-02-19 13:25:14
|
Hi Benny, > python 2 behaves different because of pieces like: > > if sys.version_info[0] < 3: > try: > filepath = os.path.join(DATA_DIR, "gedcom.xml") > ged_file = open(filepath.encode('iso8859- > 1'), "r") > except: > return > else: > try: > filepath = os.path.join(DATA_DIR, "gedcom.xml") > ged_file = open(filepath, "rb") > except: > return > the first part is the code from gramps 3.4, which is working at the > moment. John reminded me that this is not the actual GEDCOM, and to tell you the truth, I have no idea why this code is there. I know that it's used, but in the mean time I found the opening of the actual GEDCOM in importgedcom.py: if sys.version_info[0] < 3: ifile = open(filename, "rU") else: ifile = open(filename, "rb") I wonder what will happen when I change that to "rb". > So if you now have a fail with that, something somewhere else changed. > Even if you only do the else part in your change, python2 might still > behave different because the conversion functions that are used behave > different. > See what is imported from constfunc, eg cuni, ... You mean things like this (in importgedcom.py): from gramps.gen.constfunc import STRTYPE Things like that probably explain why the if that gave me a problem, works OK in a python shell, where nothing special is imported. I see lots of 'interesting' imports in libgedcom.py too, like: from __future__ import print_function, unicode_literals if sys.version_info[0] < 3: from cStringIO import StringIO else: from io import StringIO But I tend to leave them as is, assuming that they were put there for a reason. Latest news: When I change the open mentioned above to "rb" for all python versions, my RootsMagic GEDCOM file still imports OK, but the b's before the string literals are still needed. I think I can live with that, especially when someone else can test the attached PAF GEDCOM with python 3. thanks, Enno |
From: Benny M. <ben...@gm...> - 2013-02-19 13:47:46
|
2013/2/19 Enno Borgsteede <enn...@gm...> > Hi Benny, > > python 2 behaves different because of pieces like: >> >> if sys.version_info[0] < 3: >> try: >> filepath = os.path.join(DATA_DIR, "gedcom.xml") >> ged_file = open(filepath.encode('iso8859- >> 1'), "r") >> except: >> return >> else: >> try: >> filepath = os.path.join(DATA_DIR, "gedcom.xml") >> ged_file = open(filepath, "rb") >> except: >> return >> the first part is the code from gramps 3.4, which is working at the >> moment. >> > John reminded me that this is not the actual GEDCOM, and to tell you the > truth, I have no idea why this code is there. I know that it's used, but in > the mean time I found the opening of the actual GEDCOM in importgedcom.py: > > if sys.version_info[0] < 3: > ifile = open(filename, "rU") > else: > ifile = open(filename, "rb") > > I wonder what will happen when I change that to "rb". The < 3 codepath is the old code which is well tested and is working at the moment in version 3.4. So there should be no need to change that, or a lot things will have to be changed. If things fail for you in python 2, it is because somebody changed something somewhere else since the 3.4 code. Benny > > So if you now have a fail with that, something somewhere else changed. >> Even if you only do the else part in your change, python2 might still >> behave different because the conversion functions that are used behave >> different. >> See what is imported from constfunc, eg cuni, ... >> > You mean things like this (in importgedcom.py): > > from gramps.gen.constfunc import STRTYPE > > Things like that probably explain why the if that gave me a problem, works > OK in a python shell, where nothing special is imported. > > I see lots of 'interesting' imports in libgedcom.py too, like: > > from __future__ import print_function, unicode_literals > > if sys.version_info[0] < 3: > from cStringIO import StringIO > else: > from io import StringIO > > But I tend to leave them as is, assuming that they were put there for a > reason. > > Latest news: When I change the open mentioned above to "rb" for all python > versions, my RootsMagic GEDCOM file still imports OK, but the b's before > the string literals are still needed. I think I can live with that, > especially when someone else can test the attached PAF GEDCOM with python 3. > > thanks, > > Enno > > |
From: Enno B. <enn...@gm...> - 2013-02-19 13:59:17
|
Hi John, > On Feb 18, 2013, at 2:17 PM, Enno Borgsteede <enn...@gm...> wrote: > >> On 18-02-13 21:30, John Ralls wrote: >>> Then, so that it will work with both Py2 and Py3, the tests should be if line == b"\xef\xbb" >> Something weird here: The Python site says that the b prefix is ignored in Python 2, but my experience shows that it definitely makes a difference in 2.7.3. > Really? Yes. And I guess that's because of the types that are imported in Gramps, like Benny said. That's why your experiments in a python shell, and the ones I did here before, yield different results. The latest code that I run here opens all files as "rb", and still needs those b's before string literals. It works OK on Windows files (PAF and RM) and GEDCOM files created by Gramps itself, in Linux Mint. The only thing that I haven't tested yet is importing a GEDCOM created by a Windows Gramps (3.4.2). I must add that in order to test GEDCOM files written by Gramps trunk, I first had to change the export code, like this: Index: gramps/plugins/export/exportgedcom.py =================================================================== --- gramps/plugins/export/exportgedcom.py (revision 21372) +++ gramps/plugins/export/exportgedcom.py (working copy) @@ -236,7 +236,7 @@ """ self.dirname = os.path.dirname (filename) - self.gedcom_file = io.open(filename, "w", encoding='utf-8') + self.gedcom_file = open(filename, "w") self._header(filename) self._submitter() self._individuals() Which puts me in a situation where I read all GEDCOMs as rb, and write them as w. What logic is that? regards, Enno |
From: Benny M. <ben...@gm...> - 2013-02-19 16:39:51
|
2013/2/19 Enno Borgsteede <enn...@gm...> > Hi John, > > On Feb 18, 2013, at 2:17 PM, Enno Borgsteede <enn...@gm...> wrote: > > > >> On 18-02-13 21:30, John Ralls wrote: > >>> Then, so that it will work with both Py2 and Py3, the tests should be > if line == b"\xef\xbb" > >> Something weird here: The Python site says that the b prefix is ignored > in Python 2, but my experience shows that it definitely makes a difference > in 2.7.3. > > Really? > Yes. And I guess that's because of the types that are imported in > Gramps, like Benny said. That's why your experiments in a python shell, > and the ones I did here before, yield different results. > > The latest code that I run here opens all files as "rb", and still needs > those b's before string literals. It works OK on Windows files (PAF and > RM) and GEDCOM files created by Gramps itself, in Linux Mint. The only > thing that I haven't tested yet is importing a GEDCOM created by a > Windows Gramps (3.4.2). > > I must add that in order to test GEDCOM files written by Gramps trunk, I > first had to change the export code, like this: > > Index: gramps/plugins/export/exportgedcom.py > =================================================================== > --- gramps/plugins/export/exportgedcom.py (revision 21372) > +++ gramps/plugins/export/exportgedcom.py (working copy) > @@ -236,7 +236,7 @@ > """ > > self.dirname = os.path.dirname (filename) > - self.gedcom_file = io.open(filename, "w", encoding='utf-8') > + self.gedcom_file = open(filename, "w") > self._header(filename) > self._submitter() > self._individuals() > Hmm. It is not wrong of Gramps to create it's own output in a fixed known encoding, and utf-8 should be the one to use then. Benny > > Which puts me in a situation where I read all GEDCOMs as rb, and write > them as w. What logic is that? > > regards, > > Enno > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > _______________________________________________ > Gramps-devel mailing list > Gra...@li... > https://lists.sourceforge.net/lists/listinfo/gramps-devel > |
From: John R. <jr...@ce...> - 2013-02-19 17:07:58
|
On Feb 19, 2013, at 8:39 AM, Benny Malengier <ben...@gm...> wrote: > > > > 2013/2/19 Enno Borgsteede <enn...@gm...> > Hi John, > > On Feb 18, 2013, at 2:17 PM, Enno Borgsteede <enn...@gm...> wrote: > > > >> On 18-02-13 21:30, John Ralls wrote: > >>> Then, so that it will work with both Py2 and Py3, the tests should be if line == b"\xef\xbb" > >> Something weird here: The Python site says that the b prefix is ignored in Python 2, but my experience shows that it definitely makes a difference in 2.7.3. > > Really? > Yes. And I guess that's because of the types that are imported in > Gramps, like Benny said. That's why your experiments in a python shell, > and the ones I did here before, yield different results. > > The latest code that I run here opens all files as "rb", and still needs > those b's before string literals. It works OK on Windows files (PAF and > RM) and GEDCOM files created by Gramps itself, in Linux Mint. The only > thing that I haven't tested yet is importing a GEDCOM created by a > Windows Gramps (3.4.2). > > I must add that in order to test GEDCOM files written by Gramps trunk, I > first had to change the export code, like this: > > Index: gramps/plugins/export/exportgedcom.py > =================================================================== > --- gramps/plugins/export/exportgedcom.py (revision 21372) > +++ gramps/plugins/export/exportgedcom.py (working copy) > @@ -236,7 +236,7 @@ > """ > > self.dirname = os.path.dirname (filename) > - self.gedcom_file = io.open(filename, "w", encoding='utf-8') > + self.gedcom_file = open(filename, "w") > self._header(filename) > self._submitter() > self._individuals() > > Hmm. It is not wrong of Gramps to create it's own output in a fixed known encoding, and utf-8 should be the one to use then. > > > Which puts me in a situation where I read all GEDCOMs as rb, and write > them as w. What logic is that? > I think the message here is that the gedcom import/export plugins are doing things which have side effects on reading and writing files. It might have seemed a brilliant hack at the time, and it might even have been necessary, but it's making it difficult now to figure out how to get everything wired up correctly for both Py2 and Py3. Regards, John Ralls |
From: John R. <jr...@ce...> - 2013-02-19 19:53:55
|
On Feb 19, 2013, at 5:59 AM, Enno Borgsteede <enn...@gm...> wrote: > > I must add that in order to test GEDCOM files written by Gramps trunk, I first had to change the export code, like this: > > Index: gramps/plugins/export/exportgedcom.py > =================================================================== > --- gramps/plugins/export/exportgedcom.py (revision 21372) > +++ gramps/plugins/export/exportgedcom.py (working copy) > @@ -236,7 +236,7 @@ > """ > > self.dirname = os.path.dirname (filename) > - self.gedcom_file = io.open(filename, "w", encoding='utf-8') > + self.gedcom_file = open(filename, "w") > self._header(filename) > self._submitter() > self._individuals() Is that because otherwise you get this? Traceback (most recent call last): File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/gui/plug/export/_exportassistant.py", line 484, in do_prepare success = self.save() File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/gui/plug/export/_exportassistant.py", line 589, in save self.option_box_instance) File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 1447, in export_data ret = ged_write.write_gedcom_file(filename) File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 240, in write_gedcom_file self._header(filename) File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 323, in _header self._writeln(0, "HEAD") File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 282, in _writeln self.gedcom_file.write("%d %s\n" % (level, token)) TypeError: must be unicode, not str The correct fix for that is: --- a/gramps/plugins/export/exportgedcom.py +++ b/gramps/plugins/export/exportgedcom.py @@ -271,15 +271,15 @@ class GedcomWriter(UpdateCallback): # make it unicode so that breakup below does the right thin. text = cuni(text) if limit: - prefix = "\n%d CONC " % (level + 1) + prefix = cuni("\n%d CONC " % (level + 1)) txt = prefix.join(breakup(text, limit)) else: txt = text - self.gedcom_file.write("%d %s %s\n" % (token_level, token, txt) + self.gedcom_file.write(cuni("%d %s %s\n" % (token_level, token, token_level = level + 1 token = "CONT" else: - self.gedcom_file.write("%d %s\n" % (level, token)) + self.gedcom_file.write(cuni("%d %s\n" % (level, token))) def _header(self, filename): """ After applying that, I'm able to round-trip with the following error: Warn: ADDR overwritten Line 19: 2 ADR1 Not Provided Error: REPO '' (input as @@) not in input GEDCOM. Record with typifying attribute 'Unknown' created The imported file was not self-contained. To correct for that, 1 objects were created and their typifying attribute was set to 'Unknown'. Where possible these 'Unknown' objects are referenced by note N0002. I'll commit that change after I test it with Py3. Regards, John Ralls |
From: Enno B. <enn...@gm...> - 2013-02-19 20:43:21
|
Hi John, > Is that because otherwise you get this? > Traceback (most recent call last): > File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/gui/plug/export/_exportassistant.py", line 484, in do_prepare > success = self.save() > File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/gui/plug/export/_exportassistant.py", line 589, in save > self.option_box_instance) > File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 1447, in export_data > ret = ged_write.write_gedcom_file(filename) > File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 240, in write_gedcom_file > self._header(filename) > File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 323, in _header > self._writeln(0, "HEAD") > File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 282, in _writeln > self.gedcom_file.write("%d %s\n" % (level, token)) > TypeError: must be unicode, not str > > The correct fix for that is: > > --- a/gramps/plugins/export/exportgedcom.py > +++ b/gramps/plugins/export/exportgedcom.py > @@ -271,15 +271,15 @@ class GedcomWriter(UpdateCallback): > # make it unicode so that breakup below does the right thin. > text = cuni(text) > if limit: > - prefix = "\n%d CONC " % (level + 1) > + prefix = cuni("\n%d CONC " % (level + 1)) > txt = prefix.join(breakup(text, limit)) > else: > txt = text > - self.gedcom_file.write("%d %s %s\n" % (token_level, token, txt) > + self.gedcom_file.write(cuni("%d %s %s\n" % (token_level, token, > token_level = level + 1 > token = "CONT" > else: > - self.gedcom_file.write("%d %s\n" % (level, token)) > + self.gedcom_file.write(cuni("%d %s\n" % (level, token))) > > def _header(self, filename): > """ That's it, right. I filed a bug report for that http://www.gramps-project.org/bugs/view.php?id=6461 and I plan to test your patch later tonight. I'm in a RootsDev conference right now. > After applying that, I'm able to round-trip with the following error: > Warn: ADDR overwritten Line 19: 2 ADR1 Not Provided > Error: REPO '' (input as @@) not in input GEDCOM. Record with typifying attribute 'Unknown' created > > The imported file was not self-contained. > To correct for that, 1 objects were created and > their typifying attribute was set to 'Unknown'. > Where possible these 'Unknown' objects are > referenced by note N0002. Ah, yes, I know those errors. I filed a report for ADDR stuff earlier myself, for 3.4. And I saw that the 'Unknown' objects may create fatal errors on the next GEDCOM export. I'll have to retest that, and see if there's a report for that. > I'll commit that change after I test it with Py3. Thanks. Enno |
From: John R. <jr...@ce...> - 2013-02-19 22:27:58
|
On Feb 19, 2013, at 12:43 PM, Enno Borgsteede <enn...@gm...> wrote: > Hi John, >> Is that because otherwise you get this? >> Traceback (most recent call last): >> File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/gui/plug/export/_exportassistant.py", line 484, in do_prepare >> success = self.save() >> File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/gui/plug/export/_exportassistant.py", line 589, in save >> self.option_box_instance) >> File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 1447, in export_data >> ret = ged_write.write_gedcom_file(filename) >> File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 240, in write_gedcom_file >> self._header(filename) >> File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 323, in _header >> self._writeln(0, "HEAD") >> File "/Users/john/Development/Gramps-Build/Gramps-svn/src/gramps-git/gramps/plugins/export/exportgedcom.py", line 282, in _writeln >> self.gedcom_file.write("%d %s\n" % (level, token)) >> TypeError: must be unicode, not str >> >> The correct fix for that is: >> >> --- a/gramps/plugins/export/exportgedcom.py >> +++ b/gramps/plugins/export/exportgedcom.py >> @@ -271,15 +271,15 @@ class GedcomWriter(UpdateCallback): >> # make it unicode so that breakup below does the right thin. >> text = cuni(text) >> if limit: >> - prefix = "\n%d CONC " % (level + 1) >> + prefix = cuni("\n%d CONC " % (level + 1)) >> txt = prefix.join(breakup(text, limit)) >> else: >> txt = text >> - self.gedcom_file.write("%d %s %s\n" % (token_level, token, txt) >> + self.gedcom_file.write(cuni("%d %s %s\n" % (token_level, token, >> token_level = level + 1 >> token = "CONT" >> else: >> - self.gedcom_file.write("%d %s\n" % (level, token)) >> + self.gedcom_file.write(cuni("%d %s\n" % (level, token))) >> def _header(self, filename): >> """ > That's it, right. I filed a bug report for that > > http://www.gramps-project.org/bugs/view.php?id=6461 > > and I plan to test your patch later tonight. I'm in a RootsDev conference right now. >> After applying that, I'm able to round-trip with the following error: >> Warn: ADDR overwritten Line 19: 2 ADR1 Not Provided >> Error: REPO '' (input as @@) not in input GEDCOM. Record with typifying attribute 'Unknown' created >> >> The imported file was not self-contained. >> To correct for that, 1 objects were created and >> their typifying attribute was set to 'Unknown'. >> Where possible these 'Unknown' objects are >> referenced by note N0002. > Ah, yes, I know those errors. I filed a report for ADDR stuff earlier myself, for 3.4. And I saw that the 'Unknown' objects may create fatal errors on the next GEDCOM export. I'll have to retest that, and see if there's a report for that. >> I'll commit that change after I test it with Py3. r21375, backported to gramps40 r21378, and I resolved the bug. Thanks for pointing it out, I hadn't made the connection. Regards, John Ralls |