From: Zsolt F. <zso...@no...> - 2007-01-31 11:11:25
|
Hi, While thinking about the possibility of having formatted notes (i.e. rich text notes), we've bumped into the problem of speed again. After having markup text for notes there will be still some functionalities, which require the clean text version of the note (e.g. filters, gedcom export, etc.). So far we've had two solutions in mind: a. Keep only the markup version of the text and create the clean text on the fly. This solution has some speed consequence though. Converting 100.000 notes with 30 markup tag pairs takes around 11 sec on 1.6GHz + 512MB. This could be unbearable for filtering... At the moment clearing the markup text is done with: "text = re.sub(r'(</?span.*?>)', '', markup_text)" b. Keep both the markup and the clean text version of the note in memory parallel. The clear text version could be created e.g. by a background process after loading the db. This solution has some memory issue instead of the speed problem, i.e. all notes are doubled in memory. Any better idea or comment is appreciated. Cheers, Zsolt |
From: <ben...@ug...> - 2007-01-31 11:33:48
|
Zsolt, I don't fully understand b. A grdb database is never in memory, only parts of it. Getting all notes and creating clean text in memory would be contrary to what the database is for: get what you need without keeping things in memory. I can only think of alternative that you store in the database the markup and the clean version, in the same way people were asking to store namedisplay in the database. For notes the above would mean my 540 Mb database would again be much bigger, so I would vote against it. So, technically, you should go with a. I think, but be intelligent on making the clean text: The problem I can think of are filters, but you could apply the filter as follows: 1/apply filter to the markup text and only if you have a positive result convert to clean text and redo the filter 2/apply filter to all allowed markup identifiers. If no positive, search in markup text, if positive, take the markup identifiers into account by only replacing those that give positive out of the markup text. In any case, nothing of the conversion of clean text should stay in memory. About GEDCOM export: people can expect that to take some time, so the extra 5/11 seconds will be no big deal. You actually could give the option to export the marked up code if GEDCOM doesn't crach on <>. However, how to handle import of <bold> in GEDCOM to non markup code... Benny Quoting Zsolt Foldvari <zso...@no...>: > Hi, > > While thinking about the possibility of having formatted notes (i.e. > rich text notes), we've bumped into the problem of speed again. > > After having markup text for notes there will be still some > functionalities, which require the clean text version of the note (e.g. > filters, gedcom export, etc.). So far we've had two solutions in mind: > > a. Keep only the markup version of the text and create the clean text on > the fly. This solution has some speed consequence though. Converting > 100.000 notes with 30 markup tag pairs takes around 11 sec on 1.6GHz + > 512MB. This could be unbearable for filtering... > At the moment clearing the markup text is done with: "text = > re.sub(r'(</?span.*?>)', '', markup_text)" > > b. Keep both the markup and the clean text version of the note in memory > parallel. The clear text version could be created e.g. by a background > process after loading the db. This solution has some memory issue > instead of the speed problem, i.e. all notes are doubled in memory. > > Any better idea or comment is appreciated. > > Cheers, > Zsolt > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Gramps-devel mailing list > Gra...@li... > https://lists.sourceforge.net/lists/listinfo/gramps-devel > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. |
From: Zsolt F. <zso...@no...> - 2007-01-31 12:18:13
|
Benny, > A grdb database is never in memory, only parts of it. Getting all notes and > creating clean text in memory would be contrary to what the database is for: > get what you need without keeping things in memory. > I guess you're right and I misunderstood something. This case, however, to create a clear text version of the notes only in memory means on-the-fly conversion again. > I can only think of alternative that you store in the database the markup and > the clean version, in the same way people were asking to store namedisplay in > the database. For notes the above would mean my 540 Mb database would again be > much bigger, so I would vote against it. > I would avoid this option too. > So, technically, you should go with a. I think, but be intelligent on > making the > clean text: > The problem I can think of are filters, but you could apply the filter as > follows: > 1/apply filter to the markup text and only if you have a positive > result convert > to clean text and redo the filter > 2/apply filter to all allowed markup identifiers. If no positive, search in > markup text, if positive, take the markup identifiers into account by only > replacing those that give positive out of the markup text. I believe 1. is a good and quick solution for filters. > You actually could give the option to export the > marked up code if GEDCOM doesn't crach on <>. However, how to handle import of > <bold> in GEDCOM to non markup code... In my understanding GEDCOM export is meant to share database with other people using different genealogy software. Thus including the pango markup tags in GEDCOM would give gibberish on those other software. Zsolt |
From: Richard T. <rjt...@th...> - 2007-01-31 12:32:47
|
On Wednesday 31 January 2007 11:33, ben...@ug... wrote: > Zsolt, > > I don't fully understand b. > A grdb database is never in memory, only parts of it. Getting all notes and > creating clean text in memory would be contrary to what the database is > for: get what you need without keeping things in memory. > > I can only think of alternative that you store in the database the markup > and the clean version, in the same way people were asking to store > namedisplay in the database. For notes the above would mean my 540 Mb > database would again be much bigger, so I would vote against it. > Why not only store the formatted version _if_ the user actually formatted anything. It should be simple to check if any formating was used and if not the string can simply be stored in the unformatted field and a Null value can be put in the formatted field. Then those that do not use formatting will not get a bigger database. Richard |
From: Zsolt F. <zso...@no...> - 2007-01-31 12:45:28
|
Richard, > Why not only store the formatted version _if_ the user actually formatted > anything. It should be simple to check if any formating was used and if not > the string can simply be stored in the unformatted field and a Null value can > be put in the formatted field. I'm not sure I follow you. Or maybe I was not clear enough. Formatting means start using pango markup tags in notes (of course via proper gui). It might be more clear to use the term: "rich text" notes. Clear text version of a formatted note means the pure text without the pango markups. Thus from DB point of view a formatted and an unformatted note is still a string, and without a user formats a specific note nothing will be stored differently. Maybe my terms are confusing with the existing flowed/formatted properties. Sorry. Zsolt |
From: Don A. <don...@co...> - 2007-01-31 12:52:10
|
Zsolt, I don't want to store pango strings in the notes. I would rather use the more standard <b>, <i>, and <u>. These are already understood by most of the document interfaces. They should also be simpler to strip out - constants are always faster than regular expressions. We can convert from pango to the more HTMLish items in the note editor. When we get the string, convert to our format before saving in the db. Doing it here would be where everything is interactive, and the user would not notice any fraction of a second delay. Don On Wed, 2007-01-31 at 13:45 +0100, Zsolt Foldvari wrote: > Richard, > > Why not only store the formatted version _if_ the user actually formatt= ed=20 > > anything. It should be simple to check if any formating was used and if= not=20 > > the string can simply be stored in the unformatted field and a Null val= ue can=20 > > be put in the formatted field.=20 > I'm not sure I follow you. Or maybe I was not clear enough. Formatting > means start using pango markup tags in notes (of course via proper gui). > It might be more clear to use the term: "rich text" notes. > Clear text version of a formatted note means the pure text without the > pango markups. > Thus from DB point of view a formatted and an unformatted note is still > a string, and without a user formats a specific note nothing will be > stored differently. > Maybe my terms are confusing with the existing flowed/formatted > properties. Sorry. >=20 > Zsolt >=20 > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share y= our > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV > _______________________________________________ > Gramps-devel mailing list > Gra...@li... > https://lists.sourceforge.net/lists/listinfo/gramps-devel |
From: Zsolt F. <zso...@no...> - 2007-01-31 14:49:23
|
Don, > I don't want to store pango strings in the notes. I would rather use the > more standard <b>, <i>, and <u>. These are already understood by most of > the document interfaces. Would you like to do this only because it is understood by many output interfaces already, or because you're concerned of the speed of report creation if tags have to be converted? I think we can not find a common tagging syntax, which would be understood by all output formatters, therefore some conversion may be always required, which on the other hand will always have some speed consequences. I agree, however, that we should use that tagging syntax in gramps, which is closest to all others to reduce amount of needed conversions. Is the above the best choice? If yes, fine by me. > They should also be simpler to strip out - > constants are always faster than regular expressions. > What exactly you have in your mind? I tried to remove the pango tags using constants with str.replace(), and it gave worse result than regexp. Cheers, Zsolt |
From: Alex R. <sh...@gr...> - 2007-01-31 17:21:45
|
Zsolt, On Wed, 2007-01-31 at 13:45 +0100, Zsolt Foldvari wrote: > > Why not only store the formatted version _if_ the user actually formatt= ed=20 > > anything. It should be simple to check if any formating was used and if= not=20 > > the string can simply be stored in the unformatted field and a Null val= ue can=20 > > be put in the formatted field.=20 > I'm not sure I follow you. Or maybe I was not clear enough. I think what Richard proposes is this: instead of storing note's single text field, 'blah', we could store a tuple of either ('blah',None) or (None,'<b>blah</b>'), depending on whether we have formatting or not. In RelLib, it would correspond to the two different fields, just as you propose, but only one will be stored in the db. Then for notes that don't have the formatting, we don't have to do any conversion. For those that do have formatting, we convert on the fly when we need plain text. This is all in addition to flowed/formatted attribute. I think I like this approach, Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Zsolt F. <zso...@no...> - 2007-01-31 21:20:50
|
Alex, > I think what Richard proposes is this: instead of storing note's > single text field, 'blah', we could store a tuple of > either ('blah',None) or (None,'<b>blah</b>'), > depending on whether we have formatting or not. > > In RelLib, it would correspond to the two different fields, > just as you propose, but only one will be stored in the db. > Then for notes that don't have the formatting, we don't > have to do any conversion. For those that do have formatting, > we convert on the fly when we need plain text. > I don't see a big difference. With this approach we simply don't even try to convert what is not formatted. However, trying to delete tags from an already clean text takes quite short time (for the same 100.000 note I used before, but now without the tags it gives a bit less than 3 sec). Nevertheless it's undoubtedly an improvement. Zsolt |
From: Eero T. <ee...@us...> - 2007-01-31 18:50:54
|
Hi, On Wednesday 31 January 2007 19:20, Alex Roitman wrote: > > > Why not only store the formatted version _if_ the user actually > > > formatted anything. It should be simple to check if any formating was > > > used and if not the string can simply be stored in the unformatted > > > field and a Null value can be put in the formatted field. > > > > I'm not sure I follow you. Or maybe I was not clear enough. > > I think what Richard proposes is this: instead of storing note's > single text field, 'blah', we could store a tuple of > either ('blah',None) or (None,'<b>blah</b>'), > depending on whether we have formatting or not. > > In RelLib, it would correspond to the two different fields, > just as you propose, but only one will be stored in the db. > Then for notes that don't have the formatting, we don't > have to do any conversion. For those that do have formatting, > we convert on the fly when we need plain text. > > This is all in addition to flowed/formatted attribute. Wouldn't it be easier to have just another boolean attribute for whether text is "styled" or not? - Eero |
From: Alex R. <sh...@gr...> - 2007-01-31 20:03:28
|
On Wed, 2007-01-31 at 21:00 +0200, Eero Tamminen wrote: > > I think what Richard proposes is this: instead of storing note's > > single text field, 'blah', we could store a tuple of > > either ('blah',None) or (None,'<b>blah</b>'), > > depending on whether we have formatting or not. >=20 > Wouldn't it be easier to have just another boolean attribute for whether > text is "styled" or not? Sure, this is more or less the same. Alex --=20 Alexander Roitman http://www.gramps-project.org |