From: Mark C. <mar...@gm...> - 2011-04-15 17:52:17
|
Hi; Would it be at all useful to submit .wpd files that don't open correctly with the latest version of libwpd? I work in an organization that makes extensive use of WP 12 and can get my hands on many such documents. |
From: Mark C. <mar...@gm...> - 2011-04-18 15:18:49
|
BTW, I have libwpd 0.9.1 and libwpg 0.2.0 from Arch Linux. Is that new enough? Writerperfect I'll have to install myself (0.8.0). Mark On Mon, Apr 18, 2011 at 11:06 AM, Mark Coolen <mar...@gm...> wrote: > I'll see what I can do. I'll make sure I have the latest version to > test on. I'd like to see our organization use LibreOffice as a > replacement for WP, so libwpd et al are an important part of this. > > Mark > > On Fri, Apr 15, 2011 at 5:15 PM, Fridrich Strba > <fri...@bl...> wrote: >> Hello, Mark, >> >> On 15/04/2011 19:51, Mark Coolen wrote: >>> Would it be at all useful to submit .wpd files that don't open >>> correctly with the latest version of libwpd? >> >> Very useful indeed, provided that they don't open correctly with the >> master branch of libwpd/libwpg/writerperfect mix. I fixed some problems >> of document loading last weeks, so some problem might be dupplicates. >> >>> I work in an organization that makes extensive use of WP 12 and can >>> get my hands on many such documents. >> >> Very nice! We normally like to add a sample document corresponding to a >> problem we fixed to our regression testing suite >> <http://libwpd.git.sourceforge.net/git/gitweb.cgi?p=libwpd/libwpd-regression;a=summary> >> to avoid the same class of problems biting us ever again. So, if you are >> submitting a file, it would be nice to specify whether we are legally >> entitled to commit it there. We like to have the problematic document >> public in order to lower the bus factor as much as possible. >> >> Thanks for your willingness to help. >> >> Cheers >> >> Fridrich >> >> ------------------------------------------------------------------------------ >> Benefiting from Server Virtualization: Beyond Initial Workload >> Consolidation -- Increasing the use of server virtualization is a top >> priority.Virtualization can reduce costs, simplify management, and improve >> application availability and disaster protection. Learn more about boosting >> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev >> _______________________________________________ >> Libwpd-devel mailing list >> Lib...@li... >> https://lists.sourceforge.net/lists/listinfo/libwpd-devel >> > > > > -- > ___________________________ > | Coolen Software Solutions > | +1.519.652.9378 > | mar...@gm... > |___________________________ > -- ___________________________ | Coolen Software Solutions | +1.519.652.9378 | mar...@gm... |___________________________ |
From: Fridrich S. <fri...@bl...> - 2011-04-19 06:22:01
|
Hi, Mark, On 18/04/11 17:18, Mark Coolen wrote: > BTW, I have libwpd 0.9.1 and libwpg 0.2.0 from Arch Linux. Is that new > enough? Writerperfect I'll have to install myself (0.8.0). OK, there was a fix landing to the libwpd master after 0.9.1, which was solving some issues with loading documents with mildly corrupted prefix for WP6 parser. Would be good to see whether that one is not solving the issue already before reporting, I guess. F. |
From: Edward M. <em...@co...> - 2011-04-19 12:58:35
|
Hello, Recently I was asked to help someone convert hundreds of WPMac files that include Japanese Kanji, files that were created on old Macs that had the Japanese Language Kits installed. It seems - I could be wrong - that libwpd doesn't convert the characters in those files. The method I found for converting them was a bit roundabout: Use a PowerPC Mac that runs OS 10.4 and "Classic" with the Japanese Language Kit installed. Open the WPMac files in WPMac in Classic. Copy the contents of the file to the clipboard. Paste the contents of the file from the Clipboard into OS X's TextEdit or any other unicode-aware Mac application. Save the resulting file as an RTF or DOC file. The resulting file opens correctly in LibreOffice, Pages, Word, etc. This method obviously requires obsolete hardware and software. I would guess that it would require an enormous amount of effort to support double-byte CJK and other WorldScript-based scripts in libwpd, and that the potential need for it is far too small to justify the effort. But is this something that might someday be possible in the future? Edward Mendelson Contributing Editor PC Magazine |
From: William L. <wr...@gm...> - 2011-04-19 13:16:25
|
On Tue, Apr 19, 2011 at 8:58 AM, Edward Mendelson <em...@co...> wrote: > Hello, > > Recently I was asked to help someone convert hundreds of WPMac files that > include Japanese Kanji, files that were created on old Macs that had the > Japanese Language Kits installed. > > It seems - I could be wrong - that libwpd doesn't convert the characters in > those files. The method I found for converting them was a bit roundabout: > ... Use a PowerPC Mac that runs OS 10.4 and "Classic" with the Japanese Language > Kit installed. Open the WPMac files in WPMac in Classic. Copy the contents > of the file to the clipboard. Paste the contents of the file from the > Clipboard into OS X's TextEdit or any other unicode-aware Mac application. > Save the resulting file as an RTF or DOC file. The resulting file opens > correctly in LibreOffice, Pages, Word, etc. > > This method obviously requires obsolete hardware and software. I would > guess that it would require an enormous amount of effort to support > double-byte CJK and other WorldScript-based scripts in libwpd, and that the > potential need for it is far too small to justify the effort. But is this > something that might someday be possible in the future? > Actually, it's not really that difficult. Unless Japanese is dramatically different from what we've seen so far, all we should need to make this conversion work is a table mapping from WordPerfect extended characters to their unicode equivalents. Over the years we've expanded support for languages from only plain latin to relatively obscure ones like Tibetan courtesy of mappings submitted by various people. If you don't have the expertise to create such a mapping yourself, we could probably derive one from (1) a WP document containing all the characters in a Japanese script and (2) one converted to RTF/DOC. If you're interested in producing something like this, let us know! -- William Lachance wr...@gm... |
From: Edward M. <em...@co...> - 2011-04-19 14:08:17
|
On 4/19/2011 9:16 AM, William Lachance wrote: > On Tue, Apr 19, 2011 at 8:58 AM, Edward Mendelson <em...@co... > <mailto:em...@co...>> wrote: > > Hello, > > Recently I was asked to help someone convert hundreds of WPMac files > that include Japanese Kanji, files that were created on old Macs > that had the Japanese Language Kits installed. > > It seems - I could be wrong - that libwpd doesn't convert the > characters in those files. The method I found for converting them > was a bit roundabout: > ... > > Use a PowerPC Mac that runs OS 10.4 and "Classic" with the Japanese > Language Kit installed. Open the WPMac files in WPMac in Classic. > Copy the contents of the file to the clipboard. Paste the contents > of the file from the Clipboard into OS X's TextEdit or any other > unicode-aware Mac application. Save the resulting file as an RTF or > DOC file. The resulting file opens correctly in LibreOffice, Pages, > Word, etc. > > This method obviously requires obsolete hardware and software. I > would guess that it would require an enormous amount of effort to > support double-byte CJK and other WorldScript-based scripts in > libwpd, and that the potential need for it is far too small to > justify the effort. But is this something that might someday be > possible in the future? > > > Actually, it's not really that difficult. Unless Japanese is > dramatically different from what we've seen so far, all we should need > to make this conversion work is a table mapping from WordPerfect > extended characters to their unicode equivalents. Over the years we've > expanded support for languages from only plain latin to relatively > obscure ones like Tibetan courtesy of mappings submitted by various people. > > If you don't have the expertise to create such a mapping yourself, we > could probably derive one from (1) a WP document containing all the > characters in a Japanese script and (2) one converted to RTF/DOC. If > you're interested in producing something like this, let us know! > > -- > William Lachance > wr...@gm... <mailto:wr...@gm...> I created a WPMac 3.5e file with a few lines of Japanese kanji. The text is nonsense - I simply typed in random characters because I know about ten words of Japanese and don't know how to type them. But it should give you an idea of how kanji is stored in WPMac files. The file is here: http://dl.dropbox.com/u/271144/Kanji.wpmac I'm not expert enough in the WPMac file format to learn anything from it, but perhaps it may be useful to someone who knows a lot more than I do. When I open it in Writer, it's blank except for a single letter "t". Edward Mendelson Contributing Editor PC Magazine |
From: Fridrich S. <fri...@bl...> - 2011-04-19 14:28:53
|
Hello, Edward On 19/04/11 16:07, Edward Mendelson wrote: > I created a WPMac 3.5e file with a few lines of Japanese kanji. The text > is nonsense - I simply typed in random characters because I know about > ten words of Japanese and don't know how to type them. But it should > give you an idea of how kanji is stored in WPMac files. The file is here: > http://dl.dropbox.com/u/271144/Kanji.wpmac > I'm not expert enough in the WPMac file format to learn anything from > it, but perhaps it may be useful to someone who knows a lot more than I > do. When I open it in Writer, it's blank except for a single letter "t". OK, they are all in the C8 function. Now only if we could find a documentation about the double-byte Mac Script character sets. F. |
From: Smokey A. <alq...@ar...> - 2011-04-19 18:16:12
|
At 4:28 PM +0200 on 4/19/11, Fridrich Strba wrote: >Hello, Edward > >On 19/04/11 16:07, Edward Mendelson wrote: >> I created a WPMac 3.5e file with a few lines of Japanese kanji. The text >> is nonsense - I simply typed in random characters because I know about >> ten words of Japanese and don't know how to type them. But it should >> give you an idea of how kanji is stored in WPMac files. The file is here: > > http://dl.dropbox.com/u/271144/Kanji.wpmac >> I'm not expert enough in the WPMac file format to learn anything from >> it, but perhaps it may be useful to someone who knows a lot more than I >> do. When I open it in Writer, it's blank except for a single letter "t". > >OK, they are all in the C8 function. Now only if we could find a >documentation about the double-byte Mac Script character sets. Fridrich, I think you want http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT (and other files in http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ for other Mac encodings). (Someone else please commit this link to memory for me; I knew it existed but couldn't remember where to find it anymore :-( ) I thought I had remembered you telling me years ago that the WorldScript stuff was stored in the resource fork and was thus near-impossible for libwpd to use; I'm happy to see that appears not to be the case here :-) (maybe it was just the single-byte scripts?) Best, Smokey (still here, very busy) -- Smokey Ardisson alq...@ar... http://www.ardisson.org/ ------------------------------------------ "He is a fool who has forgotten what became of his ancestry seven generations before him and who does not care what will become of his progeny seven generations after him." --Kazakh Proverb |
From: Fridrich S. <fri...@bl...> - 2011-04-19 23:53:44
Attachments:
Kanji.wpmac.html
|
Hello, Smokey, my old buddy. Nice to know you alive and kicking :) On 19/04/11 19:40, Smokey Ardisson wrote: > http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT (and other > files in http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ for other Mac > encodings). (Someone else please commit this link to memory for me; I > knew it existed but couldn't remember where to find it anymore :-( ) This is exactly what is needed. Thanks for thinking about it. It is very helpful. > I thought I had remembered you telling me years ago that the WorldScript > stuff was stored in the resource fork and was thus near-impossible for > libwpd to use; I'm happy to see that appears not to be the case here :-) > (maybe it was just the single-byte scripts?) Ah, if I remembered all the stupid things I ever said, I would be having to have a head of the size of a 40-wheeler. There were other things in the resource fork that we now try to read. Like pictures. But, just to tell you all that I had another try on this. And after some hours of grep, sed, awk and even a c++ generator, I added about 5000 lines of conversion tables to libwpd_internal.cpp and the document is now looking like the attached one. Judge by yourselves. F. |
From: Fridrich S. <fri...@bl...> - 2011-04-15 21:15:50
|
Hello, Mark, On 15/04/2011 19:51, Mark Coolen wrote: > Would it be at all useful to submit .wpd files that don't open > correctly with the latest version of libwpd? Very useful indeed, provided that they don't open correctly with the master branch of libwpd/libwpg/writerperfect mix. I fixed some problems of document loading last weeks, so some problem might be dupplicates. > I work in an organization that makes extensive use of WP 12 and can > get my hands on many such documents. Very nice! We normally like to add a sample document corresponding to a problem we fixed to our regression testing suite <http://libwpd.git.sourceforge.net/git/gitweb.cgi?p=libwpd/libwpd-regression;a=summary> to avoid the same class of problems biting us ever again. So, if you are submitting a file, it would be nice to specify whether we are legally entitled to commit it there. We like to have the problematic document public in order to lower the bus factor as much as possible. Thanks for your willingness to help. Cheers Fridrich |
From: Mark a. N. <the...@gm...> - 2011-04-18 12:38:42
|
I'll see what I can do. I'll make sure I have the latest version to test on. I'd like to see our organization use LibreOffice as a replacement for WP, so libwpd et al are an important part of this. Mark On Fri, Apr 15, 2011 at 5:15 PM, Fridrich Strba <fri...@bl...> wrote: > Hello, Mark, > > On 15/04/2011 19:51, Mark Coolen wrote: >> Would it be at all useful to submit .wpd files that don't open >> correctly with the latest version of libwpd? > > Very useful indeed, provided that they don't open correctly with the > master branch of libwpd/libwpg/writerperfect mix. I fixed some problems > of document loading last weeks, so some problem might be dupplicates. > >> I work in an organization that makes extensive use of WP 12 and can >> get my hands on many such documents. > > Very nice! We normally like to add a sample document corresponding to a > problem we fixed to our regression testing suite > <http://libwpd.git.sourceforge.net/git/gitweb.cgi?p=libwpd/libwpd-regression;a=summary> > to avoid the same class of problems biting us ever again. So, if you are > submitting a file, it would be nice to specify whether we are legally > entitled to commit it there. We like to have the problematic document > public in order to lower the bus factor as much as possible. > > Thanks for your willingness to help. > > Cheers > > Fridrich > > ------------------------------------------------------------------------------ > Benefiting from Server Virtualization: Beyond Initial Workload > Consolidation -- Increasing the use of server virtualization is a top > priority.Virtualization can reduce costs, simplify management, and improve > application availability and disaster protection. Learn more about boosting > the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev > _______________________________________________ > Libwpd-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libwpd-devel > |
From: Mark C. <mar...@gm...> - 2011-04-18 15:06:58
|
I'll see what I can do. I'll make sure I have the latest version to test on. I'd like to see our organization use LibreOffice as a replacement for WP, so libwpd et al are an important part of this. Mark On Fri, Apr 15, 2011 at 5:15 PM, Fridrich Strba <fri...@bl...> wrote: > Hello, Mark, > > On 15/04/2011 19:51, Mark Coolen wrote: >> Would it be at all useful to submit .wpd files that don't open >> correctly with the latest version of libwpd? > > Very useful indeed, provided that they don't open correctly with the > master branch of libwpd/libwpg/writerperfect mix. I fixed some problems > of document loading last weeks, so some problem might be dupplicates. > >> I work in an organization that makes extensive use of WP 12 and can >> get my hands on many such documents. > > Very nice! We normally like to add a sample document corresponding to a > problem we fixed to our regression testing suite > <http://libwpd.git.sourceforge.net/git/gitweb.cgi?p=libwpd/libwpd-regression;a=summary> > to avoid the same class of problems biting us ever again. So, if you are > submitting a file, it would be nice to specify whether we are legally > entitled to commit it there. We like to have the problematic document > public in order to lower the bus factor as much as possible. > > Thanks for your willingness to help. > > Cheers > > Fridrich > > ------------------------------------------------------------------------------ > Benefiting from Server Virtualization: Beyond Initial Workload > Consolidation -- Increasing the use of server virtualization is a top > priority.Virtualization can reduce costs, simplify management, and improve > application availability and disaster protection. Learn more about boosting > the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev > _______________________________________________ > Libwpd-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libwpd-devel > -- ___________________________ | Coolen Software Solutions | +1.519.652.9378 | mar...@gm... |___________________________ |
From: Edward M. <em...@co...> - 2011-04-20 00:46:36
|
On 4/19/2011 7:53 PM, Fridrich Strba wrote: > > But, just to tell you all that I had another try on this. And after some > hours of grep, sed, awk and even a c++ generator, I added about 5000 > lines of conversion tables to libwpd_internal.cpp and the document is > now looking like the attached one. > > Judge by yourselves. Fridrich, Amazing!! Deeply impressed!! May I ask the person I've been working with for a few test files that I can send along? Edward Mendelson Contributing Editor PC Magazine |
From: Edward M. <em...@co...> - 2011-04-20 00:48:23
|
On 4/19/2011 7:53 PM, Fridrich Strba wrote: > But, just to tell you all that I had another try on this. And after some > hours of grep, sed, awk and even a c++ generator, I added about 5000 > lines of conversion tables to libwpd_internal.cpp and the document is > now looking like the attached one. Not to make your life more difficult, but would it now be possible to add the CHINSIMP and CHINTRAD and KOREAN to the tables? They're all here: http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ Plus a few others! Edward Mendelson Contributing Editor PC Magazine |
From: Fridrich S. <fri...@bl...> - 2011-04-20 07:41:43
|
Edward, On 20/04/2011 02:48, Edward Mendelson wrote: > Not to make your life more difficult, but would it now be possible to > add the CHINSIMP and CHINTRAD and KOREAN to the tables? They are all in already. All CJK two-byte codes are inside http://libwpd.git.sourceforge.net/git/gitweb.cgi?p=libwpd/libwpd;a=commit;h=1814c4270a1fe13c6ff569985b62349d1707b6dd ;) f. |
From: Smokey A. <alq...@ar...> - 2011-04-21 06:32:15
|
At 1:53 AM +0200 on 4/20/11, Fridrich Strba wrote: >On 19/04/11 19:40, Smokey Ardisson wrote: >>http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT (and other >>files in http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ for other Mac >>encodings). (Someone else please commit this link to memory for me; I >>knew it existed but couldn't remember where to find it anymore :-( ) > >This is exactly what is needed. Thanks for thinking about it. It is >very helpful. :-) I'm glad I was able to remember/find it again. >But, just to tell you all that I had another try on this. And after >some hours of grep, sed, awk and even a c++ generator, I added about >5000 lines of conversion tables to libwpd_internal.cpp and the >document is now looking like the attached one. > >Judge by yourselves. Amazing work again! At 9:41 AM +0200 on 4/20/11, Fridrich Strba wrote: >Edward, >On 20/04/2011 02:48, Edward Mendelson wrote: > > Not to make your life more difficult, but would it now be possible to >> add the CHINSIMP and CHINTRAD and KOREAN to the tables? > >They are all in already. All CJK two-byte codes are inside >http://libwpd.git.sourceforge.net/git/gitweb.cgi?p=libwpd/libwpd;a=commit;h=1814c4270a1fe13c6ff569985b62349d1707b6dd Just for my own information, your recent commits have only been for the double-byte languages; there's still no support for the single-byte ones in WP-Mac files, right? Enjoy the rest of your vacation! Smokey -- Smokey Ardisson alq...@ar... http://www.ardisson.org/ ------------------------------------------ "He is a fool who has forgotten what became of his ancestry seven generations before him and who does not care what will become of his progeny seven generations after him." --Kazakh Proverb |
From: Fridrich S. <fri...@bl...> - 2011-04-21 09:49:27
|
Smokey, On 21/04/2011 08:20, Smokey Ardisson wrote: > Just for my own information, your recent commits have only been for the > double-byte languages; there's still no support for the single-byte ones > in WP-Mac files, right? I have problem with this for this moment. Because they are basically in the same range for each charset. I don't know how those are placed in the script function. It is possible that for each group there is one byte indicating which group it is and the code, but I am pretty not sure. If this were the case, it could happen that we could somehow convert them. For the other "extended character function", we do following. We read first the one byte of mac character code and interprete it into the MacRoman table. If this character is not valid (special codes for that), we read the two bytes of WP5 charset/char pair and interprete it the same way as we use for WP5 documents. I have implemented this primarily as only the WP5 part, but if I recall well, we had some problems with particular mac characters that were misinterpreted (probably bug in WP3). So we reverted to the priority for the Mac Character. It is nevertheless conceivable, although I don't have the empirical evidence for it that if you write a document using another system language, the mac character will be from different set. There, we cannot do much unless we somehow get the information about what charset those characters are from. That is also I was saying the we could use some additional information about how the WorldScript encoding is looking like. So, unless we know which encoding the document uses for the mac character part of the 0xC0 functions, we are a bit grilled. I almost feel like actually converting the WP5 pair first and then the mac char only if the WP5 is not giving result, since it might cover correctly more cases even though it might mess up some 2-3 acutes vs. graves accents. But the proper way would be to get the information from the docs about what encoding one uses. Anybody volunteering? F. -- Please avoid sending me Word, Excel or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html |
From: Edward M. <em...@co...> - 2011-04-21 13:46:33
|
On 21 Apr 2011, at 9:37 AM, Fridrich Strba wrote: > I see in a manual list of script IDs ranging from 0x00 for Roman to 0x20 > for Symbol. Now I would like to see how an eventual arabic text looks in > a wp document. This could make us somehow understand how to extract this > information from the 0xC8 function. I can try to enter some arbitrary Arabic characters into a WPMac file later today and post it to the group. Is that what would be useful? Or do you want WPDOS 5.1 Arabic? That's ordinary character set 13/14 so I assume you already handle it. Some sample Arabic WPDOS files that I use for testing are here: http://www.un.org/popin/unpopcom/32ndsess/gass.htm The PDF files made from the WP files (on the same page) are of course NOT Unicode-based Arabic, but simply arbitrary symbols. Edward Mendelson Contributing Editor PC Magazine |
From: Fridrich S. <fri...@bl...> - 2011-04-21 14:56:33
|
Thanks, Edward, On 21/04/11 15:46, Edward Mendelson wrote: > Some sample Arabic WPDOS files that I use for testing are here: > http://www.un.org/popin/unpopcom/32ndsess/gass.htm Here I realized that we were not doing any conversion of arabic for WP5. Now. I found this program here http://www.lionscribe.com/downloads/files/wp2rtf/wp2rtf.zip that has some test files for arabic and for hebrew. I "implemented" arabic for WP5. Nevertheless, not properly. Smokey, could you go through the rtf files compared what wpd2text makes out of the *WP files and point me which glyphs are not correct? Eventually map them to unicode (but I can do it when I have time)? Also, someone that is good at hebrew, could you please map the missing hebrew characters for me? I might try, but not sure how good it will be because I will basically compare two pictures. F. -- Please avoid sending me Word, Excel or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html |
From: Edward M. <em...@co...> - 2011-04-21 15:14:29
|
On 4/21/2011 10:56 AM, Fridrich Strba wrote: > > Also, someone that is good at hebrew, could you please map the missing > hebrew characters for me? I might try, but not sure how good it will be > because I will basically compare two pictures. I don't know any Hebrew at all, but the character set was small enough for me to map them to their Unicode equivalents. If you go to this page: http://www.columbia.edu/~em36/wpdos/arabicandhebrew.html#hebrewpspdf and download my WP51HEPS.EXE you will find an .ALL file named WP51HEPS.ALL. There are character maps in that file that map the WP Hebrew characters to the glyphs in the PostScript fonts also included in the file. As you'll see, these are mapped either to their Adobe "afii" numbers (easily converted to unicode) or their unicode numbers. There are, if I remember correctly, three characters that aren't directly matched. A Hebrew scholar told me that three characters in the WP Hebrew set are "ghosts" - that is, they were combinations of a consonant and vowel markings that in fact never occur in any real Hebrew text. I don't remember what they are, but I think I tried to find some way to create them by combining real unicode/Adobe characters, just to be complete. Please let me know if I can clarify this further. It's been a few years since I did this. I made a start on Arabic, but the task was too complex and there seemed to be too few users who cared. Edward Mendelson Contributing Editor PC Magazine |
From: Edward M. <em...@co...> - 2011-04-21 15:53:51
|
On 4/21/2011 10:56 AM, Fridrich Strba wrote: > Thanks, Edward, > > On 21/04/11 15:46, Edward Mendelson wrote: >> Some sample Arabic WPDOS files that I use for testing are here: >> http://www.un.org/popin/unpopcom/32ndsess/gass.htm > > Here I realized that we were not doing any conversion of arabic for WP5. > Now. I found this program here > http://www.lionscribe.com/downloads/files/wp2rtf/wp2rtf.zip that has > some test files for arabic and for hebrew. I "implemented" arabic for > WP5. Nevertheless, not properly. > > Smokey, could you go through the rtf files compared what wpd2text makes > out of the *WP files and point me which glyphs are not correct? > Eventually map them to unicode (but I can do it when I have time)? > > Also, someone that is good at hebrew, could you please map the missing > hebrew characters for me? I might try, but not sure how good it will be > because I will basically compare two pictures. > I have the LionScribe utility, and used it to convert the CHARACTR.DOC from Arabic WP into RTF format. I've posted it here: http://dl.dropbox.com/u/271144/CHARACTR.DOC.rtf I have no way of knowing how accurate the results are, but I'm sure they're better than anything anyone else has tried. Edward Mendelson Contributing Editor PC Magazine |
From: Smokey A. <alq...@ar...> - 2011-04-23 06:19:10
|
At 11:23 PM +0200 on 4/21/11, Fridrich Strba wrote: >On 21/04/11 21:52, Smokey Ardisson wrote: >> Do we have reason to believe that the mapping in WP5 is different from >> the ones I contributed for WP6? > >Yes, we have a strong reason to believe that not all are the same. If >you take the arabic test in the zip I pointed to and convert with a >fresh checkout of libwpd to html, you will see remarkable differences. >The advantage though is that that test document is actually having the >names of the characters near to them, which will make the janitorial >work a bit easier. I can't believe that a company would move around blocks of characters from one version of the software to the next, especially when those characters' codes are the canonical ways of identifying them :-P :sigh: However, there were not too many differences/corrections between what's in libwpd_internal.cpp right now for set 13. Can someone (Edward?) extract the full character sets 13 and 14 from "TEST.WP" from the wp2rtf zip and either run those through wp2rtf or generate PDFs that show the Arabic characters for me? The "TEST_ARA" pair of documents didn't include the first dozen codepoints in set 13 and don't include any of set 14, so I don't have a visual reference :-( I have set 13 all fixed (with the exception of those first dozen that I don't know what they look like and which have useless names), within the confines of what's available in Unicode and with the limitation of a single-codepoint-to-single-codepoint mapping (wp2rtf produces more "accurate" mappings of some fancy WP codepoints that don't have single matching Unicode glyphs by using two characters). I think new additions to Unicode since version 4.1 (when I did the WP6 mappings) will let us successfully map some of the additional random diacritical marks WP used to Unicode; if so, I can also fix the old WP6 mappings for those. > > Looking back through my files from that era, it looks like we ended up >> punting on "true" conversion for WP-Arabic and ended up mapping to >> Unicode presentation forms (except for the "stand-alone" forms of the >> letters, which got mapped to the normal, combining Unicode characters), >> so that the result was reverse-ordered, unconnected text (I think you >> were going to use Fribidi to try and reorder, but I remember vaguely >> that that effort had some problems and you intended to fix things >> elsewhere in another manner). At some point in the future, we might >> want to revisit that and map all of the WP-Arabic codepoints to normal, >> combining forms where possible for WP5/WP6. > >Yeah, I tried to do that, but it is a bit too complicated and the >reverse bidi algorithm is not even defined. Basically, you would have to >have two marks like OOXML has, to tell that this and this span is part >of the same run, because if not, you will not have a way to reorder >spans with different character properties (bold, italic) but part of the >same phrase. It was a huge mess to do and I really did not have the >courage to dive into it. Just pretend I never said that paragraph you replied to; I wasn't completely awake when I wrote that and I had forgotten the key bit of the problem: non-Mac WP required you to enter characters backwards/LTR to begin with (I was thinking for some reason that the characters were in the correct order in the file and would work properly in a bidi-aware word processor if only we'd mapped to the normal codepoints) :-P Smokey -- Smokey Ardisson alq...@ar... http://www.ardisson.org/ ------------------------------------------ "He is a fool who has forgotten what became of his ancestry seven generations before him and who does not care what will become of his progeny seven generations after him." --Kazakh Proverb |
From: Fridrich S. <fri...@bl...> - 2011-04-23 11:23:42
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 23/04/11 08:18, Smokey Ardisson wrote: > I can't believe that a company would move around blocks of characters > from one version of the software to the next, especially when those > characters' codes are the canonical ways of identifying them :-P > :sigh: However, there were not too many differences/corrections > between what's in libwpd_internal.cpp right now for set 13. Yeah, but I saw differences and I noticed in each one of them they changed some characters. > Can someone (Edward?) extract the full character sets 13 and 14 from > "TEST.WP" from the wp2rtf zip and either run those through wp2rtf or > generate PDFs that show the Arabic characters for me? The "TEST_ARA" > pair of documents didn't include the first dozen codepoints in set 13 > and don't include any of set 14, so I don't have a visual reference > :-( This is what Edward posted some days ago: http://dl.dropbox.com/u/271144/CHARACTR.DOC.rtf It should have the visual references. > I have set 13 all fixed (with the exception of those first dozen that > I don't know what they look like and which have useless names), > within the confines of what's available in Unicode and with the > limitation of a single-codepoint-to-single-codepoint mapping (wp2rtf > produces more "accurate" mappings of some fancy WP codepoints that > don't have single matching Unicode glyphs by using two characters). It would be actually good to mark somewhere the exact wp characters (charset, charnumber) and their multicharacter mappings. I will then do what I do for the double byte script, will handle those where the single codepoint to single codepoint works and will add a special handling for those that need to be mapped using two or more chars. In the single codepoint to single codepoint mapping I will mark them as having 0x0000 conversion which will be a special case. Nevertheless, I will do this later. > I think new additions to Unicode since version 4.1 (when I did the > WP6 mappings) will let us successfully map some of the additional > random diacritical marks WP used to Unicode; if so, I can also fix > the old WP6 mappings for those. Feel free to fix whatever you want. Patches are always welcome :) Cheers F. - -- Please avoid sending me Word, Excel or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/ iEYEARECAAYFAk2ytrMACgkQu9a1imXPdA8a8gCfZ/zVfpVif6JmCNpXtEsZ/JTN z0QAn2DY44glLmbfAzoHnG9ot1f4xrOG =rRZz -----END PGP SIGNATURE----- |
From: Edward M. <em...@co...> - 2011-04-23 21:01:21
|
On 23 Apr 2011, at 4:42 PM, Edward Mendelson wrote: > > On 4/23/2011 4:19 PM, Fridrich Strba wrote: >> Edward, >> >> >> On Sat, 2011-04-23 at 11:13 -0400, Edward Mendelson wrote: >>> Also potentially useful: The character map document that shipped with 6.x for DOS is here: >>> http://dl.dropbox.com/u/271144/CHARACT6.DOC >>> It includes all 14 sets. >> >> Thanks for this one. I run it over wpd2text and it might be that we >> actually do a good job here. Now, I could use some help here. If you >> people could simply do the wpd2text of those characters and see whether >> all glyphs in all charsets are correctly mapped. If you find error, just >> note which charset and char number and what would be the correct unicode >> mapping. For characters that would correspond to 2 or more unicode >> character sequence, please write that down, but give me also a closer >> approximation of 1-1 mapping. >> >> >>> I'll find and post the CHARMAP.TST that shipped with 5.1 Hebrew and Arabic later on. >> >> The wp2rtf zip file contains a TEST.WP file that has all the charsets in >> it. If you have the visual representation, just run it through wpd2text >> and compare. I would appreciate again to have the information of wrongly >> mapped glyphs. >> >> BTW: I hope I actually fixed (maybe apart about 10 chars) the WP5.1 >> hebrew map yesterday. Please check. >> >> Thanks for helping with this > > Hello Fridrich, > > I am happy to help. Let me make certain that I know what you asking for. > > 1. You want me to test the WPDOS 6.x character map file by running it > through wpd2text, and compare the output from wpd2text with the original > file, and report any changes. > > 2. The TEST.WP file is (except for the first line) the same as the > CHARMAP.TST file that shipped with WP 5.1 Arabic and Hebrew. It shows a > very different character set from the WPDOS 6.x character map file, and > it is (I believe) in WP5 format. You want me to run it also through > wpd2text and compare the output with the original, and report any changes. > > I will be away much of the evening (New York time) but will try to do > this some time this weekend. > > Edward > Just to clarify again: I've run both files through wpd2text - and the results did not seem useful at all (see the linked file). However, if you want me to run them through wpd2odt, then the results were very useful indeed. I'll check them later. See the results here: http://dl.dropbox.com/u/271144/CharTests.zip And wpd2html produced some very impressive-looking results: http://dl.dropbox.com/u/271144/testwp5.html http://dl.dropbox.com/u/271144/character6.html As I said, I'll have to check these later this weekend. Edward |
From: Fridrich S. <fri...@bl...> - 2011-04-24 04:55:16
|
Hello, Edward, On 23/04/2011 23:01, Edward Mendelson wrote: > I've run both files through wpd2text - and the results did not seem useful at all (see the linked file). However, if you want me to run them through wpd2odt, then the results were very useful indeed. I'll check them later. > See the results here: > http://dl.dropbox.com/u/271144/CharTests.zip Whichever workflow works the best for you :) is the best for you. > And wpd2html produced some very impressive-looking results: > http://dl.dropbox.com/u/271144/testwp5.html > http://dl.dropbox.com/u/271144/character6.html Thanks! It is a result of several nights of diving into the unicode charts. First it was Ariya, then me with Smokey. So happy that it works so far. > As I said, I'll have to check these later this weekend. Ok, just to clarify what would be good: 1) Check what the file should look like and what it looks like. When you find a differences in glyphs, please note the WP charset and WP char of the offending character. In case you want to go further, just check the unicode charts, both visually and also the names of the characters and provide the right unicode character. In case there is no possibility to have direct mapping (i.e. the character is composed from a combining diacritics and a character), I would gladly have the multi-character mapping too, but also some close approximation (character without diacritics for instance) of 1:1 mapping. I will extend the mechanism so that we can handle sequences of characters as results, but for the time being, I want first to see whether it is worth it. Thanks for your work. Fridrich |