Menu

NEW ATTEMPT TO DIGITALIZE THE GAFFIOT DICTIONARY ON WIKISOURCE

Anonymous
2013-03-02
2013-07-19
  • Anonymous

    Anonymous - 2013-03-02

    Dear Pr. OHKUBO,

    Let me first congratulate you for your work about the Gaffiot dictionary. Can you explain to me how you proceeded ? If I understrand, the letters A, B, et C are done. With which degree of exactitude, you think ?

    I try to continue the digitalization of the Gaffiot dictionary on the Wikisources project. The idea is to ask to many contributors to digitalize just one page (for the first pass).

    Each contributor ask to me on the address numerisation.gaffiot (at) hotmail.fr a page number and he has to digitalize it (i.e. extract the characters). he has also to validate a page corrected by another contributor.

    I made a document for beotians to explain how to proceed.

    May be, some japan latinists would kindly participate to this huge work ? [without doing what you have already done, evidently]

    I could send you my message of demand (I would translate in (my poor) english) for you to give (or translate in japanease) to the persons that would be interested to digitalize one page of the gaffiot.

    What do you think of that ?

    Sincerely yours,

    gérard gréco -> numerisation.gaffiot (at) hotmail.fr

     
  • OHKUBO Katsuhiko

    Thank you very much for finding this project!
    At first, sorry for my late reply.
    I didn't know I (the forum owner) had to log in and permit your message (or delete SPAMs) to read after site renewal on February, and update notification setting was also cleared...

    I am a software engineer (normal office worker) and an amateur Latinist, not a scholar, nor a professional.
    This project is my private voluntary work (unrelated to my office), without any private/official partners (such as University, Government etc) and no funds.

    Steps of digitalization are as follows:
    1. Scanning an image file by ABBBY FineReader10 (with special settings for special characters (Greeks, macrons etc)) and save as a HTML file.
    2. Roughly convert HTML to XML by my Java program, which also converts Greeks to Betacode automatically, etc.
    3. Manually format and fixing that XML (this is the main part of digitalization, 30-60 mins per page).
    4. Spell checking and fixing Latin texts (=italic part) by my program.
    5. Checking reference texts ("Cic. de Or. 1.2.3" etc) by my program, it's a rough check, not perfect.

    Since it's a draft phase, I have not proofread any pages yet.
    French texts are not checked technically (once I tried to use some French spell checker, but I could not use it well).


    Your idea (inviting contributors) sounds good if it works well.
    Frankly (sorry to say), digitalization is hard to start, harder to continue, much harder to complete.
    Do you have an experience of digitalization of dictionaries?
    If not, I suggest you to create and proofread by yourself 100 pages (it's short!), at least.
    You will know how much you really can, how long it takes, what is difficult, etc.
    If you are expecting someone's help already, I wonder whether you will be able to complete.
    "Someday", "Someone" will be "Never", "Nobody".

    Since 2009, I am creating electric dictionaries using XML files released by Perseus Digital Library.
    (only Japanese pages available, sorry).
    http://classicalepwing.sourceforge.jp/
    http://classicalepwing.sourceforge.jp/screenshot.html
    http://classicalepwing.sourceforge.jp/usage.html
    http://classicalepwing.sourceforge.jp/usage2.html
    http://classicalepwing.sourceforge.jp/usage3.html
    In 2011, I have digitalized almost all part of Halsey's etymological dictionary, 1 column, over 200 pages.
    Last year, I have also digitalized Weekley's etymological dictionary, 2 columns, over 800 pages. It took 4 months to proofread (it's a wonderful study for me).
    So I believe I can digitalize Gaffiot (3 columns, 1700 pages) completely even if without other's help.

    In 2008, I joined a mailing list (ML) of Latin in Japan, which has over 500 amateur members, a few professionals. Before my joining, a group had read all of "De Bello Gallico", and it (Japanese translation of whole parts) was published in 2009.
    http://www.amazon.co.jp/dp/4256191178
    Other groups are reading Cicero, Bible, etc now.
    I don't join such groups because I have learned Latin since 2007 at Athénée Français which is the most famous commercial French language school (but I have not attended French classes). I am reading Caesar and Ovidius now, had read a part of Annales by Tacitus.
    I know there are many eager amateur Latinists in Japan, I announced in the ML what I am doing, but there is nobody to help this project for 2 years!

    I don't know whether there is someone who wants to contribute and really contributes in French or another countries.
    I really hope there is, but it may not...

    Do you have a connection (or able to contact) with the team members who digitalized Du cange?
    http://ducange.enc.sorbonne.fr/
    http://ducange.enc.sorbonne.fr/doc/credits
    It has 10 volumes, 6000 pages!!
    http://ducange.enc.sorbonne.fr/doc/dia/
    It's wonderful if these members will contribute for Gaffiot dictionary even if NO FUND (or get a fund from French Government and start professional project).
    Maybe I'm so pessimistic, but I don't expect someone's help (because there isn't).


    As a software engineer, technically, I think it's not good way to work on Wikisource directly.
    It's based on simple HTML, easy to update, good for Wikisource, but not good for other purposes.
    Generally, well structure defined XML are used to save big dictionary data.
    For example, Perseus uses TEI-XML, and I also created Gaffiot's data as same format.
    http://www.perseus.tufts.edu/hopper/opensource/download
    Du cange's data is also TEI-XML files.
    Sanskrit site uses XML, which is not TEI.
    http://www.sanskrit-lexicon.uni-koeln.de/download.html
    XML is able to convert another format (HTML etc) and another viewing style easily, some kinds of checks are technically easy.

    Here is my Gaffiot's XML (TEI-like), which I really typed (almost) everyday. It's already progressed at page 526.
    http://sourceforge.net/projects/digital-gaffiot/files/gaffiot_xml-130310.zip/download
    I converted it to HTML web pages using XSLT etc.
    HTML is one of the XML, so it's not so bad, but using XML is better.

    I'm sorry if I disappointed you and not so cooperative.
    I will rethink what I can do for you after there are 100 green marked pages on Wikisource.

     

    Last edit: OHKUBO Katsuhiko 2013-03-12
  • Anonymous

    Anonymous - 2013-03-22

    Thank you for this frank answer, Mr Ohkubo. You are a new Briarée, who had one hundred of arms ! and to learn latin in such a brief time ! a new Argus too !?! bravo !

    1. You said : Scanning an image file by ABBBY FineReader10 (with special settings for special characters (Greeks, macrons etc)) and save as a HTML file.

    I use ABBYY too with the number 9 release. But I have great difficulties with greek and macron etc. characters. I tried to teach them to the software but I didn't really manage to do that. Can you send to me the add-on you prepared to recognize such characters ? It would be for me a great help.

    1. You said : I am a software engineer (normal office worker) and an amateur Latinist, not a scholar, nor a professional.
      This project is my private voluntary work (unrelated to my office), without any private/official partners (such as University, Government etc) and no funds.

    So do I. I am an engeneer in hydraulics.

    1. You said : If you are expecting someone's help already, I wonder whether you will be able to complete.

    It sounds like you say...

    1. You said : As a software engineer, technically, I think it's not good way to work on Wikisource directly.
      It's based on simple HTML, easy to update, good for Wikisource, but not good for other purposes.
      Generally, well structure defined XML are used to save big dictionary data.

    You're perfectly right. But I wanted to test the collectif work. I want to continue this test to complete the D letter. It is a micro-approch with the advantage of latin and french readers and of the validation by a pair.

    But I have a second project : the letter T with a macro-approch as you done, I think. I tested the lex (gnu flex) software on some pages : it is an interessant method ; but the problem for me is the bad results of the breve and long characters by Abbyy. May be you could help me on this subject.

    Thank you four yor DTD : I will study it with a great attention. Thank you for the links too !

    Rendez-vous after the 100 pages ! Think to me for the abbyy add-on !

    Ave,

    Gérard Gréco (Gerardus Græcus)

    PS : Have you an email adress, this interface is not very pleasant. Mine : numerisation.gaffiot@hotmail.fr

     
  • OHKUBO Katsuhiko

    Thank you for your reply.

    About setting of ABBYY FineReader, it's better to read help pages of your software to know how to set languages of scanning documents.
    On FineReader 10, there is a language setting page at Tools > Options.
    I created some "User languages" to set special characters.

     
  • Anonymous

    Anonymous - 2013-03-27

    Thank you for your "how-to" ! I didn't know these features of Abbyy : it works on 9 release too.

    I try to do something with letter T with your method in parrallel with the letter D on Wikisources.

    You are right, the html of Wikisources is relatively poor. But I think that a lex programm (or successive lex programs) could extract (almost) automaticaly the information and place it in an XML file.

    I will try to construct a adapted DTD, the TEI 5 one is, I guess, too general for the Gaffiot dictionary. I will communicate this DTD to you when composed to have, if you agree, your advices about it.

    Thank you and rendez-vous to the hundred pages ! Havoc !

    GG

     
  • OHKUBO Katsuhiko

    I'm glad about your determination.
    Here is the most important advice:

    DO NOT OVERWORK!
    
    TAKE ENOUGH RESTS FOR YOUR FINGERS!
    

    An excellent English-Japanese translator (whom I admire most) had seriously injured her fingers by her 20 years work, cannot type normally anymore, uses a voice recognition software to input text now.
    Take 10 minutes rest after 1 hour typing.
    It's not an idle threat nor a joke.
    If you feel a pain, stop typing and rest your fingers.

    Conversion between Wikisource HTML and XML is not so important task.
    At least, I can create a XML -> HTML converter.

    I typed page 555 yesterday, I will able to release section D (pp.464-567) in April.

     
  • Anonymous

    Anonymous - 2013-04-24

    Hi, dear Pr. Ohkubo,

    Thank you for your advices.

    I am trying to digitalize the 100 last pages of the Gaffiot. Like you, I uded Finereader (9) with your settings about breve and long letters. The result is quite bad for the breve and long letters : one error for two letters. May be the image file of the Gaffiot I used was too poor.

    Q : what image file do you use ? is it reachable on the Net ?

    Sincerelty yours,

    GG

     
  • OHKUBO Katsuhiko

    Dear Gérard,
    I'm glad of your message.

    Yes, some special settings for FineReader does NOT mean perfect recognition.
    Sometimes breve and long letters are recognized correctly, sometimes not.
    Sometimes these are recognized as à/á/â/ä etc (but these are necessary for French!).
    I created and uses a roughly format program to change à -> #ăā etc if these characters are headword position (it's not perfect solution, of course. I check and correct them manually).

    I uses original source images of http://www.lexilogos.com/latin/gaffiot.php ,
    i.e. Prof. OGURISU Hitoshi (http://www.eonet.ne.jp/~ogurisu/index.html) scanned files,
    http://www.eonet.ne.jp/~ogurisu/Fr/Gaffiot.html
    https://www.wakayama-u.ac.jp/%7Eogurisu/archives/dictionaries/Gaffiot/GaffiotEdic.zip
    (I don't know Prof.Ogurisu, but a few years before, I sent a mail to him. He was glad about reusing his image files.)

    GaffiotEdic.zip contains over 1700 TIFF files of all pages, 400 dpi, 2425x3763 pixels.
    http://www.lexilogos.com/latin/gaffiot.php uses JPEG files, 1000x1552 pixels.
    JPEG is "lossy" compression format (and lexilogos scaled down image sizes also).
    It seems JPEG images are too noisy than TIFF images....

     

    Last edit: OHKUBO Katsuhiko 2013-04-25
  • Anonymous

    Anonymous - 2013-05-18

    Thank you very much, dear professor Okhubo, for your kind advices.

    1. Congratulations for having digitalized the D letter !

    2. I have digitalized the nearly 100 last pages (p. 1600 to the end) of the Gaffiot, from the pictures of prof. Ogurisu. A few latin teachers is now reviewing them. I hope it will be finished within a month. I have not transformed them in TEI form.

    The rough result is like this :

    !-- TRICASTINI p. 1600 -->

    Trĭcastīni, ōrum, m., peuple de la Narbonnaise : Plin. 3, 36.

    Tricca, æ (Triccē, ĕs, Sen. Tro. 831) f. (Τρίκκη), ville de Thessalie : Liv. 32, 13, 5 ; Plin. 4, 29 || -æus, a, um, de Tricca : Avien. Phæn. 206.

    trīcēnārĭus, a, um (triceni), de trente, qui contient le nombre trente : <cl>tricenaria fistula</cl> Frontin. Aq. 29 ; 48, <cf>tuyau de 30 pouces de circonférence</cf> ; <cl>tricenarius (homo)</cl> Sen. Exc. Contr. 3, 3, 5, (homme) de trente ans ; tricenariæ cærimoniæ P. Fest. 71, 10, cérémonie de 30 jours.

    trĭcēni, æ, a (triginta), \P 1 distrib., chacun trente, chaque fois, trente : in singula conclavia tricenos lectos quærere Cic. Verr. 4, 58, chercher trente lits pour chacune des salles à manger \P 2 = triginta, trente : Plin. 18, 144, etc.

    \F gén. pl. tricenum Plin. 7, 164 ; Frontin. Aq. 49.

    trīcennālĭa, ĭum ou ĭōrum, n., fête qui se célèbre tous les trente ans : Oros. 7, 28 ; Prob. App. 196, 10.

    trīcennālis, e, de trente ans : Ruf. Hier. 1, 11.

    trīcenĭum, ĭi, n. (triginta, annus), durée de trente ans : Cod. Just. 7, 31, 1 ; Sid. Ep. 8, 6.

    trīcensĭmāni, v. tricesimani.

    trĭcentēni et trĭcenti, v. trecenteni, trecenti : Col. 5, 2, 5.

    Have you some software to go farther with such a material in direction of the TEI schema ?

    1. I found that the TEI schema you used is not as rich as it could be. For example, the quotations are not identified. I think it would be better if it was like this for example :

    <entry>

    <form>
    <orth>vălĕo, ŭi, ĭtum, ēre,</orth>
    <gramGrp>
    <gram type="pos&gt;int.,&lt;/gram&gt;&lt;br&gt; &lt;gramGrp&gt;&lt;br&gt; &lt;/form&gt;&lt;/p&gt; &lt;p&gt;&lt;sense n=" 1"="">
    <def>être fort, vigoureux :</def>

    <cit type="example" xml:lang="la">
      <quote>plus potest, qui plus valet</quote>
      <bibl>
        <author>Pl.</author> 
        <title>Truc.</title>
        <biblScope>812,</biblScope>
        <cit type="translation" xml:lang="fr">
          <quote>il peut le plus, celui qui est le plus fort,</quote>
        </cit>
    <cit>
    
    <xr type="cf">cf.
      <cit>
        <bibl>
          <author>Pl.</author> 
          <title>Amp.</title>
          <biblScope1103 ;</biblScope>
        </bibl>
      </cit>
    </xr>
    
    <cit type="example" xml:lang="la">
      <quote>ubi vitis valebit</quote>
      <bibl>
         <author>Cat.</author>
     <title>Agr.</title>
         <biblScope>33, 3,</biblScope>
      </bibl> 
      <cit type="translation" xml:lang="fr">
        <quote>quand le cep sera fort</quote>
      </cit>
     </cit>
    
     <note>[dans ce sens <xr type="cf">v. <ref>valens,</ref></xr> très employé par <author>Cic.</author>]</note>
    

    </sense>

    <sense n="||"> [or <sense="2.1">]

    <def>[avec inf.] avoir la force de : </def>

    etc.

    What do you think of that ?

    1. I made an attempt to print my results with TeX. I joined a page for example. The data of this page have not be corrected.

    2. I attacked the 500 following (in reverse order) pages form p. 1100 to 1600...

    3. If I (and my very few (you were right...) colleagues) reviewed your work, for example the letter A to correct typos, what about the copyrights ? The result of these corrections would be under Creative licence ?

    Sincerely yours,

    Gérard Gréco.

     
  • OHKUBO Katsuhiko

    GREAT! CONGRATULATIONS!! BRAVO!!!

    It's great that you have Latin teachers to review texts.

    I think Wikisource HTML <-> TEI-XML conversion is my task.
    I got a email from a member of Perseus Digital Library on April, it said they didn't have enough programmers and workers to add new files or correct TEI-XML files now.
    So making TEI-XML maybe useful, but it' not so necessary.
    I have used TEI-XML format simply because I didn't know another useful format.
    (The chief of Perseus, Dr.Gregory Crane moved(?) to University of Leipzig on May, he planned to start new Digital Humanities project. Our work maybe useful for it.)
    See below:
    http://sites.tufts.edu/perseusupdates/2013/05/02/reinventing-humanities-publication-project-receives-e1-1-million-grant-from-the-saxon-ministry-of-science-and-european-social-fund/

    This is a real sample of Lewis&Short dictionary.

    Human readable page:
    http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3atext%3a1999.04.0059

    TEI-XML page:
    http://www.perseus.tufts.edu/hopper/xmlchunk?doc=Perseus%3Atext%3A1999.04.0059%3Aentry%3DA1

    I have only checked names of authors and books by my program.
    Here is a part of the program.

    registerValidAuth("ambr", "Abel, abr, Dign.sac, Dign.sacerd, ep, "
        + "Hex, Hexaem, Luc, noe, Off.Min, Pæn, Parad, Virgin");
    registerValidAuth("apul",
        "Ascl, Asclep, de^dogm.Plat, Diph, Diphth, Dogm.Plat, m, flor, herb, "
        + "Mag, Met, Mund, Orth, plat, Socr, Trism");
    registerValidAuth("aug", "Adim, bapt, civ, Concup, conf, Cons.evang, Contin, Corrept, "
        + "Cresc, Crescon, Doctr, Doct.Chr, Doctr.chr, "
        + "Duab.anim, Dulcit, Enchir, ep, Eucher, Evang, Ev.Joh, "
    

    I planned to add XML tags for quotations by creating some new programs, but not yet.


    I made an attempt to print my results with TeX. I joined a page for example. The data of this page have not be corrected.

    Excellent work!
    Full text PDF is much better and very useful to search.

    I attacked the 500 following (in reverse order) pages form p. 1100 to 1600...

    No problem.
    I'm at page 587 now, and plan to progress forward.
    But unfortunately, my office work (creating programs for retail stores, I'm not 'professor') have been very busy since April, so I have not able to progress the gaffiot project.
    I will restart the project on August (or maybe after autumn...).

    If I (and my very few (you were right...) colleagues) reviewed your work, for example the letter A to correct typos, what about the copyrights ? The result of these corrections would be under Creative licence ?

    I released my XML/HTML as Public Domain (dual license of Creative Commons 0 and GPLv3), so you can change and re-release or anything you want (including commercial use) without my permission.
    http://creativecommons.org/publicdomain/zero/1.0/
    http://www.gnu.org/licenses/license-list.en.html#CC0
    CC0 doesn't have Share-Alike condition, so you can re-release another license if you want.
    http://creativecommons.org/licenses/by-sa/3.0/deed.en


    I'm sorry about my first reply for you.
    I didn't believe you to really try and almost finish in such a short term.

    After releasing your HTML files (and my office work will be normal status...), I will create some programs to convert format, check styles, and something necessary.
    Maybe I have to re-plan the project for better goal (I don't know what is better now).

    Sincerely yours,
    Katsuhiko OHKUBO.

     

    Last edit: OHKUBO Katsuhiko 2013-05-20
  • Anonymous

    Anonymous - 2013-06-02

    Hi, dear Dr (If you are not a professor, you are indeed a doctor, and if not, to respect to your initiative and your work, I shall call you doctor if you permiss) Okhubo,

    First, I want to thank you a lot for the gift of your work. How can you accept that anyone could make money with your (huge) work ? Can you explain that to me ? Is this a feature of japanease mentality (that I don't know, I confess it to my great shame, just a few movies of Kurosawa) or of your only personnality ?

    I producted a pdf file of your letter A. I made corrections on one only page jsut to appreciate the amount of necessary corrections.

    There are very few errors. Congratulations ! But, as you said, a correction is albeit necessary.

    If you want, I can send you the letter A typographied in one pdf file.

    Sincerely yours,
    GG

     
  • Anonymous

    Anonymous - 2013-06-02

    The whole letter A

     
  • OHKUBO Katsuhiko

    Thank you very much for your reply.
    I'm very glad and proud because my work is recognized as a valuable thing.
    That is enough for me, so please call me without Dr. etc...

    I have gotten many many free data form many sites (such as Perseus Digital Library, Zeno.org, Du Cange [http://ducange.enc.sorbonne.fr/], a big morphological data of Sanskrit by Gérard Huet [http://sanskrit.inria.fr/DATA/XML/], etc.), so I think I also must create some free data as a return gift.

    I like the story why did Michael Hart start the Project Gutenberg.
    http://www.gutenberg.org/wiki/Gutenberg:The_History_and_Philosophy_of_Project_Gutenberg_by_Michael_Hart
    He got a valuable thing, so he started to create an another valuable thing.

    As a software engineer, sometimes I wonder what is valuable software.
    Almost of all software will be gone and vanished in just a few years.
    I want to create (and leave) something beyond centuries (even though I'm an amateur).
    I believe digital contents (such as texts, movies etc) will be so, dictionaries are most basic and important tools for all.


    In the beginning of the part A, there are much more errors because of try-and-error work style.
    My checking programs have improved as progress.

    • Some kind of Italic / non-Italic style can be mechanically corrected (but not yet)
    • All of Latin words are spell checked after dropping hyphen, but hyphen marks are not dropped in output XML/HTML (I must improve my programs).
    • Currently, a pair of (), [] are also checked, but not checked in the beginning of this project.
    • Currently, some characters which must not be in Latin words (è, é etc.) are also checked and corrected (but not in the beginning). Sometimes ā, ă etc. are used in Latin quotations.

    To avoid editing a large file, I create gaffiot-A.xml, gaffiot-B.xml, etc. and only check the last (gaffiot-E.xml now) file by current checkers.
    I believe I must recheck whole texts by current (and more improved) programs.


    I'm in page 591 now.
    My office work is still very busy....
    After it will be normal status, I will digitize until page 600, and improve and recheck whole texts, and release the updated XML/HTML.

     

    Last edit: OHKUBO Katsuhiko 2013-06-03
  • Anonymous

    Anonymous - 2013-07-18

    Hi Mr Okhubo,

    I just finished (within the first pass) a packet of 500 pages : from 1100 up to 1599. Added with the packet of about 100 pages the pages from 1100 to the end are numerized.

    (*) The text has been controled within a first pass. The last (nearly) 100 pages (1600 to the end) are beeing controlled (2d pass) now.

    I try now to find means to find (nearly) automatically the structure of the dictionary to distinguish latin, french, etc. with Lex (flex) and Yacc (Bison).

    Sincerely yours,

    GG

    PS : please give me your address where I can send you materials I don't want to make them public. -> numerisation.gaffiot @ hotmail.fr

     
  • OHKUBO Katsuhiko

    GREAT!!
    500pages!?!?! amazing, unbelievable.....
    Please rest your hands well enough....

    I also just planned to release renewed XML/HTML files on Saturday (to page 600).
    They doesn't seem perfect but a little better than previous version.

    And I will send a mail to you.
    I hope we can decide better format of XML/HTML and how to check text etc.
    Your work encouraged me greatly. Thank you very much.

    PS.
    I cannot read/write French now.
    Last week, I completed the Latin class (I have attended for 6 years, elementary/middle/supreme classes).
    Now I'm wondering about attending a elementary French class, it will start on September.

     

Anonymous
Anonymous

Add attachments
Cancel