FreeDict FAQ
From freedict
Last revised on 12-3-2010 by Bansp.
How can I access those dictionaries of yours?
Firstly, a list of Freedict servers is available, where you can find those which offer WWW front-ends. The best one on this list is the official Freedict server, with the WWW front-end located at http://freedict.org/dict. This page gives you an extra bonus: it allows you to install a search plugin in your browser, for accessing dictionary definitions.
The DICT wiki at http://www.dict.org/w/software/software lists the various clients that you can use for querying definitions. A client that is particularly worth mentioning is the Firefox add-on by David Costanzo, available at http://dict.mozdev.org/. In order to access Freedict dictionaries, you have to set the Freedict server as the one to be queried. (The add-on is also capable of displaying the results directly in the browser window, and to use various query strategies, see the 'test' section of its homepage.)
Another client with a lot of potential is vOOcabulum, an Open Office extension. It is in alpha and all feedback is warmly welcome.
Finally, there is also the 'working' CSS view. Some of our dictionaries come equipped with a CSS stylesheet that makes it possible to display the raw dictionary XML as if it were a web page. This may be the most imperfect way of accessing dictionaries, especially the large ones (they may choke your browser loading into the memory). Treat this as the last resort -- it is there primarily for the developers, to see if they get things right as they build the XML.
How do I get all the dictionaries from the SVN onto my machine?
You will generally only need the trunk of the repository. To download the repository, create e.g. an SVN/ directory, cd to there and do
svn co https://freedict.svn.sourceforge.net/svnroot/freedict/trunk freedict
This gets you the entire main tree of the repository, which will generally be too much. If you want a particular pair of languages alone, try
svn co https://freedict.svn.sourceforge.net/svnroot/freedict/trunk/la1-la2 la1-la2
In order to keep the directory updated, enter it and use
svn update
to get the newest stuff.
But if you are not a developer, you usually needn't bother about the SVN — the download page should suffice in most cases. If, however, you know that the project SVN repository contains stuff that is newer than the version offered for download, please be so kind as to notify us at the mailing list or at the Help forum. Neither of these lets you post when not subscribed or logged in (this is an anti-spam measure), so if you'd rather post anonymously, the support-request tracker might be the way to go (thanks in advance!).
(Thanks to the Comprehensive Knowledge Archive Network for inspiration to expand this section.)
What is the significance of the -nophon part of the filename in some dictionaries?
la1-la2-nophon.tei files are created as a temporary step during the make process of dictionaries, for whose headword languages (la1) FreeDict supports phonetics, because data is available. The process is: 1. convert dictionary data into la1-la2-nophon.tei 2. add phonetics info and create la1-la2.tei.
Practically, this means that renaming la1-la2.tei to la1-la2-nophon.tei is our internal kludge to avoid the phase where the dictionary is processed by a text-to-speech system that adds phonetic information (in <phon/> elements). In most cases, you needn't worry about this :-)
But if you are concerned with switching phonetic processing off for your dictionary, put the line supported_phonetics= into the Makefile, right below the line defining DISTFILES.
What is this P4 vs. P5 issue? Which is better?
"P4" and "P5" are version numbers of the TEI standard. P4 was a post-SGML (or "XML-ised SGML" kind of standard, where the Dictionaries module (chapter 12 then, I believe) was still called "Print Dictionaries" and concentrated practically on the way to render print dictionaries as TEI.
P5, from our perspective, can be divided into "early P5", pre-1.0, until August or September 2007, and (let's say) "mature P5", since the late 2007. In the mature phase, the Dictionaries module has changed rather drastically (now it's in ch. 9 and the title is just "Dictionaries"), and is now supposed to properly serve the needs of "digital lexicographers" as well. The mature P5 Dictionaries module means also a total break in backwards-compatibility, in the multi-/bilingual section: it's not just that <xptr/> and <xref/> (and <TEI.2/>) need renaming — the entire <trans/> and <tr/> system is gone. It is now replaced with the generalized system featuring <cit/> and <quote/>, meant to handle all foreign text, be it examples of usage or translation equivalents. Read ch. 9 for details.
Back to the initial question: the entire TEI infrastructure follows the development of the standard. This is true of the schema and documentation generator (Roma) as well as the blessed XSLT suite for transforming TEI into HTML, PDF and what not. The entire model of TEI conformance has changed and is now expressed in the form of so-called ODD files that must accompany your TEI file. Additionally, the TEI will cease to maintain P4 in 2012. What this means is that if you are just beginning to create a dictionary or you want to turn an existing dictionary into TEI, then P5 is obviously the only way to go. If you have a P4 dictionary, it's up to you... Migration to P5 would definitely be a plus, but Freedict will still grudgingly support P4, though be warned: all new developments in the tool area target P5, and we will eventually transform all P4 dictionaries into P5, to simplify tool maintenance.
Where do I find the 3-letter symbol for my language?
FreeDict uses ISO 639-3 language codes for identifying dictionary packages.
Where do I find a suitable 2-letter symbol for xml:lang?
Same place, but look at the ISO 639-1 codes. If the language in question doesn't have such a code set, take the corresponding 3-letter ISO 639-3 code. This is according to the BCP 47 document.
I want to contribute to the project. What do I do?
Great news :-) Now, the first thing that will never hurt is to announce your plans on Freedict-beta, the project mailing list. The next step depends on a lot of issues, here are a few possibilities:
I have access to an interesting dictionary in a different format
The dictionary is on a free license, right? We have a few tools in the making, for translating into TEI P5, so please post on Freedict-beta, and we'll see what can be done.
Depending on whether we already have a dictionary for the given language pair, and on the quality of your source and (possibly) our source, we can think of several ways to proceed (creating a new dictionary, merger, parallel distribution, etc.).
I want to create one from scratch
I guess this is the place to mention a few basic books on lexicography, just in case (TODO).
Contact us on the list, please :-) And we'll move on from there.
I want to expand/change/fix an existing dictionary
Again, the list is a good place to begin. Also, please have a look at the header section of the dictionary, where you may find information on the current developer and their contact details. You can view the header, along with other accompanying documents (e.g. AUTHORS, ChangeLog, etc.) in the web SVN browser -- this is much better than downloading a distribution, because it may happen that some work is being carried out on a dictionary that has not yet been released as a new version. (If the SVN page doesn't come up, please wait some five minutes and retry.) We also have a separate page listing the dictionary maintainers -- you might want to see if someone has already registered their interest in the given language pair and possibly team up with them.
Is this a lexicographic project?
The answer definitely depends on what you understand by lexicographic, and on who you are going to ask. A few personal answers may follow, this issue has never actually been brought up, to my mind.
If by lexicographic project you understand a project that deals with markup and distribution of dictionaries, then by all means, FreeDict is a lexicographic project.
Now the first possible personal view (Piotr's). Let me stress: I don't speak for the others here, though I'd be happy if we agreed on this issue. I would like to treat FreeDict as a project that is not lexicographic in the light of any serious definition of lexicography that you can think of. I would like to see FreeDict's function as restricted to disseminating structured content and (almost) absolutely non-normative with regard to (meta)lexicography as science or art. In other words, I would like to avoid making any recommendations regarding the lexicographic choices concerning the macrostructure, microstructure, Part-of-Speech inventory, etc. This is why I have already remarked in some parts of the HOWTO (I always signed those remarks) that I don't think that FreeDict (or TEI) should recommend an inventory of POS values or anything of that sort.
An entirely different project is needed for this purpose and in fact such projects have already been created — let me name two: the old, foundational EAGLES (Expert Advisory Group on Language Engineering Standards) and the ISO TC37/SC4 Language Resources Management Committee.[1] These projects have produced recommendations/guidelines for, among others, digital lexicographers. Our job, as FreeDict, should be IMHO to encourage developers to submit their dictionaries to us and to do so by working on the tools that translate XML/text into TEI P5 and on tools that render such dictionaries nicely, so that developers can see that their work is being used by as many people as we can reach. And that, in turn, means following the DICT distribution framework as well as other systems (in fact, this bit can be the subject of another discussion, so let me stop here).
Above, I said I thought FreeDict should be almost absolutely non-normative because, naturally, some restrictions are imposed by the format. The TEI Dictionaries module allows for a lot of variation, but if we don't want to end up with huge and buggy translators, some reasonable constraints should be enforced. Among them is the ban on the <entryFree/> element, which is anyway meant for paper dictionaries with messy microstructure. Indeed, "no messy entries" can be reasonably stated as the fundamental format-induced requirement, with its particular applications to be defined later, if need be.
What parts of the TEI source are read by FreeDict scripts?
FreeDict scripts extract some information from the source XML, here is the current list:
- edition information
Read from teiHeader/fileDesc/editionStmt/edition. In P5 dictionaries, the @n attribute of the <edition/> element is queried first, and if it does not exist, the entire content of the element is read (in the latter case, it is expected that the content is a version number, such as "0.3", etc.). This information is used for creating filenames of distribution packages.
- maintainer information
Read from /{TEI,TEI.2}/teiHeader/fileDesc/titleStmt/respStmt/ where the <resp/> element has the value "Maintainer". If it is followed by an email address in angle brackets (you need the < entity for that purpose), the address is also used by the system. Below is an example:
<respStmt>
<!-- for the freedict database -->
<resp>Maintainer</resp>
<name>[your name here] <userid@sourceforge.net></name>
</respStmt>
Note that the left angle bracket of a non-element has to be escaped in XML with <.
- status information
This is auxiliary information, read from teiHeader/fileDesc/notesStmt/note[@type='status']. Currently, the recommended values are (from freedict.org):
- 'stable'
- 'big enough to be useful' (from 10000 entries on)
- 'too small' (less than 1000 entries)
- 'low quality'
- 'unknown'
- URL of the source
Read from teiHeader/fileDesc/sourceDesc/*/xptr/@url. As of today (15:50, 1 March 2009 (UTC)), this is hardcoded to use the <xptr/> element, defined by TEI P4.
- author information (for StarDict packages)
Not sure at the moment where this matters. It currently reads the first <name/> element encountered in the first <respStmt/>.
- title information (for StarDict packages)
Read from teiHeader/fileDesc/titleStmt/title.
How big should my dictionary be? How complex?
Keep it simple (for starters). Too many projects have been killed by their developers' desire to code the next wonder of the world, the ultimate IT. Let us subscribe to the Open Source motto "publish early, publish often". Make your dictionary a simple glossary at first,
<entry>
<form>
<orth>alasiri</orth>
</form>
<sense>
<def>afternoon</def>
</sense>
</entry>
possibly with parts of speech and some basic attributes:
<entry xml:id="alasiri">
<form xml:lang="swh">
<orth>alasiri</orth>
</form>
<gramGrp>
<pos>n</pos>
</gramGrp>
<sense>
<def>afternoon</def>
</sense>
</entry>
Initially, it's OK to keep the equivalents inside <def>, and it's OK to separate senses with a semicolon, and equivalents with commas. Later on, you might go for something slightly more complicated, as in:
<entry xml:id="pia-2" n="2">
<form xml:lang="swh">
<orth>pia</orth>
</form>
<gramGrp>
<pos>adv</pos>
</gramGrp>
<sense xml:id="pia-2.1" n="1">
<def>also, too</def>
</sense>
<sense xml:id="pia-2.2" n="2">
<def>equally, likewise</def>
</sense>
</entry>
And if you want to move past this stage, please contact the Freedict-beta list, so that we can talk about that. While we adhere to the schema documented in ch. 9 of the TEI Guidelines, there are some constraints on what out conversion tools can digest. (BTW, above, I assumed that there is a xml:id="eng" attribute on the <body>; it is also a good idea to keep one <pos> per entry, and treat pairs such as the English verb and noun record as separate; if you need to handle this differently, do contact the mailing list).
What is the minimal header allowed by the TEI?
This is a stub of a TEI file:
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<entry>...</entry>
</body>
</text>
</TEI>
Freedict expects some extra elements in the header, and requires others (e.g. license statement and project description). Please see above for more concrete information.
paper vs. electronic, cit/quote
From Denis Arnaud:
> In breton bilingual dictionaries the usage is to write the gender
> of the nouns and the plural ending (or irregular form). And when
> it's a composite word, it's all written between the first
> (signifiant) word and the second one. As an example here's the
> translation for week-end (french people use the english word)
> week-end : dibenn g. (où)-sizhun The singular form is dibenn-sizhun
> and plural form dibennoù-sizhun Dibenn means end and sizhun week.
> And dibenn is a masculin noun ('gourel' in breton).
Ok, we have several issues here. The most important thing is that in electronic dictionaries, as opposed to paper dictionaries, where space matters (the publisher pays for it, you pay for it), you really want to use full forms of equivalents, because a) it is more user-friendly (you don't have to "train" the user to understand the system) and b) it is machine-readable, which potentially brings many extra benefits.
So in your example, you want something like:
(assuming xml:lang="fr" for the dictionary text; I use French-like
grammatical codes ['m' for 'masculine', etc.])
<sense>
<cit type="trans">
<quote xml:lang="br">dibenn-sizhun</quote>
<gramGrp>
<gen>m</gen>
<number>sg</number>
</gramGrp>
<form type="inflected" xml:lang="br">
<orth>dibennoù-sizhun</orth>
<gramGrp>
<number>pl</number>
</gramGrp>
</form>
</cit>
</sense>
interpretation:
- "week-end" in fr is "dibenn-sizhun" in br;
- "dibenn-sizhun" is masc sg.,
- its (orthographic) plural form is "dibennoù-sizhun"
(Note: I am assuming that the gender of "dibenn" becomes the gender of "dibenn-sizhun" -- this is rather standard in (a type of) compounds that one of the elements is more important and imposes its own features onto the entire structure.)
How to encode the inflectional forms of the headword
Sebastian Humenda asks how to encode the forms below:
Here the definitions like in a dictionary, (n. e.g. is neutrum): iubere, iubeo, iussi, ussum - befehlen perfectus, -a, -um - vollendet, vollkommen (the 2nd and 3rd form is for feminine/neuter) maiestas, -atis f. - Hoheit, Erhabenheit, Größe (-* is the genitive form, used for declination)
The first point is: see the comment to Dennis Arnaud's question on encoding parts of words (or just believe me: you don't want to include parts of words in your dictionary -- that would be such a waste of its potential).
Secondly, a simple format for manual encoding can be something like what follows:
<entry xml:id="iubeo">
<form xml:lang="la">
<orth>iubeo</orth>
<form type="infl">
<orth type="inf">iubere</orth>
<orth type="perf">iussi</orth>
<orth type="sup">ussum</orth>
</form>
</form>
<gramGrp>
<pos>v</pos>
<iType>2</iType> <!-- inflection class (for verbs: conjugation number) -->
</gramGrp>
<sense>
<def>befehlen</def>
</sense>
</entry>
If the above is used, it should be converted by XSLT to the form advocated a.o. by the relevant part of chapter 9 of the TEI Guidelines, the encoding should look roughly as follows (now I assume, for the sake of exercise, that the infinitive is chosen as the headword, which is not always the practice in Latin dictionaries):
<entry xml:id="iubere">
<form xml:lang="la">
<orth>iubere</orth>
<form type="infl">
<form xml:id="iubere-iubeo">
<orth>iubeo</orth>
<gramGrp xml:lang="de">
<per>1</per>
<number>sg</number>
<mood>ind</mood>
<tns>praes</tns>
<gram type="voice">aktiv</gram>
</gramGrp>
</form>
<form xml:id="iubere-iussi">
<orth>iussi</orth>
<gramGrp xml:lang="de">
<per>1</per>
<number>sg</number>
<mood>ind</mood>
<tns>perf</tns>
<gram type="voice">aktiv</gram>
</gramGrp>
</form>
<!-- and so on for the supine form ussum -->
</form>
</form>
<gramGrp>
<pos>v</pos>
<iType>2</iType> <!-- inflection class (for verbs: conjugation number) -->
</gramGrp>
<sense>
<def>befehlen</def>
</sense>
</entry>
Note that you characterise each of the forms roughly in the same way you characterise the entire lemma (or lexeme), for which the headword is just an identifier (so this raises the question: where to encode the fact that iubere is infinitival? A partial answer is: it's your default here, for all verbs in the entire dictionary, unless marked otherwise).
Not every grammatical category is provided as a separate element, but those that are, are actually specializations of the generic element <gram>. Hence, <tns> = <gram type="tns">.
Note also the <iType> element that holds something that may be referred to as "conjugation/declension/lexical/noun class", depending on the language, assumed grammatical system and the intended purpose. This should assume some verifiable system (e.g. an established grammar of the given languages) and it should be used consistently.
Note that in e.g. the Swahili-English dictionary, we follow a different convention, because the plural forms of nouns are actually references to other entries in the dictionary. But here, you supply a mini-paradigm with every entry (which may indeed be very useful to the user).
On to the noun:
<entry xml:id="maiestas">
<form xml:lang="la">
<orth>maiestas</orth>
<form type="infl">
<orth>maiestatis</orth>
<case>gen</case>
</form>
</form>
<gramGrp>
<pos>n</pos>
<gen>f</gen>
</gramGrp>
<sense>
<def>Hoheit, Erhabenheit, Größe</def>
</sense>
</entry>
Remarks:
- note that the <form> for maiestatis is flatter than for the forms of iubere -- whether you keep to the system of iubere (<form> within <form> within <form>) or simplify like here (<form> within <form>) is a matter of the convention that you use; just make sure to be consistent, so that when you decide to e.g. provide the entire paradigms for your nouns, the information can be easily added with (say) XSLT; it is good to document such conventions in the dictionary header, for the sake of users and other developers;
- note that again, the information that the form maiestas is Nominative is your dictionary-wide default; but the main <gramGrp> identifies the features of the lexeme as an abstract object that is realised by the particular forms in the syntactic context;
- these structures assume that you have put something like <text xml:lang="de"> at the top of the dictionary, and you only mark the divergence from that in the <form> element (but in <gramGrp> elements within the <form>, you need to reset the language back to German (if this is what you want -- here we touch upon the interesting issue of active versus passive dictionaries that deserves a separate treatment);
- and, of course, we use <def> for a simple and not entirely machine-readable way of listing the equivalents.
How else can I help?
In many ways. You can for example make sure to tell us about any mistake or inconsistency in the FAQ or in the HOWTO that you notice. Even a typo counts.
Generally, the bug tracker is the best place to report such stuff, but you can also use the mailing list.
Similarly with the dictionaries themselves. We'll be grateful for error reports. With omissions it's a bit different: if you suggest a new translation and provide a means for us to verify it (just in case -- let's say it's a prank-avoidance mechanism), it can surely be added within hours or days. If you just complain about some word missing, the procedure becomes more difficult and may take a long time to complete. Still, the project trackers are the best way to get such reports to us.
We will also ultimately benefit from your bug reports concerning e.g. DICT clients or servers -- just make sure to direct them to the appropriate address :-)
There are also issues that the project needs or suffers from in one way or another, and where we need community action. This concerns e.g. voting for feature requests or voting for bugs that somehow affect us. Here's a list of possible actions that one can take to support FreeDict:
Sourceforge feature requests
(You can only vote for these if you have a Sourceforge account.)
- Fine-grained SVN permissions -- a good thing for a project with numerous subprojects, such as FreeDict; please consider voting on the svnauthz solution (the one with the highest number of votes; 20 at this moment)
- Wiki syntax highlighting -- to make reading documentation easier (38 votes when it was announced on the freedict-beta list)
- Wiki support for footnotes -- as above, starting from 1 vote
Other possible actions
There is an issue with the XML parser that the FreeDict build system uses, xmllint: it won't support three-letter language codes for the @xml:lang attribute. This is bad for us, because some of the languages we have do not have two letter (ISO-639-1) codes set, and will never have them by design -- in such cases, @xml:lang should be given a three-character value from ISO 639-3. But when xmllint sees that, it aborts with an error. If you happen to be registered at Gnome Bugzilla, please consider at least adding your CC to the relevant bug report (this is an equivalent of voting) or supplying a patch...
This bug currently affects at least Khasi, but, in essence, this is an issue of scale: two-character codes exist for 136 (most common) languages. 3-letter codes -- for over 7500. You do the maths. (Hint: we're not a "most-common-languages-only" project.)
