|
From: Salahuddin P. <sal...@gm...> - 2009-05-12 21:02:02
|
Dear all,
I was working on অভিধান - Abhidhan for XML support. To
enable various application and tools to utilize our dictionary.
Basic work is already done, but we need to define a standard XML (XML
DTD or XML Schema).
Any suggestion or comments ?
Example: test XML output.
<?xml version="1.0" encoding="utf-8"?>
<dictionary>
<search_results>
<dict_entry id="1">
<en_word>read</en_word>
<pos_tag>Noun, singular or mass</pos_tag>
<bn_word>পড়া</bn_word>
</dict_entry>
<dict_entry id="2">
<en_word>read</en_word>
<pos_tag>Verb, base form</pos_tag>
<bn_word>পড়া</bn_word>
</dict_entry>
<dict_entry id="3">
<en_word>read</en_word>
<bn_pronunciation> উচ্চাঃ রীড</
bn_pronunciation>
<pos_tag>Verb, non-3rd person singular present</
pos_tag>
<bn_word>পাঠ করা</bn_word>
</dict_entry>
</search_results>
</dictionary>
regards
salahuddin
|
|
From: Golam M. H. <gmh...@gm...> - 2009-05-13 02:11:23
|
Hi, On Tue, May 12, 2009 at 5:13 PM, Salahuddin Pasha <sal...@gm...> wrote: > Basic work is already done, but we need to define a standard XML (XML > DTD or XML Schema). > Example: test XML output. > > <?xml version="1.0" encoding="utf-8"?> > <dictionary> > <search_results> > <dict_entry id="1"> > <en_word>read</en_word> > <pos_tag>Noun, singular or mass</pos_tag> Thanks a lot for your work. I should suggest that you also try to have an entry for PennTag for Parts-of-Speech (pos) like "NN", "VV" etc. So something like <penn_tag>NN</penn_tag> This would be needed if Anubadok Online intreface needs to update its database using your XML gateway of Ankur dictionary database. Cheers, Golam |
|
From: Abu Z. <za...@gm...> - 2009-05-13 05:36:32
|
You might also find it helpful to look at apertium dictionary format, which is also standard XML. Here is the link to svn for Nepalese Language (its the closest language to Bengali in apertium we have so far, and the Bengali pair is far from finished :( ) http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-bn-en/. I have been working to find some standard tag sets for Bengali language, so far I'm also doing away with pen treebank tagsets, but I the future I might need to extend those, as for my project requirements. *However, I bellive penn treebank tagset to be sufficient for a general purpose dictionary format.* The attached file contains the Pen Treebank Tagset and also the bilingual ductioanry format from apertium. What I'd like to propose is instead of using <pos_tag>Verb, non-3rd person singular present</ pos_tag> you could create some definitions like verb, person, number, tense and then use them as the property for the specific entry. I'd be easier to parse in the future. On Wed, May 13, 2009 at 8:02 AM, Golam Mortuza Hossain <gmh...@gm...>wrote: > Hi, > > On Tue, May 12, 2009 at 5:13 PM, Salahuddin Pasha > <sal...@gm...> wrote: > > Basic work is already done, but we need to define a standard XML (XML > > DTD or XML Schema). > > Example: test XML output. > > > > <?xml version="1.0" encoding="utf-8"?> > > <dictionary> > > <search_results> > > <dict_entry id="1"> > > <en_word>read</en_word> > > <pos_tag>Noun, singular or mass</pos_tag> > > > Thanks a lot for your work. > > I should suggest that you also try to have an entry for PennTag > for Parts-of-Speech (pos) like "NN", "VV" etc. So something like > > <penn_tag>NN</penn_tag> > > This would be needed if Anubadok Online intreface needs to update its > database using your XML gateway of Ankur dictionary database. > > Cheers, > Golam > > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your > production scanning environment may not be a perfect world - but thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW KODAK > i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Bengalinux-core mailing list > Ben...@li... > https://lists.sourceforge.net/lists/listinfo/bengalinux-core > -- Regards Abu Zaher Md. Faridee http://zaher14.blogspot.com/ --- Time heals every wound, but time itself is a wound that never heals. |
|
From: Deepayan S. <dee...@gm...> - 2009-05-13 17:06:12
|
On 5/12/09, Salahuddin Pasha <sal...@gm...> wrote:
> Dear all,
>
> I was working on অভিধান - Abhidhan for XML support. To
> enable various application and tools to utilize our dictionary.
>
> Basic work is already done, but we need to define a standard XML (XML
> DTD or XML Schema).
>
> Any suggestion or comments ?
Back in 2003, the bengalinux dictionary list had a discussion on this.
Nothing ever came out of it, and when Golam first started on anubadok,
his emphasis was more specialized. In any case, that discussion may
provide some suggestions.
You can get it from the list archives, and I'm also attaching a
cleaned up and edited version of the thread here:
<thread from May 2003>
----
[Ankur-dictionary] dictionary.dtd
From: Kaushik Ghose <kghose@wa...> - 2003-05-14 04:17
Hi,
here is the descriptor file.
I'm new to XML and DTDs so please go over the semantics as well as the
syntax an see if this serves our purpose...
<?xml version="1.0"?>
<!ELEMENT entry*(word_bn, info_bn*)>
<!ELEMENT word_bn (#CDATA)>
<!ELEMENT info_bn (english, pronounciation_bn,meaning_bn)>
<!ELEMENT english (#CDATA)>
<!ELEMENT pronounciation_bn (#CDATA)>
<!ELEMENT meaning_bn (#CDATA)>
thanks
-kg
----
From: Kaushik Ghose <kghose@wa...> - 2003-05-14 05:12
Ok, small correction, QTs DOM class seems to parse this correctly
dictionary.dtd
<?xml version="1.0"?>
<!ELEMENT dictionary (entry*)>
<!ELEMENT entry (word_bn, info_bn*) >
<!ELEMENT word_bn (#CDATA)>
<!ELEMENT info_bn (english?, pronounciation_bn?,meaning_bn?)>
<!ELEMENT english (#CDATA)>
<!ELEMENT pronounciation_bn (#CDATA)>
<!ELEMENT meaning_bn (#CDATA)>
test.xml
<?xml version="1.0"?>
<!DOCTYPE entry SYSTEM "dictionary.dtd">
<dictionary>
<entry>
<word_bn>????????????????????? ???????????????</word_bn>
<info_bn>
<english>seedling</english>
<pronounciation_bn>ankur</pronounciation_bn>
<meaning_bn>??????????????????? ???????????
???????????????????????? ??????????????????
??????????????????</meaning_bn>
</info_bn>
</entry>
<entry>
<word_bn>????????????????????? ?????????</word_bn>
<info_bn>
<english>bangla</english>
<pronounciation_bn>bangla</pronounciation_bn>
<meaning_bn>??????????????????? ?????????????????
????????????????????????, ????????????????????????? ???????????
????????????????????????? ?????</meaning_bn>
</info_bn>
<info_bn>
<english>bengali</english>
</info_bn>
</entry>
</dictionary>
thanks
-kg
----
From: Deepayan Sarkar <deepayan@st...> - 2003-05-14 07:03
Ha! A friend of mine once corrected me on this, now I can correct
someone else :) 'pronounciation' should be spelled
'pronunciation'.
I'm not an expert on DTDs (though I know someone who knows much
more, whom I can ask after after we make some progress). I find it
very difficult to understand DTD's, and much easier to understand
examples of what the final thing would look like. Let's work that
way, and we can write out the DTD on ce we decide on the 'look'.
I don't know if you know this, but there's something called
attributes which might be useful. For instance, with multiple
meanings as different parts of speech. Here's an example (I'm using
slightly different tags) --- 'pos' is part of speech, 'plural' is
whether the word has a plural form, etc.:
<entry>
<word>chhaanaa</word>
<info pos="noun" plural="false" origin="deshi">
<meaning>dudh theke toiri ek dhoroner ...</meaning>
<synonyms>...</synonyms>
<antonyms>...</antonyms> ## ???
<translation lang="en">cottage cheese (?)</translation>
<pronunciation>chhaanaa</pronunciation>
</info>
<info pos="noun" origin="tatbhabo"> #it's probably not, but...
<meaning>shishu, bachchaa</meaning>
<translation lang="en">child, young</translation> # comma separated
<translation lang="hn">bachcha</translation> #hindi is hn ? not sure
<pronunciation>chhaanaa</pronunciation>
<derivative form="the">chhaanaaTaa, chhaanaaTi</derivative>
<derivative form="of" num="singular">chhaanaaTir</derivative>
<derivative form="of" num="plural">chhaanaader</derivative>
</info>
</entry>
(I've used romanized bengali in place of what should be bengali, but
you get the idea.)
I think we should handle derivative words here (and not have
separate entries for them. They can be generated from
this). Sanskrit has very systematic rules for 'shabdarup'. Bengali
isn't as systematic, but there are still quite general rules. We can
formulate some rules and list down only derivative words that are
exceptions to that rule. We have the standard forms:
to, by, for, from, of and in
plus maybe plurals, the, a --- anything else ?
Also, Bengali (unlike English) often has many words which mean
exactly the same thing. We might try to think of a way to have a
single entry for all o f them.
Can anyone (preferably with a dictionary at hand) think of anything else ?
This is not very important right now, but what's a good format to store
pronunciation ?
----
From: Taneem Ahmed <taneem@ey...> - 2003-05-14 08:33
On Wed, 14 May 2003, Kaushik Ghose wrote:
> Hi,
> here is the descriptor file.
> I'm new to XML and DTDs so please go over the semantics as well as the
> syntax an see if this serves our purpose...
>
>
> <?xml version="1.0"?>
> <!ELEMENT entry*(word_bn, info_bn*)>
> <!ELEMENT word_bn (#CDATA)>
> <!ELEMENT info_bn (english, pronounciation_bn,meaning_bn)>
> <!ELEMENT english (#CDATA)>
> <!ELEMENT pronounciation_bn (#CDATA)>
> <!ELEMENT meaning_bn (#CDATA)>
I remember someone mentioned something about multiple language support. Is
it possible to have a general element instead of "english" so that it'll
be easier to expand for other langauges?
Taneem
----
From: Taneem Ahmed <taneem@ey...> - 2003-05-14 08:37
Sorry I didn't see Deepayan's mail when I sent my previous e-mail. His
example is what I was talking about :)
Taneem
On Wed, 14 May 2003, Deepayan Sarkar wrote:
----
From: Kaushik Ghose <kghose@wa...> - 2003-05-14 20:54
hi,
On Wed, 14 May 2003, Deepayan Sarkar wrote:
>
> Ha! A friend of mine once corrected me on this, now I can correct
someone else
> :) 'pronounciation' should be spelled 'pronunciation'.
>
Okay :), so the new tag for this is <pron> >:D
> I'm not an expert on DTDs (though I know someone who knows much more, whom I
> can ask after after we make some progress). I find it very difficult to
> understand DTD's, and much easier to understand examples of what the final
> thing would look like. Let's work that way, and we can write out
the DTD once
> we decide on the 'look'.
Sure, I think I've got the hold of elementary DTD (ie of the level I set
out, so I can handle that -QTs happy, so am I...)
> I don't know if you know this, but there's something called attributes which
> might be useful. For instance, with multiple meanings as different parts of
> speech. Here's an example (I'm using slightly different tags) --- 'pos' is
> part of speech, 'plural' is whether the word has a plural form, etc.:
>
> <entry>
> <word>chhaanaa</word>
> <info pos="noun" plural="false" origin="deshi">
> <meaning>dudh theke toiri ek dhoroner ...</meaning>
> <synonyms>...</synonyms>
> <antonyms>...</antonyms> ## ???
> <translation lang="en">cottage cheese (?)</translation>
> <pronunciation>chhaanaa</pronunciation>
> </info>
> <info pos="noun" origin="tatbhabo"> #it's probably not, but...
> <meaning>shishu, bachchaa</meaning>
> <translation lang="en">child, young</translation> # comma separated
> <translation lang="hn">bachcha</translation> #hindi is hn ? not sure
> <pronunciation>chhaanaa</pronunciation>
> <derivative form="the">chhaanaaTaa, chhaanaaTi</derivative>
> <derivative form="of" num="singular">chhaanaaTir</derivative>
> <derivative form="of" num="plural">chhaanaader</derivative>
> </info>
> </entry>
I would suggest only putting in the english synonym, or closest word
- this is a question of size and interfacing. If we have a set of
english synonyms we can then use that to link to an English-German
dict say, or an English-Thai dict to have a bangla-thai dict for ex.
If we start to put in translations for additional languages I think
the file will become very large and slow to load.
As it is, with the bangla word, bangla synonyms, antonyms, meanings
and english synonyms I think we are going to deal with pretty large
files for each bangla alphabet.
Another issue to deal with is what we do with words that have no
direct one word english equivalent.
I couldn't get what "origin" means ? By plural="false" do you mean
it doesn't have a plural form ?
> I think we should handle derivative words here (and not have
separate entries
> for them. They can be generated from this). Sanskrit has very systematic
> rules for 'shabdarup'. Bengali isn't as systematic, but there are
still quite
> general rules. We can formulate some rules and list down only derivative
> words that are exceptions to that rule. We have the standard forms:
>
> to, by, for, from, of and in
>
> plus maybe plurals, the, a --- anything else ?
This is fine,
> Also, Bengali (unlike English) often has many words which mean exactly the
> same thing. We might try to think of a way to have a single entry for all of
> them.
I would rather not. I'd say link it to the required word by putting that
in the synonym, and in the <meaning> tag put in somethig like "see blah"
>
> Can anyone (preferably with a dictionary at hand) think of anything else ?
>
>
> This is not very important right now, but what's a good format to store
> pronunciation ?
>
unicode should do fine, there's a provision for the international phonetic
alphabet
http://www.unicode.org/charts/PDF/U0250.pdf
so the next draft layout...
<dictionary>
<entry>
<word_bn> chanaa </word_bn>
<info pos="noun" plural="true" origin="??">
<pron>....</pron>
<meaning_bn> baccha </meaning_bn>
<synonym_bn>...</synonym_bn>
<synonym_bn>...</synonym_bn>
<antonym_bn>...</antonym_bn>
<synonym_en>...</synonym_en>
<synonym_en>...</synonym_en>
<grammar>
<derivative form="the">chhaanaaTaa,chhaanaaTi</derivative>
<derivative form="of"
num="singular">chhaanaaTir</derivative>
<derivative form="of" num="plural">chhaanaader</derivative>
</grammar>
</info>
<info pos="noun" plural="false" origin="??">
<pron>...</pron>
<meaning_bn> khabar... </meaning_bn>
</info>
</entry>
</dictionary>
-kg
----
From: Deepayan Sarkar <deepayan@st...> - 2003-05-14 23:25
On Wednesday 14 May 2003 15:53, Kaushik Ghose wrote:
> I would suggest only putting in the english synonym, or closest word -
> this is a question of size and interfacing. If we have a set of english
> synonyms we can then use that to link to an English-German dict say, or
> an English-Thai dict to have a bangla-thai dict for ex.
> If we start to put in translations for additional languages I think the
> file will become very large and slow to load.
Before we go any further, we need to decide how we are eventually planning to
use the XML files.
I don't think XML is a good format for use in any real application. For
example, for a spell-checker to load the XML files directly would be very
inefficient.
Instead, the XML could be a repository of all possible information
we might
ever want to have. For a spell checker we could generate something that would
contain only the words and nothing else (that could be a plain text file, or
a database, could be in various different encodings and formats). Generating
this from the XML may take a while, but if we do this once every two months
or so, it shouldn't matter. Similarly for speech synthesis, we could extract
only the actual word and its pronunciation, and leave everything else out.
From that perspective, I don't think it should matter if the XML files become
large. And of course we don't need to have a single file for each
alphabet,
we could split them as much as we want (maybe the first 3 letters identify
each file) as long as given a word it's possible to identify which file that
word belongs to.
As for the translation, I'm not saying that we have to list
translations in to
all possible languages. But there's no harm in keeping the option.
In fact,
initially we won't even have english translations for the words that we
already have. And as you point out, not all words will even have an
English
translation. All this wouldn't matter if we allow an arbitrary number
(including 0) of instances of the <translation> tag for each word.
The English->other language idea may not always be the best because there
might be some words which have no proper english version, but could have,
say, hindi versions. We could make it policy to include a non-english
translation only when this is the case. But explicitly ruling out
that opti on
is not a good idea, I think.
> As it is, with the bangla word, bangla synonyms, antonyms, meanings and
> english synonyms I think we are going to deal with pretty large files for
> each bangla alphabet.
>
> Another issue to deal with is what we do with words that have no direct
> one word english equivalent.
>
> I couldn't get what "origin" means ?
Basically tot-somo, tot-bhobo, dishi, bideshi, that sort of stuff.
> By plural="false" do you mean it doesn't have a plural form ?
Yes.
> > I think we should handle derivative words here (and not have separate
> > entries for them. They can be generated from this). Sanskrit has very
> > systematic rules for 'shabdarup'. Bengali isn't as systematic, but there
> > are still quite general rules. We can formulate some rules and list down
> > only derivative words that are exceptions to that rule. We have the
> > standard forms:
> >
> > to, by, for, from, of and in
> >
> > plus maybe plurals, the, a --- anything else ?
>
> This is fine,
>
> > Also, Bengali (unlike English) often has many words which mean exactly
> > the same thing. We might try to think of a way to have a single
entry f or
> > all of them.
>
> I would rather not. I'd say link it to the required word by putting that
> in the synonym, and in the <meaning> tag put in somethig like "see blah"
Yes, that should be good enough. Maybe in those cases
<word_bn>gabAkSha</word_bn>
<info ...>
<meaning_bn type="refer">jAnalA</meaning_bn>
</info>
> > Can anyone (preferably with a dictionary at hand) think of anything else
> > ?
> >
> >
> > This is not very important right now, but what's a good format to store
> > pronunciation ?
>
> unicode should do fine, there's a provision for the international phonetic
> alphabet
> http://www.unicode.org/charts/PDF/U0250.pdf
Cool. Does there exist a speech synthesizer which can work from this
? That
way we could confirm that we enter the correct pronunciation.
> so the next draft layout...
>
>
> <dictionary>
> <entry>
> <word_bn> chanaa </word_bn>
> <info pos="noun" plural="true" origin="??">
Since most words would have plural="true", we could omit that (the
default would be "true").
> <pron>....</pron>
> <meaning_bn> baccha </meaning_bn>
> <synonym_bn>...</synonym_bn>
> <synonym_bn>...</synonym_bn>
Any problem with giving multiple synonyms comma separated ?
> <antonym_bn>...</antonym_bn>
> <synonym_en>...</synonym_en>
> <synonym_en>...</synonym_en>
I still think a translation tag with a language attribute would be more
appropriate.
> <grammar>
> <derivative form="the">chhaanaaTaa,chhaanaaTi</derivative>
> <derivative form="of"
> num="singular">chhaanaaTir</derivative>
> <derivative form="of"
> num="plural">chhaanaader</derivative>
> </grammar>
> </info>
> <info pos="noun" plural="false" origin="??">
> <pron>...</pron>
> <meaning_bn> khabar... </meaning_bn>
> </info>
> </entry>
> </dictionary>
Otherwise looks OK (maybe an optional comment tag for each word),
unless someone else can think of something.
BTW, what's the use of the extra _bn for the tags (not that it matters) ?
Deepayan
----
From: Kaushik Ghose <kghose@wa...> - 2003-05-15 02:57
Hiya,
On Wed, 14 May 2003, Deepayan Sarkar wrote:
> Before we go any further, we need to decide how we are eventually
planning to
> use the XML files.
>
> I don't think XML is a good format for use in any real application. For
> example, for a spell-checker to load the XML files directly would be very
> inefficient.
>
> Instead, the XML could be a repository of all possible information we might
> ever want to have. For a spell checker we could generate something
that would
> contain only the words and nothing else (that could be a plain text file, or
> a database, could be in various different encodings and formats). Generating
> this from the XML may take a while, but if we do this once every two months
> or so, it shouldn't matter. Similarly for speech synthesis, we could extract
> only the actual word and its pronunciation, and leave everything else out.
>
> >From that perspective, I don't think it should matter if the XML
files become
> large. And of course we don't need to have a single file for each alphabet,
> we could split them as much as we want (maybe the first 3 letters identify
> each file) as long as given a word it's possible to identify which file that
> word belongs to.
>
> As for the translation, I'm not saying that we have to list
translations into
> all possible languages. But there's no harm in keeping the option. In fact,
> initially we won't even have english translations for the words that we
> already have. And as you point out, not all words will even have an English
> translation. All this wouldn't matter if we allow an arbitrary number
> (including 0) of instances of the <translation> tag for each word.
>
Ok, that seems fine. The size of the files will matter for the GUI that
does the dicto editing and any online collaboration tool we come up with
for creating the dicto, but yes, we'll have automated tools to create
(like you, may be on the first of every two months) separate file clusters
for spell checkers, theasauri etc. which can be more compacted.
Now, for the translation. Are we looking to put in one word that can link
this bangla word to a word in some other dicto ? Or are we looking to give
a translation of it ? For that we can probably end up with two sets of
tags.
<synonym lang ="">...</synonym>
<meaning lang ="">...</meaning>
where synonym is the one word thingy, meaning is well a paragraph or so.
> Yes, that should be good enough. Maybe in those cases
>
> <word_bn>gabAkSha</word_bn>
> <info ...>
> <meaning_bn type="refer">jAnalA</meaning_bn>
> </info>
Yes, good idea, I'd prefer a separate tag <refer> which would do this job.
we could do it via synonyms too, may be everything...
> Cool. Does there exist a speech synthesizer which can work from this ? That
> way we could confirm that we enter the correct pronunciation.
Didn't go much through it but here's a promising site
http://www.vorde.org/prodVordeTech/documents/vorde/split/node28.html
> > so the next draft layout...
> >
> >
> > <dictionary>
> > <entry>
> > <word_bn> chanaa </word_bn>
> > <info pos="noun" plural="true" origin="??">
>
> Since most words would have plural="true", we could omit that (the default
> would be "true").
>
> > <pron>....</pron>
> > <meaning_bn> baccha </meaning_bn>
> > <synonym_bn>...</synonym_bn>
> > <synonym_bn>...</synonym_bn>
>
> Any problem with giving multiple synonyms comma separated ?
>
> > <antonym_bn>...</antonym_bn>
> > <synonym_en>...</synonym_en>
> > <synonym_en>...</synonym_en>
Yeah, I couldn't figure out if commas would tell the parser these are
separate instances, or just one big glob of text, so I played it safe...
> I still think a translation tag with a language attribute would be more
> appropriate.
Yes.
> > <grammar>
> > <derivative form="the">chhaanaaTaa,chhaanaaTi</derivative>
> > <derivative form="of"
> > num="singular">chhaanaaTir</derivative>
> > <derivative form="of"
> > num="plural">chhaanaader</derivative>
> > </grammar>
> > </info>
> > <info pos="noun" plural="false" origin="??">
> > <pron>...</pron>
> > <meaning_bn> khabar... </meaning_bn>
> > </info>
> > </entry>
> > </dictionary>
>
> Otherwise looks OK (maybe an optional comment tag fr each word), unless
> someone else can think of something.
>
> BTW, what's the use of the extra _bn for the tags (not that it matters) ?
Yeah, that should get replaced by the lang tag.
so here it is (hopefully I remembered everything)
<dictionary>
<entry>
<word>...</word>
<info pos="noun" plural="false" orign="." date=".">
<pron>...</pron>
<synonym lang="bn">...</synonym>
<synonym lang="bn">...</synonym>
<antonym lang="bn">...</antonym>
<synonym lang="en">...</synonym>
<meaning lang="bn">...</meaning>
<meaning lang="en">...</meaning>
<grammar>
<derivative form="the"
num="singular">...</derivative>
</grammar>
</info>
</entry>
</dictionary>
I'll make a DTD and see if I can make a GUI for it...
-kg
----
From: Deepayan Sarkar <deepayan@st...> - 2003-05-15 04:13
On Wednesday 14 May 2003 21:56, Kaushik Ghose wrote:
> Ok, that seems fine. The size of the files will matter for the GUI that
> does the dicto editing and any online collaboration tool we come up with
> for creating the dicto, but yes, we'll have automated tools to create
> (like you, may be on the first of every two months) separate file clusters
> for spell checkers, theasauri etc. which can be more compacted.
Yes, we do need to plan ahead so that individual files don't get very big.
Since the main purpose of the GUI is to enter new words and edit existing
words, the only requirement is that given a word we should be able figure out
which file it should be in. That way, if the file doesn't exist, the program
could create a blank instance of the XML document object, and if it does
exist, parse it and read it into memory.
As for the file structure, we could consider a separate directory for each
starting character, then one file for each combination of first 3 letters
(I'm not sure what the best way to name these files would be). But we may
need to adjust this depending on how many files per directory and how many
words per file this would make. Could you run through the existing words and
get an estimate (basically count combinations of first 3 characters) ?
> Now, for the translation. Are we looking to put in one word that can link
> this bangla word to a word in some other dicto ? Or are we looking to give
> a translation of it ? For that we can probably end up with two sets of
> tags.
>
> <synonym lang ="">...</synonym>
> <meaning lang ="">...</meaning>
>
> where synonym is the one word thingy, meaning is well a paragraph or so.
Again, no harm in keeping the option (that way, we could potentially have a
bengali to english dictionary as well as a bengali to bengali).
> > Yes, that should be good enough. Maybe in those cases
> >
> > <word_bn>gabAkSha</word_bn>
> > <info ...>
> > <meaning_bn type="refer">jAnalA</meaning_bn>
> > </info>
>
> Yes, good idea, I'd prefer a separate tag <refer> which would do this job.
> we could do it via synonyms too, may be everything...
OK.
> > Any problem with giving multiple synonyms comma separated ?
> >
> > > <antonym_bn>...</antonym_bn>
> > > <synonym_en>...</synonym_en>
> > > <synonym_en>...</synonym_en>
>
> Yeah, I couldn't figure out if commas would tell the parser these are
> separate instances, or just one big glob of text, so I played it safe...
The comma is not special in XML, so it would be interpreted as a single long
string. But we could always interpret them correctly inside applications.
Anyway, it's not that important.
> so here it is (hopefully I remembered everything)
>
> <dictionary>
> <entry>
> <word>...</word>
> <info pos="noun" plural="false" orign="." date=".">
What's date ? The last modification time ?
> <pron>...</pron>
> <synonym lang="bn">...</synonym>
> <synonym lang="bn">...</synonym>
> <antonym lang="bn">...</antonym>
> <synonym lang="en">...</synonym>
> <meaning lang="bn">...</meaning>
> <meaning lang="en">...</meaning>
> <grammar>
> <derivative form="the"
> num="singular">...</derivative>
> </grammar>
> </info>
> </entry>
> </dictionary>
>
> I'll make a DTD and see if I can make a GUI for it...
Great. I have done this sort of programming in Python, but not C++.
I might be
able to help once you get something going. I think it might be useful to
start by writing a class to represent a single XML file, with methods to add
and modify tags (rather than directly accessing the XML document object all
the time). That way, if there are minor changes in the DTD, we just need to
modify this class.
Deepayan
----
From: Kaushik Ghose <kghose@wa...> - 2003-05-16 15:07
<?xml version="1.0"?>
<!ELEMENT dictionary (entry*)>
<!ELEMENT entry (word, info*) >
<!ELEMENT word (#CDATA)>
<!ELEMENT info (refer?,pron?, synonym?,antonym?,meaning?,grammar?)>
<!ATTLIST info pos (n|adj|v|adv) "n" plural (true|false) "false" origin
CDATA #DEFAULT "????????????" date CDATA>
<!ELEMENT refer (#CDATA)>
<!ELEMENT pron (#CDATA)>
<!ELEMENT synonym (#CDATA)>
<!ATTLIST synonym lang CDATA #DEFAULT "bn">
<!ELEMENT antonym (#CDATA)>
<!ATTLIST antonym lang CDATA #DEFAULT "bn">
<!ELEMENT meaning (#CDATA)>
<!ATTLIST meaning lang CDATA #DEFAULT "bn">
<!ELEMENT grammar (derivative?)>
<!ELEMENT derivative (#CDATA)>
<!ATTLIST derivative form (the|of) "the" num (singular|plural) "singular">
also, to answer Deepayan's question by date I was thinking of date of
origin, first use etc.
Will potter with QT
right now, I'm goign to hardcode the DTD structure, I can't think of a
simple way of creating an editor that will parse the DTD and configure the
GUI on the fly - fixed boxes for all teh element will be quicker for this
size DTD
PS. try the perl tool at
http://www.sagehill.net/livedtd/download.html
-kg
</thread>
|
|
From: Salahuddin P. <sal...@gm...> - 2009-05-14 16:11:59
|
On May 13, 2009, at 10:57 PM, Deepayan Sarkar wrote: > On 5/12/09, Salahuddin Pasha <sal...@gm...> wrote: >> Dear all, >> >> I was working on অভিধান - Abhidhan for XML support. To >> enable various application and tools to utilize our dictionary. >> >> Basic work is already done, but we need to define a standard XML (XML >> DTD or XML Schema). >> >> Any suggestion or comments ? > > Back in 2003, the bengalinux dictionary list had a discussion on this. > Nothing ever came out of it, and when Golam first started on anubadok, > his emphasis was more specialized. In any case, that discussion may > provide some suggestions. > > You can get it from the list archives, and I'm also attaching a > cleaned up and edited version of the thread here: > > ....................... > ---- > > From: Kaushik Ghose <kghose@wa...> - 2003-05-16 15:07 > > <?xml version="1.0"?> > <!ELEMENT dictionary (entry*)> > <!ELEMENT entry (word, info*) > > <!ELEMENT word (#CDATA)> > <!ELEMENT info (refer?,pron?, synonym?,antonym?,meaning?,grammar?)> > <!ATTLIST info pos (n|adj|v|adv) "n" plural (true|false) "false" > origin > CDATA #DEFAULT "????????????" date CDATA> > <!ELEMENT refer (#CDATA)> > <!ELEMENT pron (#CDATA)> > <!ELEMENT synonym (#CDATA)> > <!ATTLIST synonym lang CDATA #DEFAULT "bn"> > <!ELEMENT antonym (#CDATA)> > <!ATTLIST antonym lang CDATA #DEFAULT "bn"> > <!ELEMENT meaning (#CDATA)> > <!ATTLIST meaning lang CDATA #DEFAULT "bn"> > <!ELEMENT grammar (derivative?)> > <!ELEMENT derivative (#CDATA)> > <!ATTLIST derivative form (the|of) "the" num (singular|plural) > "singular"> > > > also, to answer Deepayan's question by date I was thinking of date of > origin, first use etc. > > Will potter with QT > > right now, I'm goign to hardcode the DTD structure, I can't think > of a > simple way of creating an editor that will parse the DTD and > configure the > GUI on the fly - fixed boxes for all teh element will be quicker > for this > size DTD > > PS. try the perl tool at > http://www.sagehill.net/livedtd/download.html > > -kg > > > </thread> > > Dear Deepayan bhai, Thank you for your mail. Here is the present updated one example: <?xml version="1.0" encoding="utf-8"?> <dictionary> <search_results> <dict_entry> <bdict_id>68218</bdict_id> <en_word>apple</en_word> <pos_tag>Proper noun, singular</pos_tag> <penn_tag>NP</penn_tag> <bn_pronunciation></bn_pronunciation> <en_leema></en_leema> <bn_word>অ্যাপল</bn_word> <explanation></explanation> <example>উদাঃ</example> <status>EDITED</status> </dict_entry> </search_results> </dictionary> From Deepayan bhai's mail. I think we still need to add these fields. We will add this in later version as we do not have enough information for these fields now. origin="deshi" <synonyms>...</synonyms> <antonyms>...</antonyms> <entry> <info pos="noun" plural="false" origin="deshi"> <synonyms>...</synonyms> <antonyms>...</antonyms> </info> </entry> <grammar> <derivative form="the">chhaanaaTaa,chhaanaaTi</derivative> <derivative form="of"num="singular">chhaanaaTir</derivative> <derivative form="of" num="plural">chhaanaader</derivative> </grammar> Another questions is which would better for us ? use <grammer> tag and store information in nested tags or the palin one in the present updated one. regards salahuddin |