File | Date | Author | Commit |
---|---|---|---|
infodb_scripts | 2015-01-27 |
![]() |
[4daa77] Fixed create*Info.sh headers |
parsers | 2015-02-07 |
![]() |
[2fdce3] Modified DMLBS parser to replace '|' with '/' |
scripts | 2015-02-08 |
![]() |
[7b326a] Added command-line arguments, usage text, and d... |
.gitignore | 2015-02-07 |
![]() |
[308c16] Added .gitignore to ignore tests/, logs/ direct... |
README.md | 2014-08-25 |
![]() |
[c9042c] Updated README.md |
logeion_parse.py | 2015-01-03 |
![]() |
[25bbb4] Added modify flag, enabled status to logeion_parse |
Backend scripts, files, etc. for parsing/updating dictionaries. Feel free to pull this to
add your own dictionaries/try out Logeion for yourself; instructions for doing so are
at the bottom of this README.
NB: If you're running any of the Logeion db-building scripts, run them in the top-level
Logeion directory (i.e. /Users/Shared/Logeion_parsers on stephanus).
If you are not parsing the shortdefs or Greek textbooks, you may skip this step.
To grab Latin and Greek shortdefs, first run:
$ scripts/update_shortdefs.py <lemmastoknow db> <lexicon db>
This will update the lemmastoknow file with modified entries from the lexicon. Then, run
$ scripts/grab_lemmastoknow.py <dico> ...
dico: [HQ | JACT | LTRG | Mastro | shortdefs | all]
The appropriately-named files should be in the current directory; put them in the right
spots in the dictionaries directory, making sure that the name matches what the parser
for that dictionary accepts (currently, filename should be the same one that grab_lemmastoknow.py
spits out). Make sure that lemmastoknow.sqlite
is in your current directory. For example,
if you need to reparse Hansen & Quinn, LTRG, and the Greek shortdefs, then do the following
(assuming the lemmastoknow file is in your current directory):
$ scripts/update_shortdefs.py <lemmastoknow> <lexicon>
$ scripts/grab_lemmastoknow.py HQ LTRG shortdefs
$ ls *.dat
hq.dat ltrg.dat shortdefs.dat
$ mv hq.dat path/to/HQ/
$ mv ltrg.dat path/to/LTRG/
$ mv shortdefs.dat path/to/GreekShortDefs/
Then run:
$ ./logeion_parse.py <name of dictionary> ([ --latin | --greek | --sidebar ])* ...
to regenerate each dictionary. For example, if you want to parse GreekShortDefs, LatinShortDefs,
and all of the textbooks, you would run:
$ ./logeion_parse.py GreekShortDefs LatinShortDefs --sidebar
If you want to regenerate all dictionaries, then just run:
$ ./logeion_parse.py --all
When it's finished, the new database will be in the current directory as new_dvlg-wheel.sqlite
.
parser.log
will detail any errors that the parsers reported. (They should also be visible on STDOUT
.)
$ ./logeion_parse.py <name of dictionary>
--db
option; for example, to reparse all of the Latin and Greek dictionaries in dvlg-wheel.sqlite
,$ ./logeion_parse.py --latin --greek --db dvlg-wheel.sqlite
new_dvlg-wheel.sqlite
is the default database name, running logeion_parse.py
withnew_dvlg-wheel.sqlite
in your current directory will overwrite whatever dictionaries you are parsing.Dictionaries consist of two objects: a folder containing the source files and a parser file containing a
Python function and other data to parse the source files and provide necessary metadata, respectively.
parse
which returns a list of dict
s of values. Each entry in the list should{'head': <lemma>, 'orth_orig': <lemma w/diacritics>, 'content': <entire entry>}
{'head': <lemma>, 'content': <entire entry>, 'chapter': <chapter #>}
;name
, type
, and caps
:name
: name of the dictionary (same as dictionary folder)type
: (latin|greek|sidebar)
caps
: (uncapped|source|precapped)
; uncapped
means that capitalization needs tosource
means that it should serve as a source for capitalization info,precapped
means that it shouldn't be touched during capitalization.convert_xml
(optional): (True|False)
; determines whether the dictionary's contentFalse
or don't include itlogeion_parse.py
Logeion_parsers/parsers
. The logeion_parse.py
script uses theparsers/
directory as a plugins directory, i.e. it will automatically load all files inname
property in the parser file.Logeion_parsers/dictionaries
, in a folderNewDico.xml
should be in Logeion_parsers/dictionaries/NewDico
.)dOrder_<language>
headword.py
script. If the dictionary is not added to this list it will{'amatus, -a, -um': 'beloved'}
{'amatus': 'amatus, -a, -um, beloved'}
, etc.{'amātus, -a, -um': 'beloved'}
=> {'amatus': 'amātus, -a, -um, beloved'}
.&(gt|lt|amp|apos|quot);
, they shoulddvlg-wheel.sqlite:
CREATE TABLE Entries(head text, orth_orig text, content text, dico text, lookupform text);
CREATE INDEX lookupform_index_e on Entries (lookupform);
CREATE TABLE Sidebar(head text, content text, chapter text, dico text, lookupform text);
CREATE INDEX lookupform_index_s on Sidebar (lookupform);
CREATE TABLE LatinHeadwords (head text);
CREATE TABLE GreekHeadwords (head text);
CREATE TABLE Transliterated (normhead text, transhead text);
CREATE INDEX trans_index on Transliterated (transhead);
(greek|latin)Info.db:
CREATE TABLE authorFreqs(lemma text, rank integer, author text, freq float, lookupform text);
CREATE TABLE collocations (lemma text, collocation text, count integer, lookupform text);
CREATE TABLE frequencies (lemma text, rank integer, count integer, rate real, lookupform text);
CREATE TABLE samples (lemma text, rank integer, sample text, author text, work text);
CREATE INDEX aF_l on authorFreqs(lookupform);
CREATE INDEX c_l on collocations(lookupform);
CREATE INDEX f_l on frequencies(lookupform);
CREATE INDEX s_lem on samples(lemma);
If you want to deploy Logeion to a new server, or want to check it out and test it/make it your own,
follow these steps. (Sample databases are located
here.)
$ ./logeion_parse.py <name of dictionary>
dvlg-wheel-mini.sqlite
, then run$ ./logeion_parse.py <name of dictionary> --db dvlg-wheel-mini.sqlite
dvlg-wheel-mini.sqlite
is in your current directory).convert_xml
property is present in the parser file and set to True
.logeion-cgi
and logeion-html
, respectively). The filesmaster
branch are configured with the two directories cgi-bin
and html
as siblings, though feelgreekInfo.db
, latinInfo.db
, anddvlg-wheel.sqlite
. For the last, you may also rename dvlg-wheel-mini.sqlite
or edit