scdict - Browse /scdict-0.0.1 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.
Name	Modified	Size	InfoDownloads / Week
Parent folder
README	2014-01-06	22.9 kB	0
scdict-0.0.1.tar.gz	2014-01-06	258.2 kB	0
Totals: 2 Items		281.1 kB	0
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Name
====

    scdict - provides an interface for searching words in scanned dictionaries
    (PDF, DJVU, etc.).

Synopsis
========

scdict [OPTION]

    -help

help

    -version

version

    -default-dictionary <dictionary>
    -dd <dictionary>

set the default dictionary.

Quick tutorial
==============

Compilation and installation are quite straightforward provided you have GNU
CLISP installed: make and make-install should suffice.  So suppose you have
the application installed.

Suppose you have a scanned copy of a dictionary (a not OCRed one).  The idea
is as follows: you type in a word on the console, and the dictionary opens at
the proper page.

Obviously, it suffices for the computer to know the first entry on each page
of the word list, or the last one.  Then given any word, it is possible to
exactly calculate the page where it must be located.  We call the list of such
words an index.  Indexing a dictionary with several hundred pages is quite
boring, but actually it doesn't take that much time and pays off greatly.

Basically, you turn over the pages of a dictionary and put down the last (or
the first) entry on every page.  Thus you obtain a text file.

So you've got an index.  SCDICT automatizes the search.  Moreover, it is
designed with a view to uniform treatment of indices with respect to
installation and sharing.  It also makes an attempt to establish more or less
universal data formats for this purpose.

SCDICT uses JSON as the serialization format.  It is a convenient
human-readable format for serializing scalar data (numbers, strings, Boolean
values), arrays (ordered collections) and objects (unordered collection of
key-value pairs, where key is a string).  See Wikipedia for details.  Note
that strings are quoted, brackets denote arrays, and braces denote objects.

Suppose you have a Foo-Bar dictionary ~/my-dicts/foo-bar.pdf.  Then you can
create the directory ~/.scdict/data/foo-bar/ and place the index file
foo-bar.ind and the information file foo-bar.json there.  The latter may
contain the following record:

    {
        "name"              : "FOO-BAR",
        "title"             : "The Copious Foo-Bar Dictionary",
        "description"       : "John Doe, The Copious Foo-Bar Dictionary.",
        "alphabet"          : "LATIN",
        "file"              : "foo-bar.pdf",
        "format"            : "PDF",
        "index"             : "foo-bar.ind",
        "indexFormat"       : "PLAIN",
        "indexDirectory"    : "foo-bar/",
        "firstPage"         : 21,
        "firstOnPage"       : false
    }

Later the fields of an object representing a dictionary are considered in
detail.  Note that you indicate "foo-bar/" as the index directory tacitly
assuming it to be a subdirectory of ~/.scdict/data/, so that SCDICT will be
able to find the index file ~/.scdict/data/foo-bar/foo-bar.ind. However,
SCDICT still won't be able to find the PDF file.  So you create one more JSON
file containing the record

    {
        "name"              : "FOO-BAR",
        "directory"         : "~/my-dict/"
    }

You still name it foo-bar.json and put it under ~/.scdict/extra.

Now SCDICT is able to find and read both files and combine the information. 

The contents of the directory foo-bar constitutes a kind of a package, which
is portable in the way that you can copy it from one computer to another
without need to modify anything.  You will only have to supply the file
indicating the actual location of the dictionary.

SCDICT can handle multi-volume dictionaries as well.  They have the same
fields as single-volume ones, but certain fields can be array-valued.

At this point there remains only one thing to get everything working: you must
instruct SCDICT how to open files at given pages.  This is done in your
configuration file ~/.scdict/scdictrc.lisp.  See section Configuration Files
and examples therein.

Command line interface is ideal for searching.  If the programme starts
successfully, it offers you a list of dictionaries, one of them being active.
If it is the dictionary you need, all you have to do is to type in word after
word pressing Enter.  However, if you start you input with a space (or spaces), it is
interpreted as a command and not as a word to look up.  This is how you can
change the dictionary, if SCDICT guessed it wrongly.  Just type

    <Space>d <name of dictionary>

Here &lt;name of dictionary\> is the dictionary unique identifier indicated as the
`name` field of the JSON record.  Note that commands are case insensitive and
so are dictionary names.  There are more commands.

By the way, you can set the default dictionary using the `-dd` command line
option.


Configuration files
===================

User preferences are set in the configuration file ~/.scdict/scdictrc.lisp,
while system-wide preferences are kept in +scdict+/scdictrc.lisp, where
+scdict+ typically has the value of /usr/local/share/scdict.   User configs
override system ones.  First of all you want to use them for setting
default viewers.

The configuration files are Common Lisp source files.

In order to set a default viewer, you can first define a viewer.  A viewer is
regarded as a rule of the kind: take a file name, take a number and return a
shell command that would open the file at given page.  In other words, a
viewer is a function taking a file name and an integer as arguments and
returning a string.  Say, you want to use djview4 as your default DjVu viewer.
A possible command to open a file at a given page would be
 
    djview -page=<page> "<file>"

(see `man djview` for more options).  Note that you should quote the file name
as it may contain spaces.  In Lisp you use the following syntax:

    (defviewer default-djvu-viewer (f p) "djview -page=~A \"~A\"" p f)

Here `defviewer` means that we are defining a new viewer and
`default-djvu-viewer`
is its name (you could choose another one to your liking).  Next, as we
mentioned, the viewer must accept two arguments, a file name, and a page
number, which we conventionally denoted by `<file>` and `<page>`.  Here in the
Lisp expression we choose the letters f and p to respectively denote them.
You could chose any two letters (identifiers) you like, but the first letter
in parentheses after the viewer's name will always stand for the file name,
while the second one will stand for the page.  Next we see our command in the
form of so-called format string.  The string has two gaps for the page number
and the file name which are both represented by ~A.  After the string we
supply the values to fill in the gaps in the order the gaps occur, i. e. first
the page number, then the file name.  Note that the format string (as any
string) is delimited by double quotes and you must escape literal double
quotes with backslashes.

Here is another example for mupdf:

    (defviewer my-favourite-pdf-viewer (file page)
      "mupdf \"~A\" ~A" file page)

Here the syntax is simpler and can be written in the form
    
    mupdf <file> <page>

We have used the variable names `file` and `page` instead of just f and p.

Observe that you can split Lisp code between multiple lines (anyway, computer
reads it by parentheses) and it it customary (though not absolutely necessary)
to indent the body of a macro by two spaces.  Further, observe that Lisp code
is typically written in lower case with hyphens as separators (rather than e.
g. underscores).  Of course, a dedicated text editor makes formatting the code
much easier, but if you are not going to code that much, you can have pretty
rc files even without it.

Next, we want to associate our viewers to formats.  This can be done as
follows:

    (set-default-viewer "PDF" #'my-favourite-pdf-viewer)
    (set-default-viewer "DJVU" #'default-djvu-viewer)

The magic word `set-default-viewer` does the trick.  It is followed by the
format designator (the same kind of strings as "format" fields in the JSON
files) and by the name of the viewer preceded by a sharp quote.  You don't
have to worry about sharp quotes, just accept it.

You could spare names and use anonymous viewers created by the macro
`make-viewer`, which takes the same arguments as `defviewer` except the viewer
name, i. e. a lambda list, a format string, and format arguments.  So you
could set a default viewer like that:

    (set-default-viewer "DJVU"
      (make-viewer (f p) "djview -page=~A \"~A\"" p f))

This time without sharp quote, never mind.

Moreover, you can establish viewers for individual dictionaries using the
function set-viewer as follows:
    
    (set-viewer "MY-DICT"
                (make-viewer (f p)
                  "djview -fullscreen -page=~A \"~A\"" p f))

So we decide to open the dictionary in the fullscreen mode and created an
anonymous viewer (`make-viewer` without sharp quote).  Alternatively we could
have created a named viewer beforehand and used its name (with a sharp quote).
Funny, eh?
  
Since a viewer is nothing more than a function taking a string and an integer
and returning a string, you can create more sophisticated ones if you ever
need them.

Generally, you can use all the standard Common Lisp in the scdictrc.lisp
files; besides, you can use functions from the CL-FAD library without package
prefix, and you have the +scdict+ and +version+ constants holding the
system-wide data directory and version respectively. When the Lisp reader
reads the configuration files, its current package is SCDICTRC. 

Commands
========

SCDICT knows that you type in a command and not just a work for look up if you
type one or more spaces first.  Stretches of spaces separate the command and
its arguments one from another.  The following commands are available:

    d <name>        Make dictionary <name> the current one.
    info            Information about current dictionary.
    info <name>     Information about dictionary <name>.
    show w          Explain absence of warranty.
    show c          Explain conditions of redistribution.


Dictionaries
============

SCDICT can handle single-volume and multi-volume dictionaries.  Their JSON
representations have the same fields, but the interpretation is slightly
different.

First consider a single-volume dictionary.

Obligatory fields:
------------------

**name**  
String (case insensitive, uppercase preferred)  
The identifier of the dictionary.

**file**  
String  
An absolute or relative path to the file of the dictionary (see Where SCDICT
looks for files).

**alphabet**  
A string (case insensitive, uppercase preferred)  
The alphabet used in the dictionary (see [Alphabets]).

**indexFile**  
String  
An absolute or relative path to the index file (see Where SCDICT looks for
files)

**indexFormat**  
String (case insensitive, uppercase preferred)  
The format of the index (see Index Formats).

Optional fields:
----------------

**format**  
String (case insensitive, uppercase preferred)  
The format of the dictionary.  It serves to define the viewer that will open
the dictionary file.  See Configuration Files for setting up viewers.  
Default: the uppercased version of the dictionary file's `extension' (i. e.
the part of the name after the last dot provided it isn't the dot starting the
file name; if there are no dots in the file name except perhaps the dot
starting it, the extension is the empty string).

**directory**  
String  
An absolute or relative path to directories containing the dictionary file
(see Where SCDICT looks for files).  It has no default value and can be left
out altogether.

**firstPage**  
Integer  
The number of the page in the file corresponding to the first index entry.  
Default: 1.  

**firstOnPage**  
Boolean  
`true` if the first word on each page is indexed; false if the last word on
each page is indexed.  
Default: `false`.

**indexDirectory**  
String  
An absolute or relative path to directories containing the index file (see
Where SCDICT looks for files).  It has no default value and can be left out
altogether.

**title**  
String  
The title of the dictionary.  
Default: "".

**description**  
String  
General information about the dictionary.  
Default: "".

No other fields are handled, even though they may be present.

Multi-volume dictionaries have the same obligatory and optional fields.  The
fields `alphabet`, `firstOnPage`, `title` and `description` have exactly the
same meanings and properties as for single-volume dictionaries (`firstOnPage`
pertains to the indices of each volume).  The fields `file`, `index`,
`indexFormat`, `format`, `directory`, `firstPage`, and `indexDirectory` have
analogous meanings but can optionally be arrays of strings, except for `file`
and `indexFile`, which must be arrays of strings.  If one of these fields is a
scalar (in particular, assumes its default value), it is supposed to pertain
to each volume.  E. g., a single string for indexDirectory means that all the
index files reside in the same directory, and omitted firstPage means that the
word list of each dictionary starts at page 1.

Alphabets
=========

SCDICT is quite flexible regarding alphabets.  Lexicographic sorting with
respect to national alphabets can drastically differ from e. g. Unicode code
points.  For example, in older Spanish spelling *ch* is regarded as a single
letter following *c*, so that *chapa* is listed after *curso*; likewise, *ny*
and *sz* are single letters in Hungarian; in German *o* and *ö* occupy the
same place in alphabetical order, whereas in Swedish *ö* is a separate letter,
and the last one in the alphabet at that; in Danish *aa* is listed together
with *å* at the end of the alphabet, while in German we have a contrary
situation: the ligature *ß* has the value of the two letters *ss*.

SCDICT ships with a few predefined alphabets that cover a lot of languages.
Alphabets are identified by strings (case insensitive, upper case preferred).
It is these strings that are the values of "alphabet" fields in JSON objects
representing dictionaries.

Here is the list of supported alphabets:

*   LATIN   

Twenty six English letters with optional diacritics and ligatures ß = ss, æ =
ae, œ = oe.  Suitable for English, French, German and a lot of other
languages.  You can type in French words omitting accents (but you can put
accents if you want to, or even put them incorrectly and mix upper and lower
case - that doesn't influence the result).

*   SWEDISH
*   NORWEGIAN
*   DANISH
*   SPANISH
*   POLISH
*   CZECH
*   RUSSIAN

If you need to break a digraph (e. g. to list Aachen at the beginning of a
Danish dictionary, even though ordinarily *aa* = *å*), it suffices to insert a
non-letter character between its elements, e. g. *A|achen*.  The intrusion
will be skipped since it can't be analyzed in terms of the Danish alphabet,
but it will separate the two a's.

JSON files with user alphabets are under ~/.scdict/alphabet.  One file can
contain one or several alphabets.  File names are irrelevant.

If you are not interested in creating your own alphabets (or if examples in
the +scdict+/alphabet are enough for you), you can skip the rest of the
section.

SCDICT regards the alphabet as a union of letters and ligatures.

From the point of view of representation, a letter is a class of equivalent
strings.  For example, in German lexicographic order "a", "A", "ä", and "Ä"
are identical.  Thus we define a (generalized) letter by enumerating such
equivalent strings.  A sequenced collection of letters represents an alphabet.
But the alphabet can also comprise ligatures, i. e. single characters
equivalent to a sequence of characters.

Given a string of characters, it is parsed as follows.  Ligatures explode into
equivalent character sequences.  Then we start from the beginning of the
string and search for the longest match with a letter string.  If there is no
match, skip first character, else collect the letter and proceed with the rest
of the string.  Observe that this algorithm parsing ignores any non-parsable
characters and sequences of characters.

In terms of JSON, an alphabet is represented as an object with one obligatory
field and three optional fields: `letters`, `ligatures`, and `caseSensitive`.

**name**  
String  
Obligatory identifier.

**letters**  
An array whose elements are strings or arrays of strings.  
Each element of the array `letters' is an array enumerating equivalent
strings.  However, if there is only one string in the equivalence class, this
very string can do instead of the one-element array.  
Default: empty array.

**ligatures**  
An array of two-element arrays of strings  
This is an enumeration of ligatures.  Each item has the form

    [<ligature>, <string>],

where `<ligature>` is a one-character string and `<string>` is its expansion.  
Default: empty array.

**caseSensitive**  
Boolean  
A flag that allows to at least halve the total amount of enumerated strings.  
Default: false.

Indices
=======

The simplest kind of index is just a text file containing one word per line.
If a dictionary has such an index, its `indexFormat` is set to "PLAIN".

However, typos in an index can be more easily corrected if page numbers are
appended to listed entries.  The index format "PLAIN-NUMBERS" implies a text
file with lines matching the regular expression \d+\s+(.*), where only the
part in brackets is used for searching.  In other words, each line is a number
(non-empty sequence of digits) followed by one or more whitespace characters
(space or tab) followed by something else, and when SCDICT reads in such a
file, it actually strips leading number and whitespace.  Note that the leading
number thus doesn't influence the search and is only intended for facilitating
human navigation in the file.

The "SIMPLE" format is somewhat more sophisticated: it is analogous to
"PLAIN", but empty lines and lines starting with # are skipped, thus allowing
for comments in index files.

"SIMPLE-NUMBERS" is the obvious combination of "SIMPLE" and "PLAIN-NUMBERS"

Where SCDICT looks for files
============================

Briefly: there is the system directory /usr/local/share (if you installed
SCDICT to default locations) and the user directory ~/.scandict.  They contain
subdirectories.  Briefly: the contents of the user directory overrides the
contents of the system one, and *extra* directory overrides *data* directory.
The *data* directory is intended for permanent data, whereas *extra* directory
is intended for user-specific information such as paths to PDFs.

By default SCDICT looks for JSON files containing dictionary information in
*data* and *extra* directories, the latter overriding the former. Only the
contents of the files matters, file names are irrelevant.  Several JSON
objects (possibly contained in different files) may contain pieces of
information concerning the same dictionary (identified by `name'); in this
case the information is updated.  In theory SCDICT can look for JSON files in
other locations provided that absolute paths are specified as correspondent
command line options.

If `indexFile`` is an absolute path, it is assumed to be the path to the index
file. If it is a relative path, it is augmented by the `indexDirectory`.  If
`indexDirectory` is an absolute path, an absolute path is obtained and is
assumed to be the path to the index file.  If `indexDirectory` directory is a
relative path, a relative path is obtained.  It is assumed to be the path to
the index relative to one of *extra* or *data* directories.  Finally, if
`indexDirectory` and/or `indexFile` are arrays, the same rules are applied
component-wise.

If index file happens to be a relative path, SCDICT thinks it's relative to
*extra* or *data* directories.

SCDICT looks for alphabets in *alphabet* directories.

These principles are likely to change in future releases.

Questions and Answers
=====================

Q.  Why Common Lisp in configuration files?

A.  First of all, I think that it's better to use an existing language rather
than yet another scripting language.  On the other hand, this is way easier to
implement, given that the programme itself is written in Common Lisp.
Languages of Lisp family are flexible, so they can be easily adapted to
purposes of configuration.

Q.  How do I manage parentheses in configuration files?

A.  Well, pretty much in the same way as you manage them elsewhere.  I think
most editors are able to at least highlight matching parentheses.  If you
don't code much, you can do without Emacs or vim.  After all, the parentheses
are so cute, aren't they?

Q.  When I open a file in Okular (Evince, ...), the console gets messed up
with messages like

    okular(7185)/kdecore (KConfigSkeleton) KCoreConfigSkeleton::writeConfig:

What should I do?

A.  The simplest solution would be to redirect standard error to /dev/null:

    (defviewer okular (f p) "okular --page ~A \"~A\" 2>/dev/null" p f)

Q.  Okular (Evince, ...) opens files very slowly.

A.  It can't be helped.  You could try more lightweight viewers such as mupdf
for PDF and djview4 for DjVu.  The optimal format for scanned dictionaries
might be indirect DjVu (i. e. a DjVu consisting of multiple files).  By the
way, djview4 does not only open indirect DjVu files, but can also save a
bundled DjVu (single file) in the indirect format.

Q.  djview4 keeps opening the same page even though my command seems right.

A.  You must be using an old version from your repository.  This bug is no
longer present in newer releases.  You can compile djview4 from source.

Q.  Why GNU CLISP?

A.  Readline for example; besides, CLISP is very portable and creates smaller
executables as compared to SBCL.  Well, I just like CLISP.  After all, the
code is mostly ANSI.  It can be made portable using more libraries or at least
supplying alternatives in a few places.  I think I'll do that.

Q.  Why compile it if I can use it from Slime?

A.  Absolutely.  And from slimv, too.

In fact, all that you actually need is in the src/ directory, and the function
SCDICT.INTERFACE:MAIN starts the main loop.  The function takes the list of
parameters as parsed by APPLY-ARGV.  If you don't compile the programme,
SCDICT.INTERFACE:\*SCDICT\* path is set to NIL and actually ignored, so all
your data is in ~/.scdictrc; however, you have to copy src/alphabet there
yourself.

Q.  Why JSON and not e. g. S-expressions?

A.  Here is a quotation from CL-JSON library's home page: "Many people find
parentheses difficult, but brackets and braces easy. That has led to many
implementations of JSON. There is no format based on s-expressions implemented
in over 20 languages (yet!)."  I didn't want to be Lisp-centric.  As the JSON
format is fairly universal and widely used, there can arise other
applications, which is good.

Q.  What about Windows version?

A.  Briefly, I'm just not interested in it.  In theory this is possible, since
CLISP runs under Windows.
Source: README, updated 2014-01-06
scdict Files

A search engine for non-OCRed scanned dictionaries.

scdict Files

A search engine for non-OCRed scanned dictionaries.

Get an email when there's a new version of scdict