You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
(36) |
Apr
(36) |
May
(127) |
Jun
(193) |
Jul
(12) |
Aug
(46) |
Sep
(66) |
Oct
(28) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(39) |
Feb
(68) |
Mar
(58) |
Apr
(88) |
May
(40) |
Jun
(82) |
Jul
(213) |
Aug
(19) |
Sep
(2) |
Oct
(26) |
Nov
(2) |
Dec
|
2008 |
Jan
(5) |
Feb
(30) |
Mar
(26) |
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
(4) |
Apr
(44) |
May
(1) |
Jun
(9) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
(4) |
Feb
(4) |
Mar
|
Apr
(7) |
May
(35) |
Jun
|
Jul
|
Aug
(48) |
Sep
(10) |
Oct
(1) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(3) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(40) |
2017 |
Jan
(82) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(15) |
Oct
|
Nov
|
Dec
|
2019 |
Jan
|
Feb
(37) |
Mar
(28) |
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(7) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(27) |
2021 |
Jan
(52) |
Feb
(4) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(8) |
Nov
(72) |
Dec
(100) |
2022 |
Jan
(119) |
Feb
(94) |
Mar
(4) |
Apr
|
May
|
Jun
(5) |
Jul
(3) |
Aug
(2) |
Sep
|
Oct
|
Nov
(10) |
Dec
(97) |
2023 |
Jan
(52) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(17) |
Sep
(21) |
Oct
(8) |
Nov
|
Dec
|
2024 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <vic...@us...> - 2024-06-26 16:55:19
|
Revision: 2753 http://sourceforge.net/p/axsl/code/2753 Author: victormote Date: 2024-06-26 16:55:17 +0000 (Wed, 26 Jun 2024) Log Message: ----------- Move "marker" attributes to an orthogonal element. Add flow element. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2024-06-26 12:49:01 UTC (rev 2752) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2024-06-26 16:55:17 UTC (rev 2753) @@ -19,7 +19,7 @@ <!-- The root of the intermediate document. --> -<!ELEMENT axsl-spell-check-input (text*)> +<!ELEMENT axsl-spell-check-input (flow*)> <!ATTLIST axsl-spell-check-input xml:lang CDATA #REQUIRED > @@ -26,6 +26,17 @@ <!-- +Indicates which flow the descendant elements are in. Some documents have more than one flow. For example, many books +have both a main flow and a footnote flow. When tracing the location of a spell-check flag, it is useful to know in +which of these flows the flag was detected. +--> +<!ELEMENT flow (text | marker)*> +<!ATTLIST flow + name CDATA #REQUIRED +> + + +<!-- Element "text" optionally and recursively may contain child "text" elements. Any of these may optionally declare an xml:lang attribute to define the orthography that should be used by its descendants for purposes of @@ -40,18 +51,13 @@ 2. "line" is the (optional) line number in the original document, if known. 3. "column" is the (optional) column number in the original document, if known. 4. "xpath" is the (optional) xpath of the element in the original document, if known. -5. "marker" is the (optional) marker from the original document. This might be, for example, a page number marked in the -original document that refers back to the backing document. While "line", "column", and "xpath" give clues about the -location of the original element in the original XML document, "marker" gives a clue about the location in the document -used to create that original XML document. --> -<!ELEMENT text (#PCDATA | word | foreign)*> +<!ELEMENT text (#PCDATA | word | foreign | marker)*> <!ATTLIST text xml:lang CDATA #IMPLIED line CDATA #IMPLIED column CDATA #IMPLIED xpath CDATA #IMPLIED - marker CDATA #IMPLIED > @@ -74,18 +80,13 @@ 2. "line" is the (optional) line number in the original document, if known. 3. "column" is the (optional) column number in the original document, if known. 4. "xpath" is the (optional) xpath of the element in the original document, if known. -5. "marker" is the (optional) marker from the original document. This might be, for example, a page number marked in the -original document that refers back to the backing document. While "line", "column", and "xpath" give clues about the -location of the original element in the original XML document, "marker" gives a clue about the location in the document -used to create that original XML document. --> -<!ELEMENT word (#PCDATA)> +<!ELEMENT word (#PCDATA | marker)> <!ATTLIST word xml:lang CDATA #IMPLIED line CDATA #IMPLIED column CDATA #IMPLIED xpath CDATA #IMPLIED - marker CDATA #IMPLIED > @@ -101,18 +102,28 @@ 2. "line" is the (optional) line number in the original document, if known. 3. "column" is the (optional) column number in the original document, if known. 4. "xpath" is the (optional) xpath of the element in the original document, if known. -5. "marker" is the (optional) marker from the original document. This might be, for example, a page number marked in the -original document that refers back to the backing document. While "line", "column", and "xpath" give clues about the -location of the original element in the original XML document, "marker" gives a clue about the location in the document -used to create that original XML document. --> -<!ELEMENT foreign (#PCDATA | word)* > +<!ELEMENT foreign (#PCDATA | word | marker)* > <!ATTLIST foreign xml:lang CDATA #IMPLIED line CDATA #IMPLIED column CDATA #IMPLIED xpath CDATA #IMPLIED - marker CDATA #IMPLIED > + +<!-- +A marker from an original document, that is, a document backing the semantic XML that is being parsed to create this +spell-check document. This might be, for example, a page number in such a backing document. This is in distinction from +the the "line", "column", and "xpath" attributes, which give clues about the location of the original element in the +original XML document. Marker elements are orthogonal to the text, word, and foreign elements, so they can occur in the +middle of a word. This possibility must be considered by consumers of documents of this type to ensure that the entire +word is considered for spell-checking instead of two fragments of the word. + +--> +<!ELEMENT marker> +<ATTLIST marker + value CDATA #REQUIRED +> + <!-- Last Line of DTD --> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2024-06-26 12:49:03
|
Revision: 2752 http://sourceforge.net/p/axsl/code/2752 Author: victormote Date: 2024-06-26 12:49:01 +0000 (Wed, 26 Jun 2024) Log Message: ----------- Add "marker" attribute for data from the document backing the XML. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2024-06-07 14:25:23 UTC (rev 2751) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2024-06-26 12:49:01 UTC (rev 2752) @@ -40,6 +40,10 @@ 2. "line" is the (optional) line number in the original document, if known. 3. "column" is the (optional) column number in the original document, if known. 4. "xpath" is the (optional) xpath of the element in the original document, if known. +5. "marker" is the (optional) marker from the original document. This might be, for example, a page number marked in the +original document that refers back to the backing document. While "line", "column", and "xpath" give clues about the +location of the original element in the original XML document, "marker" gives a clue about the location in the document +used to create that original XML document. --> <!ELEMENT text (#PCDATA | word | foreign)*> <!ATTLIST text @@ -47,6 +51,7 @@ line CDATA #IMPLIED column CDATA #IMPLIED xpath CDATA #IMPLIED + marker CDATA #IMPLIED > @@ -69,6 +74,10 @@ 2. "line" is the (optional) line number in the original document, if known. 3. "column" is the (optional) column number in the original document, if known. 4. "xpath" is the (optional) xpath of the element in the original document, if known. +5. "marker" is the (optional) marker from the original document. This might be, for example, a page number marked in the +original document that refers back to the backing document. While "line", "column", and "xpath" give clues about the +location of the original element in the original XML document, "marker" gives a clue about the location in the document +used to create that original XML document. --> <!ELEMENT word (#PCDATA)> <!ATTLIST word @@ -76,6 +85,7 @@ line CDATA #IMPLIED column CDATA #IMPLIED xpath CDATA #IMPLIED + marker CDATA #IMPLIED > @@ -91,6 +101,10 @@ 2. "line" is the (optional) line number in the original document, if known. 3. "column" is the (optional) column number in the original document, if known. 4. "xpath" is the (optional) xpath of the element in the original document, if known. +5. "marker" is the (optional) marker from the original document. This might be, for example, a page number marked in the +original document that refers back to the backing document. While "line", "column", and "xpath" give clues about the +location of the original element in the original XML document, "marker" gives a clue about the location in the document +used to create that original XML document. --> <!ELEMENT foreign (#PCDATA | word)* > <!ATTLIST foreign @@ -98,6 +112,7 @@ line CDATA #IMPLIED column CDATA #IMPLIED xpath CDATA #IMPLIED + marker CDATA #IMPLIED > <!-- Last Line of DTD --> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2024-06-07 14:25:26
|
Revision: 2751 http://sourceforge.net/p/axsl/code/2751 Author: victormote Date: 2024-06-07 14:25:23 +0000 (Fri, 07 Jun 2024) Log Message: ----------- Add line and column number to the Token interface. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java 2024-06-07 13:53:44 UTC (rev 2750) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java 2024-06-07 14:25:23 UTC (rev 2751) @@ -135,6 +135,18 @@ WritingSystem getWritingSystem(); /** + * Returns the line number within the lexer content at which this token begins. + * @return The line number, or -1 if unknown. + */ + int getLine(); + + /** + * Returns the column number within the lexer content at which this token begins. + * @return The column number, or -1 if unknown. + */ + int getColumn(); + + /** * Returns an immutable copy of this token. * Tokens are transient and mutable, and in most cases this doesn't matter, as the token will be looked at and * immediatly discarded. @@ -145,7 +157,7 @@ * Be sure to use an immutable implementation if that is important. */ default Token getImmutableCopy() { - return new ImmutableToken(getText(), getTokenType(), getWritingSystem()); + return new ImmutableToken(getText(), getTokenType(), getWritingSystem(), getLine(), getColumn()); } } @@ -166,16 +178,27 @@ /** The writing system of the token. */ private WritingSystem writingSystem; + /** The line, within the Lexer, at which this token begins. */ + private int line; + + /** The column, within the Lexer, at which this token begins. */ + private int column; + /** * Constructor. * @param text The text of the token. * @param type The type of the token. * @param writingSystem The writing system of the token. + * @param line The line, within the Lexer, at which this token begins. + * @param column The column, within the Lexer, at which this token begins. */ - ImmutableToken(final CharSequence text, final TokenType type, final WritingSystem writingSystem) { + ImmutableToken(final CharSequence text, final TokenType type, final WritingSystem writingSystem, final int line, + final int column) { this.text = text.toString(); this.type = type; this.writingSystem = writingSystem; + this.line = line; + this.column = column; } @Override @@ -194,6 +217,16 @@ } @Override + public int getLine() { + return this.line; + } + + @Override + public int getColumn() { + return this.column; + } + + @Override public Token getImmutableCopy() { return this; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2024-06-07 13:53:45
|
Revision: 2750 http://sourceforge.net/p/axsl/code/2750 Author: victormote Date: 2024-06-07 13:53:44 +0000 (Fri, 07 Jun 2024) Log Message: ----------- Make location information more explicit, for better downstream use. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2023-10-12 10:45:59 UTC (rev 2749) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2024-06-07 13:53:44 UTC (rev 2750) @@ -37,14 +37,16 @@ Attributes: 1. "xml:lang" is used to determine which dictionary(ies) should be used for the spell-checking. -2. "location" stores an optional clue about where the element was located in the - original document. This could be the line/column number, "98:24" for example, - or perhaps an XPath. +2. "line" is the (optional) line number in the original document, if known. +3. "column" is the (optional) column number in the original document, if known. +4. "xpath" is the (optional) xpath of the element in the original document, if known. --> <!ELEMENT text (#PCDATA | word | foreign)*> <!ATTLIST text xml:lang CDATA #IMPLIED - location CDATA #IMPLIED + line CDATA #IMPLIED + column CDATA #IMPLIED + xpath CDATA #IMPLIED > @@ -64,14 +66,16 @@ Attributes: 1. "xml:lang" is used to determine which dictionary(ies) should be used for the spell-checking. -2. "location" stores an optional clue about where the element was located in the - original document. This could be the line/column number, "98:24" for example, - or perhaps an XPath. +2. "line" is the (optional) line number in the original document, if known. +3. "column" is the (optional) column number in the original document, if known. +4. "xpath" is the (optional) xpath of the element in the original document, if known. --> <!ELEMENT word (#PCDATA)> <!ATTLIST word xml:lang CDATA #IMPLIED - location CDATA #IMPLIED + line CDATA #IMPLIED + column CDATA #IMPLIED + xpath CDATA #IMPLIED > @@ -80,11 +84,20 @@ surrounding text. Such content does not mark the end of a processing segment, but only an interruption in it. + +Attributes: +1. "xml:lang" is used to determine which dictionary(ies) should be used for the + spell-checking. +2. "line" is the (optional) line number in the original document, if known. +3. "column" is the (optional) column number in the original document, if known. +4. "xpath" is the (optional) xpath of the element in the original document, if known. --> <!ELEMENT foreign (#PCDATA | word)* > <!ATTLIST foreign xml:lang CDATA #IMPLIED - location CDATA #IMPLIED + line CDATA #IMPLIED + column CDATA #IMPLIED + xpath CDATA #IMPLIED > <!-- Last Line of DTD --> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-12 10:46:04
|
Revision: 2749 http://sourceforge.net/p/axsl/code/2749 Author: victormote Date: 2023-10-12 10:45:59 +0000 (Thu, 12 Oct 2023) Log Message: ----------- Turn word-placeholder attributes into elements. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd 2023-10-06 11:50:20 UTC (rev 2748) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd 2023-10-12 10:45:59 UTC (rev 2749) @@ -251,11 +251,29 @@ dictionary exists, but should be found in the dictionary for a different country/region. --> -<!ELEMENT word-placeholder (t) > -<!ATTLIST word-placeholder - reason (country-specific | different-country) #REQUIRED +<!ELEMENT word-placeholder (t, (country-specific | different-country)) > + + +<!-- +A possible reason for a word-placeholder. +1. country The specific countr(ies) to which this spelling belong(s). +--> +<!ELEMENT country-specific EMPTY> +<!ATTLIST country-specific + country CDATA #REQUIRED > + +<!-- +A possible reason for a word-placeholder. +1. country The specific countr(ies) to which this spelling belong(s). +--> +<!ELEMENT different-country EMPTY> +<!ATTLIST different-country + country CDATA #REQUIRED +> + + <!ELEMENT comment (#PCDATA) > <!-- Last Line of DTD --> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-06 11:50:23
|
Revision: 2748 http://sourceforge.net/p/axsl/code/2748 Author: victormote Date: 2023-10-06 11:50:20 +0000 (Fri, 06 Oct 2023) Log Message: ----------- Add word-placeholder element for axsl-dictionary. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd 2023-10-04 19:30:31 UTC (rev 2747) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-dictionary.dtd 2023-10-06 11:50:20 UTC (rev 2748) @@ -110,7 +110,8 @@ This is suitable as a root element for files that need to handle only one orthography. --> -<!ELEMENT axsl-dictionary (import-dictionary*, (w | word-group | phrase)*)> +<!ELEMENT axsl-dictionary (import-dictionary*, + (w | word-group | phrase | word-placeholder)*)> <!-- 1. id: Used to allow one dictionary to point to another. It is an error for more than one dictionary document to have the same id, although that must @@ -238,6 +239,23 @@ referenced-word CDATA #REQUIRED > + +<!-- +Sits where a word might sit, but takes that word's place to signal that the word +should not be entered here. +1. reason Provides the reason for the placeholder. The possible values are: + a. country-specific Can be specified in a country-agnostic dictionary to +signal that this word should be found in a country-specific dictionary. + b. different-country Can be specified in a country-specific dictionary to +signal that this word should not be used in the country/region for which the +dictionary exists, but should be found in the dictionary for a different +country/region. +--> +<!ELEMENT word-placeholder (t) > +<!ATTLIST word-placeholder + reason (country-specific | different-country) #REQUIRED +> + <!ELEMENT comment (#PCDATA) > <!-- Last Line of DTD --> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-04 19:30:34
|
Revision: 2747 http://sourceforge.net/p/axsl/code/2747 Author: victormote Date: 2023-10-04 19:30:31 +0000 (Wed, 04 Oct 2023) Log Message: ----------- Minor doc cleanup. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java 2023-10-03 16:54:52 UTC (rev 2746) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java 2023-10-04 19:30:31 UTC (rev 2747) @@ -29,10 +29,7 @@ import java.util.List; /** - * <p>A collection of natural-language words, possibly useful for hyphenation, spell-checking, etc. - * This interface is part of the "optional" package because it is quite possible to provide orthography information - * without any implementations of it. - * It is true that some interfaces include Dictionary in their APIs, but they are in all cases nullable.</p> + * <p>A collection of natural-language words, possibly useful for hyphenation, spell-checking, etc.</p> */ public interface Dictionary extends Serializable { Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java 2023-10-03 16:54:52 UTC (rev 2746) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java 2023-10-04 19:30:31 UTC (rev 2747) @@ -28,9 +28,7 @@ import java.util.Iterator; /** - * <p>Implementations know how to break a character sequence into words and interword content. - * This interface is part of the "optional" package because it is quite possible to provide orthography information - * without any implementations of this interface.</p> + * <p>Implementations know how to break a character sequence into words and interword content.</p> * * <p>The {@link Lexer} begins processing in an empty and unlocked state. * Client code adds content using {@link #addUntokenized(CharSequence, WritingSystem)} and This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-03 16:54:55
|
Revision: 2746 http://sourceforge.net/p/axsl/code/2746 Author: victormote Date: 2023-10-03 16:54:52 +0000 (Tue, 03 Oct 2023) Log Message: ----------- Remove passage of ad-hoc dictionaries when searching for words. Modified Paths: -------------- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java 2023-10-03 11:30:36 UTC (rev 2745) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java 2023-10-03 16:54:52 UTC (rev 2746) @@ -23,13 +23,10 @@ package org.axsl.fotree.text; -import org.axsl.orthography.Dictionary; import org.axsl.orthography.Orthography; import org.axsl.orthography.OrthographyException; import org.axsl.orthography.Word; -import java.util.List; - /** * Extension of {@link Orthography} specific to FO tree. */ @@ -36,8 +33,7 @@ public interface FoOrthography extends Orthography { @Override - FoWord recognizeWord(CharSequence wordChars, int offset, int length, Word.PartOfSpeech pos, - List<Dictionary> adhocDictionaries); + FoWord recognizeWord(CharSequence wordChars, int offset, int length, Word.PartOfSpeech pos); @Override FoWord hyphenateUnrecognizedWord(CharSequence wordChars, int offset, int length); Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java 2023-10-03 11:30:36 UTC (rev 2745) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java 2023-10-03 16:54:52 UTC (rev 2746) @@ -56,7 +56,7 @@ * @param index The index into the (conceptual) array of alternatives for {@code wordChars}. * If the word is in this dictionary at all, setting this to zero should always return something. * @return The word matching the parameters, or null if none matches. - * @see Orthography#recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech, List) which can + * @see Orthography#recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech) which can * also consider other dictionaries as well as derivative forms. */ Word getWord(CharSequence wordChars, int index); Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java 2023-10-03 11:30:36 UTC (rev 2745) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java 2023-10-03 16:54:52 UTC (rev 2746) @@ -23,8 +23,6 @@ package org.axsl.orthography; -import java.util.List; - /** * Characteristics and methods for a writing system or group of related writing systems. */ @@ -39,42 +37,32 @@ * @param length The number of chars in {@code wordContent} that describe the word content. * @param pos The part of speech for the word that should be returned. * This can be null, implying that a word with no part of speech or any part of speech can be returned. - * @param adhocDictionaries Optional list of dictionaries that are independent of this orthography, but which should - * be considered when looking for a word. - * For example, if a document that is being spell-checked has its own document-specific dictionary, containing words - * that are not likely to be found in any standard dictionary, that dictionary could be included here. * @return The word matching the parameters, or null if none is found. * @see Dictionary#getWord(CharSequence, int) which retrieves a word directly from a dictionary without * consideration for other dictionaries or finding derivatives. * All handling of derivatives should be handled by implementations of {@link Orthography}. */ - Word recognizeWord(CharSequence wordChars, int offset, int length, Word.PartOfSpeech pos, - List<Dictionary> adhocDictionaries); + Word recognizeWord(CharSequence wordChars, int offset, int length, Word.PartOfSpeech pos); /** - * Indicates whether {@link #recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech, List)} + * Indicates whether {@link #recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech)} * would return a word if it were called. * For applications such as spell-checkers that are simply validating words, and do not need hyphenation * information, this method may provide better performance than checking for a null return from - * {@link #recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech, List)}. + * {@link #recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech)}. * @param wordChars The chars whose word should be retrieved. * @param offset The index to the first character in {@code wordContent} that is part of the word content. * @param length The number of chars in {@code wordContent} that describe the word content. * @param pos The part of speech for the word that should be returned. * This can be null, implying that a word with no part of speech or any part of speech can be returned. - * @param adhocDictionaries Dictionaries that are bound to something other than the orthography, but which should be - * considered when looking for a word. - * For example, if a document that is being spell-checked has its own document-specific dictionary, containing words - * that are not likely to be found in any standard dictionary, that dictionary could be included here. * @return The word matching the parameters, or null if none is found. */ - boolean isRecognizedWord(CharSequence wordChars, int offset, int length, Word.PartOfSpeech pos, - List<Dictionary> adhocDictionaries); + boolean isRecognizedWord(CharSequence wordChars, int offset, int length, Word.PartOfSpeech pos); /** * Using hyphenation patterns, if available, provides a hyphenated word for a sequence of characters that does not * represent a canonical word (i.e. not found by - * {@link #recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech, List)}). + * {@link #recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech)}). * In general, that method should be preferred over this one, as a dictionary along with orthographical rules should * provide better hyphenation results than patterns. * @param wordChars Together with {@code offset} and {@code length}, describes the sequence of chars for which a This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-03 11:30:39
|
Revision: 2745 http://sourceforge.net/p/axsl/code/2745 Author: victormote Date: 2023-10-03 11:30:36 +0000 (Tue, 03 Oct 2023) Log Message: ----------- Move more NaturalLanguage code to the attic. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/00-axsl-catalog.xml trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd trunk/axsl/axsl-orthography/build.gradle Removed Paths: ------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-natural-language.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/00-axsl-catalog.xml =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/00-axsl-catalog.xml 2023-10-02 22:55:04 UTC (rev 2744) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/00-axsl-catalog.xml 2023-10-03 11:30:36 UTC (rev 2745) @@ -50,15 +50,6 @@ uri="./axsl-hyphenation.dtd"/> - <!-- AXSL natural language definition. --> - <public - publicId="-//aXSL//DTD Natural Language V0.1//EN" - uri="./axsl-natural-language.dtd"/> - <system - systemId="http://www.axsl.org/dtds/0.1/en/axsl-natural-language.dtd" - uri="./axsl-natural-language.dtd"/> - - <!-- AXSL dictionary. --> <public publicId="-//aXSL//DTD Dictionary V0.1//EN" Deleted: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-natural-language.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-natural-language.dtd 2023-10-02 22:55:04 UTC (rev 2744) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-natural-language.dtd 2023-10-03 11:30:36 UTC (rev 2745) @@ -1,68 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> - -<!-- -Document Type Definition (DTD) for an XML document that describes various -features of a natural language, that is, a language spoken and/or written by -humans. - -The initial purpose of this DTD is to provide a way to document the valid -letters in the language. - -Use the following public and system IDs for this DTD: -<!DOCTYPE axsl-natural-language - PUBLIC "-//aXSL//DTD Natural Language V0.1//EN" - "http://www.axsl.org/dtds/0.1/en/axsl-natural-language.dtd"> ---> - -<!ELEMENT axsl-natural-language (letter-range*, letter*)> -<!-- -1. iso-639: The ISO-639 code for the language being defined. ---> -<!ATTLIST axsl-natural-language - iso-639 CDATA #REQUIRED -> - -<!-- -A range of Unicode code points that are valid letters in this language. By -"letters" is meant characters other than numbers, symbols, and punctuation marks -that can properly be found in content for this language. - -Applications are expected to handle any canonical normalization of letters in -this range. ---> -<!ELEMENT letter-range EMPTY> - -<!-- -1. start: The Unicode code point marking the start of the range of valid -letters. For example, to designate the character "a", use "U+0061". -2. end: The Unicode code point marking the end of the range of valid letters. -For example, to designate the character "z", use "U+007A". ---> -<!ATTLIST letter-range - description CDATA #IMPLIED - start CDATA #REQUIRED - end CDATA #REQUIRED -> - -<!-- -A single user-oriented grapheme that is a valid letter in this language. By -"grapheme" is meant not a byte, or a 16-bit character, or even a Unicode code -point, but rather one or more Unicode code points that together describe a -displayable/printable character. For example, an "e" with an acute accent can -be encoded as either a single code point, U+00E9, or as a combination of the -code points for "e", U+0065, followed by the code point for the "combining" -acute accent, U+0301. ---> -<!ELEMENT letter EMPTY> - -<!-- -1. value: The Unicode code point(s) defining the letter. If multiple code points -comprise the letter, delimit each with a space. For example, to define an "e" -with an acute accent, set "value" to "U+0065 U+0301". ---> -<!ATTLIST letter - description CDATA #IMPLIED - value CDATA #REQUIRED -> - -<!-- Last Line of DTD --> Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-10-02 22:55:04 UTC (rev 2744) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-10-03 11:30:36 UTC (rev 2745) @@ -14,11 +14,6 @@ However, this is not a requirement, and indeed, there would be no way to enforce it if it were. Therefore, consult your implementation documentation if in doubt. - -Use the following public and system IDs for this DTD: -<!DOCTYPE axsl-natural-language - PUBLIC "-//aXSL//DTD Orthography Configuration V0.1//EN" - "http://www.axsl.org/dtds/0.1/en/axsl-orthography-config.dtd"> --> <!-- Incorporate the axsl-parts-of-speech.dtd into this DTD. --> Modified: trunk/axsl/axsl-orthography/build.gradle =================================================================== --- trunk/axsl/axsl-orthography/build.gradle 2023-10-02 22:55:04 UTC (rev 2744) +++ trunk/axsl/axsl-orthography/build.gradle 2023-10-03 11:30:36 UTC (rev 2745) @@ -21,7 +21,6 @@ include 'axsl-area-tree.dtd' include 'axsl-dictionary.dtd' include 'axsl-hyphenation.dtd' - include 'axsl-natural-language.dtd' include 'axsl-orthography-config.dtd' include 'axsl-parts-of-speech.dtd' into '/resources/org/axsl/dtds/' This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-02 22:55:07
|
Revision: 2744 http://sourceforge.net/p/axsl/code/2744 Author: victormote Date: 2023-10-02 22:55:04 +0000 (Mon, 02 Oct 2023) Log Message: ----------- Move the text token flow interfaces from axsl-orthography to axsl-fotree. Modified Paths: -------------- trunk/axsl/axsl-areatree/src/main/java/org/axsl/area/factory/LineContentFactory.java trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoPunctuation.java trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextToken.java trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextTokenFlow.java trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoWhitespace.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java Added Paths: ----------- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/TextTokenFlowLocation.java Removed Paths: ------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Punctuation.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextToken.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlow.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlowLocation.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Whitespace.java Modified: trunk/axsl/axsl-areatree/src/main/java/org/axsl/area/factory/LineContentFactory.java =================================================================== --- trunk/axsl/axsl-areatree/src/main/java/org/axsl/area/factory/LineContentFactory.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-areatree/src/main/java/org/axsl/area/factory/LineContentFactory.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -41,8 +41,8 @@ import org.axsl.fotree.fo.PageNumberCitation; import org.axsl.fotree.fo.PageNumberCitationLast; import org.axsl.fotree.fo.ScalingValueCitation; +import org.axsl.fotree.text.TextTokenFlowLocation; import org.axsl.galley.GlyphAreaSequenceG5; -import org.axsl.orthography.TextTokenFlowLocation; import java.math.BigDecimal; Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoPunctuation.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoPunctuation.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoPunctuation.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -24,11 +24,10 @@ package org.axsl.fotree.text; import org.axsl.kp.KpBox; -import org.axsl.orthography.Punctuation; /** - * Extension of {@link Punctuation} that provides additional methods for FO tree text. + * Punctuation. */ -public interface FoPunctuation extends Punctuation, KpBox, FoTextToken { +public interface FoPunctuation extends KpBox, FoTextToken { } Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextToken.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextToken.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextToken.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -24,13 +24,13 @@ package org.axsl.fotree.text; import org.axsl.kp.KpNode; -import org.axsl.orthography.TextToken; import org.axsl.value.group.TextModifiers; /** - * Extension of {@link TextToken} that provides additional methods for FO tree text. + * <p>The high-level content of a {@link FoTextTokenFlow}, i.e. {@link FoWord}s and interword content + * ({@link FoWhitespace} and {@link FoPunctuation}). */ -public interface FoTextToken extends TextToken, KpNode { +public interface FoTextToken extends CharSequence, KpNode { /** * Returns the number of chars in this token, after applying text modifiers. Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextTokenFlow.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextTokenFlow.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoTextTokenFlow.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -24,21 +24,34 @@ package org.axsl.fotree.text; import org.axsl.kp.KpBranch; -import org.axsl.orthography.TextTokenFlow; -import org.axsl.orthography.TextTokenFlowLocation; import org.axsl.value.WhiteSpaceTreatment; import org.axsl.value.group.TextModifiers; /** - * Extension of {@link TextTokenFlow} specific to FO tree. + * <p>A sequence of {@link FoWord}s and interword content ({@link FoWhitespace} and {@link FoPunctuation}). + * This could contain a paragraph, a clause, or any other sequence of such content. + * In practice, implementations represent a chunk of text that should be included in a paragraph and possibly broken + * into lines. + * Therefore an instance would not ordinarily contain text from more than one paragraph.</p> */ -public interface FoTextTokenFlow extends TextTokenFlow<FoTextToken>, KpBranch { +public interface FoTextTokenFlow extends KpBranch { - @Override + /** + * Returns the number of tokens ({@link FoWord}, {@link FoWhitespace}, and {@link FoPunctuation}) for this token + * flow. + * @return The number of tokens. + */ + int qtyTokens(); + + /** + * Returns a given token ({@link FoWord}, {@link FoWhitespace}, or {@link FoPunctuation}) for this token flow. + * @param index The index to the token that is being queried. + * @return The token at {@code index}. + */ FoTextToken tokenAt(int index); /** - * Returns an extract of this {@link TextTokenFlow} as a {@link CharSequence}. + * Returns an extract of this {@link FoTextTokenFlow} as a {@link CharSequence}. * In practice, this is primarily used to obtain the portion of the token flow that should be included on a specific * line of output. * The extract requested is described as a set of bound markers, one for the start, inclusive, and one for the end, @@ -55,4 +68,14 @@ CharSequence extract(TextTokenFlowLocation startLocation, TextTokenFlowLocation endLocation, TextModifiers textModifiers, boolean isStartOfLine, boolean isEndOfLine); + /** + * Creates a location for a specified set of indexes. + * @param tokenIndex The index to the token of the location being marked, which should be in the range 0 through + * 65,535. + * @param segmentIndex The index to the token segment of the location being marked, which should be in the range 0 + * through 127. + * @return A new {@link TextTokenFlowLocation} instance storing the specified parameters. + */ + TextTokenFlowLocation markLocation(int tokenIndex, int segmentIndex); + } Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoWhitespace.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoWhitespace.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoWhitespace.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -24,11 +24,10 @@ package org.axsl.fotree.text; import org.axsl.kp.KpGlue; -import org.axsl.orthography.Whitespace; /** - * Extension of {@link Whitespace} that provides additional methods for FO tree text. + * Whitespace. */ -public interface FoWhitespace extends Whitespace, KpGlue, FoTextToken { +public interface FoWhitespace extends KpGlue, FoTextToken { } Copied: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/TextTokenFlowLocation.java (from rev 2741, trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlowLocation.java) =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/TextTokenFlowLocation.java (rev 0) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/TextTokenFlowLocation.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -0,0 +1,84 @@ +/* + * Copyright 2022 The aXSL Project. + * http://www.axsl.org + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/* + * $LastChangedRevision$ + * $LastChangedDate$ + * $LastChangedBy$ + */ + +package org.axsl.fotree.text; + +import org.axsl.orthography.WordSegment; + +/** + * <p>Description of a location in a {@link FoTextTokenFlow}. + * The ability to describe a location in a {@link FoTextTokenFlow} is useful for applications that wish to subdivide it, + * for example, to place part of it one line of output and other parts on other lines. + * A {@link FoTextTokenFlow} is a shallow tree whose main branches are {@link FoTextToken} instances. + * Within those tokens, {@link FoWord} instances are also branches having child {@link WordSegment} instances. + * By allowing specification of both of these items, we can precisely describe a location within the + * {@link FoTextTokenFlow}.</p> + * + * <p>Although it is also true that these {@link WordSegment} instances can also be thought of as branches having + * {@link Character#TYPE} children, the {@link WordSegment} is atomic with regard to a location within a + * {@link FoTextTokenFlow} instance. + * In other words, by definition, a {@link WordSegment} is indivisible. + * Therefore there is no need nor benefit to storing an index into the {@link WordSegment}.</p> + * + * <p>Design Note: It may seem wasteful to create a type for a struct-like value type instead of using primitives. + * However, we thought that result less wrong than managing a portion of a {@link FoTextTokenFlow} with four index + * parameters instead of two locations. + * The extra memory consumption resulting from the use of objects is partially mitigated by the fact that the one object + * frequently will serve as the ending location for one extract as well as the beginning location for the next. + * The decision was also influenced by the probability that Java will eventually support value types (aka "inline types" + * in Project Valhalla), which will mitigate this, so coding to this interface now will hopefully not need to be changed + * when that support is added.</p> + * + * @see <a href="https://en.wikipedia.org/wiki/Project_Valhalla_(Java_language)">Java Project Valhalla</a> + */ +public interface TextTokenFlowLocation extends Comparable<TextTokenFlowLocation> { + + /** String format possibly useful for {@link Object#toString()}. */ + String TO_STRING_FORMAT = "[%d, %d]"; + + /** The factor applied to token index differences to make scale larger than segment index differences. */ + int TOKEN_INDEX_FACTOR = 1000; + + /** + * Returns the token index of the location. + * @return The token index. + * This is always non-negative. + */ + int getTokenIndex(); + + /** + * Returns the segment index of the location. + * @return The index to the first segment in {@link #getTokenIndex()} for this location. + * This is always non-negative. + */ + byte getSegmentIndex(); + + @Override + default int compareTo(final TextTokenFlowLocation o) { + int returnValue = 0; + returnValue += (getTokenIndex() - o.getTokenIndex()) * TOKEN_INDEX_FACTOR; + returnValue += getSegmentIndex() - o.getSegmentIndex(); + return returnValue; + } + +} Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -115,13 +115,6 @@ /** * <p>One token resulting from the tokenization of this Lexer.</p> - * - * <p>Design note: It is tempting to try to use the existing subinterfaces of {@link org.axsl.orthography.TextToken} - * to combine the {@link #getText()} and {@link #getTokenType()} methods in this type. - * However, the purpose of those interfaces is higher-level than we want here, and using them would be klunky and - * confusing. - * The purpose here is to identify chunks of text that can later be converted into or replaced by those higher-level - * concepts.</p> */ interface Token { Deleted: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Punctuation.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Punctuation.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Punctuation.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -1,31 +0,0 @@ -/* - * Copyright 2016 The aXSL Project. - * http://www.axsl.org - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - * $LastChangedRevision$ - * $LastChangedDate$ - * $LastChangedBy$ - */ - -package org.axsl.orthography; - -/** - * Punctuation. - */ -public interface Punctuation extends TextToken { - -} Deleted: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextToken.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextToken.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextToken.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -1,32 +0,0 @@ -/* - * Copyright 2021 The aXSL Project. - * http://www.axsl.org - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - * $LastChangedRevision$ - * $LastChangedDate$ - * $LastChangedBy$ - */ - -package org.axsl.orthography; - -/** - * <p>The high-level content of a {@link TextTokenFlow}, i.e. {@link Word}s and interword content ({@link Whitespace} - * and {@link Punctuation}). - */ -public interface TextToken extends CharSequence { - -} Deleted: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlow.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlow.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlow.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -1,60 +0,0 @@ -/* - * Copyright 2021 The aXSL Project. - * http://www.axsl.org - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - * $LastChangedRevision$ - * $LastChangedDate$ - * $LastChangedBy$ - */ - -package org.axsl.orthography; - -/** - * <p>A sequence of {@link Word}s and interword content ({@link Whitespace} and {@link Punctuation}). - * This could contain a paragraph, a clause, or any other sequence of such content. - * In practice, implementations represent a chunk of text that should be included in a paragraph and possibly broken - * into lines. - * Therefore an instance would not ordinarily contain text from more than one paragraph.</p> - * - * @param <T> The subtype of {@link TextToken} supported by the implementation. - */ -public interface TextTokenFlow<T extends TextToken> { - - /** - * Returns the number of tokens ({@link Word}, {@link Whitespace}, and {@link Punctuation}) for this token flow. - * @return The number of tokens. - */ - int qtyTokens(); - - /** - * Returns a given token ({@link Word}, {@link Whitespace}, or {@link Punctuation}) for this token flow. - * @param index The index to the token that is being queried. - * @return The token at {@code index}. - */ - T tokenAt(int index); - - /** - * Creates a location for a specified set of indexes. - * @param tokenIndex The index to the token of the location being marked, which should be in the range 0 through - * 65,535. - * @param segmentIndex The index to the token segment of the location being marked, which should be in the range 0 - * through 127. - * @return A new {@link TextTokenFlowLocation} instance storing the specified parameters. - */ - TextTokenFlowLocation markLocation(int tokenIndex, int segmentIndex); - -} Deleted: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlowLocation.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlowLocation.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/TextTokenFlowLocation.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -1,82 +0,0 @@ -/* - * Copyright 2022 The aXSL Project. - * http://www.axsl.org - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - * $LastChangedRevision$ - * $LastChangedDate$ - * $LastChangedBy$ - */ - -package org.axsl.orthography; - -/** - * <p>Description of a location in a {@link TextTokenFlow}. - * The ability to describe a location in a {@link TextTokenFlow} is useful for applications that wish to subdivide it, - * for example, to place part of it one line of output and other parts on other lines. - * A {@link TextTokenFlow} is a shallow tree whose main branches are {@link TextToken} instances. - * Within those tokens, {@link Word} instances are also branches having child {@link WordSegment} instances. - * By allowing specification of both of these items, we can precisely describe a location within the - * {@link TextTokenFlow}.</p> - * - * <p>Although it is also true that these {@link WordSegment} instances can also be thought of as branches having - * {@link Character#TYPE} children, the {@link WordSegment} is atomic with regard to a location within a - * {@link TextTokenFlow} instance. - * In other words, by definition, a {@link WordSegment} is indivisible. - * Therefore there is no need nor benefit to storing an index into the {@link WordSegment}.</p> - * - * <p>Design Note: It may seem wasteful to create a type for a struct-like value type instead of using primitives. - * However, we thought that result less wrong than managing a portion of a {@link TextTokenFlow} with four index - * parameters instead of two locations. - * The extra memory consumption resulting from the use of objects is partially mitigated by the fact that the one object - * frequently will serve as the ending location for one extract as well as the beginning location for the next. - * The decision was also influenced by the probability that Java will eventually support value types (aka "inline types" - * in Project Valhalla), which will mitigate this, so coding to this interface now will hopefully not need to be changed - * when that support is added.</p> - * - * @see <a href="https://en.wikipedia.org/wiki/Project_Valhalla_(Java_language)">Java Project Valhalla</a> - */ -public interface TextTokenFlowLocation extends Comparable<TextTokenFlowLocation> { - - /** String format possibly useful for {@link Object#toString()}. */ - String TO_STRING_FORMAT = "[%d, %d]"; - - /** The factor applied to token index differences to make scale larger than segment index differences. */ - int TOKEN_INDEX_FACTOR = 1000; - - /** - * Returns the token index of the location. - * @return The token index. - * This is always non-negative. - */ - int getTokenIndex(); - - /** - * Returns the segment index of the location. - * @return The index to the first segment in {@link #getTokenIndex()} for this location. - * This is always non-negative. - */ - byte getSegmentIndex(); - - @Override - default int compareTo(final TextTokenFlowLocation o) { - int returnValue = 0; - returnValue += (getTokenIndex() - o.getTokenIndex()) * TOKEN_INDEX_FACTOR; - returnValue += getSegmentIndex() - o.getSegmentIndex(); - return returnValue; - } - -} Deleted: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Whitespace.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Whitespace.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Whitespace.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -1,31 +0,0 @@ -/* - * Copyright 2016 The aXSL Project. - * http://www.axsl.org - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/* - * $LastChangedRevision$ - * $LastChangedDate$ - * $LastChangedBy$ - */ - -package org.axsl.orthography; - -/** - * Whitespace. - */ -public interface Whitespace extends TextToken { - -} Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java 2023-10-02 20:42:03 UTC (rev 2743) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java 2023-10-02 22:55:04 UTC (rev 2744) @@ -64,7 +64,7 @@ * <p>Design Note: Since hyphenation data wants to be static, effort has been made to allow implementations as much * flexibility as possible in how they store and use that data.</p> */ -public interface Word extends TextToken { +public interface Word extends CharSequence { /** * Enumeration of valid parts of speech (sometimes called "word classes") used in natural languages, with one This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-02 20:42:06
|
Revision: 2743 http://sourceforge.net/p/axsl/code/2743 Author: victormote Date: 2023-10-02 20:42:03 +0000 (Mon, 02 Oct 2023) Log Message: ----------- Consolidate tokenization concepts into the Lexer. Modified Paths: -------------- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java 2023-10-02 20:02:33 UTC (rev 2742) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java 2023-10-02 20:42:03 UTC (rev 2743) @@ -42,7 +42,14 @@ @Override FoWord hyphenateUnrecognizedWord(CharSequence wordChars, int offset, int length); - @Override + /** + * Converts a sequence of characters into a flow of tokens (words, whitespace, and punctuation). + * @param wordSequenceChars The sequence of characters to be tokenized. + * @param offset The offset in {@code wordSequenceChars} to be tokenized. + * @param length The number of chars in {@code wordSequenceChars} to be tokenized. + * @return The flow of tokens that were tokenized. + * @throws OrthographyException For errors during tokenization. + */ FoTextTokenFlow tokenize(CharSequence wordSequenceChars, int offset, int length) throws OrthographyException; } Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java 2023-10-02 20:02:33 UTC (rev 2742) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java 2023-10-02 20:42:03 UTC (rev 2743) @@ -86,18 +86,6 @@ Word hyphenateUnrecognizedWord(CharSequence wordChars, int offset, int length); /** - * Tokenizes a character sequence, presumably a paragraph-sized chunk, returning a collection of those tokens. - * @param wordSequenceChars The input being tokenized. - * @param offset The index into the first character in {@code wordSequenceChars} that is part of the content. - * @param length The number of chars in {@code wordSequenceChars} that describe the content. - * @return The collection of tokens. - * @throws OrthographyException For errors during tokenization. Throwing a checked exception here allows the client - * to capture more information about the nature of the problem. For example, a SAX parser might be able to indicate - * where in the parsed document a problem occurred if forced to catch and handle it. - */ - TextTokenFlow<?> tokenize(CharSequence wordSequenceChars, int offset, int length) throws OrthographyException; - - /** * Indicates the validity of breaking a line in the middle of a word without hyphenation. * The CJKV languages allow this, since the ideographs are themselves distinct words or word-like concepts. * @return True if and only if it is legal in this orthography to break a line in the middle of a word without Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java 2023-10-02 20:02:33 UTC (rev 2742) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java 2023-10-02 20:42:03 UTC (rev 2743) @@ -44,4 +44,10 @@ */ Dictionary getDictionary(String dictionaryId); + /** + * Returns the lexer. + * @return The lexer. + */ + Lexer getLexer(); + } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-10-02 20:02:35
|
Revision: 2742 http://sourceforge.net/p/axsl/code/2742 Author: victormote Date: 2023-10-02 20:02:33 +0000 (Mon, 02 Oct 2023) Log Message: ----------- Remove org.axsl.orthography.optional package, moving its interfaces up a level. Modified Paths: -------------- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoDictionary.java trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java Added Paths: ----------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java Removed Paths: ------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/ Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoDictionary.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoDictionary.java 2023-09-28 12:06:33 UTC (rev 2741) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoDictionary.java 2023-10-02 20:02:33 UTC (rev 2742) @@ -23,7 +23,7 @@ package org.axsl.fotree.text; -import org.axsl.orthography.optional.Dictionary; +import org.axsl.orthography.Dictionary; /** * Extension of {@link Dictionary} specific to FO tree. Modified: trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java =================================================================== --- trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java 2023-09-28 12:06:33 UTC (rev 2741) +++ trunk/axsl/axsl-fotree/src/main/java/org/axsl/fotree/text/FoOrthography.java 2023-10-02 20:02:33 UTC (rev 2742) @@ -23,10 +23,10 @@ package org.axsl.fotree.text; +import org.axsl.orthography.Dictionary; import org.axsl.orthography.Orthography; import org.axsl.orthography.OrthographyException; import org.axsl.orthography.Word; -import org.axsl.orthography.optional.Dictionary; import java.util.List; Copied: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java (from rev 2723, trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Dictionary.java) =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java (rev 0) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Dictionary.java 2023-10-02 20:02:33 UTC (rev 2742) @@ -0,0 +1,104 @@ +/* + * Copyright 2021 The aXSL Project. + * http://www.axsl.org + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/* + * $LastChangedRevision$ + * $LastChangedDate$ + * $LastChangedBy$ + */ + +package org.axsl.orthography; + +import org.axsl.i18n.WritingSystem; + +import java.io.Serializable; +import java.util.List; + +/** + * <p>A collection of natural-language words, possibly useful for hyphenation, spell-checking, etc. + * This interface is part of the "optional" package because it is quite possible to provide orthography information + * without any implementations of it. + * It is true that some interfaces include Dictionary in their APIs, but they are in all cases nullable.</p> + */ +public interface Dictionary extends Serializable { + + /** + * Returns the writing system supported by this dictionary. + * @return The writing system supported by this dictionary. + * This should never be null. + */ + WritingSystem getWritingSystem(); + + /** + * Returns the number of alternative ways a given word appears in this dictionary. + * @param wordChars The chars whose word is being queried. + * @return The number of alternatives for this word in this dictionary. + */ + int qtyAlternatives(CharSequence wordChars); + + /** + * Retrieves a word from this dictionary based on an index into its alternatives. + * @param wordChars The chars whose word should be retrieved. + * @param index The index into the (conceptual) array of alternatives for {@code wordChars}. + * If the word is in this dictionary at all, setting this to zero should always return something. + * @return The word matching the parameters, or null if none matches. + * @see Orthography#recognizeWord(CharSequence, int, int, org.axsl.orthography.Word.PartOfSpeech, List) which can + * also consider other dictionaries as well as derivative forms. + */ + Word getWord(CharSequence wordChars, int index); + + /** + * Indicates whether this dictionary stores information about a specific part-of-speech qualifier. + * Raw dictionary data (e.g. XML files) may contain more data than the dictionary needs to support. + * For example, a dictionary that is used only for spell-checking may not care to store information about number or + * gender, but does care about convertibility to other forms. + * @param pos The part of speech being tested. + * Since a word can be of more than one part of speech, and since some qualifiers apply to more than one part of + * speech, this must be provided to disambiguate which qualifier is being tested for. + * @param qualifier The qualifier being tested. + * @return True if and only if this dictionary supports providing information about this qualifier for entries in + * it. + * Note that, even if this method returns false, the dictionary may still be able to provide the same information + * for words derived from words in this dictionary. + * Note also that, even if this method returns true, this means only that the dictionary is <em>able</em> to store + * this data, not that the data has actually been entered into the dictionary. + * @throws NullPointerException If either {@code pos} or {@code qualifier} is null. + * @throws IllegalArgumentException If {@code qualifier} is not applicable to {@code pos}. + */ + boolean supportsQualifiedType(Word.PartOfSpeech pos, Word.PosQualifier qualifier); + + /** + * Indicates whether a given sequence of characters should be excluded as a word for this orthography. + * This is useful for cases where one dictionary can override another. + * For example, the American English word "honor" is spelled "honour" in British English. + * If an American English dictionary uses a British English dictionary as its base, but overrides the spelling of + * this one word, then "honor" should be treated as correct, and "honour" should be treated as incorrect. + * Processors should check to see if a given sequence of characters is excluded before checking with the overridden + * dictionary for validity. + * @param wordChars The characters being tested. + * @return True if and only if {@code wordChars} is necessarily eliminated as a valid word in this orthography. + */ + boolean isExcludedWord(CharSequence wordChars); + + /** + * Returns the list of imported dictionary IDs. + * @return The list of imported dictionary IDs for this dictionary. + * This should never be null, but can be empty. + */ + List<String> getImportedDictionaries(); + +} Copied: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java (from rev 2741, trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java) =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java (rev 0) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Lexer.java 2023-10-02 20:02:33 UTC (rev 2742) @@ -0,0 +1,287 @@ +/* + * Copyright 2021 The aXSL Project. + * http://www.axsl.org + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/* + * $LastChangedRevision$ + * $LastChangedDate$ + * $LastChangedBy$ + */ + +package org.axsl.orthography; + +import org.axsl.i18n.WritingSystem; + +import java.util.Iterator; + +/** + * <p>Implementations know how to break a character sequence into words and interword content. + * This interface is part of the "optional" package because it is quite possible to provide orthography information + * without any implementations of this interface.</p> + * + * <p>The {@link Lexer} begins processing in an empty and unlocked state. + * Client code adds content using {@link #addUntokenized(CharSequence, WritingSystem)} and + * {@link #addWordToken(CharSequence, WritingSystem)} as needed until some logical break point. + * When all content has been added, client code calls the {@link #lock()} method, which puts the Lexer into the locked + * state. + * The results of the tokenization can then be iterated by client code using the {@link #hasNext()} and {@link #next()} + * methods. + * When all results have been iterated, the client code calls {@link #clear()} to remove all content from the Lexer + * and reset it to the unlocked state, so that the sequence can be repeated.</p> + * + * <p>The break point mentioned above to trigger processing should be some point that has an unambiguous "end." + * In other words, there should be no content after it that could affect the results of the content before it. + * A good candidate for this break point is the end of a sentence or paragraph.</p> + * + * <p>General text patterns can of course be used during the tokenization process, but the inclusion of + * {@link TokenType#AMBIGUOUS_LEADING_PUNCTUATION} and {@link TokenType#AMBIGUOUS_TRAILING_PUNCTUATION} is intended to + * prevent the need to consult a dictionary. + * Downstream processes may need to consult a dictionary to resolve whether such items are part of a word or are + * interword content.</p> + */ +public interface Lexer extends Iterator<Lexer.Token> { + + /** + * <p>Enumeration of valid token types that can be returned by this Lexer.</p> + */ + enum TokenType { + + /** Token is a word. */ + WORD, + + /** + * <p>Token is a break between words, which is usually whitespace, but in some cases could be a specialized + * punctuation mark. + * An example of such punctuation occurs in the English sentence "The latitude/longitude of Sydney is + * -33.865143, 151.209900." + * The slash or solidus between "latitude" and "longitude" marks the end of one word and the beginning of + * another, for both spell-checking and line-breaking purposes.</p> + * + * <p>While whitespace between words is generally thought of as being "glue," able to expand and contract in + * size to make the layout of a line visually appealing, please note that this is not necessarily true for all + * tokens returned with a type of {@link TokenType#BREAK}. + * In addition to the slash or solidus discussed above, some Unicode space characters have fixed sizes by + * design.</p> + */ + BREAK, + + /** Token is inter-word punctuation immediately before a word. For line-breaking purposes, it is attached to + * that word, but for word recognition (such as spell-checking), it is not. */ + LEADING_PUNCTUATION, + + /** Token is inter-word punctuation immediately after a word. For line-breaking purposes, it is attached to + * that word, but for word recognition (such as spell-checking), it is not. */ + TRAILING_PUNCTUATION, + + /** Serves a similar purpose as {@link #AMBIGUOUS_TRAILING_PUNCTUATION}, but for leading punctuation. + * We are not aware of any examples where such a marking is needed, but allow for that possibility. */ + AMBIGUOUS_LEADING_PUNCTUATION, + + /** + * <p>Token is punctuation that may be either 1) part of a word, or 2) interword content, and the determination + * of which can only be done by consulting a higher-order resource, such as a dictionary. + * A lexer cannot and should not resolve the ambiguity, but should report it so that higher-order resources + * can resolve it. + * For example, many Latin-script languages use the period (.) character to signify both 1) a full-stop, ending + * a sentence, and 2) an abbreviation. + * The former is inter-word word punctuation, the latter is part of the word. + * Ambiguity arises because both of these uses come at the end of a sequence of word characters. + * Consider the Latin expression "id est," which is abbreviated {@code i.e.} and which abbreviation would never + * occur at the end of a sentence. + * The first period can easily be recognized by the lexer as part of the word since it is surrounded immediately + * by word characters. + * The second is ambiguous. + * Whether it is part of the word "i.e." or it is a full stop after the word "i.e" cannot be resolved without + * consulting a dictionary of some sort. + * By marking it as ambiguous, downstream processes can resolve the ambiguity by consulting such a + * dictionary.</p> + */ + AMBIGUOUS_TRAILING_PUNCTUATION; + + } + + /** + * <p>One token resulting from the tokenization of this Lexer.</p> + * + * <p>Design note: It is tempting to try to use the existing subinterfaces of {@link org.axsl.orthography.TextToken} + * to combine the {@link #getText()} and {@link #getTokenType()} methods in this type. + * However, the purpose of those interfaces is higher-level than we want here, and using them would be klunky and + * confusing. + * The purpose here is to identify chunks of text that can later be converted into or replaced by those higher-level + * concepts.</p> + */ + interface Token { + + /** + * Returns the text of the token. + * @return The text of the token. + */ + CharSequence getText(); + + /** + * Returns the type of the token. + * @return The type of the token. + */ + TokenType getTokenType(); + + /** + * Returns the writing system of the token. + * @return The writing system of the token. + */ + WritingSystem getWritingSystem(); + + /** + * Returns an immutable copy of this token. + * Tokens are transient and mutable, and in most cases this doesn't matter, as the token will be looked at and + * immediatly discarded. + * For cases where a longer-lived version of the token is helpful (such as for testing), use this method to + * create an immutable copy. + * @return An immutable copy of this token. If the token is already immutable, {@code this} may be returned. + * Caveat: There is no way to force {@link #getWritingSystem()} to return an immutable value. + * Be sure to use an immutable implementation if that is important. + */ + default Token getImmutableCopy() { + return new ImmutableToken(getText(), getTokenType(), getWritingSystem()); + } + + } + + /** + * Implementation of {@link Token} that is immutable. + * Caveat: There is no way to force {@link #getWritingSystem()} to return an immutable value. + * Be sure to use an immutable implementation if that is important. + */ + class ImmutableToken implements Token { + + /** The text of the token. */ + private String text; + + /** The type of the token. */ + private TokenType type; + + /** The writing system of the token. */ + private WritingSystem writingSystem; + + /** + * Constructor. + * @param text The text of the token. + * @param type The type of the token. + * @param writingSystem The writing system of the token. + */ + ImmutableToken(final CharSequence text, final TokenType type, final WritingSystem writingSystem) { + this.text = text.toString(); + this.type = type; + this.writingSystem = writingSystem; + } + + @Override + public String getText() { + return this.text; + } + + @Override + public TokenType getTokenType() { + return this.type; + } + + @Override + public WritingSystem getWritingSystem() { + return this.writingSystem; + } + + @Override + public Token getImmutableCopy() { + return this; + } + + @Override + public String toString() { + final StringBuilder builder = new StringBuilder(); + builder.append("["); + builder.append(this.text); + builder.append("], "); + builder.append(this.type.toString()); + builder.append(", "); + builder.append(this.writingSystem.toString()); + return builder.toString(); + } + } + + /** + * Adds untokenized content. + * @param text The untokenized text to be added. + * If the size is less than 1, this token will be ignored. + * @param writingSystem The writing system to be used to tokenize {@code text}. + * @throws IllegalStateException If the Lexer is in the "locked" state. + * @throws IllegalArgumentException If any parameter is null. + */ + void addUntokenized(CharSequence text, WritingSystem writingSystem); + + /** + * <p>Adds a token already known to be a word. + * This is mostly useful for open compound words whose components would not be recognized as words in a downstream + * dictionary. + * (An open compound word is one whose components are separated by a space character). + * The open compound word "Colorado River" has two components, each of which is likely to be recognized as a word + * in an English dictionary. + * However, the open compound word "São Paulo" also has two components, neither of which is likely to be recognized + * as a word in an English dictionary, but, considered as a whole, is likely to be found, it being a major city + * in Brazil. + * Some applications may wish to look for such open compound words downstream of the lexer, but this introduces + * the complexities of 1) how many words should be considered, 2) what directions (forward and/or backward) should + * be considered, and 3) what triggers such a search. + * If the application already knows the word boundaries, that complexity is avoided by using this method.</p> + * + * <p>For content that has no pre-processed tokens, add all content using + * {@link #addUntokenized(CharSequence, WritingSystem)} instead.</p> + * @param text The text of the word token to be added. + * If the size is less than 1, this token will be ignored. + * @param writingSystem The writing system of {@code text}. + * Since this item has already been tokenized, the writing system is not needed for that purpose, but is retained + * for downstream processes, which may need it for other purposes, such as dictionary lookup. + * @throws IllegalStateException If the Lexer is in the "locked" state. + * @throws IllegalArgumentException If any parameter is null. + */ + void addWordToken(CharSequence text, WritingSystem writingSystem); + + /** + * Puts this Lexer into the "locked" state, preventing additional content from being added, and allowing the results + * to be iterated. + */ + void lock(); + + /** + * Clears the content of the lexer and unlocks it so that it can be reused. + */ + void clear(); + + /** + * Returns the next token reported by this lexer from the content added in + * {@link #addUntokenized(CharSequence, WritingSystem)} and + * {@link #addWordToken(CharSequence, WritingSystem)}. + * Caveat: The object returned by this method should be considered to be <em>transient and mutable,</em> and may be + * changed by this Lexer after processing resumes. + * Any values that need to be retained by client applications must be preserved by such applications immediately. + * See {@link Token#getImmutableCopy()} to obtain an immutable copy. + * Also, the {@link CharSequence} returned by {@link Token#getText()} may also be mutable, and, if its value needs + * to be retained by client applications, must be copied into different location. + * The easiest way to convert the value to an immutable object is {@link CharSequence#toString()}. + * @return The next token reported by this lexer. + * @throws IllegalStateException If the Lexer is <em>not</em> in the "locked" state. + */ + Token next(); + +} Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java 2023-09-28 12:06:33 UTC (rev 2741) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Orthography.java 2023-10-02 20:02:33 UTC (rev 2742) @@ -23,8 +23,6 @@ package org.axsl.orthography; -import org.axsl.orthography.optional.Dictionary; - import java.util.List; /** @@ -70,7 +68,7 @@ * that are not likely to be found in any standard dictionary, that dictionary could be included here. * @return The word matching the parameters, or null if none is found. */ - boolean isRecognizedWord(CharSequence wordChars, int offset, int length, org.axsl.orthography.Word.PartOfSpeech pos, + boolean isRecognizedWord(CharSequence wordChars, int offset, int length, Word.PartOfSpeech pos, List<Dictionary> adhocDictionaries); /** Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java 2023-09-28 12:06:33 UTC (rev 2741) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/OrthographyServer.java 2023-10-02 20:02:33 UTC (rev 2742) @@ -24,7 +24,6 @@ package org.axsl.orthography; import org.axsl.i18n.WritingSystem; -import org.axsl.orthography.optional.Dictionary; /** * Knows how to find orthography resources. Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java 2023-09-28 12:06:33 UTC (rev 2741) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/Word.java 2023-10-02 20:02:33 UTC (rev 2742) @@ -23,7 +23,6 @@ package org.axsl.orthography; -import org.axsl.orthography.optional.Dictionary; import org.axsl.primitive.sequence.ByteSequence; import java.util.Arrays; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-28 12:06:35
|
Revision: 2741 http://sourceforge.net/p/axsl/code/2741 Author: victormote Date: 2023-09-28 12:06:33 +0000 (Thu, 28 Sep 2023) Log Message: ----------- Remove exception for empty content. It should just be ignored. Add exception for null content or writing system for fail-fast. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-27 11:30:27 UTC (rev 2740) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-28 12:06:33 UTC (rev 2741) @@ -223,9 +223,10 @@ /** * Adds untokenized content. * @param text The untokenized text to be added. + * If the size is less than 1, this token will be ignored. * @param writingSystem The writing system to be used to tokenize {@code text}. * @throws IllegalStateException If the Lexer is in the "locked" state. - * @throws IllegalArgumentException If the size of {@code text} is less than 1. + * @throws IllegalArgumentException If any parameter is null. */ void addUntokenized(CharSequence text, WritingSystem writingSystem); @@ -247,11 +248,12 @@ * <p>For content that has no pre-processed tokens, add all content using * {@link #addUntokenized(CharSequence, WritingSystem)} instead.</p> * @param text The text of the word token to be added. + * If the size is less than 1, this token will be ignored. * @param writingSystem The writing system of {@code text}. * Since this item has already been tokenized, the writing system is not needed for that purpose, but is retained * for downstream processes, which may need it for other purposes, such as dictionary lookup. * @throws IllegalStateException If the Lexer is in the "locked" state. - * @throws IllegalArgumentException If the size of {@code text} is less than 1. + * @throws IllegalArgumentException If any parameter is null. */ void addWordToken(CharSequence text, WritingSystem writingSystem); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-27 11:30:29
|
Revision: 2740 http://sourceforge.net/p/axsl/code/2740 Author: victormote Date: 2023-09-27 11:30:27 +0000 (Wed, 27 Sep 2023) Log Message: ----------- 1. Allow implementations to reject empty content. 2. Improve some variable names. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-24 22:50:00 UTC (rev 2739) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-27 11:30:27 UTC (rev 2740) @@ -221,12 +221,13 @@ } /** - * Adds a sequence of untokenized content. - * @param sequence The untokenized sequence to be added. - * @param writingSystem The writing system to be used to tokenize {@code sequence}. + * Adds untokenized content. + * @param text The untokenized text to be added. + * @param writingSystem The writing system to be used to tokenize {@code text}. * @throws IllegalStateException If the Lexer is in the "locked" state. + * @throws IllegalArgumentException If the size of {@code text} is less than 1. */ - void addUntokenized(CharSequence sequence, WritingSystem writingSystem); + void addUntokenized(CharSequence text, WritingSystem writingSystem); /** * <p>Adds a token already known to be a word. @@ -245,13 +246,14 @@ * * <p>For content that has no pre-processed tokens, add all content using * {@link #addUntokenized(CharSequence, WritingSystem)} instead.</p> - * @param sequence The word token to be added. - * @param writingSystem The writing system of {@code sequence}. + * @param text The text of the word token to be added. + * @param writingSystem The writing system of {@code text}. * Since this item has already been tokenized, the writing system is not needed for that purpose, but is retained * for downstream processes, which may need it for other purposes, such as dictionary lookup. * @throws IllegalStateException If the Lexer is in the "locked" state. + * @throws IllegalArgumentException If the size of {@code text} is less than 1. */ - void addWordToken(CharSequence sequence, WritingSystem writingSystem); + void addWordToken(CharSequence text, WritingSystem writingSystem); /** * Puts this Lexer into the "locked" state, preventing additional content from being added, and allowing the results This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-24 22:50:02
|
Revision: 2739 http://sourceforge.net/p/axsl/code/2739 Author: victormote Date: 2023-09-24 22:50:00 +0000 (Sun, 24 Sep 2023) Log Message: ----------- Rename token type WHITESPACE to BREAK, and add documentation about why. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-24 13:23:47 UTC (rev 2738) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-24 22:50:00 UTC (rev 2739) @@ -62,8 +62,21 @@ /** Token is a word. */ WORD, - /** Token is inter-word whitespace. */ - WHITESPACE, + /** + * <p>Token is a break between words, which is usually whitespace, but in some cases could be a specialized + * punctuation mark. + * An example of such punctuation occurs in the English sentence "The latitude/longitude of Sydney is + * -33.865143, 151.209900." + * The slash or solidus between "latitude" and "longitude" marks the end of one word and the beginning of + * another, for both spell-checking and line-breaking purposes.</p> + * + * <p>While whitespace between words is generally thought of as being "glue," able to expand and contract in + * size to make the layout of a line visually appealing, please note that this is not necessarily true for all + * tokens returned with a type of {@link TokenType#BREAK}. + * In addition to the slash or solidus discussed above, some Unicode space characters have fixed sizes by + * design.</p> + */ + BREAK, /** Token is inter-word punctuation immediately before a word. For line-breaking purposes, it is attached to * that word, but for word recognition (such as spell-checking), it is not. */ @@ -78,8 +91,8 @@ AMBIGUOUS_LEADING_PUNCTUATION, /** - * Token is punctuation that may be either 1) part of a word, or 2) interword content, and the determination of - * which can only be done by consulting a higher-order resource, such as a dictionary. + * <p>Token is punctuation that may be either 1) part of a word, or 2) interword content, and the determination + * of which can only be done by consulting a higher-order resource, such as a dictionary. * A lexer cannot and should not resolve the ambiguity, but should report it so that higher-order resources * can resolve it. * For example, many Latin-script languages use the period (.) character to signify both 1) a full-stop, ending @@ -93,7 +106,8 @@ * The second is ambiguous. * Whether it is part of the word "i.e." or it is a full stop after the word "i.e" cannot be resolved without * consulting a dictionary of some sort. - * By marking it as ambiguous, downstream processes can resolve the ambiguity by consulting such a dictionary. + * By marking it as ambiguous, downstream processes can resolve the ambiguity by consulting such a + * dictionary.</p> */ AMBIGUOUS_TRAILING_PUNCTUATION; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-24 13:23:49
|
Revision: 2738 http://sourceforge.net/p/axsl/code/2738 Author: victormote Date: 2023-09-24 13:23:47 +0000 (Sun, 24 Sep 2023) Log Message: ----------- Remove explicit-token concept. This is now considered to be out-of-scope for a lexer, but should be handled in a dictionary. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-09-23 15:08:23 UTC (rev 2737) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-09-24 13:23:47 UTC (rev 2738) @@ -27,38 +27,12 @@ "./axsl-parts-of-speech.dtd"> %aXSL-Parts-of-Speech-DTD; -<!ELEMENT axsl-orthography-config (explicit-token-list*, match-rule-list*, +<!ELEMENT axsl-orthography-config (match-rule-list*, derivative-pattern-list*, derivative-factory-list*, dictionary-resource*, hyphenation-patterns-resource*, orthography+)> <!-- -Collection of words, usually abbreviations, that should be recognized by the -lexer as word tokens, to disambiguate them from the end of a sentence. -For example, a lexer might interpret "i.e." as two tokens: "i.e" and "." where -the second is punctuation marking the end of a sentence. -By adding an entry to this list, it will instead be considered as possibly being -a single token "i.e.". -The necessity for this is driven by the fact that the full-stop "." character is -overloaded to mean both 1) a marker for abbreviation and 2) the end of a -sentence. -Note that entries in this list are used to distinguish tokens only, and they -will presumably need to be added to a dictionary as well so that they are -recognized and handled as valid words. ---> -<!ELEMENT explicit-token-list (explicit-token*) > -<!ATTLIST explicit-token-list - id ID #REQUIRED -> - - -<!ELEMENT explicit-token (#PCDATA)> -<!ATTLIST explicit-token - end-of-sentence (never | possible) #REQUIRED -> - - -<!-- Collection of regex match patterns which, when matched, signify the input as a valid "word" even though not found in any dictionary. --> @@ -168,12 +142,6 @@ > -<!ELEMENT explicit-tokens EMPTY> -<!ATTLIST explicit-tokens - reference IDREF #REQUIRED -> - - <!ELEMENT match-rules EMPTY> <!ATTLIST match-rules reference IDREF #REQUIRED @@ -245,7 +213,7 @@ <!ELEMENT unparsed-hyphenation-patterns (resource-location)> -<!ELEMENT orthography (explicit-tokens*, match-rules*, +<!ELEMENT orthography (match-rules*, derivative-rules?, dictionary?, hyphenation-patterns?, derivative-factories?) > This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-23 15:08:25
|
Revision: 2737 http://sourceforge.net/p/axsl/code/2737 Author: victormote Date: 2023-09-23 15:08:23 +0000 (Sat, 23 Sep 2023) Log Message: ----------- 1. Add token type for ambiguous leading punctuation. 2. Remove numeric indexes from token types. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-22 23:09:08 UTC (rev 2736) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-23 15:08:23 UTC (rev 2737) @@ -47,52 +47,36 @@ * A good candidate for this break point is the end of a sentence or paragraph.</p> * * <p>General text patterns can of course be used during the tokenization process, but the inclusion of - * {@link TokenType#AMBIGUOUS_PUNCTUATION} is intended to prevent the need to consult a dictionary. - * Downstream processes may need to consult a dictionary to resolve whether such an item is part of a word or is + * {@link TokenType#AMBIGUOUS_LEADING_PUNCTUATION} and {@link TokenType#AMBIGUOUS_TRAILING_PUNCTUATION} is intended to + * prevent the need to consult a dictionary. + * Downstream processes may need to consult a dictionary to resolve whether such items are part of a word or are * interword content.</p> */ public interface Lexer extends Iterator<Lexer.Token> { - /** The index of {@link TokenType#WORD}. */ - byte WORD_TOKEN_INDEX = 0; - - /** The index of {@link TokenType#WHITESPACE}. */ - byte WHITESPACE_TOKEN_INDEX = 1; - - /** The index of {@link TokenType#LEADING_PUNCTUATION}. */ - byte LEADING_PUNCTUATION_TOKEN_INDEX = 2; - - /** The index of {@link TokenType#TRAILING_PUNCTUATION}. */ - byte TRAILING_PUNCTUATION_TOKEN_INDEX = 3; - - /** The index of {@link TokenType#AMBIGUOUS_PUNCTUATION}. */ - byte AMBIGUOUS_PUNCTUATION_TOKEN_INDEX = 4; - - /** * <p>Enumeration of valid token types that can be returned by this Lexer.</p> - * - * <p>Each enumerated item is attached to a numeric index, which can be used to conveniently convert to and from - * that index. - * This gives implementations the ability to pseudo-extend this enumeration during internal processing by adding - * additional indexes for that purpose.</p> */ enum TokenType { /** Token is a word. */ - WORD(WORD_TOKEN_INDEX), + WORD, /** Token is inter-word whitespace. */ - WHITESPACE(WHITESPACE_TOKEN_INDEX), + WHITESPACE, /** Token is inter-word punctuation immediately before a word. For line-breaking purposes, it is attached to * that word, but for word recognition (such as spell-checking), it is not. */ - LEADING_PUNCTUATION(LEADING_PUNCTUATION_TOKEN_INDEX), + LEADING_PUNCTUATION, /** Token is inter-word punctuation immediately after a word. For line-breaking purposes, it is attached to * that word, but for word recognition (such as spell-checking), it is not. */ - TRAILING_PUNCTUATION(TRAILING_PUNCTUATION_TOKEN_INDEX), + TRAILING_PUNCTUATION, + /** Serves a similar purpose as {@link #AMBIGUOUS_TRAILING_PUNCTUATION}, but for leading punctuation. + * We are not aware of any examples where such a marking is needed, but allow for that possibility. */ + AMBIGUOUS_LEADING_PUNCTUATION, + /** * Token is punctuation that may be either 1) part of a word, or 2) interword content, and the determination of * which can only be done by consulting a higher-order resource, such as a dictionary. @@ -111,52 +95,8 @@ * consulting a dictionary of some sort. * By marking it as ambiguous, downstream processes can resolve the ambiguity by consulting such a dictionary. */ - AMBIGUOUS_PUNCTUATION(AMBIGUOUS_PUNCTUATION_TOKEN_INDEX); + AMBIGUOUS_TRAILING_PUNCTUATION; - /** Array mapping the token types to their indexes. */ - private static final TokenType[] INDEX_ARRAY = - {WORD, WHITESPACE, LEADING_PUNCTUATION, TRAILING_PUNCTUATION, AMBIGUOUS_PUNCTUATION}; - - /** The numeric index of the item. */ - private byte index; - - /** - * Constructor. - * @param index The numeric index of the token type. - */ - TokenType(final byte index) { - this.index = index; - } - - /** - * Returns the numeric index for this token type. - * @return The numeric index. - */ - public byte getIndex() { - return this.index; - } - - /** - * Returns the token type for a given numeric index. - * @param index The index for which a token type is sought. - * @return The token type matching {@code index}, or null if the index does not exist. - */ - public static TokenType fromIndex(final int index) { - if (index < 0 - || index >= INDEX_ARRAY.length) { - return null; - } - return INDEX_ARRAY[index]; - } - - /** - * Returns the number of items in this enumeration. - * @return The number of items in this enumeration. - */ - public static int getCount() { - return INDEX_ARRAY.length; - } - } /** This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-22 23:09:10
|
Revision: 2736 http://sourceforge.net/p/axsl/code/2736 Author: victormote Date: 2023-09-22 23:09:08 +0000 (Fri, 22 Sep 2023) Log Message: ----------- Add ability to convert Tokens into an immutable form. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-22 16:16:39 UTC (rev 2735) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-22 23:09:08 UTC (rev 2736) @@ -188,9 +188,85 @@ * @return The writing system of the token. */ WritingSystem getWritingSystem(); + + /** + * Returns an immutable copy of this token. + * Tokens are transient and mutable, and in most cases this doesn't matter, as the token will be looked at and + * immediatly discarded. + * For cases where a longer-lived version of the token is helpful (such as for testing), use this method to + * create an immutable copy. + * @return An immutable copy of this token. If the token is already immutable, {@code this} may be returned. + * Caveat: There is no way to force {@link #getWritingSystem()} to return an immutable value. + * Be sure to use an immutable implementation if that is important. + */ + default Token getImmutableCopy() { + return new ImmutableToken(getText(), getTokenType(), getWritingSystem()); + } + } /** + * Implementation of {@link Token} that is immutable. + * Caveat: There is no way to force {@link #getWritingSystem()} to return an immutable value. + * Be sure to use an immutable implementation if that is important. + */ + class ImmutableToken implements Token { + + /** The text of the token. */ + private String text; + + /** The type of the token. */ + private TokenType type; + + /** The writing system of the token. */ + private WritingSystem writingSystem; + + /** + * Constructor. + * @param text The text of the token. + * @param type The type of the token. + * @param writingSystem The writing system of the token. + */ + ImmutableToken(final CharSequence text, final TokenType type, final WritingSystem writingSystem) { + this.text = text.toString(); + this.type = type; + this.writingSystem = writingSystem; + } + + @Override + public String getText() { + return this.text; + } + + @Override + public TokenType getTokenType() { + return this.type; + } + + @Override + public WritingSystem getWritingSystem() { + return this.writingSystem; + } + + @Override + public Token getImmutableCopy() { + return this; + } + + @Override + public String toString() { + final StringBuilder builder = new StringBuilder(); + builder.append("["); + builder.append(this.text); + builder.append("], "); + builder.append(this.type.toString()); + builder.append(", "); + builder.append(this.writingSystem.toString()); + return builder.toString(); + } + } + + /** * Adds a sequence of untokenized content. * @param sequence The untokenized sequence to be added. * @param writingSystem The writing system to be used to tokenize {@code sequence}. @@ -238,9 +314,10 @@ * Returns the next token reported by this lexer from the content added in * {@link #addUntokenized(CharSequence, WritingSystem)} and * {@link #addWordToken(CharSequence, WritingSystem)}. - * Note that the object returned by this method is <em>mutable,</em> and may be changed by this Lexer after - * processing resumes. + * Caveat: The object returned by this method should be considered to be <em>transient and mutable,</em> and may be + * changed by this Lexer after processing resumes. * Any values that need to be retained by client applications must be preserved by such applications immediately. + * See {@link Token#getImmutableCopy()} to obtain an immutable copy. * Also, the {@link CharSequence} returned by {@link Token#getText()} may also be mutable, and, if its value needs * to be retained by client applications, must be copied into different location. * The easiest way to convert the value to an immutable object is {@link CharSequence#toString()}. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-22 16:16:42
|
Revision: 2735 http://sourceforge.net/p/axsl/code/2735 Author: victormote Date: 2023-09-22 16:16:39 +0000 (Fri, 22 Sep 2023) Log Message: ----------- Add index to TokenType enum. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-22 13:36:29 UTC (rev 2734) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-22 16:16:39 UTC (rev 2735) @@ -53,24 +53,45 @@ */ public interface Lexer extends Iterator<Lexer.Token> { + /** The index of {@link TokenType#WORD}. */ + byte WORD_TOKEN_INDEX = 0; + + /** The index of {@link TokenType#WHITESPACE}. */ + byte WHITESPACE_TOKEN_INDEX = 1; + + /** The index of {@link TokenType#LEADING_PUNCTUATION}. */ + byte LEADING_PUNCTUATION_TOKEN_INDEX = 2; + + /** The index of {@link TokenType#TRAILING_PUNCTUATION}. */ + byte TRAILING_PUNCTUATION_TOKEN_INDEX = 3; + + /** The index of {@link TokenType#AMBIGUOUS_PUNCTUATION}. */ + byte AMBIGUOUS_PUNCTUATION_TOKEN_INDEX = 4; + + /** - * Enumeration of valid token types that can be returned by this Lexer. + * <p>Enumeration of valid token types that can be returned by this Lexer.</p> + * + * <p>Each enumerated item is attached to a numeric index, which can be used to conveniently convert to and from + * that index. + * This gives implementations the ability to pseudo-extend this enumeration during internal processing by adding + * additional indexes for that purpose.</p> */ enum TokenType { /** Token is a word. */ - WORD, + WORD(WORD_TOKEN_INDEX), /** Token is inter-word whitespace. */ - WHITESPACE, + WHITESPACE(WHITESPACE_TOKEN_INDEX), /** Token is inter-word punctuation immediately before a word. For line-breaking purposes, it is attached to * that word, but for word recognition (such as spell-checking), it is not. */ - LEADING_PUNCTUATION, + LEADING_PUNCTUATION(LEADING_PUNCTUATION_TOKEN_INDEX), /** Token is inter-word punctuation immediately after a word. For line-breaking purposes, it is attached to * that word, but for word recognition (such as spell-checking), it is not. */ - TRAILING_PUNCTUATION, + TRAILING_PUNCTUATION(TRAILING_PUNCTUATION_TOKEN_INDEX), /** * Token is punctuation that may be either 1) part of a word, or 2) interword content, and the determination of @@ -81,8 +102,8 @@ * a sentence, and 2) an abbreviation. * The former is inter-word word punctuation, the latter is part of the word. * Ambiguity arises because both of these uses come at the end of a sequence of word characters. - * Consider the Latin expression "id est," which is abbreviated {@code i.e.} and which would never occur at the - * end of a sentence. + * Consider the Latin expression "id est," which is abbreviated {@code i.e.} and which abbreviation would never + * occur at the end of a sentence. * The first period can easily be recognized by the lexer as part of the word since it is surrounded immediately * by word characters. * The second is ambiguous. @@ -90,8 +111,52 @@ * consulting a dictionary of some sort. * By marking it as ambiguous, downstream processes can resolve the ambiguity by consulting such a dictionary. */ - AMBIGUOUS_PUNCTUATION + AMBIGUOUS_PUNCTUATION(AMBIGUOUS_PUNCTUATION_TOKEN_INDEX); + /** Array mapping the token types to their indexes. */ + private static final TokenType[] INDEX_ARRAY = + {WORD, WHITESPACE, LEADING_PUNCTUATION, TRAILING_PUNCTUATION, AMBIGUOUS_PUNCTUATION}; + + /** The numeric index of the item. */ + private byte index; + + /** + * Constructor. + * @param index The numeric index of the token type. + */ + TokenType(final byte index) { + this.index = index; + } + + /** + * Returns the numeric index for this token type. + * @return The numeric index. + */ + public byte getIndex() { + return this.index; + } + + /** + * Returns the token type for a given numeric index. + * @param index The index for which a token type is sought. + * @return The token type matching {@code index}, or null if the index does not exist. + */ + public static TokenType fromIndex(final int index) { + if (index < 0 + || index >= INDEX_ARRAY.length) { + return null; + } + return INDEX_ARRAY[index]; + } + + /** + * Returns the number of items in this enumeration. + * @return The number of items in this enumeration. + */ + public static int getCount() { + return INDEX_ARRAY.length; + } + } /** This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-22 13:36:32
|
Revision: 2734 http://sourceforge.net/p/axsl/code/2734 Author: victormote Date: 2023-09-22 13:36:29 +0000 (Fri, 22 Sep 2023) Log Message: ----------- Add more token type options. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-21 12:38:38 UTC (rev 2733) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-22 13:36:29 UTC (rev 2734) @@ -45,6 +45,11 @@ * <p>The break point mentioned above to trigger processing should be some point that has an unambiguous "end." * In other words, there should be no content after it that could affect the results of the content before it. * A good candidate for this break point is the end of a sentence or paragraph.</p> + * + * <p>General text patterns can of course be used during the tokenization process, but the inclusion of + * {@link TokenType#AMBIGUOUS_PUNCTUATION} is intended to prevent the need to consult a dictionary. + * Downstream processes may need to consult a dictionary to resolve whether such an item is part of a word or is + * interword content.</p> */ public interface Lexer extends Iterator<Lexer.Token> { @@ -59,8 +64,34 @@ /** Token is inter-word whitespace. */ WHITESPACE, - /** Token is inter-word punctuation. */ - PUNCTUATION + /** Token is inter-word punctuation immediately before a word. For line-breaking purposes, it is attached to + * that word, but for word recognition (such as spell-checking), it is not. */ + LEADING_PUNCTUATION, + + /** Token is inter-word punctuation immediately after a word. For line-breaking purposes, it is attached to + * that word, but for word recognition (such as spell-checking), it is not. */ + TRAILING_PUNCTUATION, + + /** + * Token is punctuation that may be either 1) part of a word, or 2) interword content, and the determination of + * which can only be done by consulting a higher-order resource, such as a dictionary. + * A lexer cannot and should not resolve the ambiguity, but should report it so that higher-order resources + * can resolve it. + * For example, many Latin-script languages use the period (.) character to signify both 1) a full-stop, ending + * a sentence, and 2) an abbreviation. + * The former is inter-word word punctuation, the latter is part of the word. + * Ambiguity arises because both of these uses come at the end of a sequence of word characters. + * Consider the Latin expression "id est," which is abbreviated {@code i.e.} and which would never occur at the + * end of a sentence. + * The first period can easily be recognized by the lexer as part of the word since it is surrounded immediately + * by word characters. + * The second is ambiguous. + * Whether it is part of the word "i.e." or it is a full stop after the word "i.e" cannot be resolved without + * consulting a dictionary of some sort. + * By marking it as ambiguous, downstream processes can resolve the ambiguity by consulting such a dictionary. + */ + AMBIGUOUS_PUNCTUATION + } /** @@ -103,9 +134,22 @@ void addUntokenized(CharSequence sequence, WritingSystem writingSystem); /** - * Adds a token already known to be a word. - * For content that has no pre-processed tokens, add all content using - * {@link #addUntokenized(CharSequence, WritingSystem)} instead. + * <p>Adds a token already known to be a word. + * This is mostly useful for open compound words whose components would not be recognized as words in a downstream + * dictionary. + * (An open compound word is one whose components are separated by a space character). + * The open compound word "Colorado River" has two components, each of which is likely to be recognized as a word + * in an English dictionary. + * However, the open compound word "São Paulo" also has two components, neither of which is likely to be recognized + * as a word in an English dictionary, but, considered as a whole, is likely to be found, it being a major city + * in Brazil. + * Some applications may wish to look for such open compound words downstream of the lexer, but this introduces + * the complexities of 1) how many words should be considered, 2) what directions (forward and/or backward) should + * be considered, and 3) what triggers such a search. + * If the application already knows the word boundaries, that complexity is avoided by using this method.</p> + * + * <p>For content that has no pre-processed tokens, add all content using + * {@link #addUntokenized(CharSequence, WritingSystem)} instead.</p> * @param sequence The word token to be added. * @param writingSystem The writing system of {@code sequence}. * Since this item has already been tokenized, the writing system is not needed for that purpose, but is retained This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-21 12:38:43
|
Revision: 2733 http://sourceforge.net/p/axsl/code/2733 Author: victormote Date: 2023-09-21 12:38:38 +0000 (Thu, 21 Sep 2023) Log Message: ----------- Add option for nouns to have a number of "number-any". Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-parts-of-speech.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-parts-of-speech.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-parts-of-speech.dtd 2023-09-20 11:35:59 UTC (rev 2732) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-parts-of-speech.dtd 2023-09-21 12:38:38 UTC (rev 2733) @@ -23,7 +23,7 @@ A noun. --> <!ELEMENT noun ( - (singular | plural | pluralizable?)?, + (singular | plural | pluralizable | number-any)?, ((masculine?, feminine?, neuter?) | gender-any)?, convertible-to-possessive? This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-20 11:36:01
|
Revision: 2732 http://sourceforge.net/p/axsl/code/2732 Author: victormote Date: 2023-09-20 11:35:59 +0000 (Wed, 20 Sep 2023) Log Message: ----------- Convert Lexer an Iterator. Track WritingMode for explicit words. Modified Paths: -------------- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java Modified: trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java =================================================================== --- trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-19 18:19:24 UTC (rev 2731) +++ trunk/axsl/axsl-orthography/src/main/java/org/axsl/orthography/optional/Lexer.java 2023-09-20 11:35:59 UTC (rev 2732) @@ -25,7 +25,7 @@ import org.axsl.i18n.WritingSystem; -import java.util.List; +import java.util.Iterator; /** * <p>Implementations know how to break a character sequence into words and interword content. @@ -32,25 +32,73 @@ * This interface is part of the "optional" package because it is quite possible to provide orthography information * without any implementations of this interface.</p> * - * <p>Implementations of this interface can be used in an axsl-orthography-config XML document "lexer" element, to - * specify for the orthography what class should be used to perform the lexing task. + * <p>The {@link Lexer} begins processing in an empty and unlocked state. + * Client code adds content using {@link #addUntokenized(CharSequence, WritingSystem)} and + * {@link #addWordToken(CharSequence, WritingSystem)} as needed until some logical break point. + * When all content has been added, client code calls the {@link #lock()} method, which puts the Lexer into the locked + * state. + * The results of the tokenization can then be iterated by client code using the {@link #hasNext()} and {@link #next()} + * methods. + * When all results have been iterated, the client code calls {@link #clear()} to remove all content from the Lexer + * and reset it to the unlocked state, so that the sequence can be repeated.</p> * - * <p>The {@link Lexer} begins processing in an "empty" state. - * Content is then added to it using {@link #addUntokenized(CharSequence, WritingSystem)} and - * {@link #addWordToken(CharSequence)} as needed until some logical break point. - * The results of the tokenization are then returned by {@link #process()}, which also resets the Lexer to the empty - * state, and the cycle can be repeated. - * This design allows any pre-processing tokenization to be accepted as such.</p> - * * <p>The break point mentioned above to trigger processing should be some point that has an unambiguous "end." - * In other words, there should be no content after it that could affect the results of the content before it. </p> + * In other words, there should be no content after it that could affect the results of the content before it. + * A good candidate for this break point is the end of a sentence or paragraph.</p> */ -public interface Lexer { +public interface Lexer extends Iterator<Lexer.Token> { /** + * Enumeration of valid token types that can be returned by this Lexer. + */ + enum TokenType { + + /** Token is a word. */ + WORD, + + /** Token is inter-word whitespace. */ + WHITESPACE, + + /** Token is inter-word punctuation. */ + PUNCTUATION + } + + /** + * <p>One token resulting from the tokenization of this Lexer.</p> + * + * <p>Design note: It is tempting to try to use the existing subinterfaces of {@link org.axsl.orthography.TextToken} + * to combine the {@link #getText()} and {@link #getTokenType()} methods in this type. + * However, the purpose of those interfaces is higher-level than we want here, and using them would be klunky and + * confusing. + * The purpose here is to identify chunks of text that can later be converted into or replaced by those higher-level + * concepts.</p> + */ + interface Token { + + /** + * Returns the text of the token. + * @return The text of the token. + */ + CharSequence getText(); + + /** + * Returns the type of the token. + * @return The type of the token. + */ + TokenType getTokenType(); + + /** + * Returns the writing system of the token. + * @return The writing system of the token. + */ + WritingSystem getWritingSystem(); + } + + /** * Adds a sequence of untokenized content. * @param sequence The untokenized sequence to be added. * @param writingSystem The writing system to be used to tokenize {@code sequence}. + * @throws IllegalStateException If the Lexer is in the "locked" state. */ void addUntokenized(CharSequence sequence, WritingSystem writingSystem); @@ -59,20 +107,37 @@ * For content that has no pre-processed tokens, add all content using * {@link #addUntokenized(CharSequence, WritingSystem)} instead. * @param sequence The word token to be added. + * @param writingSystem The writing system of {@code sequence}. + * Since this item has already been tokenized, the writing system is not needed for that purpose, but is retained + * for downstream processes, which may need it for other purposes, such as dictionary lookup. + * @throws IllegalStateException If the Lexer is in the "locked" state. */ - void addWordToken(CharSequence sequence); + void addWordToken(CharSequence sequence, WritingSystem writingSystem); /** - * Processes the content added in {@link #addUntokenized(CharSequence, WritingSystem)} and - * {@link #addWordToken(CharSequence)}, returns the word and interword content of that content, and resets the lexer - * to the "empty" state so that it can begin processing again. - * @return The list of word and interword content tokenized from - * {@link #addUntokenized(CharSequence, WritingSystem)} and {@link #addWordToken(CharSequence)}. - * Even-numbered elements in the list always contain a word, and odd-numbered indexes always contain interword - * content. - * In the case that the sequence actually starts with interword content (instead of the more normal case of starting - * with a word), the first element (at index 0) will be an empty sequence. + * Puts this Lexer into the "locked" state, preventing additional content from being added, and allowing the results + * to be iterated. */ - List<CharSequence> process(); + void lock(); + /** + * Clears the content of the lexer and unlocks it so that it can be reused. + */ + void clear(); + + /** + * Returns the next token reported by this lexer from the content added in + * {@link #addUntokenized(CharSequence, WritingSystem)} and + * {@link #addWordToken(CharSequence, WritingSystem)}. + * Note that the object returned by this method is <em>mutable,</em> and may be changed by this Lexer after + * processing resumes. + * Any values that need to be retained by client applications must be preserved by such applications immediately. + * Also, the {@link CharSequence} returned by {@link Token#getText()} may also be mutable, and, if its value needs + * to be retained by client applications, must be copied into different location. + * The easiest way to convert the value to an immutable object is {@link CharSequence#toString()}. + * @return The next token reported by this lexer. + * @throws IllegalStateException If the Lexer is <em>not</em> in the "locked" state. + */ + Token next(); + } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-19 18:19:26
|
Revision: 2731 http://sourceforge.net/p/axsl/code/2731 Author: victormote Date: 2023-09-19 18:19:24 +0000 (Tue, 19 Sep 2023) Log Message: ----------- Add element "foreign". Remove ability of "text" elements to be nested. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2023-09-19 16:53:40 UTC (rev 2730) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-spell-check-input.dtd 2023-09-19 18:19:24 UTC (rev 2731) @@ -41,7 +41,7 @@ original document. This could be the line/column number, "98:24" for example, or perhaps an XPath. --> -<!ELEMENT text (#PCDATA | text | word)*> +<!ELEMENT text (#PCDATA | word | foreign)*> <!ATTLIST text xml:lang CDATA #IMPLIED location CDATA #IMPLIED @@ -64,11 +64,27 @@ Attributes: 1. "xml:lang" is used to determine which dictionary(ies) should be used for the spell-checking. +2. "location" stores an optional clue about where the element was located in the + original document. This could be the line/column number, "98:24" for example, + or perhaps an XPath. --> <!ELEMENT word (#PCDATA)> <!ATTLIST word xml:lang CDATA #IMPLIED + location CDATA #IMPLIED > +<!-- +Marks a sequence of text as having a different writing system than the +surrounding text. +Such content does not mark the end of a processing segment, but only an +interruption in it. +--> +<!ELEMENT foreign (#PCDATA | word)* > +<!ATTLIST foreign + xml:lang CDATA #IMPLIED + location CDATA #IMPLIED +> + <!-- Last Line of DTD --> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-19 16:53:43
|
Revision: 2730 http://sourceforge.net/p/axsl/code/2730 Author: victormote Date: 2023-09-19 16:53:40 +0000 (Tue, 19 Sep 2023) Log Message: ----------- Remove lexer from the orthography configuration. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-09-19 13:03:57 UTC (rev 2729) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-09-19 16:53:40 UTC (rev 2730) @@ -147,21 +147,7 @@ > -<!ELEMENT lexer EMPTY> <!-- -1. class: The fully-qualified class name of an implementation of - org.axsl.orthography.Lexer. - Such classes know how to break a string of text (a paragraph or block) into - words using the specific rules for a given orthography. - For example, English allows allows an apostrophe or closing single quotation - mark within a word to mark a contraction or possession. ---> -<!ATTLIST lexer - class CDATA #REQUIRED -> - - -<!-- Describes patterns in a resource file that should be excluded when building the resource. --> @@ -259,7 +245,7 @@ <!ELEMENT unparsed-hyphenation-patterns (resource-location)> -<!ELEMENT orthography (explicit-tokens*, lexer, match-rules*, +<!ELEMENT orthography (explicit-tokens*, match-rules*, derivative-rules?, dictionary?, hyphenation-patterns?, derivative-factories?) > This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <vic...@us...> - 2023-09-19 13:03:59
|
Revision: 2729 http://sourceforge.net/p/axsl/code/2729 Author: victormote Date: 2023-09-19 13:03:57 +0000 (Tue, 19 Sep 2023) Log Message: ----------- Remove link between lexer and writing system in orthography config. Modified Paths: -------------- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd Modified: trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd =================================================================== --- trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-09-19 12:06:42 UTC (rev 2728) +++ trunk/axsl/axsl-00-dev/doc/web/dtds/0.1/en/axsl-orthography-config.dtd 2023-09-19 13:03:57 UTC (rev 2729) @@ -155,18 +155,9 @@ words using the specific rules for a given orthography. For example, English allows allows an apostrophe or closing single quotation mark within a word to mark a contraction or possession. -2. language-iso-3char: The 3-character ISO-639-2/T code for the language being - configured. For example, for English: "eng". -3. script-iso-4char: The 4-character ISO-15924 code for the script being - configured. For example, for Latin: "Latn". -4. country-iso-3char: The 3-character ISO-3166-1 code for the language being - configured. For example, for Canada: "CAN". --> <!ATTLIST lexer class CDATA #REQUIRED - language-iso-3char CDATA #IMPLIED - script-iso-4char CDATA #IMPLIED - country-iso-3char CDATA #IMPLIED > This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |