EpiDoc: Epigraphic Documents in TEI XML / Request Features / #110 Document: use 'c' for illegible characters that are known to be vowel/consonant/etc.

BODARD Gabriel - 2016-07-07

Note: there is still pressure from the Indic epigraphy community for something more granular than "character", perhaps "letter" vs. "diacritic", since even an illegible consonant is unambiguously distinct from a vowel. I don't think this changes my argument above, but others may differ...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2016-07-19

Technically this is not possible in TEI, since <c> may not contain <gap> or any other markup. Unless someone can convince me this is a necessary restriction, I suggest a feature request to TEI to fix this. <c> should be able to contain pretty much anything that <seg> can...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Description has changed:

Diff:

--- old
+++ new
@@ -1,4 +1,4 @@
-There was a proposal on Markup (see **@unit for &lt;gap%gt; in Indic script texts** at http://lsv.uky.edu/scripts/wa.exe?A2=ind1606&L=markup&P=1355) to allow values such as "vowel", "consonant", and "aksara" in `gap/@unit`. My advice, as you'll see in the email at http://lsv.uky.edu/scripts/wa.exe?A2=ind1607&L=markupP=2366, is to separate the observation about the lost or illegible character(s), `tei:gap` from the interpretation (however indubitable) that this character must be a vowel/consonant/diacritic or whatever else. The examples in Marc's initial email, modified to something like:
+There was a proposal on Markup (see **@unit for &lt;gap&gt; in Indic script texts** at http://lsv.uky.edu/scripts/wa.exe?A2=ind1606&L=markup&P=1355) to allow values such as "vowel", "consonant", and "aksara" in `gap/@unit`. My advice, as you'll see in the email at http://lsv.uky.edu/scripts/wa.exe?A2=ind1607&L=markupP=2366, is to separate the observation about the lost or illegible character(s), `tei:gap` from the interpretation (however indubitable) that this character must be a vowel/consonant/diacritic or whatever else. The examples in Marc's initial email, modified to something like:

 ~~~
 <c type="consonant"><gap reason="illegible" quantity="1" unit="character"/></c>i

BODARD Gabriel - 2016-07-19

Merged from [#89].

As requested by Christian Prager on Markup ( http://lsv.uky.edu/scripts/wa.exe?A2=ind1501&L=markup&D=1&O=D&X=7C0AF26B279219B3FE&P=808 ). He's interested in marking up incomplete or damaged logographic characters, for which it's important to say more than simply "underdotted". This is occasionally a desideratum for Greek/Latin as well, but we've never had a decent recommendation for this other than, "Use SVG"!

As this discussion needs both Sanskritists and Mayanists to give input.

Related

Request Features: ~~#89~~

Last edit: BODARD Gabriel 2016-07-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2016-07-19

assigned_to: Emmanuelle Morlock
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2016-07-19

Emmanuelle will convene conversation between Pyu project in Lyon and Mayan Worterbuch in Bonn, and poke EnCoWS list to draw their attention to the Markup discussion.

Last edit: BODARD Gabriel 2016-07-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Emmanuelle Morlock - 2016-07-19

status: unread --> accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Description has changed:

Diff:

--- old
+++ new
@@ -1,4 +1,6 @@
-There was a proposal on Markup (see **@unit for &lt;gap&gt; in Indic script texts** at http://lsv.uky.edu/scripts/wa.exe?A2=ind1606&L=markup&P=1355) to allow values such as "vowel", "consonant", and "aksara" in `gap/@unit`. My advice, as you'll see in the email at http://lsv.uky.edu/scripts/wa.exe?A2=ind1607&L=markupP=2366, is to separate the observation about the lost or illegible character(s), `tei:gap` from the interpretation (however indubitable) that this character must be a vowel/consonant/diacritic or whatever else. The examples in Marc's initial email, modified to something like:
+There was a proposal on Markup (see **@unit for &lt;gap&gt; in Indic script texts** at http://lsv.uky.edu/scripts/wa.exe?A2=ind1606&L=markup&P=1355) to allow values such as "vowel", "consonant", and "aksara" in `gap/@unit`. My advice, as you'll see in the email at
+http://lsv.uky.edu/scripts/wa.exe?A2=ind1607&L=markup&O=D&F=&S=&P=2366
+is to separate the observation about the lost or illegible character(s), `tei:gap` from the interpretation (however indubitable) that this character must be a vowel/consonant/diacritic or whatever else. The examples in Marc's initial email, modified to something like:

 ~~~
 <c type="consonant"><gap reason="illegible" quantity="1" unit="character"/></c>i

BODARD Gabriel - 2016-09-20

Group: future --> 8.23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Emmanuelle Morlock - 2017-02-21

Group: 8.23 --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2017-10-17

Group: future --> 9.0
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Emmanuelle Morlock - 2017-11-24

Group: 9.0 --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2017-11-24

Where are we with this discussion, out of interest? Can we summarise the current options for encoding of lost/illegible but partially interpretable glyphs that we're trying to choose betwee?

petition TEI to allow <gap> inside <c>

petition TEI to allow <gap> inside <g>

recommend <seg type="character|diacritic|consonant|etc"><gap/></seg> or similar to encode such features

Are there any other options we want to consider, before we make a final decision on this ticket?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Emmanuelle Morlock - 2017-11-28

As the project the question came from chose some more intuitive solution, the interest lowered.
A straightforward solution for producing an output with different kind of brackets was used in the end : <seg type="graphemic|phonemic|phonetic">) that would produced different brackets in the rendering after transformation). As it is output oriented, it not of much help here.

I went through the discussions on encows (https://groups.google.com/forum/#!topic/encows/emN7ZDEnlwY) and markup, and found no other option prososed other than to embed <c> in <c> to represent smaller parts of a character.

The discussion is complex because in most case we encode an transliteration of a writing system. So if something is missing in the inscribed glyph/character sequence, it's difficult to represent it well aligned in the transliterated sequence. As the solution needs to be kept generalised, it proved misleading to try to define and model what a sub character component should be (for example how a consonant or vowel can or cannot be aligned with a character, a glyph of a part of these units). It seemed also that the fact that an assertion like "it is a wovel" is a interpretation or not is not absolute but rely deeply on the inner characteristics of a writing system. So an exploration of that perspective might not lead to a generalized solution, although there was a suggestion of a sort of an entity short hand for "illegible consonant" "illegible vowel" etc. I think in this case, it's the TEI feature structure system that should be used, but it might be judged "two complicated".

So the options you summed up meet our need of a simple but flexible encoding system.

I wouldn't be against the possibility of both. But the 3th is easier to implement as it is just a recommandation. Seems clearer also and maybe easier to use for the encoder.

There was a 4th option outlined briefly for coping with that notion of a unit smaller than a character which was to embed <c> in <c>… (cf. http://lsv.uky.edu/scripts/wa.exe?A2=ind1607&L=MARKUP&F=&S=&P=26438)
but it might be useful just for systems where the subcomponent is a structural sub unit.
e.g. if a glyph in a writing system is composed of 4 structural parts and there is one that is missing…
<g><c><gap/></c><c>xx</c><c>yyy</c><c>zzz<c></g>

So:
- is the 4th option interesting ?
- might 3 can be added to the GL as recommendation for release 9? maybe after a quick confirmation mail?
- what does 1 and 2 imply exactly: proposing that possibility to the TEI community first?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2017-11-29

Yes, options 1 or 2 would involve first writing to TEI-L with the suggestion, and seeing if there is any taste for it there, and later creating a ticket to request it on the TEI Giuthub tracker. It would then be discussed by TEI Council (which includes a couple of EpiDocistas), and probably decided upon about a year later. But if this is not urgent (Pyu are now using a solution very similar to (3) for the interim) maybe that doesn't matter.

The problem with option (4) is that it also isn't legal TEI currently (<g> cannot contain <c> and nor can <c> contain <c>, although of course <c> can contain <g>, which doesn't help us), and I fear it doesn't really reflect how TEI uses either <c> or <g>, so would probably encounter some resistence and require significant arguing for from someone knowledgeable in the languages (do Christian, Arlo, Marc or Dániel care enough to go on TEI-L and make that case?).

I suspect an email to Markup outline the 4 possibilities might quickly lead to a consensus that (3) is working, so why mess with it, or that something completely else is needed, in which case someone (not us!) needs to take the discussion to TEI-L to propose a more general solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Emmanuelle Morlock - 2017-12-03

Ok I agree. Message sent: http://lsv.uky.edu/scripts/wa.exe?A2=ind1712&L=markup&F=&S=&X=2599D03FEA67FE8470&Y=emmanuelle.morlock%40mom.fr&P=48

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2019-01-22

Group: future --> 9.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Vanderbilt - 2019-10-22

Group: 9.1 --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Vanderbilt - 2019-10-22

Bumped -> Future.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2020-06-16

status: accepted --> done
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

BODARD Gabriel - 2020-06-16

The project that requested this feature are now using "<seg type="character|diacritic|consonant|etc"><gap/></seg> or similar to encode such features", so we can close this ticket.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Document: use 'c' for illegible characters that are known to be...

XML text markup for ancient documents

Group

Searches

Help

#110 Document: use 'c' for illegible characters that are known to be vowel/consonant/etc.

Related

Discussion

Related