Menu

#360 New attribute @keepHyphen

AMBER
pending
1(low)
2015-06-03
2012-05-03
No

As a result of the discussion on TEI-L (Re: @break, April 11-20), I'd like to propose a new attribute, @keepHyphen, as a member of att.breaking. (The definition of att.breaking needs to be changed accordingly.)

@keepHyphen indicates whether or not a hyphen preceding the element bearing this attribute is preserved when the line break is removed due to re-rendering of the document.
Status: optional
Datatype: data.enumerated
Sample values include: yes - a hyphen preceding the element bearing this attribute is preserved when the line break is removed due to re-rendering of the document.
no - a hyphen preceding the element bearing this attribute is omitted when the line break is removed due to re-rendering of the document.

Discussion

  • Conal Tuohy

    Conal Tuohy - 2012-05-04

    I think that the practical benefit of the attribute can be achieved in a better way.

    The name "keepHyphen" implies a processing instruction, controlling how to render the document in a re-rendering process in which the lines of text are being re-flowed, but this is not in the spirit of the TEI, which is a system for interpreting and encoding the original text (i.e. it is descriptive rather than prescriptive markup).

    An end-of-line hyphen in a text is either a typographic artefact only (it was used only to mark the fact that a word was broken by a line break), or else it was a real part of the normal spelling of the word, and would have appeared even if the word had not been broken by a line break (such as the word "so-called").

    I believe that the distinction between a "purely typographic" hyphen and a "real" hyphen is in practical terms the same distinction which the @keepHyphen attribute would draw (between a hyphen which should or should not be kept when re-flowing lines of text). Hence I think that the @keepHyphen proposal could be replaced with something which purports to describe the nature of the hyphen in the original text, either using distinct Unicode characters, or some other XML markup. For practical reasons, I would also recommend that any XML markup to describe a hyphen should enclose it rather than follow it, because this is a more natural use of XML.

     
  • Martin de la Iglesia

    You're right, Con: encoding rendering processing instructions is not in the spirit of the TEI. And you're also right about the proposed name "keepHyphen" being somewhat misleading, since we basically want to distinguish two different characters in the source document (though they might look the same). However, I don't see a better way to encode them. Using the Unicode character for "soft hyphen" doesn't work (as stated on BP and TEI-L), because its semantics aren't exactly what we're looking for (and because there's no "soft double hyphen"). If you have any other encoding suggestions, I'd be interested in them.

     
  • James Cummings

    James Cummings - 2012-06-29
    • assigned_to: nobody --> louburnard
     
  • Lou Burnard

    Lou Burnard - 2012-07-05

    Well, one way would be to define two variant glyphs for the hyphen

    Given
    <glyph xml:id="typographicHyphen">
    <!-- this hyphen was probably introduced just by the printer and safely be ignored when reformatting -->
    </glyph>
    and
    <glyph xml:id="hardHyphen">
    <!-- this end of line hyphen should always be retained when reformatting -->
    </glyph>

    you could encode them as <g ref="#typographicHyphen>-</g> and <g ref="#hardHyphen>-</g>
    respectively.

    You'd probably prefer some shorter names though!

     
  • Martin de la Iglesia

    @louburnard: We had discussed this option before on TEI-L, cf. my e-mail from April 17: "Since these hyphens are so common, <g> bloats the code, and semantically I don't think the information of whether to keep the hyphen when rendering should be given in the hyphen character (I guess BP says about as much). Instead, I feel this is something that should be encoded in <lb>."

     
  • Lou Burnard

    Lou Burnard - 2012-07-06

    Apologies -- I should have gone back to re-read the discussion thread before commenting. Mevertheless, I still think my proposed solution is the only one that makes sense with your stated desires
    (a) "we basically want to distinguish two different characters in the source document (though they
    might look the same"
    (b) "something which purports to describe the nature of the hyphen in the original text, either using
    distinct Unicode characters, or some other XML markup.
    Surely it's not inconceivable that the distinction wanted here might be needed when there is no intervening line break? In which case making it a property of the <lb/> tag seems inappropriate. Suppose I have a word hyphenated across a page break? Suppose I have a word hyphenated by an intervening graphic? Suppose I am not recording <lb/>s at all in my text? (There's no way of saying "this <lb/> is here just because of the hyphenation issue ")

     
  • Martin de la Iglesia

    Since @keepHyphen would be a member of att.breaking, you could use in other elements than <lb>, e.g. in <milestone>, and thus insert such a break almost anywhere in the document.

    Apart from that, I don't see what's wrong with saying "this <lb/> is here just because of the hyphenation issue".

    Another workaround that just came to my mind: if there is an un-hyphenated form of the hyphenated word (i.e. if it's a soft hyphen), it could be supplied in elements like <index>, <w>, or <reg>. What do you think of that?

     
  • Lou Burnard

    Lou Burnard - 2012-09-16
    • milestone: --> AMBER
     
  • Elena Pierazzo

    Elena Pierazzo - 2012-09-20
    • assigned_to: louburnard --> epierazzo
     
  • Lou Burnard

    Lou Burnard - 2012-10-22
    • status: open --> closed-rejected
     
  • Lou Burnard

    Lou Burnard - 2012-10-22

    Council reviewed this proposal again at the meeting in Oxford in September 2012. There was a fairly unanimous feeling that this was a processing-related issue which might well be useful in a specific production environment, but that introducing it into the Guidelines in general was not warranted, given the number of other solutions available and discussed on the ticket. I am therefore closing the ticket, and apologise for the delay in doing so.

     

    Last edit: Kevin Hawkins 2013-04-21
  • Kevin Hawkins

    Kevin Hawkins - 2013-04-21
    • status: closed-rejected --> pending
     
  • Kevin Hawkins

    Kevin Hawkins - 2013-04-21

    While we rejected this ticket our our meeting (see minutes at http://www.tei-c.org/Activities/Council/Meetings/tcm52.xml ), we did also ask Elena to find an example to incorporate into the Guidelines implementing Lou's encoding solution given on the ticket. That hasn't been done yet (or if it was, no note was made on the ticket). Reopening.

     
  • James Cummings

    James Cummings - 2013-11-11
    • assigned_to: Elena Pierazzo --> Paul Schaffner
    • Priority: 5 --> 1(low)
     
  • James Cummings

    James Cummings - 2013-11-11

    ASsigned to PaulS to provide example of recommended mechanism.

     
  • Elli Mylonas

    Elli Mylonas - 2014-11-19

    Prod Paul to provide an example for Guidelines.

     
  • Lou Burnard

    Lou Burnard - 2015-05-30

    Paul prodded again.

     
  • Paul Schaffner

    Paul Schaffner - 2015-06-02

    While I root around for examples, it appears to me that there is really only one 'clean' solution proposed on the ticket, i.e. one that can readily be implemented without undue complications and requires nothing other than the existing facilities within TEI, namely

    Lou's suggestion that the hyphen-that-stays and the hyphen-that-goes-away (the real hyphen and the contingent hyphen) are really two different characters that happen to look alike. An encoder is free to adopt this interpretation and encode it with <g/>. It can be easily enough applied in most circumstances. It does not depend on context or on encoding practices (e.g. it can be used even when <lb/>s as such are not captured). And it can easily be converted into other solutions should that be needed.

    Martin's suggestion that the difference between a hyphenated and unhyphenated form might be encoded with <orig> and <reg> within <choice> is theoretically fine, but tends to run aground on the difficulties of coping with the intervening element (e.g. <lb/> <pb/> <cb>)

    The original suggestion that an attribute be defined whose semantics characterize the immediately preceding hyphen feels very kludgy, antithetical to the spirit of XML, and likely to lead to confusing situations. What happens when the attribute is present but the hyphen has been omitted? what happens when a series of elements intervene between the two halves of a word? (e.g. beau|<figure/><cb/>tiful). What happens when there is no element at all between the first and second half of the word, as happens frequently in connection with notated music ("O Ca -- na -- da"), and involves what seems to be exactly the same kind of contingent hyphen, albeit one without intervening line, column, or page break?

     

    Last edit: Paul Schaffner 2015-06-02
  • Paul Schaffner

    Paul Schaffner - 2015-06-02

    This example shows two typographic (contingent) hyphens, once preceding a line break and the other preceding a column break and intervening figure.

    The whyche Denys know<g ref="#typoHyphen"/><cb break="no"/><figure/>ynge and aperceyuynge that this holy man defferred and putt of for to gadre wythin his monasterye wyth hys bre<g ref="#typoHyphen"/><lb break="no"/>thern.

    Source: [Vitas Patrum] STC 14507. Westminster: Wynkyn de Worde, 1495. Pars 1. Chapter 84. Fol. Cxxiii. Image 130 of EEBO copy. http://gateway.proquest.com/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_id=xri:eebo:image:10340:130

    This example shows contingent ('typographic') hyphens when there is no intervening tag of any kind, the words being hyphenated in order to align with notated music:

    <l><notatedMusic/> Leave it to the Planets two,</l>
    <l><notatedMusic/> what we shall here<g ref="#typoHyphen"/>af<g ref="#typoHyphen"/>ter do,</l>
    <l><notatedMusic/> for the joy we now may prove,</l>
    <l><notatedMusic/> take ad<g ref="#typoHyphen"/>vice of present love.</l>

    Source:
    Select Ayres and Dialogues for One, Two, and Three Voyces; to the Theorbo-Lute Or Basse-Viol.
    London: printed by W. Godbid for John Playford, 1659. Wing W2909. Pg. 17.
    http://wwwlib.umi.com/eebo/image/64071/12

    And this example shows a 'real' or permanent hyphen, along with a typographic one, and one that is perhaps ambiguous but has been encoded here as 'real':

    <list>
    <item>Salvia
    <list>
    <item>variegata, painted Sage.</item>
    <item>hortensis rubra, red garden<g ref="#realHyphen"/><lb/>Sage.</item>
    </list>
    </item>
    <item>Saxifraga
    <list>
    <item>alba, white Saxifrage.</item>
    <item>aurea, golden Saxifrage.</item>
    <item>Anglicana alsine folio, En<g ref="#typoHyphen"/><lb/>glish Saxifrage with Chick<g ref="#realHyphen"/><lb/>weed-leaves.</item>
    </list>
    </item>
    </list>

    Source:
    Musaeum Tradescantianum: OR, A COLLECTION OF RARITIES. PRESERVED At South-Lambeth neer London By JOHN TRADESCANT.
    London: Printed by John Grismond, and are to be sold by Nathanael Brooke at the Angel in Cornhill, 1656.
    Wing T2005. Pg. 164.
    http://wwwlib.umi.com/eebo/image/115821/95

     

    Last edit: Paul Schaffner 2015-06-02
  • Martin de la Iglesia

    What happens when the attribute is present but the hyphen has been omitted? what happens when a series of elements intervene between the two halves of a word? (e.g. beau|

    <cb/>tiful).

    There's always @break="no".

    And this example shows a 'real' or permanent hyphen, along with a typographic one, and one that is perhaps ambiguous but has been encoded here as 'real':

    An example of a 'hard' (or 'permanent', or 'real') hyphen from a more recent source would be preferable, so that readers know how the word is supposed to be spelt, e.g. a term in which both parts begin with a capital letter ("Leibniz-Newton controversy").

     
  • Paul Schaffner

    Paul Schaffner - 2015-06-03

    One problem with using the "Lou" solution is that one ends up with too many hyphen characters. There is the 'real' hyphen (a g element), the 'contingent' hyphen (also a g element), which are fine. But if using those, should one also use the literal hyphen character ( - ) ? And if so, what does the latter signify? or in what circumstances should it be used?

     
  • Lou Burnard

    Lou Burnard - 2015-06-03

    Two comments from me:
    a) Paul's examples may be lovely but we can't see the images, which makes them less useful for pedagogic purpoises; we can't include them in the Guidelines unless they are distributable under an open licence.
    b) I am not sure what you mean by "the literal hyphen character". I might put one inside the element for the benefit of dim-witted processors unable to work out for themselves how to render a <g ref="contingentHyphern"/>.