#26 simplify encoding of chronological phrases

closed
Syd Bauman
None
4
2006-09-25
2004-09-03
Syd Bauman
No

TEI P4 has four elements, <date>, <dateRange>, <time>,
and <timeRange> for encoding and normalizing text that
describes a point or period in time. (Not counting
<timeline> and <when> which are special purpose
elements for establishing synchronous points.)
The only difference between a date and a time is the
level of precision. The quick description of the
difference between a <date> or <time> and a <dateRange>
or <timeRange> is that the *Range describes a period
greater than the level of precision used.
Furthermore, in its discussion of attributes for these
elements, P4 conflates accuracy and precision (and
also, IIRC, confidence in the accuracy :-), and does
not address whether ranges are inclusive or exclusive.
Thus I am suggesting that this mix of elements and
attributes need some attention for P5. Some first
suggestions follow.

Since it is easy to indicate a range with the
international standard representation of dates and
times (ISO 8601:2000), the *Rnage elements are
unnecessary, and should be dropped from P5. The
following example (from P4 6.4.4, source is Virginia
Woolf's _Mrs._Dalloway_) demonstrates an encoding of a
range without <dateRange>.
| Those five years &mdash;
| <date value="1918/1923">1918 to 1923</date>
| &mdash; had been, he suspected,
| somehow very important.
The Guidelines should simply state that the range
specified on value= is inclusive. E.g.
| <date value="1067/1776-07-03">After 1066 but before
| American independance</date>
(Which, of course, could also be encoded
| After <date value="1066">1066</date> but before
| <date value="1776-07-04">American indenpendance</date>
with the same accuracy and precision)
| <date value="1869-10-02/1948-01-30">during the life
| of the Mahatma</date>

The exact= attribute of <*Range> should become the
accuracy= attribute of <date> and <time>. The precision
is indicated by the precision of value=.

Since a date and time indicate the same thing (albeit
with varying precision) and the normalized
representation (ISO 8601) can include both, the
Guidelines should explicitly state that <time> and
<date> are technically interchangeable.

The Guidelines should be explicit about whether a "T"
is to be specified between the date and time fields of
an ISO 8601 value=. (I.e., whether the contents of
value= is an ISO 8601 format date followed by
whitespace followed by an ISO 8601 time (e.g.
"2004-09-03 15:24Z") or an ISO 8601 time and date (e.g.
"2004-09-03T15:24Z"). I prefer the latter myself.

The Guidelines should explicitly prohibit the notation
"24:00" to represent midnight in the value of value=.
(This notation *is* premitted by ISO 8601, one of the
few indications that it was written by committee :-)

We can imagine two different uses of the value=
attribute of <date> (or <time>, I suppose):
1. regularize the content of <date> into a format which
can easily be searched, preferably one that can easily
be parsed and searched
2. normalize the content of <date> to a date along an
agreed upon timeline (aka calander system)
It might make sense, then, to separate these into two
separate attributes, as one may reasonably want
different values for these purposes. For example, one
might like to regularize the *format* of the Julian
dates in early modern printing, but may well rather not
be bothered trying to figure out what the Gregorian or
proleptic Gregorian (i.e., normalized) value would be.
| <docDate norm="1548-04-07"
reg="1548-03-28">The.xxviii.day | of <name>Marche</name>
| <lb/>the yere of our lorde.
| <lb/>M.D.XLVIII.</docDate>

Discussion

  • David Sewell
    David Sewell
    2004-10-27

    Logged In: YES
    user_id=35379

    Syd,

    I'm jumping into this discussion without having spent a lot
    of time on P5 yet (and please forgive any gaffes in
    technical description), but what I think I might be inclined
    to propose is a way of dealing with dating issues by adding
    a couple of attributes to tei.datable (and/or to the <date>
    element). I'd like to lay out the issues and then share one
    preliminary solution I've come up with.

    My group is going to be dealing with a growing body of
    TEI-encoded documentary editions and digitizations of
    documentary editions. One desideratum in the long run is
    going to be detailed searching and indexing based on
    arbitrary date specifications. For example, a reader may
    enter a search to return all documents in the collection
    that were or may have been written on 16 June 1775.

    Now the dating of documents lacking unique precise dates is
    a very complex matter when you're planning for machine
    processing. (See a TEI-L post from a couple years ago for
    more on this:
    http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind0207&L=tei-l&P=R1420
    ). A letter may bear a dateline of simply "Tuesday";
    internal and/or external evidence may indicate that it must
    have been written sometime from September 1835 through March
    1836. It's not going to do just to code this with <origin
    notBefore="1835-09-01" notAfter="1836-03-31"> because that
    doesn't provide enough constraining information so that an
    automated search would return this document as a possibility
    for 2 February 1836 but not 3 February 1836.

    I haven't slogged through ISO 8601 enough to know whether it
    provides a robust enough syntax to handle all these cases,
    but I do know that its syntax for recurring time-intervals
    is rebarbative. My own first pass at a solution is a
    "constraints" attribute that would constrain a date range by
    allowing as valid dates only those meeting the criteria. A
    @constraints value would be something like an XHTML @style
    value, consisting of a series of subvalues delimited by
    semicolons. For example, to indicate a Tuesday in March or
    April 1835:

    <origin notBefore="1835-03-01" notAfter="1835-04-30"
    constraint="m: 3-4; dw: 2">

    where "m" is a numeric month, month range, or
    comma-separated list of discrete months, and "dw" is day of
    the week. "dm" could be used for day of the month. There are
    also special cases where one knows, usually based on
    external evidence, that a document must have been written
    during one of several discrete ranges. For example, a letter
    by William James is datelined Chicago, and we know from
    biographical evidence that he was there in the spring of
    1882 and fall of 1883 [I'm making this up]; a constraint
    could be represented like so:

    <orig notBefore="1882-03-15" notAfter="1883-10-18"
    constraint="ranges: 1882-03-15/1882-04-20,
    1883-09-07/1883-10-18">

    Automatic processing can then fairly easily establish the
    first and last possible date and all possible dates in between.

    That's about enough to handle search requirements, I think.
    Although it would be redundant with the "ranges" constraint,
    one could also add for convenience a "dates" constraint
    whose value would be a list of one or more unique dates.

    The one other thing needed for automatic processing of large
    numbers of dated documents is some way to indicate how they
    should be indexed in a generated list if that cannot be
    computed from other values. For example, to take the
    preceding example of the James letter from Chicago, suppose
    an editor feels that it really ought to be included at the
    end of all the letters from 1882. Or maybe it should be
    included immediately after a letter from June 1882 that
    refers to Chicago. There's no way to do this via automated
    processing. So I'd propose one last "index" attribute whose
    content is a single date value to be used as the
    pseudo-creation date for indexing purposes. So the James
    example could be expanded to

    <orig notBefore="1882-03-15" notAfter="1883-10-18"
    constraint="ranges: 1882-03-15/1882-04-20,
    1883-09-07/1883-10-18" index="1882-12-31">

    to have it indexed with the end-of-1882 letters. (Obviously
    there are limits to how much fine-tuning of generated
    indexes can be done without creating too many ad hoc
    attributes.)

    Finally, attributes like these are really needed on the
    <docDate> element as well as on <origin>, because where
    you're marking up something like a print edition of
    correspondence as a single document instance you're not
    going to be able to use <origin> inside the TEI body without
    extending its definition in a way that's probably not
    justifiable.

    So... how much of this makes sense? And to the extent that
    it does, which parts of it look like something that should
    be part of P5 rather than project-specific extensions?

    David

     
  • David Sewell
    David Sewell
    2005-01-23

    Logged In: YES
    user_id=35379

    Syd -- looking back at this comment again. One problem with
    allowing something like an ISO 8601 date range as the
    content of a date/@value attribute is that there is no
    correponding W3C datatype. I.e., if you have <date
    value="1776-12-01/1776-12-31">Dec 1776</date> validation
    using current P5 schema (Relax NG or W3C) will complain
    about bad attribute content.

    So date.attributes.value.content would need to be redefined,
    either by some rather ugly use of grouping or (probably
    better) by creating some user-defined datatypes for things
    like date ranges, and adding them to the content model. (Are
    there any user-defined datatypes already in use in P5? Does
    someone know how to create them if not?)

     
  • Lou Burnard
    Lou Burnard
    2005-08-06

    Logged In: YES
    user_id=1021146

    Lots here to puzzle out...

    1. We can certainly define a new datatype specifically for
    date ranges or expand the definition of
    tei.temporal.expression to include ISO 8601 ranges. I'm not
    sure which would be less confusing though.

    2. I like David's proposal for a constraints attribute a
    lot, but wonder if inventing what looks like a special
    constraints language for it is really advisable. Has this
    actually been put into effect already?

    3. As currently defined, the value attribute on all date
    elements must be in the same calendar, I believe. I find it
    hard to see why you'd want to regularize the format of (say)
    a Julian or other calendar date without also normalizing it,
    but if you did, the best way to do it would be with a <reg>
    nested inside a <choice> inside the <date>, I guess.

     
  • Syd Bauman
    Syd Bauman
    2006-09-25

    • assigned_to: nobody --> sbauman
    • status: open --> closed