Menu

rules

Xavier Tannier

DCTFinder Language Rules

Organization

You will find the language-specific rules in the distributed jar, under directory resources/data.

Inside this directory, resources are organized by languages (en_GB for British English, en_US for North American English, fr for French). The name of this directory MUST correspond to the output of java's Locale.toString() method.

Inside language directories, rules have the following organization:

  • vocabulary
  • tags
  • urls

vocabulary

Vocabulary directory contains two types of files:

  1. Trigger files:

    • trigger.txt contains words that are often associated to a publication date
    • anti-trigger.txt contains words that can be associated to a date, but very unlikely to a publication date
    • inside-trigger.txt contains words that are not date-specific but that often accompany date representations
    • post-trigger.txt contains words that often follow a publication date in a weg page.
  2. Date rules: two-column files associating a regular expression (1st column) and a normalized value

    • For numeric dates, a $Y$, $M$, $D$ corresponding to the year, the month and the day, to be associated to their corresponding group in the regular expression. For example:
      ([0123]?\d)[-./]([012]?\d)[-./]([2]\d\d\d) DMY
      means that the first group stands for the day, the second for the month and the third for the year.
    • For names of months, $Mn$, where $n$ is the number of the month between 1 and 12.
    • For days of the week, a number between 1 and 7 (1 is Monday, 7 is Sunday).
    • For times, a h, m, s corresponding to hours, minutes and seconds, to be associated to their corresponding group in the regular expression. For example,
      ([012]?\d):([012345]\d):([012345]\d) hms
      means that the first group stands for the hours, the second for the minutes and the third for the seconds.
      The H means that the group represents a half-day (see below)
    • For ordinals representing days (e.g. "1st"), the D corresponds to the first group of the left-side regular expression (first brackets)
    • For "half-days" (afternoon, morning, AM, PM), either H0 (AM) or H1 (PM)
    • The Z means that the group represents a time zone.

Important! If you remove or add a file from this vocabulary directory, run the script compile-rules.sh to update the list of rule files.

tags

  • time-tag-trigger.txt contains regular expressions matching texts that can be used in the class or id attributes of HTML tags containing only a date.
  • title-tag-trigger.txt contains regular expressions matching texts that can be used in the class or id attributes of HTML tags containing the title of the web page.
  • title-tag-anti-trigger.txt contains regular expressions matching texts that can NEVER be used in the class or id attributes of HTML tags containing the title of the web page.

urls

Following the same format as vocabulary files, the rules contained in the file date-in-url.txt will be used to match the URL of the web page, in order to extract a date from this URL, if existing.


Related

Documentation: Home

MongoDB Logo MongoDB