Menu

#569 a grobid example

AMBER
open
nobody
None
5(default)
2015-09-15
2015-09-15
No

The text of the standard does a really good job of showing examples of how to encode textually interesting pieces of text (for widely varying definitions of 'textually interesting').

What it perhaps does a less good job of is showing how to encode computationally interesting pieces of text.

I'd like to propose adding this example https://github.com/kermitt2/grobid/blob/master/grobid-trainer/resources/dataset/citation/corpus/0953-2048_24_7_075001.header-reference.xml , which is a short file from the grobid training corpus, along with a suitable covering comment to http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-listBibl.html. Maybe

"The following <listBibl> is taken from the grobid training corpus, it shows a pair of citations to the same article in the journal Superconductor Science and Technology, illustrating the variety of order and presence of the elements within the <bibl>. Such variety is important because it avoids over-fitting the generated probabilistic model which grobid generates from the training data."

If there is general agreement, I can work up a patch.

cheers
stuart

Discussion

  • Martin Holmes

    Martin Holmes - 2015-09-15

    What does this biblScope mean?

    <biblScope type="pp">075001 (7pp)</biblScope>
    
     
  • Laurent Romary

    Laurent Romary - 2015-09-15

    Seems to be an IOP specific numbering for article sequence (see http://iopscience.iop.org/0953-2048/24/7). Note that Patrice is aware he should be using @unit instead of @type on biblScope, but software developmeent is what it is, as we know.

     
  • Patrice Lopez

    Patrice Lopez - 2015-09-15

    I didn't update the training data format since many years... it should be <biblScope unit="page">!
    So we have the first page of the article and its number of pages, which is enough for a tool like Grobid then to normalize the page range. That's why the whole "raw" chunk of information is identified by <biblScope>.

    Patrice (Grobid author :)

     

    Last edit: Patrice Lopez 2015-09-15
  • Martin Holmes

    Martin Holmes - 2015-09-15

    I think encoding a start page and a number of pages in this rather cryptic way is unusual, and I don't think it's something we would want to exemplify in the Guidelines. Take a look at the existing examples for biblScope:

    http://www.tei-c.org/release/doc/tei-p5-doc/en/html/examples-biblScope.html

    If this is supposed to be an example whose primary purpose is to provide input for corpus processing tools, it would be more appropriate to encode page information in more easily parsable ways such as using @from and @to on biblScope, don't you think?