Menu

token offset

2015-07-11
2015-07-16
  • Giuseppe Attardi

    I need to retain the character offset of tokens, in order to transfer the text annotations to the tokens.

    I am using quex.tell() like this:

    quex::tokenizer_it qlex(&cin, "UTF-8");
    qlex.token_p_switch(&token);
    while (true) {
        // (*) get next token from the token stream
        const QUEX_TYPE_TOKEN_ID TokenID = qlex.receive();
    
        // (*) check against 'termination'
        if (TokenID == QUEX_TKN_TERMINATION)
          return 0;
        else if (TokenID == QUEX_TKN_EOS) {
          cout << endl;
        } else {
          // (*) print out token
          int offset = qlex.tell() - token.text.size();
          cout << quex::unicode_to_char(token.text) << '\t' << offset << endl;
        }
    }
    

    and it works properly.
    However, if I use the buffer primitives, which is the suggested solution, like this:

    quex::tokenizer_it qlex((QUEX_TYPE_CHARACTER*)0, 0, 0, "UTF-8");
    qlex.token_p_switch(&token);
    while (cin) {
      qlex.buffer_conversion_fill_region_prepare();
      // Read a line from standard input
      cin.getline((char*)qlex.buffer_conversion_fill_region_begin(),
                  qlex.buffer_conversion_fill_region_size());
      if (cin.gcount() == 0)
        return 0;
      qlex.buffer_conversion_fill_region_finish(cin.gcount() - 1);
      while (true) {
        const QUEX_TYPE_TOKEN_ID TokenID = qlex.receive();
    
        // (*) check against 'termination'
        if (TokenID == QUEX_TKN_TERMINATION)
          break;
        else if (TokenID == QUEX_TKN_EOS) {
          cout << endl;
        } else {
          // (*) print out token
          int offset = qlex.tell() - token.text.size();
          cout << quex::unicode_to_char(token.text) << '\t' << offset << endl;
        }
      }
    }
    

    the offsets at some point descrease, and I obtain:

    semanas 239
    según   247
    refiere 253
    .       260
    Anamnesis       256
    Cuartigesta     256
    que     268
    es      272
    traída  275
    a       282
    

    Notice that after "Anamnesis" it stays at 256 before restarting.
    The relevant portion of the input is:

    semanas según refiere.
    Anamnesis
    Cuartigesta que es traída a
    

    I am using version 0.64.8.
    Any suggestion for a fix?

     
  • Giuseppe Attardi

    Further investigation shows that the value returned by tell() is reset to 256 by qlex.buffer_conversion_fill_region_finish(), whenever it is > 256.

     
  • Giuseppe Attardi

    More precisely, the count is reset to 256 within the call to:

    QUEX_NAME(Buffer_move_away_passed_content)(&me->buffer);

    in QUEX_NAME(buffer_conversion_fill_region_end)
    in file:
    quex/code_base/analyzer/member/buffer-access.i

     
  • Giuseppe Attardi

    The bug seems to be that QUEX_NAME(buffer_conversion_fill_region_end) invokes:
    QUEX_NAME(Buffer_move_away_passed_content)(&me->buffer);
    instead of:
    QUEX_NAME(BufferFiller_Converter_move_away_passed_content)((QUEX_NAME(BufferFiller_Converter)<void>*)me->buffer.filler);

     
  • Giuseppe Attardi

    The problem still remains with function:

    qlex.buffer_fill_region_append_conversion_direct()
    

    I suppose it is similar, but I need help in fixing it.

     
  • Giuseppe Attardi

    I fixed the last problem by replacing:

    QUEX_NAME(Buffer_move_away_passed_content)(&me->buffer);
    

    with:

    QUEX_NAME(BufferFiller_Converter_move_away_passed_content)((QUEX_NAME(BufferFiller_Converter)<void>*)me->buffer.filler);
    

    also in:

    QUEX_NAME(buffer_fill_region_append_core)
    

    but I don't know if this might affect some other uses of the function.

     
  • Frank-Rene Schäfer

    No. The 'move-away' of content of the raw buffer must have
    happened in 'buffer_conversion_fill_region_prepare()'.

    I would love to analyze this. But, to get me started, please,
    provide a minimalist example (.cpp, .qx., .txt and Makefile).

    Thanks

     
  • Giuseppe Attardi

    I already posted the code.

     
  • Frank-Rene Schäfer

    Ok, I have been working now for more than half and hour just
    to collect your code together, adapt it and get something running.

    Upon 'cat example.txt | lexer' the package that I attached produces:

    semanas 0
    según   8
    refiere 14
    Anamnesis       22
    Cuartigesta     31
    que     43
    es      47
    traída  50
    a       57
    

    Please, feel free to adapt the package's content so that you display
    your issue.

     
  • Giuseppe Attardi

    You must use a file that is longer than 256 characters to see the problem.

     
  • Frank-Rene Schäfer

    The configuration parameter

    QUEX_SETTING_BUFFER_MIN_FALLBACK_N
    

    is set to 256. So, when the old content of the engine's buffer is
    moved away, it always keeps this number of bytes vand pastes what
    comes new after it. The 'prepare()' function do not much more but
    moving the old content away. The 'finish()' function, the case
    of converters, fill first a 'raw buffer', then convert its data
    into the engine's buffer. Once, the limit of 256 is reached, the
    engine's new content will always start at this offset.

     
  • Giuseppe Attardi

    So, what is the solution to get the proper offsets from tell(), rather than the offsets from 256?

     
  • Frank-Rene Schäfer

    Since stream navigation was considered meaningless for direct
    buffer access, the 'tell()' function was not supposed to do
    something great. Pondering over it for a while, I think, it
    would make sense, to accumulate the number of characters and
    reporting the character index of the current input pointer
    --the way you need it.

    I will also consider allowing stdin as stream input for conversions,
    at least. The conversion loop is stable enough to handled interrupted
    input, as it seems.

    The new version, is probably out at the end of this week.

     
  • Giuseppe Attardi

    Thanks, I look forward to it.

     
  • Frank-Rene Schäfer

    Could you please, file a short bug report for this issue. This, way
    I have a reference for the regression tests.

    Thanks.

     

Log in to post a comment.