SAX Parser

  • Matt Harrison
    Matt Harrison

    I have been using your excellent erlsom SAX parser,for parsing XML files, and have been successful parsing documents. 

    I now have a 6.8GB (yes Giga byte) xml document that I need to parse,and obviously the

    {ok, Xml} = file:read_file(SourceFile), fails with a enomem error.

    I notice there is a scan_file/2 method that will create a parsed Model, but I don't want the whole thing loaded. 

    Is there an equivalent sax method ? that reads the file in chunks and raises the sax events, but never loads the whole file in memory.

    I could not find anything in the docs, and wondered if I had missed



    • Willem de Jong
      Willem de Jong

      Yes, this is possible (and documented, but maybe not clear enough - are you using the latest version? Did you get erlsom from CEAN? That version is quite old).

      There even is an example included in the distribution that shows how to do it. I am copying part of it below.

      Good  luck,

      %% Example to show how the Erlsom Sax parser can be used in combination
      %% with a 'continuation function'. This enables parsing of very big documents
      %% in a sort of streaming mode.
      %% When the sax parser reaches the end of a block of data, it calls the
      %% continuation function. This should return the next block of data.
      %% the continuation function is a function that takes 2 arguments: Tail and
      %% State.
      %%    - Tail is the (short) list of characters that could not yet be parsed
      %%      because it might be a special token or not. Since this still has to
      %%      be parsed, it should be put in front of the next block of data.
      %%    - State is information that is passed by the parser to the callback
      %%      functions transparently. This can be used to keep track of the
      %%      location in the file etc.
      %% The function returns {NewData, NewState}, where NewData is a list of
      %% characters/unicode code points, and NewState the new value for the State.

      %% 'chunk' is the number of characters that is read at a time.
      %% should be tuned for the best result. (109 is obviously not a good value,
      %% it should be bigger than that - try it out).
      -define(chunk, 109).

      run() ->
         F = fun count_books/2,   %% the callback function that handles the sax events
         G = fun continue_file/2, %% the callback function that returns the next
                                  %% chunk of data
         %% open file
         {ok, Handle} = file:open(xml(), [read, raw, binary]),
         Position = 0,
         CState = {Handle, Position, ?chunk},
         SaxCallbackState = undefined,
         %% erlsom:parse_sax() returns {ok, FinalState, TrailingBytes},
         %% where TrailingBytes is the rest of the input-document
         %% that follows after the last closing tag of the XML, and Result
         %% is the value of the State after processing the last SAX event.
         {ok, Result, _TrailingBytes} =
           erlsom:parse_sax(<<>>, SaxCallbackState, F,
             [{continuation_function, G, CState}]),
         %% close file
         ok = file:close(Handle),

         %% Result is a list [{track_id, count}, ...]
         lists:foreach(fun({Date, Count}) ->
                        io:format("Date: ~p - count: ~p~n", [Date, Count])
                       end, Result),

      %% this is a continuation function that reads chunks of data
      %% from a file.
      continue_file(Tail, {Handle, Offset, Chunk}) ->
         %% read the next chunk
         case file:pread(Handle, Offset, Chunk) of
           {ok, Data} ->
             {<<Tail/binary, Data/binary>>, {Handle, Offset + Chunk, Chunk}};
           eof ->
             {Tail, {Handle, Offset, Chunk}}
      count_books(startDocument, _) ->