read large volumes of data from disk using stxxl

2014-08-27
2014-09-04
  • Isaac Perez
    Isaac Perez
    2014-08-27

    Greetings.

    I'm working on a program that reads and writes SEGY files, these are files that contain large volumes of data. I am researching to find the most efficient way (is especially important for my the loading time) to read such volumes of data.

    I'm working with stxxl library in my program. Generally speaking, I am looking for any advice or help with reading large blocks of data using stxxl library.

    However, in particular, I wonder if it's possible to read/load a block of data from disk directly into a stxxl container as stxxl::vector. Is this possible?; How?, if this is not possible... what other choices I have?.

    Thanks in advance for any help on this problem.

     
  • Timo Bingmann
    Timo Bingmann
    2014-08-28

    I dont know what SEGY files are, but the best way to load data into an STXXL program is to use the stxxl::vector(file*) constructor, which only virtually maps data into a vector, it does not read the file. The binary file, of course, must contain an array of the same fixed-length datatype as defined by the stxxl::vector, but that can usually be managed up-front.
    Of course, one should then use the usual parallel disk interface when doing the real data processing.
    But for removing the overhead of reading input and writing output one can use the virtually mapped vector.

     
  • Isaac Perez
    Isaac Perez
    2014-08-28

    Thanks for your reply.

    Regarding the SEGY files I am not an expert, but according to what I read these files contain several types of data: integers, float, etc. With what you've told me, I've thought about map the file to a stxxl::vector< unsigned char > then go processing and extracting the data from that vector. Would that be better than reading directly from file?

    On the other hand (excuse my ignorance), I do not know exactly what you mean with "the usual parallel disk interface", is it part of the stxxl library?

     
  • Timo Bingmann
    Timo Bingmann
    2014-09-02

    You can map any fixed-structure file using an stxxl::vector by specifying the appropriate C-struct.
    The <unsigned char> would be no better than a mmap() file.
    Sorry, with the "parallel disk interface" I meant the usual way STXXL stores temporary data on multiple disks, configured via the .stxxl file. Since the file-based stxxl::vector doesnt use this, you dont get any advantage from parallel disks.

     
    Last edit: Timo Bingmann 2014-09-02
  • Isaac Perez
    Isaac Perez
    2014-09-03

    The SEGY files can have variable length fields. More precisely, the data is given by a set of n blocks (n unspecified), where each block contains a control header and a sequence of floating point values​​. The amount of floating point values ​​is specified in the header of each block, and can vary from block to block.

    I explain this, because I think in that case it is not possible to define a C-struct to map the SEGY file directly to it. The classes that model the SEGY file have several pointers and dynamically allocate memory they need. Since the data size exceeds the available memory, I'm trying to incorporate STXXL to solve this issue... That's what I'm currently trying to do.

    The important question I pose here is that given the nature of the file, it seems to me (although I would like to be wrong) that my best option is to use a stxxl::vector<uchar>. Are there better options?

     
    • Hi Isaac,
      Some times ago I had similar problem, where I converted data formats
      to stxxl compatible formats.
      Did you try Boost memory-mapped files?
      You can read every time the control blocks and seek to the necessary
      block. You should not allocate memory for the data.
      It is storage performance dependent just test it before production
      runs. I use lustrefs with striped folder.
      If you use lustrefs you should mount it with local flock support.

      Arman.

      On Wed, Sep 3, 2014 at 4:32 PM, Isaac Perez isaacenrique89@users.sf.net wrote:

      The SEGY files can have variable length fields. More precisely, the data is
      given by a set of n blocks (n unspecified), where each block contains a
      control header and a sequence of floating point values. The amount of
      floating point values is specified in the header of each block, and can vary
      from block to block.

      I explain this, because I think in that case it is not possible to define a
      C-struct to map the SEGY file directly to it. The classes that model the
      SEGY file have several pointers and dynamically allocate memory they need.
      Since the data size exceeds the available memory, I'm trying to incorporate
      STXXL to solve this issue... That's what I'm currently trying to do.

      The important question I pose here is that given the nature of the file, it
      seems to me (although I would like to be wrong) that my best option is to
      use a stxxl::vector<uchar>. Are there better options?


      read large volumes of data from disk using stxxl


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/stxxl/discussion/446474/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/