[Pytables-users] advice on data representation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

I 've started to write a parser to convert ASTERIX data to HDF5, but I have some problem to represent all the data.

I use table objects. I've defined a class for each category record (a record is made of different data items).
See below as an example for category 30.

1. Some data items are optional. Is there a good way to mark a column as valid?
For enum, I can easily add an "uninitialized" default value.
For (U)Int, most of the time there is no good default value that could tell me if the data is valid or not.
I thought about using np.nan, but that's only for float.
Or I could add a bool variable (valid) to each column.
Is there another way?

2. Some data items have a variable length (in fact some fields can be repeated).
In the class I030_050_DESC for example, if FX is set to 1, then the fields are repeated.
I know that fields cannot be of a variable length in a table object.
I could try to use a shape (max_length,) for those columns but the max length would be a bit arbitrary as there is no theoretical limit (even if in practice, it is often quite low).
Or should I try to represent the data using a VLArray?
I found it quite natural to represent my data as a table and I don't really see how I could do the same with an array.

Cheers,

Benjamin

class I030_180_DESC(tables.IsDescription):
    """Calculated Track Velocity (Polar)"""
    SPEED = tables.UInt16Col(pos=0)
    HEADING = tables.UInt16Col(pos=1)

class I030_181_DESC(tables.IsDescription):
    """Calculated Track Velocity (Cartesian)"""
    X = tables.Int16Col(pos=0)
    Y = tables.Int16Col(pos=1)

class I030_340_DESC(tables.IsDescription):
    """Last Measured Mode 3/A"""
    V = tables.EnumCol(tables.Enum({
        "Code validated": 0,
        "Code not validated": 1,
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=0)
    G = tables.EnumCol(tables.Enum({
        "Default": 0,
        "Garbled code": 1,
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=1)
    L = tables.EnumCol(tables.Enum({
        "MODE 3/A code as derived from the reply of the transponder,": 0,
        "Smoothed MODE 3/A code as provided by a local tracker": 1
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=2)
    sb = tables.UInt8Col(pos=3)
    mode_3_a = tables.UInt16Col(pos=4)

class I030_400_DESC(tables.IsDescription):
    """Callsign"""
    callsign = tables.StringCol(7, pos=0)

class I030_050_DESC(tables.IsDescription):
    """Artas Track Number"""
    AUI = tables.UInt8Col(pos=0)
    unused = tables.UInt8Col(pos=1)
    STN = tables.UInt16Col(pos=2)
    FX = tables.EnumCol(tables.Enum({
        "end of data item": 0,
        "extension into next extent": 1,
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=3)

class I030Record(tables.IsDescription):
    """Cat 030 record"""
    ff_timestamp = tables.Time32Col()
    I030_010 = I030_010_DESC()
    I030_015 = I030_015_DESC()
    I030_030 = I030_030_DESC()
    I030_035 = I030_035_DESC()
    I030_040 = I030_040_DESC()
    I030_070 = I030_070_DESC()
    I030_170 = I030_170_DESC()
    I030_100 = I030_100_DESC()
    I030_180 = I030_180_DESC()
    I030_181 = I030_181_DESC()
    I030_060 = I030_060_DESC()
    I030_150 = I030_150_DESC()
    I030_140 = I030_140_DESC()
    I030_340 = I030_340_DESC()
    I030_400 = I030_400_DESC()
...
    I030_210 = I030_210_DESC()
    I030_120 = I030_120_DESC()
    I030_050 = I030_050_DESC()
    I030_270 = I030_270_DESC()
    I030_370 = I030_370_DESC()

Från: Anthony Scopatz [mailto:sc...@gm...]
Skickat: den 12 juli 2012 00:02
Till: Discussion list for PyTables
Ämne: Re: [Pytables-users] advice on using PyTables

Hello Benjamin,

Not knowing to much about the ASTERIX format, other than what you said and what is in the links, I would say that this is a good fit for HDF5 and PyTables.  PyTables will certainly help you read in the data and manipulate it.

However, before you abandon hachoir completely, I will say it is a lot easier to write hdf5 files in PyTables than to use the HDF5 C API.   If hachoir is too slow, have you tried profiling the code to see what is taking up the most time?  Maybe you could just rewrite these parts in C?  Have you looked into Cythonizing it?  Also, you don't seem to be using numpy to read in the data... (there are some tricks given ASTERIX here, but not insurmountable).

I ask the above, just so you don't have to completely rewrite everything.  You are correct though that pure python is probably not sufficient.  Feel free to ask more questions here.

Be Well
Anthony

On Wed, Jul 11, 2012 at 6:52 AM, <ben...@lf...<mailto:ben...@lf...>> wrote:
Hi,

I'm working with Air Traffic Management and would like to perform checks / compute statistics on ASTERIX data.
ASTERIX is an ATM Surveillance Data Binary Messaging Format (http://www.eurocontrol.int/asterix/public/standard_page/overview.html)

The data consist of a concatenation of consecutive data blocks.
Each data block consists of data category + length + records.
Each record is of variable length and consists of several data items (that are well defined for each category).
Some data items might be present or not depending on a field specification (bitfield).

I started to write a parser using hachoir (https://bitbucket.org/haypo/hachoir/overview) a pure python library.
But the parsing was really too slow and taking a lot of memory.
That's not really useable.

>From what I read, PyTables could really help to manipulate and analyze the data.
So I've been thinking about writing a tool (probably in C) to convert my ASTERIX format to HDF5.

Before I start, I'd like confirmation that this seems like a suitable application for PyTables.
Is there another approach than writing a conversion tool to HDF5?

Thanks in advance

Benjamin