Stingray - Schema-Based File Reader Blog

Simple handling of numerous data file formats, even COBOL EBCDIC.

Brought to you by: slott56

Bad Embedded Schema

Without naming names or pointing the finger of blame, let's talk about datasets that are created without much care. To be a little more specific, this is econometric data. Lots of data that's pretty well defined and mostly regular.

Except.

Attributes like ZIP code (or Postal code) often become part of the dataset for a number of reasons. The ZIP code is commonly used to disambiguate similarly-named business enterprises.

What sometimes happens is that your dataset is a "join" between company information, some historical stuff and some performance stuff. Each of these three collections (company, history and performance) has company name, an internal surrogate key and a ZIP code.

The ZIP code gets repeated in the resulting dataset. And the column name is "ZIP CODE" in all of the repeats.

Ideally, they're all the same. And it doesn't matter. But what about attributes which have the same name, but don't have the same value?

The csv.DictReader Problem

The Python csv.DictReader is very handy. It plucks the top row off the sheet to become the attribute names.

However.

Duplicated attribute names are (effectively) ignored.

It's not clear if this is a data quality issue or a bug in csv. One can argue both points. Indeed, my position is that it's a data quality issue and the organization preparing the data set really needs to resolve this.

When we lack a strong negotiating position, there isn't much we can do to get the data fixed. ("We have lots of customers, no one ever complained before. Go away.")

Enter Stingray Reader. With stingray, we can get both the column name schema and we can also preserve the original column ordering. This allows us to figure out a way to meaningfully handle duplicated column names.

The Stingray documentation suggests that we can trivially turn a positional schema into a CSV-like mapping from attribute name to value.

row_by_name = dict(
    (a.name, row_seq.cell(a)) for a in schema
    )
data= row_by_name['column'],to_digit_str(5)

This -- of course -- has the csv.DictReader "bug" of not handling duplicated column names correctly.

Handling Duplicated Column Names

What we can do to handle duplicated column names is only a hair more complex.

schema_by_name = defaultdict( list )
for a in schema:
    schema_by_name[a.name].append( a )

This gives us schema_by_name['name'][index] to pick an occurrence number. For most well-behaved columns, the value of index is zero. For those duplicated columns, however, we have index that is non-zero.

This does make row references a little complex.

zip_2= row.cell(schema_by_name['ZIP'][2]).to_digit_str(5)

However, it also makes them correct and complete.

Posted by 2011-10-06 Labels: python csv stingray schema