Menu

Performance issue while parsing csv

Help
Ankit G
2014-03-03
2014-03-04
  • Ankit G

    Ankit G - 2014-03-03

    Hi
    I'm using SuperCSV to read million records. I'm facing performance issues while reading a csv record into a bean, specifically a particular column. It is taking too long for mapping to a bean in comparison to other columns. The field is a simple String used for taking values "1" or "0".
    Only for some of the record this column mapping takes huge time(around 500ms) in comparison to other columns which have a mapping time of around 2-5 ms.
    Please find below an extract from the logs generated:

    2014-03-03 12:59:22,130 DEBUG org.dozer.MappingProcessor:352 - MAPPED: CsvDozerBeanData.columns --> CsvPatientDataInputAdapter$CsvPatient.ageInYears VALUES: 64 --> 64 MAPID:
    2014-03-03 12:59:22,131 DEBUG org.dozer.fieldmap.FieldMap:90 - Getting ready to invoke write method on the destination object. Dest Obj: CsvPatientDataInputAdapter$CsvPatient, Dest value: 2
    2014-03-03 12:59:22,630 DEBUG org.dozer.MappingProcessor:352 - MAPPED: CsvDozerBeanData.columns --> CsvPatientDataInputAdapter$CsvPatient.genderCode VALUES: 2 --> 2 MAPID:
    2014-03-03 12:59:22,630 DEBUG org.dozer.fieldmap.FieldMap:90 - Getting ready to invoke write method on the destination object. Dest Obj: CsvPatientDataInputAdapter$CsvPatient, Dest value: null

    You can notice there is a sudden increase in time of 500ms while reading the 3rd field.
    Please provide some solution to fix this.

    Thanks.
    Ankit

     
  • James Bassett

    James Bassett - 2014-03-03

    Hi Ankit,

    I can't see any reason why this is happening without a bit more information. A working example (e.g. Github) that reproduces this issue would be useful.

    I'd be trying to isolate what is different between that column and the others.

    If you're certain it's that column, you could try ignoring it (i.e. supply null as the field mapping for that column) and see the performance difference. You could try this with various columns to get an approximate idea of how long each column takes.

    As an aside, it looks like your CsvPatient class is an inner class (if it's a static nested class disregard this). If so, then Dozer will be creating instances of your CsvPatientDataInputAdapter class for every instance of the inner class, which will probably have a performance impact. I'd suggest creating a totally independent CsvPatient class to avoid this.

    Regards,
    James

     
  • Ankit G

    Ankit G - 2014-03-03

    Hi James

    Due to company policy I won't be able to post on Github.
    I tried making the column mapping null but the problem started occurring in the next column. Also, I tried moving all the inner classes to independent classes but that didn't helped either.
    The problem is in my application >50% of time is being consumed during the csv reading. Can you think of something else which might be causing this problem.

    Thanks & Regards
    Ankit

     
  • James Bassett

    James Bassett - 2014-03-04

    Hi Ankit,

    I wasn't suggesting posting your actual code on GitHub - just an example that reproduces the problem. I often find a solution just by trying to replicate the issue on a smaller scale - and if you don't then you have an example that other people can use to help investigate. Without more info, I can't think of anything off the top of my head.

    Dozer does introduce a significant overhead (in general, not just with Super CSV) but makes mapping easy and flexible. If performance is a priority, and you're pretty sure it's Dozer that's the culprit, you can try and use the standard CsvBeanReader - though you'll lose the ability to do nested and indexed mapping.

    If you really want to get to the bottom of it, your best bet is to use a Java profiling tool (there's one bundled with Netbeans for example). That way it should be pretty obvious where the time is being spent (instead of just speculating, or tweaking the code and seeing what happens).

    Cheers,
    James