Menu

CSV with different row formats

Help
Criso
2013-02-18
2013-02-19
  • Criso

    Criso - 2013-02-18

    Hello. I need help about how to read a csv file with different row formats using Dozer.

    Here is an example of what my csv is:
    ;;;;;;;;;;;;;
    Record;5049;5044;Bari;;;;;Cagliari;;;;;
    ;;27/09/2012;;;;;;;;;;;
    1;;07/01/1939;58;22;47;49;69;  -   ;  -   ;  -   ;  -   ;  -   ;
    2;;14/01/1939;18;77;33;62;19;  -   ;  -   ;  -   ;  -   ;  -   ;
    3;;08/07/1939;23;46;2;59;1;52;62;77;49;87;
    4;;15/07/1939;27;60;89;80;10;26;62;39;27;32;

    I can't manually edit the csv file because it is en export from another software and I have to import exactly like it is.
    As you can see, the first row is empty, the second one is something similar to a header row, but have some informations I need to read and store.
    About the second row, I need to store the 2 numbers and the strings "Bari" and "Cagliari" (those strings could change in another export).
    The third row is nearly empty and I don't need to store any information about it.
    From the fourth row ahead, I need to store informations differently from what I did for second row.

    I started trying to create 2 mapping array, 2 classes representing the map and 2 processors array, but I noticed I can't change the mapping if I already started to read the file. The second call to beanReader.configureBeanMapping() throws an exception.

    I thought to try to read the file 2 times. The first time I read only the second row and close the reader. The second time I go to read the rows from 4th ahead.
    The problem is that I don't know how to "pass" 2 rows without have to parse or map them just to get to the fourth row.
    However this solution sucks :D

    Another problem is caused by the " - " strings.
    My processor array uses "new Optional(new ParseInt())" and the class representing the maps has 5 Integer attributes.
    When Super CSV have to parse a field with " - ", it returns:
    org.dozer.converters.ConversionException: org.apache.commons.beanutils.ConversionException: Error converting from 'String' to 'Integer' For input string: "-".
    I even tryed to write my own processor, but nothing changed. It seems that the exception fires before the processor is used. It check the field's types in my class and try to cast the string to it. If i change my class' fields from Integer to String, everything works.

    I hope all of that is understandable. If not, I will try to explain again.

     

    Last edit: Criso 2013-02-18
  • James Bassett

    James Bassett - 2013-02-19

    Hi Andrea,

    Thanks for all the info - yes it's understandable :)

    Just to keep things simple, here's an example that uses CsvListReader (I hope you get this via email, because I know the format is going to be awful after I submit this!).

    Essentially, you can use getHeader() to skip over rows you don't want.

    package qdtmr.util.supercsv;
    
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.List;
    
    import org.supercsv.cellprocessor.Optional;
    import org.supercsv.cellprocessor.ParseDate;
    import org.supercsv.cellprocessor.ParseInt;
    import org.supercsv.cellprocessor.Token;
    import org.supercsv.cellprocessor.ift.CellProcessor;
    import org.supercsv.io.CsvListReader;
    import org.supercsv.io.ICsvListReader;
    import org.supercsv.prefs.CsvPreference;
    
    public class DifferentRowFormats {
    
        public static void main(String[] args) throws IOException {
    
            ICsvListReader reader = null;
            try {
                reader = new CsvListReader(new FileReader("input.csv"),
                        CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);
    
                reader.getHeader(false); // skip first line (empty)
    
                // processors for reading second line
                final CellProcessor[] processors = new CellProcessor[14];
                // only care about processing these 2 columns
                processors[1] = new ParseInt();
                processors[2] = new ParseInt();
    
                List<Object> secondLine = reader.read(processors);
                System.out.println("Second line: " + secondLine);
    
                reader.getHeader(false); // skip third line (not interested!)
    
                // a lot of the values appear to be ints, but can also be empty/dash
                final CellProcessor optionalParseInt = new Token("  -   ", null,
                        new Optional(new ParseInt()));
    
                // processors for reading the data rows
                final CellProcessor[] dataProcessors = new CellProcessor[] {
                        new ParseInt(), null, new ParseDate("dd/MM/yyyy"),
                        new ParseInt(), new ParseInt(), new ParseInt(),
                        new ParseInt(), new ParseInt(), optionalParseInt,
                        optionalParseInt, optionalParseInt, optionalParseInt,
                        optionalParseInt, null };
    
                List<Object> row;
                while ((row = reader.read(dataProcessors)) != null) {
                    System.out.println("Data row: " + row);
                }
    
            } finally {
                if (reader != null) {
                    reader.close();
                }
            }
        }
    
    }
    

    You can see how I've skipped over the 1st and 3rd rows using getHeader(), and I've defined 2 different processor arrays - one for the 2nd line, and one for the data rows. I've just made guesses for which processors to use - I don't know what the actual data is (something to do with Italy??)!

    I'd suggest only using the Dozer reader if you need indexed or deep mapping - if you just want to map your CSV to a bean then go with CsvBeanReader. It's faster and doesn't need any other dependencies.

    If you do use the Dozer reader - you'll need to configure the mapping before reading (you'll need to call configureBeanMapping() twice - once for each of your 2 classes). Then when reading the second line pass the first class into read(), and when reading the data rows pass the second class into read() (using the appropriate cell processor array).

    I hope this helps.

     
  • Criso

    Criso - 2013-02-19

    Yes, that project has something to do with Italy. It's about "Lotto", an Italian state lottery.
    Two times a week, 5 numbers (from 1 to 90) are drawn for 10 lottery wheels (that have the names of 10 italian cities) plus 5 numbers on a "national" wheel (added recently).
    From the csv I have to generate an instance of the class "Extraction" for each record in the csv (4th row ahead).

    Here an example of the class structure:

    class Exctraction {
        Integer number;
        Date date;
        LinkedList<Wheel> wheels;
    }
    
    class Wheel {
        WheelName name;
        Byte n1;
        Byte n2;
        Byte n3;
        Byte n4;
        Byte n5;
    }
    

    If I correctly understood, with deep mapping and indexed mapping, I can directly create that structure from the csv (the reason why I am using Dozer).

    I will try the "new Token()" code to solve my problem with casts. I simplified a bit the example I reported to you. I use Byte, not Integer and when the csv do not contain a number, it contains "funny_character - funny_character" (eclipse prints "� - �").
    I attached the real CSV if you are interested in it.

    Thank you for your reply, you are very kind :)

     

    Last edit: Criso 2013-02-19
  • Criso

    Criso - 2013-02-19

    Ok, I did everything I needed.

    My new processor is:

    final CellProcessor optionalParseInt = new Token("\uFFFD - \uFFFD ", null, new Optional(new ParseInt()));
    

    Thank you so much for your help and patience.
    Andrea

     
  • James Bassett

    James Bassett - 2013-02-19

    I see now, yes - the Dozer reader is probably a good choice then. Glad to hear you got it working. Don't forget to 'accept' the answers on StackOverflow too :)