Menu

#223 csv has more data columns than header columns

v1.0 (example)
closed-fixed
None
5
2022-03-27
2021-04-30
Lance
No

I have a client providing a csv file with trailing commas on the data rows like

Header 1,Header 2,Header 3
value1.1,value1.2,value2.3,
value2.1,value2.2,value2.3,

This is causing opencsv to fail with the following exceptioon

Exception in thread "pool-1-thread-3" java.lang.RuntimeException: com.opencsv.exceptions.CsvRequiredFieldEmptyException: Number of data fields does not match number of headers.
    at com.opencsv.bean.util.OpencsvUtils.handleException(OpencsvUtils.java:125)
    at com.opencsv.bean.concurrent.ProcessCsvLine.run(ProcessCsvLine.java:108)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: com.opencsv.exceptions.CsvRequiredFieldEmptyException: Number of data fields does not match number of headers.
    at com.opencsv.bean.HeaderNameBaseMappingStrategy.verifyLineLength(HeaderNameBaseMappingStrategy.java:97)
    at com.opencsv.bean.AbstractMappingStrategy.populateNewBean(AbstractMappingStrategy.java:324)
    at com.opencsv.bean.concurrent.ProcessCsvLine.processLine(ProcessCsvLine.java:131)
    at com.opencsv.bean.concurrent.ProcessCsvLine.run(ProcessCsvLine.java:87)
    ... 3 more

The only workaround I've found so far is to pass a custom BufferedReader to CsvToBeanBuilder which removes the trailing comms in readLine()

MappingStrategy<MyRowBean> mappingStrategy = new HeaderColumnNameMappingStrategy<>();
mappingStrategy.setType(MyRowBean.class);
Reader reader = new InputStreamReader(new FileReader(file), StandardCharsets.UTF_8);
BufferedReader bufferedReader = new BufferedReader(reader) {
  // here's my hack
  public String readLine() throws IOException {
    String line = super.readLine();
    return line.endsWith(",") ? line.substring(0, line.length() - 1) : line;
  }
};
CsvToBean<MyRowBean> csvToBean = new CsvToBeanBuilder<MyRowBean>(bufferedReader)
    .withMappingStrategy(mappingStrategy)
    .build();
List<Foo> myList = csvToBean.stream().map(myMapper).collect(toList());

This feels like a hack to me but I can't find a nicer was to tell opencsv to ignore the extra blank column in the data rows.

Discussion

  • Andrew Rucker Jones

    This is, of course, expected behavior. Your input data are malformed. As for a way around it, you can imagine the choices are limited. I looked at RowProcessor, but I don't think it would do what you want it to. The workaround you created is probably your best bet.

     
  • Lance

    Lance - 2021-04-30

    I was hoping that opencsv could simply see this as a data column with a null header and an empty string for a value, and since MyRowBean doesn't have a mapping for this null column it would simply be ignored

    Perhaps MappingStrategy.validateDataRowColumnCount(false)?

     

    Last edit: Lance 2021-04-30
    • Andrew Rucker Jones

      But there is no header. If there were, "Header 3" would have a comma after it as well. Even then, I doubt it would work.

       
  • Lance

    Lance - 2021-04-30

    Yes, so it's a column with a null header

     
  • Scott Conway

    Scott Conway - 2021-05-02

    Lance - believe it or not (and believe me the next words here will have Andrew writing me to see if my account was hacked) I like your idea of the validateDataRowColumnCount! It still gives us the ability to enforce safety but allows users to say "I'm grown up, I don't need health insurance, if I get sick I'll take care of it." Which is okay until they get into an accident and hit with the big medical bill.

    We actually use to not count headers and columns. The reason we started was we had a run of defects where there was malformed csv files being processed and they were upset that they were getting null pointer or casting exceptions - the latter being because the missing data was in the middle and causing a string to go into a date. So it was just easier to say columns and headers must match because if they don't how do you know what is actually missing.

    While personally I would prefer an external process to make the headers and columns match (by either adding dummy names at the header or removing extraneous data) I think there is a compromise that will get you what you want but still provide some safety.

    Instead of giving the option of turning the validation off completely (though that may still be an option - I have not talked with Andrew yet) how about an option to loosen the validation. For this case instead of saying you have to have the same number of data columns as headers we say you have to have at least as many columns as headers and if you have less then we throw an exception and any additional columns will be ignored. For you the effect would be exactly the same but there is a little more safety than ignoring the validation altogether.

    Lance, Andrew - what are your thoughts on this?

    Scott :)

     
    • Lance

      Lance - 2021-05-03
       

      Last edit: Lance 2021-05-03
  • Lance

    Lance - 2021-05-03

    I've worked with enough poor quality data files from clients to know that anything goes. So I can definitely see all of the following cases occurring in the wild
    1. More columns in a data row than header columns (solved via data columns with null header)
    2. More columns in a header row than the data rows (solved via null data cells)
    3. Data rows with differing column counts (solved via null data cells)

    If opencsv could allow all of these cases (via explicit allowX() methods if need be) that would be great. I'd much prefer nulls than be forced to "fix" the file (with another csv parsing library?) before handing it to opencsv.

    But, my immediate problem that I definitely need a solution to is outlined in my original question where data rows have the trailing comma and header rows do not

     
  • Andrew Rucker Jones

    I guess I could live with this. I think I would prefer exactly one switch, though: disableRowLengthChecking() or some such thing.

    It looks like all of the work would be in AbstractMappingStrategy::populateNewBean().

    Just a note: if a field is marked as required and we fill in null because there are insufficient columns, a CsvRequiredFieldEmptyException will still be thrown.

     
  • Lance

    Lance - 2021-05-14

    One of my pet hates is naming booleans as negative (eg disableX) because it leads to code like

    if (!rowLengthCheckingDisabled) {
       // two negatives make a positive (not of disabled means enabled)
       validateRowLength(...);
    }
    

    I much prefer naming booleans as positive (eg validateRowLength) and set it to false mappingStrategy.validateRowLength(false)

     

    Last edit: Lance 2021-05-14
  • Scott Conway

    Scott Conway - 2021-06-06

    Hey Lance - sorry it has been so long but I have been busy and this is the first time in a while.

    I am still planning on either adding a switch to disable the column count checks or extending the HeaderColumnNameMappingStrategy so the child class does not check column counts.

    But there are a couple things that cause me concern about your posts that I want to make sure you know what the danger you are putting yourself in going this route.

    What is causing all the red flags for me is your statements on 5/3 that everything could be fixed just by adding null columns at the end to make everything equal. And nothing could be further from the truth. You are making the assumption that everything is optional strings and if there is a malformed row it must be the data from the end that is missing.

    Here is an example of one of the bugs we had that caused us to add in the column checks in the first place.

    ID, Name, Date, Postition
    1,Lance,2021/05/03,Poster
    2,2021/06/06,Admin
    3,Jones,2021/05/14,Admin

    As you can see in row 2 the missing data was my name and this would cause opencsv to try and convert 'Admin' to a date and fail. Adding null columns would not solve this! But by checking the number of columns and throwing the exception it allows the user to know there is data missing as opposed to all columns there and just had data accidentally switched around.

    Now if after reading this you understand but are still convinced that for you all your malformed data is only going to be malformed a certain way that will not cause you any harm to continue processing it then we can still add an option to remove the safety BUT what I would recommend for you is in the meantime to create three tools.

    The first would be an app to give a count of how many columns are found. So for the above example the app would return:

    Columns in Header: 4

    Columns in Data

    Columns #rows
    4 2
    3 1

    And if what you are saying is true there would only one entry in columns in data but its number of columns would be different than the number of columns in the header. But if there is more than one entry I would stop there and not even try and process the file. This could be done simply with the CSVReader and looking at the length of each array from the readNext.

    The second app would be one to display the rows whose number of columns are a certain size. This way you can quickly find the problem rows. So for the above example

    findColumnsOfSize sample.csv 3

    2,2021/06/06,Admin

    And the third would be a preprocessor that basically do what you want and produce a new output file with everything having at least a given number of columns - that number having been determined by running the first app.

    AddColumnsToEnsureSize sample.csv output.csv 4
    one row modified

    This way you have the original untouched and can continue to work with the modified file. And if the failure is not at the end as you are expecting you can still go back to the original file.

    I would do at least the first two. Even with the safeties off, because at some point you will hit a file where data is not entered. That was what happened to other users and why we put the safeties in opencsv in the first place. And at that point you will have tools in place to identify the malformed lines.

     
  • Lance

    Lance - 2021-06-07

    What is causing all the red flags for me is your statements on 5/3 that everything could be fixed just by adding null columns at the end to make everything equal

    For my case the "missing" cells are at the end of a row.
    So if:
    1. there's one more header column than data columns I'd expect the last data column to have a null header
    2. there's one more header column than data columns I'd expect each data row to have a null value for the last header
    3. the data rows have different column counts I'd expect opencsv to pass nulls on the right hand cells where appropriate
    4.

    As you can see in row 2 the missing data was my name and this would cause opencsv to try and convert 'Admin' to a date and fail.

    Yes, your example is different from mine. In your example there is a missing blank cell (eg a double comma in the csv) . Nothing short of artificial intelligence in opencsv could seamlessly handle this case.

    I'm really just asking opencsv to handle the parsing in the same way that msexcel opens it without failing. Putting null values on the right hand side where appropriate.

    As for your example msexcel would put the date in the name column (and fail date parsing). I'd expect the same error in opencsv as the data is corrupt

     

    Last edit: Lance 2021-06-07
  • Andrew Rucker Jones

    • status: open --> pending
    • assigned_to: Andrew Rucker Jones
     
  • Andrew Rucker Jones

    I have implemented this for the next release.

     
  • Andrew Rucker Jones

    • status: pending --> closed-fixed
     
  • Andrew Rucker Jones

    opencsv 5.5 was released yesterday with this feature.

     
  • roesslerj

    roesslerj - 2022-03-21

    Unfortunately, this is wrongly implemented (HeaderNameBaseMappingStrategy:78):

    if (!this.headerIndex.isEmpty() && numberOfFields != this.headerIndex.getHeaderIndexLength() && !this.forceCorrectRecordLength) {
                throw new CsvRequiredFieldEmptyException(this.type, ResourceBundle.getBundle("opencsv", this.errorLocale).getString("header.data.mismatch"));
    }
    

    meaning that the Exception gets thrown if forceCorrectRecordLength is false. So in order to avoid the exception, one has to set the field to true.

     

    Last edit: roesslerj 2022-03-21
    • Scott Conway

      Scott Conway - 2022-03-27

      roesslerj - I am afraid you are misunderstanding the intent of the forceCorrectRecordLength. If it is set to true then extra columns are ignored and if there are not enough columns then the missing fields are populated with null. You can read this for yourself in the javadocs - http://opencsv.sourceforge.net/apidocs/index.html

      If set to false then we do not force the number of columns to be the same and if they are different we throw the exception.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.