opencsv / Bugs / #223 csv has more data columns than header columns

Andrew Rucker Jones - 2021-04-30

This is, of course, expected behavior. Your input data are malformed. As for a way around it, you can imagine the choices are limited. I looked at RowProcessor, but I don't think it would do what you want it to. The workaround you created is probably your best bet.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lance - 2021-04-30

I was hoping that opencsv could simply see this as a data column with a null header and an empty string for a value, and since MyRowBean doesn't have a mapping for this null column it would simply be ignored

Perhaps MappingStrategy.validateDataRowColumnCount(false)?

Last edit: Lance 2021-04-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew Rucker Jones - 2021-04-30
  
  But there is no header. If there were, "Header 3" would have a comma after it as well. Even then, I doubt it would work.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lance - 2021-04-30

Yes, so it's a column with a null header

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2021-05-02

Lance - believe it or not (and believe me the next words here will have Andrew writing me to see if my account was hacked) I like your idea of the validateDataRowColumnCount! It still gives us the ability to enforce safety but allows users to say "I'm grown up, I don't need health insurance, if I get sick I'll take care of it." Which is okay until they get into an accident and hit with the big medical bill.

We actually use to not count headers and columns. The reason we started was we had a run of defects where there was malformed csv files being processed and they were upset that they were getting null pointer or casting exceptions - the latter being because the missing data was in the middle and causing a string to go into a date. So it was just easier to say columns and headers must match because if they don't how do you know what is actually missing.

While personally I would prefer an external process to make the headers and columns match (by either adding dummy names at the header or removing extraneous data) I think there is a compromise that will get you what you want but still provide some safety.

Instead of giving the option of turning the validation off completely (though that may still be an option - I have not talked with Andrew yet) how about an option to loosen the validation. For this case instead of saying you have to have the same number of data columns as headers we say you have to have at least as many columns as headers and if you have less then we throw an exception and any additional columns will be ignored. For you the effect would be exactly the same but there is a little more safety than ignoring the validation altogether.

Lance, Andrew - what are your thoughts on this?

Scott :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Lance - 2021-05-03
  
  Last edit: Lance 2021-05-03
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lance - 2021-05-03

I've worked with enough poor quality data files from clients to know that anything goes. So I can definitely see all of the following cases occurring in the wild
1. More columns in a data row than header columns (solved via data columns with null header)
2. More columns in a header row than the data rows (solved via null data cells)
3. Data rows with differing column counts (solved via null data cells)

If opencsv could allow all of these cases (via explicit allowX() methods if need be) that would be great. I'd much prefer nulls than be forced to "fix" the file (with another csv parsing library?) before handing it to opencsv.

But, my immediate problem that I definitely need a solution to is outlined in my original question where data rows have the trailing comma and header rows do not

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-05-14

I guess I could live with this. I think I would prefer exactly one switch, though: disableRowLengthChecking() or some such thing.

It looks like all of the work would be in AbstractMappingStrategy::populateNewBean().

Just a note: if a field is marked as required and we fill in null because there are insufficient columns, a CsvRequiredFieldEmptyException will still be thrown.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lance - 2021-05-14

One of my pet hates is naming booleans as negative (eg disableX) because it leads to code like

if (!rowLengthCheckingDisabled) { // two negatives make a positive (not of disabled means enabled) validateRowLength(...); }

I much prefer naming booleans as positive (eg validateRowLength) and set it to false mappingStrategy.validateRowLength(false)

Last edit: Lance 2021-05-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2021-06-06

Hey Lance - sorry it has been so long but I have been busy and this is the first time in a while.

I am still planning on either adding a switch to disable the column count checks or extending the HeaderColumnNameMappingStrategy so the child class does not check column counts.

But there are a couple things that cause me concern about your posts that I want to make sure you know what the danger you are putting yourself in going this route.

What is causing all the red flags for me is your statements on 5/3 that everything could be fixed just by adding null columns at the end to make everything equal. And nothing could be further from the truth. You are making the assumption that everything is optional strings and if there is a malformed row it must be the data from the end that is missing.

Here is an example of one of the bugs we had that caused us to add in the column checks in the first place.

ID, Name, Date, Postition
1,Lance,2021/05/03,Poster
2,2021/06/06,Admin
3,Jones,2021/05/14,Admin

As you can see in row 2 the missing data was my name and this would cause opencsv to try and convert 'Admin' to a date and fail. Adding null columns would not solve this! But by checking the number of columns and throwing the exception it allows the user to know there is data missing as opposed to all columns there and just had data accidentally switched around.

Now if after reading this you understand but are still convinced that for you all your malformed data is only going to be malformed a certain way that will not cause you any harm to continue processing it then we can still add an option to remove the safety BUT what I would recommend for you is in the meantime to create three tools.

The first would be an app to give a count of how many columns are found. So for the above example the app would return:

Columns in Header: 4

Columns in Data

Columns #rows
4 2
3 1

And if what you are saying is true there would only one entry in columns in data but its number of columns would be different than the number of columns in the header. But if there is more than one entry I would stop there and not even try and process the file. This could be done simply with the CSVReader and looking at the length of each array from the readNext.

The second app would be one to display the rows whose number of columns are a certain size. This way you can quickly find the problem rows. So for the above example

findColumnsOfSize sample.csv 3

2,2021/06/06,Admin

And the third would be a preprocessor that basically do what you want and produce a new output file with everything having at least a given number of columns - that number having been determined by running the first app.

AddColumnsToEnsureSize sample.csv output.csv 4
one row modified

This way you have the original untouched and can continue to work with the modified file. And if the failure is not at the end as you are expecting you can still go back to the original file.

I would do at least the first two. Even with the safeties off, because at some point you will hit a file where data is not entered. That was what happened to other users and why we put the safeties in opencsv in the first place. And at that point you will have tools in place to identify the malformed lines.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lance - 2021-06-07

What is causing all the red flags for me is your statements on 5/3 that everything could be fixed just by adding null columns at the end to make everything equal

For my case the "missing" cells are at the end of a row.
So if:
1. there's one more header column than data columns I'd expect the last data column to have a null header
2. there's one more header column than data columns I'd expect each data row to have a null value for the last header
3. the data rows have different column counts I'd expect opencsv to pass nulls on the right hand cells where appropriate
4.

As you can see in row 2 the missing data was my name and this would cause opencsv to try and convert 'Admin' to a date and fail.

Yes, your example is different from mine. In your example there is a missing blank cell (eg a double comma in the csv) . Nothing short of artificial intelligence in opencsv could seamlessly handle this case.

I'm really just asking opencsv to handle the parsing in the same way that msexcel opens it without failing. Putting null values on the right hand side where appropriate.

As for your example msexcel would put the date in the name column (and fail date parsing). I'd expect the same error in opencsv as the data is corrupt

Last edit: Lance 2021-06-07

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-07-03

status: open --> pending

assigned_to: Andrew Rucker Jones
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-07-03

I have implemented this for the next release.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-07-06

status: pending --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-07-06

opencsv 5.5 was released yesterday with this feature.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

roesslerj - 2022-03-21

Unfortunately, this is wrongly implemented (HeaderNameBaseMappingStrategy:78):

if (!this.headerIndex.isEmpty() && numberOfFields != this.headerIndex.getHeaderIndexLength() && !this.forceCorrectRecordLength) { throw new CsvRequiredFieldEmptyException(this.type, ResourceBundle.getBundle("opencsv", this.errorLocale).getString("header.data.mismatch")); }

meaning that the Exception gets thrown if forceCorrectRecordLength is false. So in order to avoid the exception, one has to set the field to true.

Last edit: roesslerj 2022-03-21
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Scott Conway - 2022-03-27
  
  roesslerj - I am afraid you are misunderstanding the intent of the forceCorrectRecordLength. If it is set to true then extra columns are ignored and if there are not enough columns then the missing fields are populated with null. You can read this for yourself in the javadocs - http://opencsv.sourceforge.net/apidocs/index.html
  
  If set to false then we do not force the number of columns to be the same and if they are different we throw the exception.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

csv has more data columns than header columns

Group

Searches

Help

#223 csv has more data columns than header columns

Discussion