opencsv / Feature Requests / #139 Ignoring comment lines

Andrew Rucker Jones - 2020-10-25

I doubt if we'd be too excited to implement this, as there is no such thing as a comment in a CSV file. One way of doing it would be to implement your own Reader (or perhaps there's one out there that can already do something like this?) and pass that in to opencsv. Off hand, I don't see a way inside of opencsv to filter out lines beginning with a certain character.

Scott, do you have more insight?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-10-25

As to your second question, it's because the code doesn't filter lines at the level of CSVReader. It actually filters them much higher, in CsvToBean.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Green - 2020-10-25

I saw that feature in few other CSV libraries and though it would be pretty easy to implement.
As for second point do you think ignoring empty lines should be available also in CsvReaderBuilder to keep consistency across all readers?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2020-10-27

Today I learned you can search past, even closed, requests by keyword. Which came in handy because I knew we had closed a similar issue years ago.

But taking things in reverse order - for ignoring blank lines look at FR#91 (https://sourceforge.net/p/opencsv/feature-requests/91/). There the suggestion was to create a child class of FileInputStream to drop all initial empty lines until you hit a non-empty line.

Personally I don't want to add a ignore initial blank lines because we already have a skipLines in openCSV and there is a question of precidence (do we do both, if so which order, or just do one, if so which, or throw an error).

Now note I said INITIAL lines because a line of csv data can have new lines so if you have data like this.

id,name,note\n
1,test name,"This is a long note.\n\nFormatted to look like a paragraph.\n\nSo you do not want to remove the newlines here.\n"\n

So once the text starts the ignoring of blank lines should stop.

All that said that is why there is a ignoreEmptyLines in CsvToBean and not CSVReader. If you look under the covers what the ignoreEmptyLines does is call the CSVReader and if an empty line comes back and no data has been read yet then it will ignore that data and call the CSVReader again until data is found. You can see this implemented deep down in the SingleLineReader.

/** * The only constructor. * @param csvReader The {@link CSVReader} for reading the input * @param ignoreEmptyLines Whether blank lines of input should be ignored */ public SingleLineReader(CSVReader csvReader, boolean ignoreEmptyLines) { this.csvReader = csvReader; this.ignoreEmptyLines = ignoreEmptyLines; } private boolean isCurrentLineEmpty() { return line.length == 0 || (line.length == 1 && line[0].isEmpty()); } /** * Reads from the {@link CSVReader} provided on instantiation until a * usable line of input is found. * * @return The next line of significant input, or {@code null} if none * remain * @throws IOException If bad things happen during the read * @throws CsvValidationException If a user-defined validator fails */ public String[] readNextLine() throws IOException, CsvValidationException { do { line = csvReader.readNext(); } while (line != null && isCurrentLineEmpty() && ignoreEmptyLines); return getLine(); }

All that said you can try the SingleLineReader to see if that gives you the result you want.

So as for the main part of the request, commented lines, there was an old issue that was a exact copy of this but it was closed for non movement.

Seeing that there are multiple requests for the same thing I am willing to keep this in the planned section but don't have time to work it myself short term.

It is easy but not as easy as you are thinking. If I make a small change to the data example I gave before.

id,name,note\n
#1,test name,"This is a long note.\n\nFormatted to look like a paragraph.\n\nSo you do not want to remove the newlines here.\n"\n

I would have to ignore five lines not one.

We can actually do it easily enough in the SingleLineReader.

public String[] readNextLine() throws IOException, CsvValidationException { do { line = csvReader.readNext(); } while (line != null && isCurrentLineEmpty() && ignoreEmptyLines); return commentChar == getLine()[0].charAt(0) ? readNextLine(0 : getLine(); }

I don't think the final solution can be recursive though because if someone threw in a large file with thousands of contiguous commented lines I think we would risk getting an stack overflow error but you get the idea.

Andrew what do you think?

Last edit: Andrew Rucker Jones 2020-10-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew Rucker Jones - 2020-10-27
  
  I think I hate horked CSV files.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew Rucker Jones - 2020-10-27
  
  Scott makes an excellent point I had not thought far enough to consider: there is a difference between commented lines and commented records. You opened an issue to remove commented lines. But let me build on Scott's example by asking what happens if a line in a multiline record happens to begin with the defined comment character (which is another issue we would have to debate—there are so many possible comment characters). The only logical solution is that you can't filter lines like that. They're part of the data, and you can't expect yet another layer of (non-normed) escaping just to make sure the comment character at the beginning of a line in multiline data is not interpreted as a comment character.
  
  That's what leads Scott to commenting records instead of lines, and yes, that makes it more complicated.
  
  Just to have said it: I don't think it would cause a problem if someone were to comment out every line in a multiline record. opencsv would then simply consider the comment characters part of the (ignored) data and still search for the end of the record.
  
  BUT! What if the comment is truly a comment and has nothing to do with the CSV format? Then we would be trying to interpret it as a record, splitting fields, counting quotation marks, trying to find the end, when in fact there is no end of the record. opencsv would break.
  
  So honestly, I'm not sure it can be done.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2020-10-27

After talking with Andrew about this I realize he is completely correct (Andrew - you were right and I was wrong. <bg>) and this will be very difficult if not impossible. </bg>

So there are two ways of looking at this - we either comment by line or comment by record. And the problem with each is what if you have to comment the other.

If you comment by line then the example I originally gave:

id,name,note\n #1,test name,"This is a long note.\n\nFormatted to look like a paragraph.\n\nSo you do not want to remove the newlines here.\n"\n

would have to be commented, by the creator of the file, like so

id,name,note\n #1,test name,"This is a long note.\n#\n#Formatted to look like a paragraph.\n#\n#So you do not want to remove the newlines here.\n#"\n

Which would not be acceptable to the creators of the csv files. This was why I originally suggested a comment by record strategy and just look at the first character of the first item in the array. The problem there I realized after talking with Andrew is what if we have just a plain commented line and it has a character opencsv parses on.

So if we have the following example if we comment by record the first record would comment okay but the next line will cause the next record not to parse because the quote character would start a field that cause the next line, the real record, to be considered part of the commented record.

id,name,note\n
#1,test name,"This is a long note.\n\nFormatted to look like a paragraph.\n\nSo you do not want to remove the newlines here.\n"\n
# character in this range [,"] have special meanings as separator and quote\n
2,real record,"This is a long note.\n\nFormatted to look like a paragraph.\n\nSo you do not want to remove the newlines here.\n"\n

So yes I think this is impossible to do properly.

Now if you are willing to swear you only mean lines, never records, records should never be commented then I would point you back to FR#91 (https://sourceforge.net/p/opencsv/feature-requests/91/) and suggest you create a FileInputStream child class that removes all lines that start with your comment character, create a FileReader with that, and create a CSVReader with that.

Last edit: Andrew Rucker Jones 2020-10-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-02-22

status: open --> wont-fix

assigned_to: Andrew Rucker Jones
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

IamErwin - 2021-06-24

Fix or not?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew Rucker Jones - 2021-07-03
  
  Well, we closed it with the status "wont-fix" four months ago, and in our arguments we laid out that it's actually an impossibility, so no.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

If I extend from the CSVReader and do the following. What could be the problem ?

package com.opencsv;

import com.opencsv.processor.RowProcessor;
import com.opencsv.validators.LineValidatorAggregator;
import com.opencsv.validators.RowValidatorAggregator;
import org.apache.commons.lang3.StringUtils;

import java.io.IOException;
import java.io.Reader;
import java.util.Locale;

public class ExtendedCSVReader extends CSVReader {
    private final String indicatorToSkipLine;

    ExtendedCSVReader(String indicatorToSkipLine, Reader reader, int line, ICSVParser icsvParser, boolean keepCR, boolean verifyReader, int multilineLimit, Locale errorLocale, LineValidatorAggregator lineValidatorAggregator, RowValidatorAggregator rowValidatorAggregator, RowProcessor rowProcessor) {
        super(reader, line, icsvParser, keepCR, verifyReader, multilineLimit, errorLocale, lineValidatorAggregator, rowValidatorAggregator, rowProcessor);
        this.indicatorToSkipLine = indicatorToSkipLine;
    }

    @Override
    public String getNextLine() throws IOException {
        String result = super.getNextLine();
        if (StringUtils.startsWith(result, indicatorToSkipLine)) {
            return getNextLine();
        }
        return result;
    }
}

Andrew Rucker Jones - 2022-12-14

Please read the discussion from October 27, 2020. This solution does not respect the difference between a line and a record.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DrNickIT - 2022-12-15

I have read this and tested it. I didn't have to put # also after the \n . But probably i'm missing something. For me the whole line was skipped. What I expected.
But to think of it maybe i have to test it on a unix machine.

Last edit: DrNickIT 2022-12-15

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nevermind i created an extension of the BufferedReader with the same functionality:

public class SkipLineBufferedReader extends BufferedReader {
    private String indicatorToSkipLine;

    public SkipLineBufferedReader(Reader in, String indicatorToSkipLine) {
        super(in);
        this.indicatorToSkipLine = indicatorToSkipLine;
    }

    @Override
    public String readLine() throws IOException {
        String result = super.readLine();
        if (StringUtils.startsWith(StringUtils.strip(result), indicatorToSkipLine)) {
            return readLine();
        }
        return result;
    }
}

Ignoring comment lines

Group

Searches

Help

#139 Ignoring comment lines

Discussion