I have csv files. They have 7 fields. Some of the fields have text. I am trying to open them in a program called Weka. The text fields are surrounded by double quotes. However sometimes there are double or single quotes inside the text string. In this case Weka won't open the file and produces an error. Can anyone suggest a regular expression to use in notepad++ in the "find and replace" feature, to either escape such a double or single quote or just delete that double or single quote. For example, if the original is
534694458,"2011-11-01 11:56:10","mpelham","Body Worlds Groupon! $25 for Two Tickets to ""Body Worlds & The Brain"" at OMSI - Up to $50... http://t.co/7NG8ThM1","Por<wbr></wbr>tland, OR",0,0
I would like it to be like:
534694458,"2011-11-01 11:56:10","mpelham","Body Worlds Groupon! $25 for Two Tickets to Body Worlds & The Brain at OMSI - Up to $50... http://t.co/7NG8ThM1","Por<wbr></wbr>tland, OR",0,0
Your problem doesn't seem too difficult !
As for the single quote (') isn't the delimiter of your TEXT CSV fields, you
just have to delete them by a simple replacement
Now, for the double quotes ("), we just have to determine in which cases
it's legal :
- a double quote (") is legal if it begins a text field, so
it's preceded by a comma (,)
- a double quote (") is legal if it ends a text field, so
it's followed by a comma (,)
In ALL other cases, it must be DELETED. So :
if the dialog 'Find what' is (?<!,)"(?!,)
and the dialog 'Replace with' is EMPTY
=> ALL the DOUBLE QUOTES, NOT PRECEDED AND NOT FOLLOWED by a
COMMA, will be DELETED
This regular expression with a LOOKBEHIND and a LOOKAHEAD has been tested with
the 6.2.3 version of Notepad++, but I think it'll be OK from the 6.0 version !
I hope I correctly understood your needs and that my post will help you a bit
P.S. You can also find some documentation, about the new PRCE Regular Expressions, used by N++, since the 6.0 version, at the two adresses below :
The FIRST one concerns the syntax of regular expressions in
the SEARCH part
The SECOND one concerns the syntax of regular expressions
in the REPLACEMENT part
This is just another approach, which may prove to be safer.
The double quotes inside the texts in the fields are stored in pairs.
So a double quote character inside the text is recognized from being a pair of them.
You just have to delete all pairs of double quote characters, then.
If they are preceded or followed by a single double quote character, it doesn't matter: you will eventually be left with one (only) if the string of double quote characters consists of an odd number of them.
Assuming all double quote characters preceded by a , or followed by a , may be dangerous. They may not always identify a field/column break. They could be part of the text in the field/column as well.
Hi, Fool4UAnyway and Tomoko,
My search-replacement is just based on the hypothesis of Tomoko, when he said :
The text fields are surrounded by double quotes
So this assertion suppose that the CSV text field of Tomoko's
file doesn't contain any DOUBLE quote (")
Of course, a field may contain a double quote (") itself. But if so, how
to recognize the beginning and the end of the REAL CSV text field ?
Given a CSV line like : "This is","a "very" odd","CSV example"
Can we consider : 3 text fields "This is" "a "very" odd" and "CSV example"
OR 4 text fields "This is" "a " " odd" and "CSV exemple" , with the
word 'very' outside fields ?!
But my FIRST search regex, in my previous post, was too simple because, if a text
field begins or ends a line, it will delete the double quote at the beginning
of a line ( First text field ) and/or the double quote at the end of a line
( Last text field )
So if we use the SEARCH : (?<!(,|\n))"(?!(,|\r))
and REPLACEMENT : NOTHING
the example :
"This is","a "very" odd","CSV example"
"This is","a very odd","CSV example"
and the other example :
"abcde",123456," just a" test "" to see" if ",789012,0,"a""
is transformed in :
"abcde",123456," just a test to see if ",789012,0,"a"
Note that if you test a SINGLE line in a NEW file, my SEARCH-REPLACEMENT is
good ONLY IF that line is terminated with an Windows EOL character ( \r\n )
It is common for CSV files to contain text field values within pairs of double quote characters. Because of that, actual double quote characters inside the field value have to be recognized in some way. This escaping is done by yet another double quote character. So, a pair of double quote characters represents an actual double quote character in the text field. That's why you can delete them by pairs.
This would approach would be a problem if you want to replace the (pairs of) double quote characters by another character: then it really matters if you are at the start of the field value and dealing with the enclosing double quote character.
Your CSV examples aren't valid for the commonality mentioned above.
Also, by enclosing text field values within a pair of double quote characters, it is possible for those texts to contain newline characters, that is, span across multiple lines.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.