Looking for some assistance with the following scenario: I have 9000 htm files that all have a main title using a h1 tag. I have been tasked with removing the main title from each htm file as the titles are no longer required. These files were created in word (saved as web filtered htm…don’t ask as not my doing) and have some span tags thrown in.
Question: is there a Regex I can use in Notepad++ to find the 1st occurrence of the h1 tag & its content and remove it (but no other h1 tags on the page)?
Example of h1 tag (or it may be on 2 lines)
<h1><spanlang=EN-CA>MainTitle</span></h1>
I have been able to come up with a regex (which works but it is locating all h1 tags on the file/page:
(?s)<h1>.+?</h1>
Any suggestions?
EDIT: it seems my example and regex are not displaying as code. Sorry about that and hope it is displaying correctly now.
Last edit: segarn 2015-06-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think that I found something that could do the job :-))
Just one hypothesis : Are all your HTML files begin with the string <!DOCTYPE ? As I suppose so, type the following search/replacement regexes :
SEARCH (?s)<!DOCTYPE.*?\K\h*<h1>.*?</h1>(?-s).*\R
REPLACE Nothing
Et voilà !
To sum up, this S/R will remove :
The entire line containing the first <h1> tag, of a file
The entire line, containing the first </h1> tag, of a file
And, of course, all the lines, between these two specific lines
Some explanations :
The string "<!DOCTYPE" is used as an anchor. As this string happens once only, at the very beginning of any HTML file, there will be one match, only, per file :-)
Normally, the syntax \A is an PCRE assertion, that matches the very beginning of a file, which we should have used as the default anchor. But, because of a bug of the present N++ regex engine, on backward assertions, the equivalent regex (?s)\A.*?\K\h*<h1>.*?</h1>(?-s).*\R finds, wrongly, all the ranges <h1>.......</h1> :-(((
The \h* syntax represents any possible sequence of one of the 3 characters SPACE, TABULATION or NBSP ( the Non Breaking Space character of code \xA0 ), before the <h1> tag
When a \K syntax occurs, everything previously matched before\K is forgotten, so the first matched characters are, now, the possible sequence of horizontal blanks, beginning the line, where the first <h1> tag occurs further on.
The form .*?, between the two tags <h1> et </h1> try to match the smallest range, between the two consecutive tags.
Once, we have detected all the first range <h1>......</h1>, even on several lines, we must take all the characters of the current line ( where </h1> occurs !), as well as its "End of Line" characters !
To do so, I select the default behaviour of the regex engine (?-s), that is to say : the dot matches any standard character, only and, finally, the \R syntax matches the End of Line characters ( \r\n) for a Window file, \n for a Unix file, or \r for a MAC file )
Due to the \K form, only the block of lines, where occur the two first tags <h1> and </h1>, is, actually, matched. Therefore, as you want to remove this main title, the Replace zone must stay empty !
To end with that topic, one more sensible remark : make a text on few files, copied from the original ones !!
Hope that this regex will be the right one, for you :-))
Best Regards,
guy038
P.S. :
You'll find good documentation, about the new Boost C++ Regex library ( similar to the PERL Regular Common Expressions ), used by Notepad++, since the 6.0 version, at the TWO addresses below :
Explanation:
Find three groups of strings:
1) (.*?) everything before heading h1, shortest string search
2) heading h1, again shortest string search
3) everything after heading h1, longest string search
Replace with groups 1 and 3. (1 and 2 really, because group 2 is not saved in variable, hence no braces around it)
Note to segarn:
Notepad++ can execute replace operation on a groups of file (go to Find in files tab). But before you do that, make a backup first, you don't want to manually fix 9000 files is something goes wrong.
Best regards,
Loreia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Looking for some assistance with the following scenario: I have 9000 htm files that all have a main title using a h1 tag. I have been tasked with removing the main title from each htm file as the titles are no longer required. These files were created in word (saved as web filtered htm…don’t ask as not my doing) and have some span tags thrown in.
Question: is there a Regex I can use in Notepad++ to find the 1st occurrence of the h1 tag & its content and remove it (but no other h1 tags on the page)?
Example of h1 tag (or it may be on 2 lines)
I have been able to come up with a regex (which works but it is locating all h1 tags on the file/page:
Any suggestions?
EDIT: it seems my example and regex are not displaying as code. Sorry about that and hope it is displaying correctly now.
Last edit: segarn 2015-06-17
Hello segarn,
I think that I found something that could do the job :-))
Just one hypothesis : Are all your HTML files begin with the string <!DOCTYPE ? As I suppose so, type the following search/replacement regexes :
SEARCH
(?s)<!DOCTYPE.*?\K\h*<h1>.*?</h1>(?-s).*\RREPLACE Nothing
Et voilà !
To sum up, this S/R will remove :
The entire line containing the first
<h1>tag, of a fileThe entire line, containing the first
</h1>tag, of a fileAnd, of course, all the lines, between these two specific lines
Some explanations :
The string "<!DOCTYPE" is used as an anchor. As this string happens once only, at the very beginning of any HTML file, there will be one match, only, per file :-)
Normally, the syntax
\Ais an PCRE assertion, that matches the very beginning of a file, which we should have used as the default anchor. But, because of a bug of the present N++ regex engine, on backward assertions, the equivalent regex(?s)\A.*?\K\h*<h1>.*?</h1>(?-s).*\Rfinds, wrongly, all the ranges<h1>.......</h1>:-(((The
\h*syntax represents any possible sequence of one of the 3 characters SPACE, TABULATION or NBSP ( the Non Breaking Space character of code\xA0), before the<h1>tagWhen a
\Ksyntax occurs, everything previously matched before \K is forgotten, so the first matched characters are, now, the possible sequence of horizontal blanks, beginning the line, where the first<h1>tag occurs further on.The form
.*?, between the two tags<h1>et</h1>try to match the smallest range, between the two consecutive tags.Once, we have detected all the first range
<h1>......</h1>, even on several lines, we must take all the characters of the current line ( where</h1>occurs !), as well as its "End of Line" characters !To do so, I select the default behaviour of the regex engine
(?-s), that is to say : the dot matches any standard character, only and, finally, the\Rsyntax matches the End of Line characters (\r\n)for a Window file,\nfor a Unix file, or\rfor a MAC file )Due to the
\Kform, only the block of lines, where occur the two first tags<h1>and</h1>, is, actually, matched. Therefore, as you want to remove this main title, the Replace zone must stay empty !To end with that topic, one more sensible remark : make a text on few files, copied from the original ones !!
Hope that this regex will be the right one, for you :-))
Best Regards,
guy038
P.S. :
You'll find good documentation, about the new Boost C++ Regex library ( similar to the PERL Regular Common Expressions ), used by
Notepad++, since the6.0version, at the TWO addresses below :http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html
The FIRST link explains the syntax, of regular expressions, in the SEARCH part
The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part
Last edit: THEVENOT Guy 2015-06-18
Thanks so much THEVENOT Guy. I will give this a try and let you know how it turns out.
Hi Guy,
why not go with the simplest solution for this:
Find:
Explanation:
Find three groups of strings:
1) (.*?) everything before heading h1, shortest string search
2) heading h1, again shortest string search
3) everything after heading h1, longest string search
Replace with groups 1 and 3. (1 and 2 really, because group 2 is not saved in variable, hence no braces around it)
Note to segarn:
Notepad++ can execute replace operation on a groups of file (go to Find in files tab). But before you do that, make a backup first, you don't want to manually fix 9000 files is something goes wrong.
Best regards,
Loreia
Thanks Loreia! I will try this one also. Backups are always a must;)
S
Both replies work brilliantly! Not sure if one method is "better" than the other.
Thanks again for both of your help & info.
S