Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Copy Only certain Sections

Anonymous
2011-09-24
2012-11-13

  • Anonymous
    2011-09-24

    I work with thousand of web pages daily, I am only looking for a plugin or a way to copy by tags (ie <Div>, <body>).

    I have several thousand page the only information i want to copy is between <div class="Text"></div> and have it do this for each page without having to go by hand and select them  . How would i go about doing this .

    Please answer only if you have or know of a way whether it be a command, plugin or within notepad ++.
    Please do not answer if you don't know or if you are going to tell me just Crtl+C and Crtl+V, or simliar!

     
  • I'm assuming what's between the <div class="Text"> and </div> tags could be multi-line?

    If it's XHTML, the "right" way to do it is XSLT

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output method="text" />
        <xsl:template match="div[@class='Text']">
            <xsl:value-of select="."/>
        </xsl:template>
    
        <xsl:template match="*">
            <xsl:apply-templates select="*" />
        </xsl:template>
    </xsl:stylesheet>
    

    The XML Tools plugin can run that XSLT file over your HTML and it will produce just the contents of every <div class="Text"> </div> element. 

    You could also use the XPatherizer plugin, and copy the results from the XPath query //div but that will give you more XML as output, which I expect you would then need to run another XSLT over to get what you actually want.

    If it's not XHTML, then you've got two options - either first use HTMLTidy to make it XHTML, or if the contents of the div aren't across multiple lines, then a rough and ready regex to search it.  I say rough and ready, because no matter how good you think the regex is, there's usually a case where it will fall over.

    First do a Search

    Search : .*<div class="Text">(.*)</div>.*
    Mode: Regular expression

    (The regex could be simpler for this step, but this way it's the same regex for both the first and second stages)

    Tick the "mark line", and click "Find all".  That marks every line with such a div on it.
    Search, Bookmark, Inverse bookmarks.
    Search, Bookmark, Delete Bookmarked lines

    Search and replace,
    Search: .*<div class="Text">(.*)</div>.*
    Replace: \1
    Mode: Regular expression
    Replace all.

    If you've got divs inside your Text class div, or divs around your Text class div then this is going to fall over.  The XSLT route you can choose what you want to do  - at the moment, it will just extract the text inside each div

    so if you've got

    <div class="main">blah blah <div class="Text">this is what<div class="bold">you</div> want</div> blah blah</div>
    

    Then you'll get

    this is what you want
    

    The search and replace solution would give you

    this is what<div class="bold">you</div> want</div> blah blah
    

    Notice the blah blah on the end.   Someone could dive in and say "ou.. the new version's got non-greedy operator", but that will then give you

    this is what<div class="bold">you
    

    If it's not XHTML, and the divs are multi-line, then you could use Python Script to run a multiline regular expression over it (see editor.pymlreplace), but you've got similar issues if there's other stuff inside the div.

    Good luck,
    Dave.

    PS.  A really great way to get no answers to a question, is to use the words "Please do not answer" in your question.  Smart questions

     
  • Vera
    Vera
    2011-09-24

    @Dave
    Extraordinary rehashed answer to a humiliating request.

    Google-accounts would be a fool not to continue in exactly this manner.

    Still I much appreciate your reply for myself and others to benefit from :-)

     

  • Anonymous
    2011-09-27

    Thank You! So much for your help I will try this!

     
  • THEVENOT Guy
    THEVENOT Guy
    2012-10-28

    Hello,

    By the time, I think that you've succeeded to find a suitable solution to your problem,
      about extracting relevant text between the tags  <div class='Text'>  and  </div>

    Just for fun, this is an other way to get it !

    ONE HYPOTHESIS need to be respected :

      For example, in the stucture  <div class="xxxxxx">………</div>………<div class="yyyyyy">………</div>,

         the POSSIBLE jump to NEXT line must, OBLIGATORY, fall in the ZONES '……….', that is to say :

            - in the zones BEGINNING with > and ENDING with </div>

            - BETWEEN the BLOCS <div class="xxxxxx">………</div>

    If so,

      1) COPY ALL the text BETWEEN the TWO lines '------------', below, in a NEW file


    <div class="main">AAAAAAA<div class="text">IMPORTANT TEXT</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">TEXT SPLIT
    IN THREE
    LINES</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">AN IMPORTANT TEXT</div>NO GOOD<div class="text"> TO <div class="bold">LEARN</div> VERY</div>STILL NO GOOD<div class="text"> CAREFULLY</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">THIS IS <div class="bold">AN IMPORTANT</div> TEXT</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text"><div class="bold">A TEXTE</div></div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">THIS IS <div class="bold">A<div class="italic"> VERY</div> IMPORTANT</div> TEXT</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">THIS IS <div class="bold">A<div class="italic"> VERY</div> IMPOR<div class="underlined">TANT </div>TEXT</div> TO READ</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">THIS IS <div class="bold">AN IMPORTANT</div> TEXT<div class="italic"> ABOUT</div> WAR</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">THIS IS <div class="bold">A<div class="italic"> VERY</div> IMPOR<div class="underlined">TANT </div>TEXT TO </div> THINK ABOUT</div>NO GOOD<div class="text"> THAT <div class="bold"> THE</div> VERY</div>STILL NO GOOD<div class="text"> END</div>ZZZZZZ</div>

    <div class="main">AAAAAAA<div class="text">

    THIS IS <div class="bold">A
    <div class="italic"> VERY</div> IM

    POR<div class="underlined">TANT </div>TEXT TO </div> THINK ABOUT</div>
    NO GOOD<div class="text"> TH

    AT <div class="bold"> THE</div> VERY</div>STILL NO
    GOOD<div class="text"> END
    </div>ZZZZZZ</div>


      2) In this NEW tab, type CTRL-H to open the SEARCH-REPLACEMENT dialog

      3) SELECT the radio button 'Regular expression'

      4) If necessary, SELECT the box 'Match case'

      5) MAKE these FIVE cycles of SEARCH-REPLACEMENT below, in that ORDER, on the text of the NEW file :

    FIND             <div class
    REPLACE          #                                   REMPLACE  <div class  by the CHARACTER  #

    FIND             </div>
    REPLACE       @                                    REMPLACE  </div>      by the CHARACTER  @

    FIND             ^.*?(#="Text")
    REPLACE           \1                                DELETE EVERY text BEFORE the FIRST string  #="Text" of a LINE

    FIND             .*?(#(++|(?1))+@)*
    REPLACE           \1                                EXTRACT ALL the LONGUEST MATCHED series  #='text'……@ , with

                                                                     POSSIBLE OVERLAPPED and/or JUXTAPOSED series of  #…..@, INSIDE

    FIND             .*?(?<=)(+)(?=)|.*\Z
    REPLACE              \1                             EXTRACT ALL the text BEGINNING with the CHARACTER  >  OR  @

                                                                                                             and ENDING with the CHARACTER  #  OR  @

                                                                     ( EVEN MULTI-lines text ) and DELETE the LAST character @

      NOTE : The FOURTH search-replacement uses a RECURSIVE CALL (?1) to the SUB-PATTERN  #(++|(?1))+@

      => Finally, we obtain, below, ALL UPPERCASE text which is INSIDE the MATCHED zones  <div class = "text">…..</div>

    IMPORTANT TEXT

    TEXT SPLIT
    IN THREE
    LINES

    AN IMPORTANT TEXT TO LEARN VERY CAREFULLY

    THIS IS AN IMPORTANT TEXT

    A TEXTE

    THIS IS A VERY IMPORTANT TEXT

    THIS IS A VERY IMPORTANT TEXT TO READ

    THIS IS AN IMPORTANT TEXT ABOUT WAR

    THIS IS A VERY IMPORTANT TEXT TO  THINK ABOUT THAT  THE VERY END

    THIS IS A
    VERY IM

    PORTANT TEXT TO  THINK ABOUT
    TH

    AT  THE VERY
    END

    I've made a TUTORIAL, about the PCRE Regular Expressions ( Perl Common Regular Expressions ),
      used in Notepad++, from the 6.0 version.

    As I'm French, all this manual is written in French. but you can find out some tricks or
      explanations in all the lists and examples, all along this tutorial.

    Christian Cuvier ( cchris ), a very well-known contributer, allowed me to put my tutorial
      on his personnel site.

    So, you can download this TUTORIAL, in 3 versions, (.txt .pdf .html), at the address below :

           http://oedoc.free.fr/Regex/TutorielRegex.zip

    I hope it'll be useful to you

    Cheers !

    guy038

    P.S. You can also find some documentation, about the new PRCE Regular Expressions, used by N++, at the
           two adresses below :

         http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

         http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

         The FIRST one concerns the syntax of regular expressions in the SEARCH part

         The SECOND one concerns the syntax of regular expressions in the REPLACEMENT part

     
  • THEVENOT Guy
    THEVENOT Guy
    2012-10-28

    Hello,

    I just forgot one thing :

    Of course, the TWO characters # and @, used as LIMITS, in my example, MUST NOT exist in your ORIGINAL file !

    Cheers !

    guy038