Extract Text inside HTML Tag

zfc
2013-08-14
2013-08-15
  • zfc

    zfc - 2013-08-14

    I have a text file containing HTML tag. I wish to extract the text enclosed by .... For example, this is the original text file:
    Line 1: Text 1 Text 2 Text 3 Text 4
    Line 2: Text 5 Text 6

    The expected output is:
    Line 1: Text 2
    Line 2: Text 4
    Line 3: Text 6

    Each text containing ... tag should be copied to separated line.
    Is it possible for me to do this using Notepad+?

     
    • Cliff

      Cliff - 2013-08-14

      I have something similar. How do I select all the text within two find/searches?

      How do I select all the text between two “searches” in a macro?

      For example I will find the text “A”. Then I would like to mark this as the beginning of my text selection. Then I will do a find of the text “B”. I want to highlight all the text between “A” and “B” [eg, 1-5].

      A

      1

      2

      3

      4

      5

      B

       
      • Fool4UAnyway

        Fool4UAnyway - 2013-08-15

        First of all: how many posts are in this thread and what is the number shown in the thread list? It seems replies directly to a single message are not counted...

        For a similar case, the answer is similar as well.

        You simply would want to find A up to and including everything to B and then strip all A's and B's.

        Find:
        YourTagA.*?YourTagB

        If you want to keep only the text between tags A and B, use Text Crawler's Extract function and then remove all A and B tags.

         
  • zfc

    zfc - 2013-08-14

    How should I disable the HTML formatting function in this discussion forum? The above text should not in bold. In the orginal text file, it is HTML tag < b >...< /b >
    Line 1: Text 1 < b >Text 2< /b > Text 3 < b >Text 4< /b >
    Line 2: Text 5 < b >Text 6< /b >

     
  • Fool4UAnyway

    Fool4UAnyway - 2013-08-14

    You could use a tool like Text Crawler to Extract the <_b>...</b>_ matches, each to its own line, and then simply remove those tags afterwards.

    http://www.digitalvolcano.co.uk

    Find: "<_b>[^>]+</b>_"
    [Extract]

    You could also extract those matches only by using a somewhat more advanced regular expression containing look-behind and look-ahead expressions.

    By the way, I surrounded the < and > symbols each with a pair of underscore characters. These are interpreted as "the text in-between these should be shown in italic" and leaves the text between them as it is. (I hate auto-spell-checking.)

    I notice only the outside underscore characters are "interpreted" in the intended way. Any underscore characters in the italic texts are to be ignored. Just pretend they are not there (don't use or copy those).

     
    Last edit: Fool4UAnyway 2013-08-14
    • zfc

      zfc - 2013-08-15

      Thank you! Your regular expression and TextCrawler did exactly what I want!