Need help with regular expression : delete all characters before specific word

  • ruk4

    hi ,i have a text file like this,i want to retrieve the string after "lexicon" ph=" and before "
    for example i have a text like this :

                    <t g2p_method="lexicon" ph="d a4 _4"
                            pos="R">đã<syllable ph="d a4" tone="_4">
                            <ph p="d"/>
                            <ph p="a4"/>
                    <t g2p_method="lexicon"
                        ph="b 72 _2 - s o1 j1 _1 - z uo6a N6a _6a - m 7_X6b t6b _6b"
                            pos="0">Bờ xôi ruộng mật<syllable ph="b 72" tone="_2">
                            <ph p="b"/>
                            <ph p="72"/>
                        <syllable ph="s o1 j1" tone="_1">
                            <ph p="s"/>
                            <ph p="o1"/>
                            <ph p="j1"/>
                        <syllable ph="z uo6a N6a" tone="_6a">
                            <ph p="z"/>
                            <ph p="uo6a"/>
                            <ph p="N6a"/>
                        <syllable ph="m 7_X6b t6b" tone="_6b">
                            <ph p="m"/>
                            <ph p="7_X6b"/>
                            <ph p="t6b"/>

    the text will become like this:

    d a4 _4
    b 72 _2 - s o1 j1 _1 - z uo6a N6a _6a - m 7_X6b t6b _6b

    my regular expression is (lexicon+)([^p]+)(ph="+)([^"]+)("\n+) to find this string
    but i dont know how to delete all characters before the word "lexicon" and after the next quote : (")
    please tell me how to do this

    Last edit: ruk4 2014-02-20
  • dail8859

    I came up with


    real quick and it seems to work fairly well. Make sure the ". matches newline" option is set and replace it with


    or if you are wanting UNIX line endings use


    Not perfect but I think it will get you close enough to what you are wanting or you seem to know regular expressions well enough to modify mine.

    A quick explanation of why my regex deletes all the characters before is because I used .*? at the beginning to lazily grab all text before running into lexicon

  • ruk4

    thank you so much dail ,you make my day
    Your code is totally perfect ^^ it works like a charm :D