Need help with regular expression : delete all characters before specific word

ruk4
2014-02-20
2014-02-20
  • ruk4
    ruk4
    2014-02-20

    hi ,i have a text file like this,i want to retrieve the string after "lexicon" ph=" and before "
    for example i have a text like this :

                    <t g2p_method="lexicon" ph="d a4 _4"
                            pos="R">đã<syllable ph="d a4" tone="_4">
                            <ph p="d"/>
                            <ph p="a4"/>
                        </syllable>
                    </t>
    
                    <t g2p_method="lexicon"
                        ph="b 72 _2 - s o1 j1 _1 - z uo6a N6a _6a - m 7_X6b t6b _6b"
                            pos="0">Bờ xôi ruộng mật<syllable ph="b 72" tone="_2">
                            <ph p="b"/>
                            <ph p="72"/>
                        </syllable>
                        <syllable ph="s o1 j1" tone="_1">
                            <ph p="s"/>
                            <ph p="o1"/>
                            <ph p="j1"/>
                        </syllable>
                        <syllable ph="z uo6a N6a" tone="_6a">
                            <ph p="z"/>
                            <ph p="uo6a"/>
                            <ph p="N6a"/>
                        </syllable>
                        <syllable ph="m 7_X6b t6b" tone="_6b">
                            <ph p="m"/>
                            <ph p="7_X6b"/>
                            <ph p="t6b"/>
                        </syllable>
                    </t>
    

    the text will become like this:


    d a4 _4
    b 72 _2 - s o1 j1 _1 - z uo6a N6a _6a - m 7_X6b t6b _6b

    my regular expression is (lexicon+)([^p]+)(ph="+)([^"]+)("\n+) to find this string
    but i dont know how to delete all characters before the word "lexicon" and after the next quote : (")
    please tell me how to do this
    thanks

     
    Last edit: ruk4 2014-02-20
  • dail8859
    dail8859
    2014-02-20

    I came up with

    .*?"lexicon"\s+ph\s*=\s*"([^"]+)"
    

    real quick and it seems to work fairly well. Make sure the ". matches newline" option is set and replace it with

    \1\r\n
    

    or if you are wanting UNIX line endings use

    \1\n
    

    Not perfect but I think it will get you close enough to what you are wanting or you seem to know regular expressions well enough to modify mine.

    A quick explanation of why my regex deletes all the characters before is because I used .*? at the beginning to lazily grab all text before running into lexicon

     
  • ruk4
    ruk4
    2014-02-20

    thank you so much dail ,you make my day
    Your code is totally perfect ^^ it works like a charm :D