The Unicon Project / Discussion / General Discussion: String Scanning, Regular Expressions or SNOBOL Pattern Matching

Hi there,

I would like some advice on the best way to tackle a programming task.

I have a bunch of .htm pages that I want to convert into LaTex.

All of them seem to follow the same structure, so essentially what I'd like to do is:

Look for patterns that I want to keep and do something with them. Some examples:
a. transform <TITLE>A title</TITLE> to \chapter{A Title}
b. tranform Italic text to \em{Italic text}
c. transform Some text to Some text[newline]

In reality the tags will be a bit more complex than this, since the HTML seems to be generated and there's a lot of .. tags whose contents are of no interest to me since they don't contain any text.

I started working at this using string scanning but didn't get really far, owing to the fact that my experience so far with it was on rather short strings. But in this case I think I need to do things in a big loop.

$define collect tab

dir := "/home/northbae/Documents/magus/www.sacred-texts.com/grim/magus/"
filename := "ma"
number := 0
extension := ".htm"

# Process files ma100.htm to ma167.htm
every number := 100 to 167 do {
  file := dir || filename || number || extension

  # Open the current filename
  currentFile := open(file, "r") | die("Could not open file.")

  # Read the entire file as a string
  text := reads(currentFile, 500000)

  # Scan and extract items of interest
  # This here is where I'm stuck. See comment below at rest := tab(0)
  text ? {
    tab(find("<TITLE>"))
    move(7)
    title := collect(upto("</TITLE>"))
    write("\\chapter{",title, "}")

    tab(find("<HR>\n</P>"))
    move(9)

    # I think this needs to be text := tab(0) and have this entire thing be a loop, like
    # loop
    #     Scan
    #     extract
    #     make new subject be the rest of the string
    # end loop
    rest := tab(0)
   }
}

But after reading a bit about SNOBOL style Pattern matching, ithis might be an easier way to go about it, if I want to look for patterns and replace them with another one, or rearrange peices of text in certain ways.

But I'm still unsure how you go about this in a loop, as the patterns I'm looking for could be found more than once in the document. What I want to do is scan for a pattern, do something, then reset the subject to be the rest of the subject, then loop back to be able to check for those patterns again and again.

For some context, here are examples of the htm files I intend to scan: https://www.dropbox.com/sh/68mv6j8evgmv1ok/AACHRY9xUnReX45vM-IZRoQNa?dl=0

I think just understanding how to set up a loop that scans for patterns then resets the subject to be current pos..tab(0), this should be helpful enough to allow be to keep at it.

Regards, and thanks for any pointers.

yves

String Scanning, Regular Expressions or SNOBOL Pattern Matching

A modern descendant of the Icon programming language.

Forums

Help

String Scanning, Regular Expressions or SNOBOL Pattern Matching

String Scanning, Regular Expressions or SNOBOL Pattern Matching

A modern descendant of the Icon programming language.

Forums

Help

String Scanning, Regular Expressions or SNOBOL Pattern Matching document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

String Scanning, Regular Expressions or SNOBOL Pattern Matching