Menu

Fast Global Search

Help
Enzo
2007-05-17
2012-12-29
  • Enzo

    Enzo - 2007-05-17

    Hello,

    I'm searching a better method (faster) than the ^%GSE.
    I have a huge global (40 Mb) ...

    Perhaps, maintaining a text file and using grep ... ;-)
    or what else?

    Thanks in advance,

    Enzo

     
    • Enzo

      Enzo - 2007-05-18

      Well,

      I could use the 'sed' command to search/modify a file, where this
      file is a realtime replication of a working global. But when the
      sed editor generate the desired results in a new file (or the same),
      I could read this file and import lines into a temp global, using
      the example of Rob Tweed (Fast Pipes)

      Perhaps is there another better method for do that?

      Regards,

      Enzo

       
    • James A Self

      James A Self - 2007-05-22

      Enzo,
      A size of 40Mb is not particularly large for a global. I am accustomed to working with globals hundreds of times larger than that in the context of a hospital information system and I expect that some people following this forum commonly work with globals hundreds of times larger again.

      Even extremely large globals can be searched very quickly, if the data records are cross referenced by their content. For instance, a search module that I wrote typically completes searches in just a few seconds when performing queries with possibly complex criteria involving diagnoses, patient age, gender, procedures performed, visit date, etc. on a database including over a million patient visit records.

      If you can give an example indicating the structure of your data and the kind of searches you need to perform, we can give more detailed suggestions.

       
    • Roger Partridge

      Roger Partridge - 2007-05-22

      Enzo:

      It's software, so there are many options. The some of the most prominent are:

      The M approach mentioned by Jim - if you know what you're going to want to search - create and maintain a [cross-]index for it. Take Jim up on his offer of counsel.

      If the search is add-hoc you can use $ORDER() if you know where the class of information being sought is found, or $QUERY() if you're searching willy-nilly to go rooting around, probably with $FIND(), for your target. This amounts to the approach taken by ^%GSE, but you can make it perform better by making it more specific to your problem (minimizing indirection is very good).

      If such searches are a rare event and pure brute force is the way to go, you can dump data out to a flat file (MUPIP EXTRACT, ZWRITE, ^%GO, etc) and go after your target with a non-M tool.

      Roger

       
    • Enzo

      Enzo - 2007-06-03

      Hello Jim and Roger,

      the global I have is 400 Mb (sorry!), but seems to be little than others! ;-)
      The structure is the most simple, a global of references with 1 index and 1 field:

      ^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"

      But the user continuosly search a portion inside this description,making
      the search too slow because the $ORDER shold read all the references.

      Actually, I have made a (brute) solution, that is maintaining this global
      outside de database, as a text file. The search is made by 'sed' and the
      results are reading from the GT-M ... but works for me.

      Could you give me an example for convert this 'ugly-search' in a 'pretty-and-cool-search' ? ;-)

      Thank you again,

      Enzo

       
      • James A Self

        James A Self - 2007-06-05

        Enzo,
        If the text items in your data are less than 80 characters each and the total is 400MB, then it seems that you have 5 million items or more. I am curious as to what search times you observe with the queries and methods you have tried.

        If your sed based search takes more than a few seconds, then a simple indexing technique called KWIC (Key Word in Context) might give you dramatically better search times. This would enable rapid retrieval of text items containing specific words or word patterns.

        Below is an example subroutine (initKWIC), that you might consider running to set up a KWIC cross reference on your data in ^REF.

        On modification or deletion of any text item, you could run xrKWIC(item,newText,oldText) to cross reference any new words in the context and to delete any obsolete word/item pairs as needed.

        Running initKWIC might take many minutes to process 400MB depending on the speed of your data storage, but after that many possible searches based on ^KWIC(word,item) would return results almost instantaneously.

        initKWIC  ;initialize KWIC cross reference
          ; for given example data ^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"
          new cod set cod=""
          for  set cod=$o(^REF(cod)) quit:cod=""  do xrKWIC(cod,^REF(cod))
          quit

        xrKWIC(item,newText,oldText) ;setup and maintain KWIC index references for the words of one text item
          ;--oldText is required to remove obsolete references, but not to initialize --;

          new i,word,words
         
          ;cross reference new words
          if $l($g(newText)) do
          . set words=$$simpleText(newText)
          . for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
          . . s words(word)=""
          . . s:'$d(^KWIC(word,item)) ^(item)=""

          ;remove obsolete cross references
          if $l($g(oldText)) do
          . set words=$$simpleText(oldText)
          . for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
          . . if '$d(words(word)),$d(^KWIC(word,item)) k ^(item)
          quit

        simpleText(text)  ;simplify text to just words and spaces
          ;-- uppercase and remove punctuation (should be 9 spaces between Z and ending quote below.)
          quit $tr(text,"abcdefghijklmnopqrstuvwxyz,./?;:-_!","ABCDEFGHIJKLMNOPQRSTUVWXYZ         ")

        Here is some sample data and the corresponding KWIC cross reference.

        zwr ^REF

        ^REF(1)="example text for KWIC reference test"
        ^REF(2)="The second example text."
        ^REF(3)="Example 3 is simple."

        zwr ^KWIC

        ^KWIC(3,3)=""
        ^KWIC("EXAMPLE",1)=""
        ^KWIC("EXAMPLE",2)=""
        ^KWIC("EXAMPLE",3)=""
        ^KWIC("FOR",1)=""
        ^KWIC("IS",3)=""
        ^KWIC("KWIC",1)=""
        ^KWIC("REFERENCE",1)=""
        ^KWIC("SECOND",2)=""
        ^KWIC("SIMPLE",3)=""
        ^KWIC("TEST",1)=""
        ^KWIC("TEXT",1)=""
        ^KWIC("TEXT",2)=""
        ^KWIC("THE",2)=""

         
        • Sean

          Sean - 2007-06-05

          FWIW

          Seeding your $ORDER can help as will a properly used naked reference '^('
          By seeding I suggest that you take the term you are looking for and back the last ascii character value by one bit and add a few 'z's.'

          So if you are looking for 'EXAMPLE' you can seed with something like 'EXAMPLDzzzzz'

          My Mumps is a bit rusty but I think it comes out something like this:

          s search_term="EXAMPLE"
          search_seed=$e(search_term,1,$l(search_term)-1)+$c($a(search_term)-1)+"zzzzz"

          s search = $o(^BFG(search_seed)) returns the first one
          d  q:search'[search_term
          . s x=^(search)
          . ;do what you want with 'x' here . . .
          . search=$o(^(search))

          The seed drops you in through the pointer blocks to the first data block that has the full reference you're looking at (if it exists).

          The naked reference allows mumps to skip building the full key for the subsequent $order's.

          I've always believed that it is very important to learn how a language is actually implemented to determine what the most efficient coding practices are.

          Just a thought ...

           
          • James A Self

            James A Self - 2007-06-06

            Sean,
            Your example was apparently intended to demonstrate a prefix search (for words beginning with "EXAMPLE", such as "EXAMPLES"), but it has a few problems.

            A FOR command is needed for iteration over ^BFG(search).

            The underbar character ("_") is the concatenation operator in MUMPS and therefore is not an allowable part of any variable name. Let's change "search_term" to just "term" and eliminate "search_seed".

            The "search_seed" technique of taking an arbitrary tiny step backwards before stepping forwards is unnecessary - plus it either crashes or gives an undesirable result when a search term is given that is not in the data. Notice that this first value is not tested to ensure that it contains the search term, so it could be an unrelated word or even empty.

            Here is a corrected version:

            if $d(^BFG(term)) set search=term
            else  set search=$o(^(term))
            for   quit:search'[term  do
            . set x=^(search)
            . do something ;--but don't reference any globals here!--;
            . set search=$o(^(search))

            I don't know (I doubt) that the naked references provide any significant performance enhancement. I think they can make  the code more readable if used carefully but when stretched out over several lines they do make the code more fragile. For instance, the code above would break if "do something" attempted to store results in a global or to check additional data in a different global or at a different subscript level of the same global.

             
    • James A Self

      James A Self - 2007-06-05

      Oops, I should have known better than trying to include MUMPS code in a reply on this forum.

      The code that I posted above (and probably in the email as well) was distorted by the forum software due to the loss of syntactically significant spacing. For anyone interested in reconstructing the code, the missing spaces are actually included in the HTML source code of the forum web page.

      - all of the MUMPS lines should be indented except for the 3 lines labeled: initKWIC, xrKWIC, and simpleText.

      - the Quit command at initKWIC+3 should be followed 2 spaces, not just 1.

      - the quoted text string at simpleText+2 ABC..Z should end with 9 spaces, corresponding to the 9 punctuation characters ending the previous quoted string.

      I hope you find this useful.

       
    • Jens Wulf

      Jens Wulf - 2007-06-06

      Just one comment:
      To get the search_seed i'd prefer

      s:search_term'="" search_seed=$O(^BFG(search_term),-1)

      for me it's more "clean" than the "zzzz"'s

      Jens

       
      • Sean

        Sean - 2007-06-06

        Funny, but I don't remember that in DSM-11. 
        Must have come after my time. 
        Ever viewed -16777216::0? 
        ;^)

         

Log in to post a comment.