GT.M High end TP database engine / Discussion / Help: Fast Global Search

Enzo - 2007-05-17

Hello,

I'm searching a better method (faster) than the ^%GSE.
I have a huge global (40 Mb) ...

Perhaps, maintaining a text file and using grep ... ;-)
or what else?

Thanks in advance,

Enzo

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Enzo - 2007-05-18
  
  Well,
  
  I could use the 'sed' command to search/modify a file, where this
  file is a realtime replication of a working global. But when the
  sed editor generate the desired results in a new file (or the same),
  I could read this file and import lines into a temp global, using
  the example of Rob Tweed (Fast Pipes)
  
  Perhaps is there another better method for do that?
  
  Regards,
  
  Enzo
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- James A Self - 2007-05-22
  
  Enzo,
  A size of 40Mb is not particularly large for a global. I am accustomed to working with globals hundreds of times larger than that in the context of a hospital information system and I expect that some people following this forum commonly work with globals hundreds of times larger again.
  
  Even extremely large globals can be searched very quickly, if the data records are cross referenced by their content. For instance, a search module that I wrote typically completes searches in just a few seconds when performing queries with possibly complex criteria involving diagnoses, patient age, gender, procedures performed, visit date, etc. on a database including over a million patient visit records.
  
  If you can give an example indicating the structure of your data and the kind of searches you need to perform, we can give more detailed suggestions.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Roger Partridge - 2007-05-22
  
  Enzo:
  
  It's software, so there are many options. The some of the most prominent are:
  
  The M approach mentioned by Jim - if you know what you're going to want to search - create and maintain a [cross-]index for it. Take Jim up on his offer of counsel.
  
  If the search is add-hoc you can use $ORDER() if you know where the class of information being sought is found, or $QUERY() if you're searching willy-nilly to go rooting around, probably with $FIND(), for your target. This amounts to the approach taken by ^%GSE, but you can make it perform better by making it more specific to your problem (minimizing indirection is very good).
  
  If such searches are a rare event and pure brute force is the way to go, you can dump data out to a flat file (MUPIP EXTRACT, ZWRITE, ^%GO, etc) and go after your target with a non-M tool.
  
  Roger
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Enzo - 2007-06-03
  
  Hello Jim and Roger,
  
  the global I have is 400 Mb (sorry!), but seems to be little than others! ;-)
  The structure is the most simple, a global of references with 1 index and 1 field:
  
  ^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"
  
  But the user continuosly search a portion inside this description,making
  the search too slow because the $ORDER shold read all the references.
  
  Actually, I have made a (brute) solution, that is maintaining this global
  outside de database, as a text file. The search is made by 'sed' and the
  results are reading from the GT-M ... but works for me.
  
  Could you give me an example for convert this 'ugly-search' in a 'pretty-and-cool-search' ? ;-)
  
  Thank you again,
  
  Enzo
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - James A Self - 2007-06-05
    
    Enzo,
    If the text items in your data are less than 80 characters each and the total is 400MB, then it seems that you have 5 million items or more. I am curious as to what search times you observe with the queries and methods you have tried.
    
    If your sed based search takes more than a few seconds, then a simple indexing technique called KWIC (Key Word in Context) might give you dramatically better search times. This would enable rapid retrieval of text items containing specific words or word patterns.
    
    Below is an example subroutine (initKWIC), that you might consider running to set up a KWIC cross reference on your data in ^REF.
    
    On modification or deletion of any text item, you could run xrKWIC(item,newText,oldText) to cross reference any new words in the context and to delete any obsolete word/item pairs as needed.
    
    Running initKWIC might take many minutes to process 400MB depending on the speed of your data storage, but after that many possible searches based on ^KWIC(word,item) would return results almost instantaneously.
    
    initKWIC ;initialize KWIC cross reference
    ; for given example data ^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"
    new cod set cod=""
    for set cod=$o(^REF(cod)) quit:cod="" do xrKWIC(cod,^REF(cod))
    quit
    
    xrKWIC(item,newText,oldText) ;setup and maintain KWIC index references for the words of one text item
    ;--oldText is required to remove obsolete references, but not to initialize --;
    
    new i,word,words
    
    ;cross reference new words
    if $l($g(newText)) do
    . set words=$$simpleText(newText)
    . for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
    . . s words(word)=""
    . . s:'$d(^KWIC(word,item)) ^(item)=""
    
    ;remove obsolete cross references
    if $l($g(oldText)) do
    . set words=$$simpleText(oldText)
    . for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
    . . if '$d(words(word)),$d(^KWIC(word,item)) k ^(item)
    quit
    
    simpleText(text) ;simplify text to just words and spaces
    ;-- uppercase and remove punctuation (should be 9 spaces between Z and ending quote below.)
    quit $tr(text,"abcdefghijklmnopqrstuvwxyz,./?;:-_!","ABCDEFGHIJKLMNOPQRSTUVWXYZ ")
    
    Here is some sample data and the corresponding KWIC cross reference.
    
    zwr ^REF
    
    ^REF(1)="example text for KWIC reference test"
    ^REF(2)="The second example text."
    ^REF(3)="Example 3 is simple."
    
    zwr ^KWIC
    
    ^KWIC(3,3)=""
    ^KWIC("EXAMPLE",1)=""
    ^KWIC("EXAMPLE",2)=""
    ^KWIC("EXAMPLE",3)=""
    ^KWIC("FOR",1)=""
    ^KWIC("IS",3)=""
    ^KWIC("KWIC",1)=""
    ^KWIC("REFERENCE",1)=""
    ^KWIC("SECOND",2)=""
    ^KWIC("SIMPLE",3)=""
    ^KWIC("TEST",1)=""
    ^KWIC("TEXT",1)=""
    ^KWIC("TEXT",2)=""
    ^KWIC("THE",2)=""
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Sean - 2007-06-05
      
      FWIW
      
      Seeding your $ORDER can help as will a properly used naked reference '^('
      By seeding I suggest that you take the term you are looking for and back the last ascii character value by one bit and add a few 'z's.'
      
      So if you are looking for 'EXAMPLE' you can seed with something like 'EXAMPLDzzzzz'
      
      My Mumps is a bit rusty but I think it comes out something like this:
      
      s search_term="EXAMPLE"
      search_seed=$e(search_term,1,$l(search_term)-1)+$c($a(search_term)-1)+"zzzzz"
      
      s search = $o(^BFG(search_seed)) returns the first one
      d q:search'[search_term
      . s x=^(search)
      . ;do what you want with 'x' here . . .
      . search=$o(^(search))
      
      The seed drops you in through the pointer blocks to the first data block that has the full reference you're looking at (if it exists).
      
      The naked reference allows mumps to skip building the full key for the subsequent $order's.
      
      I've always believed that it is very important to learn how a language is actually implemented to determine what the most efficient coding practices are.
      
      Just a thought ...
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - James A Self - 2007-06-06
        
        Sean,
        Your example was apparently intended to demonstrate a prefix search (for words beginning with "EXAMPLE", such as "EXAMPLES"), but it has a few problems.
        
        A FOR command is needed for iteration over ^BFG(search).
        
        The underbar character ("_") is the concatenation operator in MUMPS and therefore is not an allowable part of any variable name. Let's change "search_term" to just "term" and eliminate "search_seed".
        
        The "search_seed" technique of taking an arbitrary tiny step backwards before stepping forwards is unnecessary - plus it either crashes or gives an undesirable result when a search term is given that is not in the data. Notice that this first value is not tested to ensure that it contains the search term, so it could be an unrelated word or even empty.
        
        Here is a corrected version:
        
        if $d(^BFG(term)) set search=term
        else set search=$o(^(term))
        for quit:search'[term do
        . set x=^(search)
        . do something ;--but don't reference any globals here!--;
        . set search=$o(^(search))
        
        I don't know (I doubt) that the naked references provide any significant performance enhancement. I think they can make the code more readable if used carefully but when stretched out over several lines they do make the code more fragile. For instance, the code above would break if "do something" attempted to store results in a global or to check additional data in a different global or at a different subscript level of the same global.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- James A Self - 2007-06-05
  
  Oops, I should have known better than trying to include MUMPS code in a reply on this forum.
  
  The code that I posted above (and probably in the email as well) was distorted by the forum software due to the loss of syntactically significant spacing. For anyone interested in reconstructing the code, the missing spaces are actually included in the HTML source code of the forum web page.
  
  - all of the MUMPS lines should be indented except for the 3 lines labeled: initKWIC, xrKWIC, and simpleText.
  
  - the Quit command at initKWIC+3 should be followed 2 spaces, not just 1.
  
  - the quoted text string at simpleText+2 ABC..Z should end with 9 spaces, corresponding to the 9 punctuation characters ending the previous quoted string.
  
  I hope you find this useful.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jens Wulf - 2007-06-06
  
  Just one comment:
  To get the search_seed i'd prefer
  
  s:search_term'="" search_seed=$O(^BFG(search_term),-1)
  
  for me it's more "clean" than the "zzzz"'s
  
  Jens
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sean - 2007-06-06
    
    Funny, but I don't remember that in DSM-11.
    Must have come after my time.
    Ever viewed -16777216::0?
    ;^)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Fast Global Search

Industrial Strength NoSQL Application Development Platform

Forums

Help

Fast Global Search

Fast Global Search

Industrial Strength NoSQL Application Development Platform

Forums

Help

Fast Global Search document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Fast Global Search