I could use the 'sed' command to search/modify a file, where this
file is a realtime replication of a working global. But when the
sed editor generate the desired results in a new file (or the same),
I could read this file and import lines into a temp global, using
the example of Rob Tweed (Fast Pipes)
Perhaps is there another better method for do that?
Regards,
Enzo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Enzo,
A size of 40Mb is not particularly large for a global. I am accustomed to working with globals hundreds of times larger than that in the context of a hospital information system and I expect that some people following this forum commonly work with globals hundreds of times larger again.
Even extremely large globals can be searched very quickly, if the data records are cross referenced by their content. For instance, a search module that I wrote typically completes searches in just a few seconds when performing queries with possibly complex criteria involving diagnoses, patient age, gender, procedures performed, visit date, etc. on a database including over a million patient visit records.
If you can give an example indicating the structure of your data and the kind of searches you need to perform, we can give more detailed suggestions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's software, so there are many options. The some of the most prominent are:
The M approach mentioned by Jim - if you know what you're going to want to search - create and maintain a [cross-]index for it. Take Jim up on his offer of counsel.
If the search is add-hoc you can use $ORDER() if you know where the class of information being sought is found, or $QUERY() if you're searching willy-nilly to go rooting around, probably with $FIND(), for your target. This amounts to the approach taken by ^%GSE, but you can make it perform better by making it more specific to your problem (minimizing indirection is very good).
If such searches are a rare event and pure brute force is the way to go, you can dump data out to a flat file (MUPIP EXTRACT, ZWRITE, ^%GO, etc) and go after your target with a non-M tool.
Roger
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the global I have is 400 Mb (sorry!), but seems to be little than others! ;-)
The structure is the most simple, a global of references with 1 index and 1 field:
^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"
But the user continuosly search a portion inside this description,making
the search too slow because the $ORDER shold read all the references.
Actually, I have made a (brute) solution, that is maintaining this global
outside de database, as a text file. The search is made by 'sed' and the
results are reading from the GT-M ... but works for me.
Could you give me an example for convert this 'ugly-search' in a 'pretty-and-cool-search' ? ;-)
Thank you again,
Enzo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Enzo,
If the text items in your data are less than 80 characters each and the total is 400MB, then it seems that you have 5 million items or more. I am curious as to what search times you observe with the queries and methods you have tried.
If your sed based search takes more than a few seconds, then a simple indexing technique called KWIC (Key Word in Context) might give you dramatically better search times. This would enable rapid retrieval of text items containing specific words or word patterns.
Below is an example subroutine (initKWIC), that you might consider running to set up a KWIC cross reference on your data in ^REF.
On modification or deletion of any text item, you could run xrKWIC(item,newText,oldText) to cross reference any new words in the context and to delete any obsolete word/item pairs as needed.
Running initKWIC might take many minutes to process 400MB depending on the speed of your data storage, but after that many possible searches based on ^KWIC(word,item) would return results almost instantaneously.
initKWIC ;initialize KWIC cross reference
; for given example data ^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"
new cod set cod=""
for set cod=$o(^REF(cod)) quit:cod="" do xrKWIC(cod,^REF(cod))
quit
xrKWIC(item,newText,oldText) ;setup and maintain KWIC index references for the words of one text item
;--oldText is required to remove obsolete references, but not to initialize --;
new i,word,words
;cross reference new words
if $l($g(newText)) do
. set words=$$simpleText(newText)
. for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
. . s words(word)=""
. . s:'$d(^KWIC(word,item)) ^(item)=""
;remove obsolete cross references
if $l($g(oldText)) do
. set words=$$simpleText(oldText)
. for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
. . if '$d(words(word)),$d(^KWIC(word,item)) k ^(item)
quit
simpleText(text) ;simplify text to just words and spaces
;-- uppercase and remove punctuation (should be 9 spaces between Z and ending quote below.)
quit $tr(text,"abcdefghijklmnopqrstuvwxyz,./?;:-_!","ABCDEFGHIJKLMNOPQRSTUVWXYZ ")
Here is some sample data and the corresponding KWIC cross reference.
zwr ^REF
^REF(1)="example text for KWIC reference test"
^REF(2)="The second example text."
^REF(3)="Example 3 is simple."
Seeding your $ORDER can help as will a properly used naked reference '^('
By seeding I suggest that you take the term you are looking for and back the last ascii character value by one bit and add a few 'z's.'
So if you are looking for 'EXAMPLE' you can seed with something like 'EXAMPLDzzzzz'
My Mumps is a bit rusty but I think it comes out something like this:
s search_term="EXAMPLE"
search_seed=$e(search_term,1,$l(search_term)-1)+$c($a(search_term)-1)+"zzzzz"
s search = $o(^BFG(search_seed)) returns the first one
d q:search'[search_term
. s x=^(search)
. ;do what you want with 'x' here . . .
. search=$o(^(search))
The seed drops you in through the pointer blocks to the first data block that has the full reference you're looking at (if it exists).
The naked reference allows mumps to skip building the full key for the subsequent $order's.
I've always believed that it is very important to learn how a language is actually implemented to determine what the most efficient coding practices are.
Just a thought ...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sean,
Your example was apparently intended to demonstrate a prefix search (for words beginning with "EXAMPLE", such as "EXAMPLES"), but it has a few problems.
A FOR command is needed for iteration over ^BFG(search).
The underbar character ("_") is the concatenation operator in MUMPS and therefore is not an allowable part of any variable name. Let's change "search_term" to just "term" and eliminate "search_seed".
The "search_seed" technique of taking an arbitrary tiny step backwards before stepping forwards is unnecessary - plus it either crashes or gives an undesirable result when a search term is given that is not in the data. Notice that this first value is not tested to ensure that it contains the search term, so it could be an unrelated word or even empty.
Here is a corrected version:
if $d(^BFG(term)) set search=term
else set search=$o(^(term))
for quit:search'[term do
. set x=^(search)
. do something ;--but don't reference any globals here!--;
. set search=$o(^(search))
I don't know (I doubt) that the naked references provide any significant performance enhancement. I think they can make the code more readable if used carefully but when stretched out over several lines they do make the code more fragile. For instance, the code above would break if "do something" attempted to store results in a global or to check additional data in a different global or at a different subscript level of the same global.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oops, I should have known better than trying to include MUMPS code in a reply on this forum.
The code that I posted above (and probably in the email as well) was distorted by the forum software due to the loss of syntactically significant spacing. For anyone interested in reconstructing the code, the missing spaces are actually included in the HTML source code of the forum web page.
- all of the MUMPS lines should be indented except for the 3 lines labeled: initKWIC, xrKWIC, and simpleText.
- the Quit command at initKWIC+3 should be followed 2 spaces, not just 1.
- the quoted text string at simpleText+2 ABC..Z should end with 9 spaces, corresponding to the 9 punctuation characters ending the previous quoted string.
I hope you find this useful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm searching a better method (faster) than the ^%GSE.
I have a huge global (40 Mb) ...
Perhaps, maintaining a text file and using grep ... ;-)
or what else?
Thanks in advance,
Enzo
Well,
I could use the 'sed' command to search/modify a file, where this
file is a realtime replication of a working global. But when the
sed editor generate the desired results in a new file (or the same),
I could read this file and import lines into a temp global, using
the example of Rob Tweed (Fast Pipes)
Perhaps is there another better method for do that?
Regards,
Enzo
Enzo,
A size of 40Mb is not particularly large for a global. I am accustomed to working with globals hundreds of times larger than that in the context of a hospital information system and I expect that some people following this forum commonly work with globals hundreds of times larger again.
Even extremely large globals can be searched very quickly, if the data records are cross referenced by their content. For instance, a search module that I wrote typically completes searches in just a few seconds when performing queries with possibly complex criteria involving diagnoses, patient age, gender, procedures performed, visit date, etc. on a database including over a million patient visit records.
If you can give an example indicating the structure of your data and the kind of searches you need to perform, we can give more detailed suggestions.
Enzo:
It's software, so there are many options. The some of the most prominent are:
The M approach mentioned by Jim - if you know what you're going to want to search - create and maintain a [cross-]index for it. Take Jim up on his offer of counsel.
If the search is add-hoc you can use $ORDER() if you know where the class of information being sought is found, or $QUERY() if you're searching willy-nilly to go rooting around, probably with $FIND(), for your target. This amounts to the approach taken by ^%GSE, but you can make it perform better by making it more specific to your problem (minimizing indirection is very good).
If such searches are a rare event and pure brute force is the way to go, you can dump data out to a flat file (MUPIP EXTRACT, ZWRITE, ^%GO, etc) and go after your target with a non-M tool.
Roger
Hello Jim and Roger,
the global I have is 400 Mb (sorry!), but seems to be little than others! ;-)
The structure is the most simple, a global of references with 1 index and 1 field:
^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"
But the user continuosly search a portion inside this description,making
the search too slow because the $ORDER shold read all the references.
Actually, I have made a (brute) solution, that is maintaining this global
outside de database, as a text file. The search is made by 'sed' and the
results are reading from the GT-M ... but works for me.
Could you give me an example for convert this 'ugly-search' in a 'pretty-and-cool-search' ? ;-)
Thank you again,
Enzo
Enzo,
If the text items in your data are less than 80 characters each and the total is 400MB, then it seems that you have 5 million items or more. I am curious as to what search times you observe with the queries and methods you have tried.
If your sed based search takes more than a few seconds, then a simple indexing technique called KWIC (Key Word in Context) might give you dramatically better search times. This would enable rapid retrieval of text items containing specific words or word patterns.
Below is an example subroutine (initKWIC), that you might consider running to set up a KWIC cross reference on your data in ^REF.
On modification or deletion of any text item, you could run xrKWIC(item,newText,oldText) to cross reference any new words in the context and to delete any obsolete word/item pairs as needed.
Running initKWIC might take many minutes to process 400MB depending on the speed of your data storage, but after that many possible searches based on ^KWIC(word,item) would return results almost instantaneously.
initKWIC ;initialize KWIC cross reference
; for given example data ^REF(cod)="DESCRIPTION_LARGE_80_CHARACTERS"
new cod set cod=""
for set cod=$o(^REF(cod)) quit:cod="" do xrKWIC(cod,^REF(cod))
quit
xrKWIC(item,newText,oldText) ;setup and maintain KWIC index references for the words of one text item
;--oldText is required to remove obsolete references, but not to initialize --;
new i,word,words
;cross reference new words
if $l($g(newText)) do
. set words=$$simpleText(newText)
. for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
. . s words(word)=""
. . s:'$d(^KWIC(word,item)) ^(item)=""
;remove obsolete cross references
if $l($g(oldText)) do
. set words=$$simpleText(oldText)
. for i=1:1:$l(words," ") set word=$p(words," ",i) if $l(word) do
. . if '$d(words(word)),$d(^KWIC(word,item)) k ^(item)
quit
simpleText(text) ;simplify text to just words and spaces
;-- uppercase and remove punctuation (should be 9 spaces between Z and ending quote below.)
quit $tr(text,"abcdefghijklmnopqrstuvwxyz,./?;:-_!","ABCDEFGHIJKLMNOPQRSTUVWXYZ ")
Here is some sample data and the corresponding KWIC cross reference.
zwr ^REF
^REF(1)="example text for KWIC reference test"
^REF(2)="The second example text."
^REF(3)="Example 3 is simple."
zwr ^KWIC
^KWIC(3,3)=""
^KWIC("EXAMPLE",1)=""
^KWIC("EXAMPLE",2)=""
^KWIC("EXAMPLE",3)=""
^KWIC("FOR",1)=""
^KWIC("IS",3)=""
^KWIC("KWIC",1)=""
^KWIC("REFERENCE",1)=""
^KWIC("SECOND",2)=""
^KWIC("SIMPLE",3)=""
^KWIC("TEST",1)=""
^KWIC("TEXT",1)=""
^KWIC("TEXT",2)=""
^KWIC("THE",2)=""
FWIW
Seeding your $ORDER can help as will a properly used naked reference '^('
By seeding I suggest that you take the term you are looking for and back the last ascii character value by one bit and add a few 'z's.'
So if you are looking for 'EXAMPLE' you can seed with something like 'EXAMPLDzzzzz'
My Mumps is a bit rusty but I think it comes out something like this:
s search_term="EXAMPLE"
search_seed=$e(search_term,1,$l(search_term)-1)+$c($a(search_term)-1)+"zzzzz"
s search = $o(^BFG(search_seed)) returns the first one
d q:search'[search_term
. s x=^(search)
. ;do what you want with 'x' here . . .
. search=$o(^(search))
The seed drops you in through the pointer blocks to the first data block that has the full reference you're looking at (if it exists).
The naked reference allows mumps to skip building the full key for the subsequent $order's.
I've always believed that it is very important to learn how a language is actually implemented to determine what the most efficient coding practices are.
Just a thought ...
Sean,
Your example was apparently intended to demonstrate a prefix search (for words beginning with "EXAMPLE", such as "EXAMPLES"), but it has a few problems.
A FOR command is needed for iteration over ^BFG(search).
The underbar character ("_") is the concatenation operator in MUMPS and therefore is not an allowable part of any variable name. Let's change "search_term" to just "term" and eliminate "search_seed".
The "search_seed" technique of taking an arbitrary tiny step backwards before stepping forwards is unnecessary - plus it either crashes or gives an undesirable result when a search term is given that is not in the data. Notice that this first value is not tested to ensure that it contains the search term, so it could be an unrelated word or even empty.
Here is a corrected version:
if $d(^BFG(term)) set search=term
else set search=$o(^(term))
for quit:search'[term do
. set x=^(search)
. do something ;--but don't reference any globals here!--;
. set search=$o(^(search))
I don't know (I doubt) that the naked references provide any significant performance enhancement. I think they can make the code more readable if used carefully but when stretched out over several lines they do make the code more fragile. For instance, the code above would break if "do something" attempted to store results in a global or to check additional data in a different global or at a different subscript level of the same global.
Oops, I should have known better than trying to include MUMPS code in a reply on this forum.
The code that I posted above (and probably in the email as well) was distorted by the forum software due to the loss of syntactically significant spacing. For anyone interested in reconstructing the code, the missing spaces are actually included in the HTML source code of the forum web page.
- all of the MUMPS lines should be indented except for the 3 lines labeled: initKWIC, xrKWIC, and simpleText.
- the Quit command at initKWIC+3 should be followed 2 spaces, not just 1.
- the quoted text string at simpleText+2 ABC..Z should end with 9 spaces, corresponding to the 9 punctuation characters ending the previous quoted string.
I hope you find this useful.
Just one comment:
To get the search_seed i'd prefer
s:search_term'="" search_seed=$O(^BFG(search_term),-1)
for me it's more "clean" than the "zzzz"'s
Jens
Funny, but I don't remember that in DSM-11.
Must have come after my time.
Ever viewed -16777216::0?
;^)