Menu

#1170 Excessive memory consumption when using stems

None
invalid
nobody
None
none
1
2013-11-20
2013-04-12
Ian Chapple
No

The situation is as follows. I have 628 index files, containing a total of over 17,000,000 indexed terms. To identify the total number of unique indexed terms (just over 3,000,000), I have a program which reads in each index file, one-by-one, and then loops over the list of indexed terms (contained in the stem iStem.). This works by setting the tail of the stem termStem., corresponding to each indexed term, to .True. However, the memory consumption when doing this is much higher than you would expect, and the program can fail on occasions (System resources exhausted).

Here is the program that I am using. If line A and line B are commented out, the memory consumption is reasonable (approx. 200MB). If they are left active, the memory consumption shoots up, to about 1,8GB, which seems totally disproportionate to what they are actually doing (setting a stem index to .True), unless I am missing/misunderstanding what is actually supposed to be happening.

/ -- Analyse the index files-- ------------------------ /

analyseIndexes:

dictionaryFolder = dictFolder'@Dictionaries\'

call SysFileTree dictionaryFolder'*.index', indexList., 'SFO'

dt = .DateTime~new

do i=1 to indexList.0
call processIndex(i)
end

exit

/ ----------------------------------------------------- /

processIndex:

file = indexList.[arg(1)]
file1 = filespec('N', file)

iStem.=readFileAsStem(file)

if iStem.0<1 then
do
drop iStem.
return
end

-- iStem. entries have the following form, where the indexed term is the
-- first word of the entry:
-- ÖFFNUNG : 500 DE 4 : 1243 DE 7 : 1244 DE 13 : 1739 DE 9 : 1743 DE 7 : 1743 DE 8....

do j=1 to iStem.0
fw=firstword(iStem.j)
if termStem.[fw]<>.True then -- line A
termStem.[fw]=.True -- line B
end

say '['formatElapsedTime(dt~elapsed)'] 'pad(i, ' ', length(indexList.0), 'l')' / 'indexList.0' 'file1' 'pad(iStem.0, ' ', 8, 'l')' 'pad(termStem.~items, ' ', 9, 'l')

drop iStem.

return

Discussion

  • Rick McGuire

    Rick McGuire - 2013-04-12

    I don't see anything there that looks like an excessive memory usage to me. Explicitly assigning a stem index value requires the creation of an entry for that particular value, which includes the name of the index and other overhead associated with the indexing. Commenting out those two lines prevents those items from getting created and, more importantly, it allows the index values you are using to be garbage collected, since none of the data read from the files is being retained at all.

     
  • Ian Chapple

    Ian Chapple - 2013-04-12

    Hi Rick,
    thanks for the quick reply.

    If you think this behaviour is normal, that's fine. I was a bit surprised by the rapid increase in memory consumption when all that was being stored was a .True value, which is why I reported it. I do take the point about the overhead needed to make this possible.

     
  • Rick McGuire

    Rick McGuire - 2013-04-12

    In fact, it is almost all overhead. The true value itself is shared, so the per index overhead for the value is just a single pointer. The rest of the memory is occupied by the stem variable item descriptor and a string object for the index name. Given the number of items you're dealing with, I'm not surprised you might run out of memory trying to keep all of that data in memory.

     
  • Rick McGuire

    Rick McGuire - 2013-11-20

    This does not look like a bug.

     
  • Rick McGuire

    Rick McGuire - 2013-11-20
    • status: open --> invalid
     

Anonymous
Anonymous

Add attachments
Cancel