ooRexx (Open Object Rexx) / Bugs / #1170 Excessive memory consumption when using stems

#1170 Excessive memory consumption when using stems

Milestone: None

Status: invalid

Owner: nobody

Labels: None

Pending work items: none

Priority: 1

Updated: 2013-11-20

Created: 2013-04-12

Creator: Ian Chapple

Private: No

The situation is as follows. I have 628 index files, containing a total of over 17,000,000 indexed terms. To identify the total number of unique indexed terms (just over 3,000,000), I have a program which reads in each index file, one-by-one, and then loops over the list of indexed terms (contained in the stem iStem.). This works by setting the tail of the stem termStem., corresponding to each indexed term, to .True. However, the memory consumption when doing this is much higher than you would expect, and the program can fail on occasions (System resources exhausted).

Here is the program that I am using. If line A and line B are commented out, the memory consumption is reasonable (approx. 200MB). If they are left active, the memory consumption shoots up, to about 1,8GB, which seems totally disproportionate to what they are actually doing (setting a stem index to .True), unless I am missing/misunderstanding what is actually supposed to be happening.

/ -- Analyse the index files-- ------------------------ /

analyseIndexes:

dictionaryFolder = dictFolder'@Dictionaries\'

call SysFileTree dictionaryFolder'*.index', indexList., 'SFO'

dt = .DateTime~new

do i=1 to indexList.0
call processIndex(i)
end

exit

/ ----------------------------------------------------- /

processIndex:

file = indexList.[arg(1)]
file1 = filespec('N', file)

iStem.=readFileAsStem(file)

if iStem.0<1 then
do
drop iStem.
return
end

-- iStem. entries have the following form, where the indexed term is the
-- first word of the entry:
-- ÖFFNUNG : 500 DE 4 : 1243 DE 7 : 1244 DE 13 : 1739 DE 9 : 1743 DE 7 : 1743 DE 8....

do j=1 to iStem.0
fw=firstword(iStem.j)
if termStem.[fw]<>.True then -- line A
termStem.[fw]=.True -- line B
end

say '['formatElapsedTime(dt~elapsed)'] 'pad(i, ' ', length(indexList.0), 'l')' / 'indexList.0' 'file1' 'pad(iStem.0, ' ', 8, 'l')' 'pad(termStem.~items, ' ', 9, 'l')

drop iStem.

return

Discussion

Rick McGuire - 2013-04-12

I don't see anything there that looks like an excessive memory usage to me. Explicitly assigning a stem index value requires the creation of an entry for that particular value, which includes the name of the index and other overhead associated with the indexing. Commenting out those two lines prevents those items from getting created and, more importantly, it allows the index values you are using to be garbage collected, since none of the data read from the files is being retained at all.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ian Chapple - 2013-04-12

Hi Rick,
thanks for the quick reply.

If you think this behaviour is normal, that's fine. I was a bit surprised by the rapid increase in memory consumption when all that was being stored was a .True value, which is why I reported it. I do take the point about the overhead needed to make this possible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Rick McGuire - 2013-04-12

In fact, it is almost all overhead. The true value itself is shared, so the per index overhead for the value is just a single pointer. The rest of the memory is occupied by the stem variable item descriptor and a string object for the index name. Given the number of items you're dealing with, I'm not surprised you might run out of memory trying to keep all of that data in memory.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Rick McGuire - 2013-11-20

This does not look like a bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Rick McGuire - 2013-11-20

status: open --> invalid
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous

Excessive memory consumption when using stems

REXX interpreter

Group

Searches

Help

#1170 Excessive memory consumption when using stems

Discussion