Salomon - 2019-04-16

Hello all!

I am running hipmer on a plant data set. Hipmer runs fine until it is killed by slurm for trying to exceed the physical memory available in my compute nodes. The final part of the run.out file is as so:

Starting stage contigMerDepth-31 -k 31 -i ALL_INPUTS.fofn-31.ufx.bin -c UUtigs_contigs-31 -d 7 -D 1.000000 -s 100 -B /dev/shm at 04/16/19 10:52:31

STAGE contigMerDepth_main -k 31 -i ALL_INPUTS.fofn-31.ufx.bin -c UUtigs_contigs-31 -d 7 -D 1.000000 -s 100 -B /dev/shm
Struct size is 32, no padding, 8 shared[] ptr
kmer_length 31
Total number of records is 1027287713
Minimum required shared memory: 279 MB. (1027287713 ufx kmers) If memory runs out re-run with more -shared-heap
Thread 0: Allocating memory for 6111570 reads: 281132220 bytes
Threads done with I/O
Time spent in all_alloc() for buckets is 0.081163 seconds
Time spent in NULL-ing buckets is 0.078106 seconds
Time spent in all_alloc() for heap ptrs and indices is 0.001610 seconds
Time spent in alloc() for my shared heap is 0.087445 seconds
Allocating 31.500 MB per node 256 elements (32 bytes) per-thread for local chunks
Time spent for caching remote pointers is 0.010392 seconds
Threads done with setting-up

*** SET - UP TIME ***
Time spent on setting up the distributed hash table is 0.616881 seconds

** OVERALL TIME BREAKDOWN **

Time for constructing UFX hash table is : 10.879285 seconds

My standard error file has in additions this:

slurmstepd: error: Job 9523 exceeded memory limit (97778864 > 97320960), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: JOB 9523 ON node06 CANCELLED AT 2019-04-16T10:52:59

The end of the output clearly states that if I run out of memory, I should run with more shared-heap. How exactly does that work? It seems that hipmer was allocating more memory than available in a single node, so I don't see how increasing the shared-heap would help.

Is my problem a different one, could I improve on this situation by increasing the number of nodes (and therefore the available memory)? Or are there steps within hipmer where the memory cannot be distributed? I could answer this questions myself by running on more nodes, but it will be a while until those are available in my cluster, so wanted to get a head-start for planning.

Thanks for your help!