Add to tide-index the following option:
--use-disk <integer> - Specify the number of peptides per batch that tide-index writes to disk. By default (when this parameter is set to 0), all peptides are stored in memory. When the parameter is set to a value N, then tide-index will only store at most N peptides in memory at once. Default = 0.</integer>
To implement this, you will need to store a series of separate sets of peptides on disk, each in sorted order by mass. After all of these temporary files are created, you will open all of them simultaneously and read through them in parallel, writing out a single file of peptides in sorted order.
It might be useful to put a pseudocode description of your proposed implementation in the issue tracker.
Notes from our discussion:
With protein reversal, we will always allow duplicates. But we will have to store both targets and decoys in the heap.
Eliminate the heap that stores all the decoys, and we generate the decoys on the fly when writing targets to disk.
Store only the targets and decoys with the current mass, and use those to identify possible duplicates. Once we find a new mass, discard the old, stored targets and decoys.
One potential savings is to replace the current scheme, which has many copies of a given peptide in the heap, each with its own pointer to a single parent protein, with a single copy of each peptide in the heap with many pointers to its multiple parents. Note that if we make this change, then the sort that happens before generating modified peptides can be eliminated.
Find out (and document) what the code does when reversing to get decoys and a duplicate is encountered.
First goal: Store one copy of each peptide in the heap, with pointers to multiple parent proteins.
Second goal: Postpone decoy generation to the very end (except when doing protein reverse).
Notes from today's meeting (Zijin and Damon):
We identified another massive memory inconsistency. The TargetInfo structure contains a ProteinInfo structure, which stores the name of the protein as a string. The TargetInfo doesn't have a pointer to a ProteinInfo -- instead, it creates a copy of a ProteinInfo object passed by reference in the TargetInfo constructor. This means that every peptide in every protein separately stores the protein name! Zijin will try to change this. Hopefully, this will be easier to address than the other goals and might provide an even bigger memory savings.
Zijin showed me why it is not trivial to change the peptide heap so that it only stores each peptide sequence once: currently, the TideIndexPeptide doesn't store the sequence of the peptide. Rather, it stores a pointer into the protein sequence. We discussed two options for changing this:
Currently, TargetInfo also references a protein sequence. The least-disruptive option would be to continue to do that. Another possibility would be to change TargetInfo to store the peptide sequence in a string. This would actually increase the memory footprint significantly, unless we could deallocate each protein's information after we generated all of its peptide targets. Zijin will investigate how these structures are used farther downstream, to see which approach makes more sense.
In addition to memory footprint, please be sure to carefully consider any impacts on the running time of tide-search.
Some updates on this issue:
I tried to change the structure of TargetInfo, so that instead of creating a new copy of ProteinInfo for each TargetInfo, only one copy of ProteinInfo would be created and stored in the heap memory and each TargetInfo will have a pointer to its corresponding ProteinInfo. However, it turned out that this change increased the memory foot print a little bit. One guess is that we currently use the tool massif of valgrind to measure the maximum memory usage and massif measures heap memory instead of stack memory.
The second change I made is to move the generating decoys part to the end. After this change, the code would sort the target peptide heap first, and for each target peptide, it would generate a decoy peptide, check it against peptides that have the same mass and length to make sure we would not recode duplicate peptides and finally write them into the output file. The memory usage also increased for this change. My guess on this problem is that since a corresponding protein information should also be written to a file when the code generates a decoy peptide, the code actually needs to carry a map that maps target sequences to TargetInfo to the end until all of the decoys being generated.
Notes from Wednesday’s meeting (Zijin and Damon):
We are going to change the code back and implement the on-disk option.
First, we are going to move the generating decoys part to the beginning so that both the target and decoy peptide will be stored into a heap at the same time. When the total amount of peptides in the heap exceeds a certain number, we will sort the heap, write the peptides into a file and clear the heap. The code will continue doing this until it finishes iterating all proteins and their peptides.
Then, we will read peptides from all of the files, get the smallest one and do the same thing as the code used to do. For peptides that are both decoys and targets, we are going to only record the aux locations of the targets and ignore the decoys.
Last edit: Zijin Zhang 2016-05-27