Menu

Processing very large number of samples

2023-05-30
2023-06-16
  • Hugo Tavares

    Hugo Tavares - 2023-05-30

    Hi,

    I am trying to use your tools on a large set of samples (1000s of samples from public datasets).
    With such a large number of samples, the identifyUTR step is predicted to take ~20 days to run on a large HPC and has so far consumed ~400GB of RAM. This is not feasible to run on our HPC (we have a job time limit of up to 7 days).
    And in any case, it seems like this should be possible to run more efficiently by paralellising some of the steps.

    I'm not fluent in PERL and so cannot quite figure out what steps can be split safely as not to affect the assumptions made by your analysis workflow.
    For example, could I generate separate annotation BED per gene and run identifyUTR in parallel for each of the genes, and then somehow combine the result at the end? And if so, how should the results be combined?

    Any recommendations would be welcome.
    Many thanks,
    hugo

     
  • Congting Ye

    Congting Ye - 2023-06-16

    Hi Hugo,

    Our APAtrap pipeline does not currently support parallel computing. One way to speed up the process is, as you mentioned, to separate bedgraph files and gene model file based on their genomic positions, and process them separately and/or simultaneously.

    Since the result for each gene corresponds to only one line in the output file, you can use the "cat" command line tool to combine them.

    Best,
    Congting

     

Log in to post a comment.