Menu

Huge memory used in intermed files

Orson
2018-10-11
2019-03-19
  • Orson

    Orson - 2018-10-11

    Dear team ksnp3

    I chosse to use these program KNSP3 because I need find SNP of 1550 genomes. I have experience using parsnp but with it is not possible analized my data.

    PD: I use a machine with 30 cores, 128 RAM. 950GB of hard disk.

    I have few question: My input files are draft genome assemblys,. not reads. 5Mb each file.

    1. I run ksnp3 and I can see that it generated a lot of intermed files, for my data was more of 900GB of intermed files. Exist a option for remove the intermed files that the program not use ?
    2. I can see that anny steps not work using all cores that I have. I use the option -CPU 25. For example in the step Removing kmers that occur less than freq=average of median and mean kmer frequency for that genome. It work one by one file. Is possible run these steps using CPUs ?
    3. How many space in hard disk you think that I need for finish my analysis. ?

    Finally:

    **I have these error running the program: I think that is only problem with space in hard disk. Or maybe RAM memory. What is your opinion ? **

    Concatenate results for each genome and sort by locus to create SNPs_all_labelLoci
    Thu Oct 11 11:13:36 UTC 2018
    .
    .
    .
    genome: 790-97_Peru_2007_C in Dir.fsplit99
    genome: GCA_002221085_1_ASM222108v1_2013_NA_C in Dir.fsplit990
    genome: GCA_002221095_1_ASM222109v1_2013_New_England_C in Dir.fsplit991
    genome: GCA_002221145_1_ASM222114v1_2013_NA_C in Dir.fsplit992
    genome: GCA_002221165_1_ASM222116v1_2013_NA_C in Dir.fsplit993
    genome: GCA_002221175_1_ASM222117v1_2013_NA_C in Dir.fsplit994
    genome: GCA_002221185_1_ASM222118v1_2013_NA_C in Dir.fsplit995
    genome: GCA_002221225_1_ASM222122v1_2013_NA_E in Dir.fsplit996
    genome: GCA_002221245_1_ASM222124v1_2013_NA_E in Dir.fsplit997
    genome: GCA_002221265_1_ASM222126v1_2013_NA_C in Dir.fsplit998
    genome: GCA_002221285_1_ASM222128v1_2014_NA_C in Dir.fsplit999
    sort: write failed: /tmp/sort5v6kCZ: No space left on device
    Number_SNPs: 1
    $count_snps: 0
    Finished finding SNPs
    Thu Oct 11 12:36:46 UTC 2018
    rm: cannot remove 'TemporaryFilesToDelete': No such file or directory
    mv: No match.

    I hope your help. Thank you so much.

    Orson.

    Saludos

     
  • Iowa

    Iowa - 2019-03-19

    I have the same proglem, ~1300 genomes, more than 800GB of data generated on hard drive.

    Is there a solution?

    SOLUTION: I used a dedicated high capacity SSD

    I ran into a second problem:

    Now it runs smoothly, jellyfish runs but then there is a step where it has to write into dedicated directoryies. It runs ok the first 600 genomes than it reports for every other genomes
    "awk: write failure (File too large) awk: close failed on file /dev/stdout (File too large)"
    until the last one.

    It generates a fasta matrix of ~600 sequences and then it strats building trees.
    At these stage it plots "TOO FEW SPECIES" when doing the core stage.

    I have already used ksnp several times in the past and recently, but this is the first time I encounter such problems.

    INFO: All genomes are from the same species; input is, for every genome, the final assembly with multifasta containing multiple contigs generated from the assembly (~5MB files); machine is 32 core 64GB of RAM 2x1TB SSD.

    **SOLUTION: I have checked all my assemblies and some were not good quality, removing/correcting them made me finish the work with good results. It still uses nearly 1TB of disk space and still is not able to delete automatically the folder "TemporaryFilesToDelete". **

    Thanks for the support provided.

     

    Last edit: Iowa 2019-03-26

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.