Menu

#18 Fatal error (may be due to problems of the input data or parameters)

4.0.1
closed
nobody
None
2015-05-26
2015-05-20
No

This may be a simple fix, but I am quite new to linux and trying to run a hybrid assembly on MIRA, and I received this error:
Fatal error (may be due to problems of the input data or parameters):


  • 8008808 reads were detected with names longer than 40 characters (see output *
  • log for more details). *
  • *
  • While MIRA and many other programs have no problem with that, some older *
  • programs have restrictions concerning the length of the read name. *
  • *
  • Example given: the pipeline *
  • CAF -> caf2gap -> gap2caf

Is there a way to fix the titles of the fastq files without interfering with quality information?

Discussion

  • Bastien Chevreux

    I do not think the the problem is you being a Linux newcomer :-)

    You truncated the message as it appeared on the screen. It actually continues like this:


    This is a warning only, but as a couple of people were bitten by this, the default
    behaviour of MIRA is to stop when it sees that potential problem.

    You might want to rename your reads to have <=40 characters. Instead of renaming reads in the input files, maybe the 'rename_prefix' functionality of manifest files is useful for you there.

    On the other hand, you also can ignore this potential problem and force MIRA to
    continue by using the parameter: '-NW:cmrnl=warn' or '-NW:cmrnl=no'


    So, in this special case, MIRA explicitely gives you 3 different solutions to deal with this problem, two of these by using a simple parameter option in the manifest file. Is there anything which could be phrased more clearly to help you choose your preferred solution?

     
  • David Sannino

    David Sannino - 2015-05-21

    Thank you for the response, I tried both adding the rename_prefix to the manifest file and using the NW:cmrnl=no parameter, but the same error keeps popping up.

    • *
    • Example given: the pipeline *
    • CAF -> caf2gap -> gap2caf *
    • will stop working at the gap2caf stage if there are read names having > 40 *
    • characters where the names differ only at >40 characters. *
    • *
    • This is a warning only, but as a couple of people were bitten by this, the *
    • default behaviour of MIRA is to stop when it sees that potential problem. *
    • *
    • You might want to rename your reads to have <= 40 characters. Instead of *
    • renaming reads in the input files, maybe the 'rename_prefix' functionality *
    • of manifest files is useful for you there. *
    • *
    • On the other hand, you also can ignore this potential problem and force MIRA *
    • to continue by using the parameter: '-NW:cmrnl=warn' or '-NW:cmrnl=no' *

    ->Thrown: void Assembly::checkForReadNameLength(uint32 stoplength)
    ->Caught: main

    Aborting process, probably due to error in the input data or parametrisation.
    Please check the output log for more information.
    For help, please write a mail to the mira talk mailing list.
    Subscribing / unsubscribing to mira talk, see: http://www.freelists.org/list/mira_talk

     
  • David Sannino

    David Sannino - 2015-05-21

    Nevermind, made a stupid mistake, I believe it is running now. Thank you again for your help.

     
  • David Sannino

    David Sannino - 2015-05-25

    I am now experiencing the following error:
    IRA warncode: ASCOV_VERY_HIGH
    Title: Very high average coverage

    You are running a genome de-novo assembly and the current best estimation for
    average coverage is 121x (note that this number can be +/- 20% off the real
    value). This is a pretty high coverage,higher than the current warning threshold
    of 80x.

    You should try to get the average coverage not higher than, say, 60x to 100x for
    Illumina data or 40x to 60x for 454 and Ion Torrent data. Hybrid assemblies
    should target a total coverage of 80x to 100x as upper bound. For that, please
    downsample your input data.

    This warning has two major reasons:
    - for MIRA and other overlap based assemblers, the runtime and memory
    requirements for ultra-high coverage projects grow exponentially, so reducing
    the data helps you there
    - for all assemblers, the contiguity of an assembly can also suffer if the
    coverage is too high, i.e. you get more contigs than you would otherwise.
    Causes for this effect can be non-random sequencing errors or low frequency
    sub-populations with SNPs which become strong enough to be mistaken for
    possible repeats.

    Do you have any recommendations for efficiently reducing the coverage of the input data. I've tried using trimmomatic with stringent settings, but the coverage is still quite high.

    Thanks

     
  • Bastien Chevreux

    For this kind of questions, please use the MIRA talk mailing list where other people help me answering all kinds of questions.

    That being said ... two things:
    1) do not use trimmomatic or other such software with Illumina reads when working with MIRA. See also http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_pd_illumina

    2) reducing the coverage can be done with a number of different software packages. The easiest way however (and the one I am using often enough) is to simply use the Unix "head" command on the FASTQ files: simply determine the number of reads you want to have, multiply by 4 and use this for the head command. E.g.: "head -4000000 input >output" will extract 1m reads from input to output.

     
  • Bastien Chevreux

    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB
Gen AI apps are built with MongoDB Atlas
Atlas offers built-in vector search and global availability across 125+ regions. Start building AI apps faster, all in one place.