Menu

Map read process is complete (100%), but running for hours without an output

Anonymous
2019-05-13
2019-07-24
  • Anonymous

    Anonymous - 2019-05-13

    Hello,
    Please find attached a fastq file of a bean (Phaseolus vulgaris) sample generated in Illumina MiSeq for GBS analysis and trimmed using cutadapt on Linux. I am trying to map the reads against the P. vulgaris (G19833) reference genome. The map read process show 100% progress, but it keeps running for hours and hours without an output. I would greatly appreciated your help with this matter.

    Regards,
    Mohammad Erfatpour
    PhD candidate, Dry bean Breeding & Genetics, Department of Plant Agriculture
    University of Guelph, ON, Canada
    Email: merfatpo@uoguelph.ca

     
  • Jorge Duitama

    Jorge Duitama - 2019-05-14

    Dear Mohammad

    Many thanks for your interest in NGSEP. I could download the fastq and reproduce the error. Unfortunately it is a very weird error that we did not observe in our test data, so I do not have an immediate answer. In the mean time, I could successfully map the reads using the command line, so, if possible, try to map the reads using directly bowtie2 (you can also use bwa). Please let me know also if the error only ocurrs in this sample or if you have observed the same issue in other samples. I will update this forum as soon as I can find a better solution.

    Best regards

    Jorge

     

    Last edit: Jorge Duitama 2019-05-14
  • Jorge Duitama

    Jorge Duitama - 2019-05-14

    Dear Mohammad

    Looking back at the log of the test run that I made using the command line I found the issue. The preprocessing that you are performing produces empty reads and reads with only one nucleotide. See for example the read with id "M05499:20:000000000-CBWGL:1:1101:23232:3417" or id "M05499:20:000000000-CBWGL:1:1118:14554:1883". This issue is somehow confusing the graphical interface. For the next release we will try to improve the error message to avoid getting into an infinite loop in this case.

    Because anyways those reads will not be useful, as a quick fix to eliminate these reads you can use the following awk command on each of your samples:

    awk '{if(NR%4==1)id=$0;if(NR%4==2)r=$0;if(NR%4==0 && length(r)>20){print id;print r;print "+";print $0}}' Samplet_1.fastq > Samplet_1_L20.fastq

    This command will only keep reads having length larger than 20. You can also take a look to our demultiplexing functionality as an alternative to preprocess your reads. In our demultiplexing procedure you can choose the minimum length to keep a read.

    Let me know if you have further issues running NGSEP.

    Best regards

    Jorge

     
    • Mohammad Erfatpour

      Hi Jorge,

      Thanks a lot for your quick reply. I added the awk command on the begining of the samlpe, but I got the error 'Map reads process has encountered a problem'. So far, with this file without the awk command, I've been able to run 'Wizard Single End' by which I get 3.63% overall alignment rate. I think I still need to work with the dataset to figure out the issue and improve the alignment rate. By the way, thanks very much for making this program free available anad taking the time to figure out the issue. I found the program very helpful and stright forward and I hope that I can take full adavantage of it soon.

      Best regards,
      Mohammad


      From: Jorge Duitama jduitama@users.sourceforge.net
      Sent: Tuesday, May 14, 2019 12:42:05 PM
      To: [ngsep:discussion]
      Subject: [ngsep:discussion] Map read process is complete (100%), but running for hours without an output

      Dear Mohammad

      Looking back at the log of the test run that I made using the command line I found the issue. The preprocessing that you are performing produces empty reads and reads with only one nucleotide. See for example the read with id "M05499:20:000000000-CBWGL:1:1101:23232:3417" or id "M05499:20:000000000-CBWGL:1:1118:14554:1883". This issue is somehow confusing the graphical interface. For the next release we will try to improve the error message to avoid getting into an infinite loop in this case.

      Because anyways those reads will not be useful, as a quick fix to eliminate these reads you can use the following awk command on each of your samples:

      awk '{if(NR%4==1)id=$0;if(NR%4==2)r=$0;if(NR%4==0 && length(r)>20){print id;print r;print "+";print $0}}' Samplet_1.fastq > Samplet_1_L20.fastq

      This command will only keep reads having length larger than 20. You can also take a look to our demultiplexing functionality as an alternative to preprocess your reads. In our demultiplexing procedure you can choose the minimum length to keep a read.

      Let me know if you have further issues running NGSEP.

      Best regards

      Jorge


      Map read process is complete (100%), but running for hours without an output https://sourceforge.net/p/ngsep/discussion/faq/thread/ff47fc35a8/?limit=25#b0eb


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/ngsep/discussion/faq/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
  • Jorge Duitama

    Jorge Duitama - 2019-05-17

    Hi Mohammad

    Great, I am glad to know that you are finding the software useful. Plesae find attached a text file with the awk command (sometimes some characters can get lost in the copy paste from the web page) and the fastq file I got. I aligned the file to the latest bean reference genome available in phytozome (v2.1) and got an alignment rate of 87.89%. You can double check with diff if the file you are getting from the awk command is equal to the attached fastq file.

    Let me know how things go

    Jorge

     
  • Jorge Duitama

    Jorge Duitama - 2019-07-24

    Hi Mohammad

    I am more than happy to help you but unfortunately I can not do instant replies because we ar not a for profit company. Also you do not need to spam the forum (or my e-mail for that matter).

    These new reads are all long (over 200bp) but they definitely do not look like GBS reads. Almost all of them (99.46%) start with a long A mononucleotide run. You may want to run fastqc and see what comes up. Also, they are very unlikely to come from a bean sample, so you may want to check out for sample mix up or contamination.

    Best regards

    Jorge

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.