Mapping and Assembly with Qualities / Bugs / #11 fix parsing of fastq seq names

#11 fix parsing of fastq seq names

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2008-09-20

Created: 2008-09-20

Creator: Aaron

Private: No

In some of the data file from NCBI SRA, the fastq formats are being messed up by including spaces in the seqname fields. If this happens, then the part after the space plus the next line up to the max length is taken to be the quality scores and fastq2bfq will not complain unless one of the qualities is '@'. In that case, it says that there is an "Inconsistant sequence name" and continues parsing the fastq file incorrectly.

To verify this, create a fastq file with spaces in the seqname, run fastq2bfq and then bfq2fastq. There will not be any complaints from MAQ but the bfq file is totaly messed up.

I think (but have not tested and recompiled MAQ myself) that the problem is that while reading the second seqname (at line 86 of seq.c) after coming across a space, it fails to go to the next '\n' like it does on line 69 of the same file. If there is another check to make sure that it is at an '\n', this should at least allow MAQ to create a proper bfq file.

Thanks,
Aaron Hardin

Discussion

Raymond Wan - 2011-04-11

Thank you for your report! I recently faced the same problem and concluded that you were on the right track. Since the FASTQ standard seems to state that comments can appear after the '+', these comments (which are common in NCBI's data) need to be removed from the buffer.

The while loop on line 86 will read up to the first whitespace. What is left needs to be taken and a line similar to what is on line 69 should do the trick.

The result is that the quality scores can become "out of sync" and if one of the qualities is not '@', then the problem will go unnoticed by the user.

I am uploading a patch now that hopefully solves this problem -- it does basically what you suggested. Thank you!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

fix parsing of fastq seq names

Group

Searches

Help

#11 fix parsing of fastq seq names

Discussion