Menu

Reading FASTQ data

2016-05-23
2016-05-24
  • Will Stokes

    Will Stokes - 2016-05-23

    Does Staden io_lib support reading FASTQ content into a Read object? If not, can you point me to where in the code I would begin to add such support myself? It appears there are multiple "standards" for encoding the quality data and it would be nice if I could use a library like io_lib to handle handle that nonsense for me. :-) I see there is a program to convert scf to fastq, but didn't see anything for the other way around.

     
    • James Bonfield

      James Bonfield - 2016-05-23

      On Mon, May 23, 2016 at 01:12:39PM +0000, Will Stokes wrote:

      Does Staden io_lib support reading FASTQ content into a Read object?

      No, sorry. It's probably be a bit of a heavy-weight API for fastq
      too.

      If not, can you point me to where in the code I would begin to add
      such support myself? It appears there are multiple "standards" for
      encoding the quality data and it would be nice if I could use a
      library like io_lib to handle handle that nonsense for me. :-) I see
      there is a program to convert scf to fastq, but didn't see anything
      for the other way around.

      The only code in the Staden Package that deals with fastq would be in
      Gap5 itself (in the confusingly named staden/src/gap5/fasta.c).
      Possibly that should have been added to io_lib, but for whatever
      reason at the time I didn't (I can't think why - probably just
      laziness).

      However it doesn't deal with the multiple ways of encoding quality.
      Frankly I'd be inclined to ignore all other fastq anyway except the
      standard qval + 33 ('!'). The others were invented by Illumina (and
      subsequently dropped again I believe).

      Processing fastq yourself though isn't hard provided you make sure to
      use the length of the sequence as the indicator for how many quality
      values you should expect. Ie don't fall into the pit fall of a line
      of qualities starting with "@" represents the next sequence identifier
      if we haven't yet read enough quality values. Other than that it's
      such a simple format you can just roll your own with relatively few
      lines.

      James

      --
      James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
      | Plurima gyrabant gymbolitare vabo;
      A Staden Package developer: | Et Borogovorum mimzebant undique formae,
      https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

      --
      The Wellcome Trust Sanger Institute is operated by Genome Research
      Limited, a charity registered in England with number 1021457 and a
      company registered in England with number 2742969, whose registered
      office is 215 Euston Road, London, NW1 2BE.

       
      • Will Stokes

        Will Stokes - 2016-05-24

        Thanks, I appreciate the quick reply and suggestion.

        Will Stokes
        Chief Software Architect

        Follow us: Facebook https://www.facebook.com/SnapGene, Twitter
        https://twitter.com/SnapGene, Newsletter
        http://www.snapgene.com/company/newsletter/subscribe_to_our_newsletter/

        On Mon, May 23, 2016 at 12:27 PM, James Bonfield jkbonfield@users.sf.net
        wrote:

        On Mon, May 23, 2016 at 01:12:39PM +0000, Will Stokes wrote:

        Does Staden io_lib support reading FASTQ content into a Read object?

        No, sorry. It's probably be a bit of a heavy-weight API for fastq
        too.

        If not, can you point me to where in the code I would begin to add
        such support myself? It appears there are multiple "standards" for
        encoding the quality data and it would be nice if I could use a
        library like io_lib to handle handle that nonsense for me. :-) I see
        there is a program to convert scf to fastq, but didn't see anything
        for the other way around.

        The only code in the Staden Package that deals with fastq would be in
        Gap5 itself (in the confusingly named staden/src/gap5/fasta.c).
        Possibly that should have been added to io_lib, but for whatever
        reason at the time I didn't (I can't think why - probably just
        laziness).

        However it doesn't deal with the multiple ways of encoding quality.
        Frankly I'd be inclined to ignore all other fastq anyway except the
        standard qval + 33 ('!'). The others were invented by Illumina (and
        subsequently dropped again I believe).

        Processing fastq yourself though isn't hard provided you make sure to
        use the length of the sequence as the indicator for how many quality
        values you should expect. Ie don't fall into the pit fall of a line
        of qualities starting with "@" represents the next sequence identifier
        if we haven't yet read enough quality values. Other than that it's
        such a simple format you can just roll your own with relatively few
        lines.

        James

        --
        James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia
        Tova
        | Plurima gyrabant gymbolitare vabo;
        A Staden Package developer: | Et Borogovorum mimzebant undique formae,
        https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

        --
        The Wellcome Trust Sanger Institute is operated by Genome Research
        Limited, a charity registered in England with number 1021457 and a
        company registered in England with number 2742969, whose registered
        office is 215 Euston Road, London, NW1 2BE.


        Reading FASTQ data
        https://sourceforge.net/p/staden/discussion/347718/thread/6abe8727/?limit=25#c9e3/e5d5


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/staden/discussion/347718/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/

         

Log in to post a comment.