Menu

#397 FASTA multi seq format variant (add support)

open
nobody
5
2011-02-09
2011-02-09
Jon Ison
No

The fragment search FASTA programs: fastf, fastm and fasts use a
modified fasta sequence format for the query sequences, where each
fragment of the sequence is separated by a ',' and a newline. For example:

>Seq_2 Serum albumin
MKWVTFIS,
MADCCEKQ,
MREKVLAS,
MPCTEDYL,
MENFVAFV

These programs are very specific about this format, so we would like to
be able to use seqret to force user submitted input into the appropriate
format (they commonly forget the header line). But EMBOSS does not
support this format :-(.

Discussion

  • Jon Ison

    Jon Ison - 2011-02-09

    From the input perspective these would map into a set of 5 sequences,
    with identifiers derived from the set identifier (Seq_2), may be some
    thing like: Seq_2_1, Seq_2_2, etc. Since the order of the sequences may
    not be known, although they come from the same source, and the gaps
    between them are also unknown I don't see any sensible way to map them
    into gapped sequence.For this format there can only be one identifier. So in the case where
    an identifier is not provided for the set feel free to generate one
    (i.e. EMBOSS_001 is fine). If you are feeling clever, then I suppose one
    could be created from the individual sequence identifiers (if set),
    assuming they share a common pattern (e.g. Seq_1, Seq_2, Seq_3 becomes Seq).

    For our case the typical cases are:

    1. Adding an header line, since one has not been supplied.

    2. Mapping a set of individual sequences (usually in fasta) into this
    format to run the search.

    3. Verifying the format prior to launching the search.

     
  • Jon Ison

    Jon Ison - 2011-02-10

    > assuming they share a common pattern (e.g. Seq_1, Seq_2, Seq_3 becomes
    > Seq).

    _001 _002 is our usual style to keep the ID length consistent at least
    up to 999.

    > For our case the typical cases are:
    >
    > 1. Adding an header line, since one has not been supplied.
    >
    > 2. Mapping a set of individual sequences (usually in fasta) into this
    > format to run the search.
    >
    > 3. Verifying the format prior to launching the search.

    OK, so the EMBOSS approach would be:

    Test before we try FASTA.

    If it starts with a > fasta header, AND the first sequence line ends
    with a comma,
    then parse each line with a comma as a new sequence and stop when the
    last line is read.

    But ... we would need to insist on the comma so a FASTM format input
    with only one sequence would be read as FASTA.

    Awkward - we would then normally fail it if the user had -sformat fastm
    on the command line.

    When writing, I think we have to go for saving them up and making up an
    ID for output so that EMBOSS_001 EMBOSS_002 etc. get an ID of EMBOSS_
    (trimmed to EMBOSS) and the IDs could be regenerated on reading again.

    How will you run seqret for this format? should we assume seqret
    -osformat fastm and we interpret the input as fastm or fasta depending
    on the number of fragments?

     
MongoDB Logo MongoDB