EMBOSS Suite / Feature Requests / #397 FASTA multi seq format variant (add support)

#397 FASTA multi seq format variant (add support)

Status: open

Owner: nobody

Labels: Sequence formats (24)

Priority: 5

Updated: 2011-02-09

Created: 2011-02-09

Creator: Jon Ison

Private: No

The fragment search FASTA programs: fastf, fastm and fasts use a
modified fasta sequence format for the query sequences, where each
fragment of the sequence is separated by a ',' and a newline. For example:

>Seq_2 Serum albumin
MKWVTFIS,
MADCCEKQ,
MREKVLAS,
MPCTEDYL,
MENFVAFV

These programs are very specific about this format, so we would like to
be able to use seqret to force user submitted input into the appropriate
format (they commonly forget the header line). But EMBOSS does not
support this format :-(.

Discussion

Jon Ison - 2011-02-09

From the input perspective these would map into a set of 5 sequences,
with identifiers derived from the set identifier (Seq_2), may be some
thing like: Seq_2_1, Seq_2_2, etc. Since the order of the sequences may
not be known, although they come from the same source, and the gaps
between them are also unknown I don't see any sensible way to map them
into gapped sequence.For this format there can only be one identifier. So in the case where
an identifier is not provided for the set feel free to generate one
(i.e. EMBOSS_001 is fine). If you are feeling clever, then I suppose one
could be created from the individual sequence identifiers (if set),
assuming they share a common pattern (e.g. Seq_1, Seq_2, Seq_3 becomes Seq).

For our case the typical cases are:

1. Adding an header line, since one has not been supplied.

2. Mapping a set of individual sequences (usually in fasta) into this
format to run the search.

3. Verifying the format prior to launching the search.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jon Ison - 2011-02-10

> assuming they share a common pattern (e.g. Seq_1, Seq_2, Seq_3 becomes
> Seq).

_001 _002 is our usual style to keep the ID length consistent at least
up to 999.

> For our case the typical cases are:
>
> 1. Adding an header line, since one has not been supplied.
>
> 2. Mapping a set of individual sequences (usually in fasta) into this
> format to run the search.
>
> 3. Verifying the format prior to launching the search.

OK, so the EMBOSS approach would be:

Test before we try FASTA.

If it starts with a > fasta header, AND the first sequence line ends
with a comma,
then parse each line with a comma as a new sequence and stop when the
last line is read.

But ... we would need to insist on the comma so a FASTM format input
with only one sequence would be read as FASTA.

Awkward - we would then normally fail it if the user had -sformat fastm
on the command line.

When writing, I think we have to go for saving them up and making up an
ID for output so that EMBOSS_001 EMBOSS_002 etc. get an ID of EMBOSS_
(trimmed to EMBOSS) and the IDs could be regenerated on reading again.

How will you run seqret for this format? should we assume seqret
-osformat fastm and we interpret the input as fastm or fasta depending
on the number of fragments?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

FASTA multi seq format variant (add support)

Group

Searches

Help

#397 FASTA multi seq format variant (add support)

Discussion