Tree [ffef36] default tip /

Read Only access

File Date Author Commit
 docs 2013-03-26 Paul Boddie Paul Boddie [ffef36] Added copyright and licensing information.
 resources 2008-02-20 [e320a5] Changed the export program to produce separate ...
 tools 2009-08-31 Paul Boddie Paul Boddie [c4edd0] Added Tailor repository conversion configuratio...
 .hgtags 2008-06-04 [2bd1e7] Added tag third-human-run for changeset 624bc5c...
 README.txt 2008-05-21 [ab4383] Updated the parameters to those suggested by Fr... 2008-02-18 [578ccc] Updated the digestall program to use the new st... 2008-06-04 [624bc5] Modified parameters for another run. 2007-12-11 [9d1ce3] Introduced the reading of organisms from a sepa... 2008-02-18 [58d97b] Changed the digestresults module to produce dat...
 names.txt 2007-11-28 [fbc40c] Removed names that are not worth looking for.
 organisms.txt 2007-12-11 [9d1ce3] Introduced the reading of organisms from a sepa...
 organisms_special.txt 2007-12-11 [9d1ce3] Introduced the reading of organisms from a sepa... 2007-11-28 [e6583a] Fixed name set initialisation.

Read Me

A collection of programs and modules which use the MS-Digest service to
collect digest sequences for a selection of protein sequences from a number of

Script/Module   Purpose
-------------   -------

extract         Extracts protein records for a selection of organisms
proteinfind     Finds and extracts records for named proteins
digestresults   Retrieves digest sequences from MS-Digest for a collection of
                input sequences
digestall       Constructs a file containing digest sequences for a selection
                of organisms
fasta2dump      Creates a database import file from a file containing protein

Selecting Input Data for Specific Organisms

If no ready-made input data is available, it is possible to select data for
specific organisms by taking a collection of sequence records and isolating
only those related to those organisms. In this process, the starting input is
the NCBI non-redundant FASTA archive NCBInr.fasta containing protein records.
To select only relevant records from this file, the extract script is run as


This creates a directory called "FASTA" containing a file for each organism
listed in the organisms.txt file.

Each organism-specific file can then be processed further, selecting only
records containing the names found in the names.txt file and recording these
records in the "candidates" directory, by using the proteinfind script as

  python <filename>

The <filename> should be replaced by the name of a FASTA file, such as one
located in the FASTA directory; for example:

  python "FASTA/Lactobacillus gasseri"

Note that the quotation marks are important where spaces are used in
filenames. The resulting file will be "candidates/Lactobacillus gasseri".

Producing Results for Individual Organisms

With candidate proteins selected, a digest file can be created using the
digestresults scripts, as in the following example:

  python "candidates/Lactobacillus gasseri" 70 12345

Here, a minimum ChemScore of 70% is specified, together with an identification
number of 12345; the latter value only being used in the header of the written
sequence record. Digest files are written to the "output" directory and retain
the name of the organism. Here, the resulting file will be called
"output/Lactobacillus gasseri".

A database dump, as opposed to a concatenated digest sequence file, may be
produced as in the following example:

  python "candidates/Lactobacillus gasseri" 70 --dump

Here, a minimum ChemScore of 70% as above dictates the contents of the output
file, but each digest sequence appears on a separate line together with other
fields, including the input sequence and a number of properties, employing a
tab-separated value format.

Producing Results for Many Organisms

Instead of manually running the proteinfind and digestresults scripts, one can
instead use the digestall script which will perform the above work for each of
the organisms of interest in turn, as defined by the organisms_special.txt

  python 70

Here, a minimum ChemScore of 70% is specified. Instead of writing out many
files, only a single file called "digests.fasta" is produced, containing the
information for all organisms which produced results. Those organisms which
failed to produce results, either for technical reasons or because of a
genuine absence of digest sequences, will be mentioned in the output of the

To produce results for all organisms, instead of a subset of organisms,
specify the --all-organisms option to the digestall script:

  python 70 --all-organisms

This will use the organisms.txt file instead of the organisms_special.txt file
as the source of the list of organisms to be investigated.

Example: Making a Database from Human Proteins

Using a file of protein records, "human_uniprot_sprot.fasta", a database was
created as follows:

  python human_uniprot_sprot.fasta 0 --dump

This captured all digest sequences and produced a dump file in the "output"
directory with the same filename.

A database was created using the make_database script in the "resources"

  python resources/

The original file of protein records was converted to a database import file
as follows:

  python human_uniprot_sprot.fasta dump_human_uniprot_sprot.fasta

This information is used to link the digest records to the original records
in the database.

Then, the data from the retrieval process was imported into the database using
the import_data script in the "resources" directory:

  python resources/

Finally, to extract data from the database, the export_data script in the
"resources" directory was used:

  python resources/

These database-related scripts require the details of the database, plus any
pertinent parameters to be specified interactively.