Read Me
A collection of programs and modules which use the MS-Digest service to
collect digest sequences for a selection of protein sequences from a number of
organisms.
Script/Module Purpose
------------- -------
extract Extracts protein records for a selection of organisms
proteinfind Finds and extracts records for named proteins
digestresults Retrieves digest sequences from MS-Digest for a collection of
input sequences
digestall Constructs a file containing digest sequences for a selection
of organisms
fasta2dump Creates a database import file from a file containing protein
records
Selecting Input Data for Specific Organisms
-------------------------------------------
If no ready-made input data is available, it is possible to select data for
specific organisms by taking a collection of sequence records and isolating
only those related to those organisms. In this process, the starting input is
the NCBI non-redundant FASTA archive NCBInr.fasta containing protein records.
To select only relevant records from this file, the extract script is run as
follows:
python extract.py
This creates a directory called "FASTA" containing a file for each organism
listed in the organisms.txt file.
Each organism-specific file can then be processed further, selecting only
records containing the names found in the names.txt file and recording these
records in the "candidates" directory, by using the proteinfind script as
follows:
python proteinfind.py <filename>
The <filename> should be replaced by the name of a FASTA file, such as one
located in the FASTA directory; for example:
python proteinfind.py "FASTA/Lactobacillus gasseri"
Note that the quotation marks are important where spaces are used in
filenames. The resulting file will be "candidates/Lactobacillus gasseri".
Producing Results for Individual Organisms
------------------------------------------
With candidate proteins selected, a digest file can be created using the
digestresults scripts, as in the following example:
python digestresults.py "candidates/Lactobacillus gasseri" 70 12345
Here, a minimum ChemScore of 70% is specified, together with an identification
number of 12345; the latter value only being used in the header of the written
sequence record. Digest files are written to the "output" directory and retain
the name of the organism. Here, the resulting file will be called
"output/Lactobacillus gasseri".
A database dump, as opposed to a concatenated digest sequence file, may be
produced as in the following example:
python digestresults.py "candidates/Lactobacillus gasseri" 70 --dump
Here, a minimum ChemScore of 70% as above dictates the contents of the output
file, but each digest sequence appears on a separate line together with other
fields, including the input sequence and a number of properties, employing a
tab-separated value format.
Producing Results for Many Organisms
------------------------------------
Instead of manually running the proteinfind and digestresults scripts, one can
instead use the digestall script which will perform the above work for each of
the organisms of interest in turn, as defined by the organisms_special.txt
file.
python digestall.py 70
Here, a minimum ChemScore of 70% is specified. Instead of writing out many
files, only a single file called "digests.fasta" is produced, containing the
information for all organisms which produced results. Those organisms which
failed to produce results, either for technical reasons or because of a
genuine absence of digest sequences, will be mentioned in the output of the
program.
To produce results for all organisms, instead of a subset of organisms,
specify the --all-organisms option to the digestall script:
python digestall.py 70 --all-organisms
This will use the organisms.txt file instead of the organisms_special.txt file
as the source of the list of organisms to be investigated.
Example: Making a Database from Human Proteins
----------------------------------------------
Using a file of protein records, "human_uniprot_sprot.fasta", a database was
created as follows:
python digestresults.py human_uniprot_sprot.fasta 0 --dump
This captured all digest sequences and produced a dump file in the "output"
directory with the same filename.
A database was created using the make_database script in the "resources"
directory:
python resources/make_database.py
The original file of protein records was converted to a database import file
as follows:
python fasta2dump.py human_uniprot_sprot.fasta dump_human_uniprot_sprot.fasta
This information is used to link the digest records to the original records
in the database.
Then, the data from the retrieval process was imported into the database using
the import_data script in the "resources" directory:
python resources/import_data.py
Finally, to extract data from the database, the export_data script in the
"resources" directory was used:
python resources/export_data.py
These database-related scripts require the details of the database, plus any
pertinent parameters to be specified interactively.