Download Latest Version Peptide_Pattern_Recognition.zip (291.4 kB)
Email in envelope

Get an email when there's a new version of Peptide Pattern Recognition

Home
Name Modified Size InfoDownloads / Week
Peptide_Pattern_Recognition.zip 2022-05-11 291.4 kB
README.txt 2018-02-26 7.3 kB
Totals: 2 Items   298.7 kB 0
Peptide Pattern Recognition

Peptide Pattern Recognition (PPR) is a method for grouping protein sequences and finding short, conserved peptides in each group. The present implementation is a considerable improvement able to handle large datasets and perform analysis in a reasonable time on a desktop computer. For details, please, read:
Busk P.K. (2018). Peptide Pattern Recognition for high-throughput protein sequence analysis and clustering. BioRxiv. doi: https://doi.org/10.1101/181917.

PPR consist of the files peptide_pattern_recognition.rb,  delete_peps_w_score_1.rb, rank_seed_proteins.rb and  make_groups.rb.

PPR was written in the Ruby programming language running on a Windows 7 operative system. The latest version of Ruby can be downloaded from http://rubyinstaller.org/. 

Installation:
Installation:
1) Download ”Peptide_Pattern_Recognition.zip” to your computer.
2) Unzip ”Peptide_Pattern_Recognition.zip”.
If the installation was successful you will now have the following files in the folder Peptide_Pattern_Recognition:
peptide_pattern_recognition.rb
delete_peps_w_score_1.rb
rank_seed_proteins.rb
make_groups.rb
README.pdf
Busk_2018
and the folders GH117 and MAPK containing test input files for PPR analysis.

To use PPR:
1) Collect the sequences you wish to analyze in a .txt file.
a) The sequences should be in strict fasta format:
>name of sequence
sequence
b) As provided PPR is set up to look for the file in the output folder.
c) The preferred name for the file is {input name}_unique_sequences.txt.
d) The folders GH117 and MAPK contain examples of the correct format of the input.
2a) When the input is ready run peptide_pattern_recognition.rb from a command prompt with the following arguments: ”output_name” ”output_directory_name” ”input_file_name” ”peptide_length” ”limit” ”cut_off” ”number_of_threads”.
2b) Alternatively, open peptide_pattern_recognition.rb in your preferred text editor.
3b) Enter the output name (line 5), the name of the output directory (line 7) and the name of the input file (line 9).
4b) Optional: The default parameters for PPR are described in Busk and Lange, 2013 and Busk et al., 2014 and can be changed.
The most important parameters for PPR analysis are the length of the conserved sequences (peptide length), number of conserved sequences (limit) and number of conserved sequences in each protein (cut off). These parameters are defined in the source code (lines 13, 15 and 17) or by providing them as a variable when running PPR. 
In addition, a number of parameters that are usually not changed from run to run are defined in the source code lines 19-29. 
5b) The optimal number of threads for PPR analysis depend on the number of processors on your computer and can be set in line 23.
6b) Run the PPR analysis by starting  peptide_pattern_recognition.rb from your text editor or from DOS.
7) The result will be stored as a number of text file in the output folder.


Interpretation of the result:

The result of a PPR analysis consist of a number of files containing groups of protein sequences and a number of files containing lists of peptides corresponding to the protein groups.

The protein files are named {input name}_group_1_prots.txt, {input name}_group_2_prots.txt and so on. (For example GH117_group_1_prots.txt).
Each protein in  the file has been assigned a score, which is the number of peptides from the corresponding peptide list, that can be found in the protein sequence.

The peptide files are named {input name}_group_1_peps.txt, {input name}_group_2_peps.txt and so on. (For example GH117_group_1_peps.txt). These files can be opened with MS Excel, Open Office Calc or similar.
The peptides are listed from the fourth row and downwards.
The first column indicates the median position of the first amino acid in the peptide in all the proteins that contain the peptide.
The second row contain the sequence of the peptide in one letter code.
The third row contain the frequency of the peptide calculated as number of proteins that contain the peptide divided by number of proteins in the group.
The fourth row contain the number of proteins that contain the peptide.

The file {input name}_classification_overview.txt contains an overview of the number of proteins in each group. (For example GH117_classification_overview.txt).

The file {input name}_conserved_peptides.txt contains a short list of the conserved peptides and their frequency for each group. (For example GH117_conserved_peptides.txt). This list is suitable as an input to the program Homology to Peptide Patterns (Hotpep) for annotation of new proteins to the groups (Busk et al., 2014, 2017).

Use of PPR for non-commercial purposes is free if you cite:
Busk P.K. (2018). Peptide Pattern Recognition for high-throughput protein sequence analysis and clustering. BioRxiv. doi: https://doi.org/10.1101/181917.
For details, please see the CC-BY-NC 4.0 international license as described below.


Good luck!


Peter Busk, 26.February.2018
References:

Busk P.K. (2018). Peptide Pattern Recognition for high-throughput protein sequence analysis and clustering. BioRxiv. doi: https://doi.org/10.1101/181917.

Busk P.K. and Lange L. (2013). Function-based classification of carbohydrate-active enzymes by recognition of short, conserved peptide motifs. Appl Environ Microbiol. 79(11), 3380-91.

Busk PK. Lange M. Pilgaard B. Lange L. (2014). Several genes encoding enzymes with the same activity are necessary for aerobic fungal degradation of cellulose in nature. PLoS One 9, e114138.

Busk PK. Pilgaard B. Lezyk MJ. Meyer AS. Lange L. (2017). Homology to Peptide Pattern for Annotation of Carbohydrate-Active Enzymes and Prediction of Function. BMC Bioinformatics 18, 214.







Peptide Pattern Recognition is provided ‘as is’, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0),(the "License"); you may not use this file or any other file or directory including subdirectories and files provided in this project except in compliance with the License.
You may obtain a copy of the License at https://creativecommons.org/licenses/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

Due to company policies this license is not valid for persons employed by, studying at or otherwise associated to the Technical University of Denmark. If you are employed by, studying at or otherwise associated to the Technical University of Denmark please, contact the copyright holder to obtain a personal license.
Source: README.txt, updated 2018-02-26