Menu

ProCoS

Lava (Lavanya Rishishwar)

ProCoS

ProCoS (Protein Composition Server, script version) is a one stop shop for computing all the possible types of peptide compositions, whether it be Pseudo Amino Acid Composition (PAAC)(Chou 2001), Amphiphilic Pseudo Amino Acid Compostion (APAAC)(Chou 2005), Five Factor Solution Score (5FSS)(Chou & Cai 2005), Amino Acid Composition, Dipeptide Composition, Tripeptide Composition... Polypeptide Composition. The program was previously released as an applet as well as a server. But complying with open-sourcing the research, we have released the script version as well.

If you use this script, please cite:
Rishishwar, L., Mishra, N., Pant, B., Pant, K., Pardasani, K.R. (2010). ProCoS - PROtein COmposition Server. Bioinformation, 5(5): 227. PMC: 3040505.

Sequence input/format (option: -i)


ProCoS can take sequences as FASTA (Pearson) Format only.

A few sample sequences may be:

sp|P80222|ADH1_ALLMI Alcohol dehydrogenase 1 OS=Alligator mississippiensis PE=1 SV=1
STAGKVIKCKAAITWEIKKPFSIEEIEVAPPKAHEVRIKILATGICRSDDHVTAGLLTMP
LPMILGHEAAGVVESTGEGVTSLKPGDKVIPLFVPQCGECMPCLKSNGNLCIRNDLGSPS
GLMADGTSRFTCKGKDIHHFIGTSTFTEYTVVHETAVARIDAAAPLEKVCLIGCGFSTGY
GAAVKDAKVEPGSTCAVFGLGGVGLSTIMGCKAAGASRIIGIDINKDKFAKAKELGATEC
INPLDCKKPIQEVLSEMTGGGVDYSFEVIGRIDTMTAALACCQDNYGTSVIVGVPPASEK
ITFNPMMLFTGRTWKGSVFGGWKSKESVPKLVADYMEKKINLDGLITHTLPFDKINEGFE
LLRTGKSIRSVLTF
sp|P49645|ADH1_APTAU Alcohol dehydrogenase 1 OS=Apteryx australis GN=ADH1 PE=1 SV=2
MSTAGKVIKCKAAVLWEPKKPFSIEEVEVAPPKAHEVRIKILATGICRSDDHVITGALVR
PFPIILGHEAAGVVESVGEGVTSVKPGDKVIPLFVPQCGECSACLSTKGNLCSKNDIGSA
SGLMPDGTTRFTCKGKAIHHFIGTSTFTEYTVVHETAVAKIAAAAPLEKVCLIGCGFSTG
YGAAVQTAKVEPGSTCAVFGLGGVGLSVVMGCKAAGASRIIAIDINKDKFAKAKELGATD
CVNPKDFTKPIHEVLMEMTGLGVDYSFEVIGHTETMAAALASCHFNYGVSVILGVPPAAE
KISFDPMLLFSGRTWKGSVFGGWKSKDAVPKLVADYMEKKFVLEPLITHTLPFIKINEGF
DLLRKGKSIRSVLVF
sp|P06525|ADH1_ARATH Alcohol dehydrogenase class-P OS=Arabidopsis thaliana GN=ADH1 PE=1 SV=2
MSTTGQIIRCKAAVAWEAGKPLVIEEVEVAPPQKHEVRIKILFTSLCHTDVYFWEAKGQT
PLFPRIFGHEAGGIVESVGEGVTDLQPGDHVLPIFTGECGECRHCHSEESNMCDLLRINT
ERGGMIHDGESRFSINGKPIYHFLGTSTFSEYTVVHSGQVAKINPDAPLDKVCIVSCGLS
TGLGATLNVAKPKKGQSVAIFGLGAVGLGAAEGARIAGASRIIGVDFNSKRFDQAKEFGV
TECVNPKDHDKPIQQVIAEMTDGGVDRSVECTGSVQAMIQAFECVHDGWGVAVLVGVPSK
DDAFKTHPMNFLNERTLKGTFFGNYKPKTDIPGVVEKYMNKELELEKFITHTVPFSEINK
AFDYMLKGESIRCIITMGA
The script is designed to take single or multiple sequences as input as long as they are in fasta format.

Composition (option: -c)


The composition is simply the amino acid composition you wish to compute. The script accepts integers (representing the k value of the k-peptide composition), APAAC, PAAC, 5FSS or PC as the valid arguments for this option. Thus, entering

PAAC - will calculate Pseudo Amino Acid Composition (Chou 2001)
APAAC - will calculate Amphiphilic Pseudo Amino Acid Composition (Chou 2005)(Chou & Cai 2005)
5FSS - will calculate Five Factor Score of Amino Acid Composition (Atchley et al 2005)
PC - will calculate PhysioChemical Properties
1 - will calculate Amino Acid Composition
2 - will calculate Dipeptide Composition
3 - will calculate Tripeptide Composition
and so on.

Also, entering

1+2+PAAC+PC will calculate Amino Acid Composition followed by Dipeptide Composition, Pseudo Amino Acid Composition and PhysioChemical Properties
2+1+PAAC+PC will calculate Dipeptide Composition followed by Amino Acid Composition, Pseudo Amino Acid Composition and PhysioChemical Properties
and so on.

5FSS, PAAC and APAAC provides you with much more options. A complete query specification for PAAC will be of the form:

5FSS(X;X;X;X;X)
Where all the X in X;X;X;X;X can assume only one out of two values => 0 and 1. Giving the value 1 to the X will incorporate that factor while calculating the query, whereas 0 will omit that factor.
Factor 1 reflects the simultaneous covariation in portion of exposed residues versus buried residues, polarity versus no polarity, hydrophobicity versus hydrophilicity, nonbonded energy versus free energy. This factor can be designated as polarity index. Factor 2 is a secondary structure factor, which represents the relationship of various amino acids with secondary structure configurations like helix, turn or coil. Factor 3 relates to molecular size or volume. Factor 4 reflects the relative amino acid composition in various proteins. Factor 5 refers to electrostatic charge with high coefficient on isoelectric point and net charge. (Atchley et al 2005)

PAAC query specification as follows:
PAAC(Weight Factor,lambda,X;X;X;X;X;X)
Weight factors can be any decimal point value, lambda is any positive integer value less than the length of the sequence, X;X;X;X;X;X represents the six parameter using which the Pseudo Amino Acid Composition is to be calculated. These can be Hydrophobic(;)Hydrophilic(;)Mass of the side chain(;)pK1(alpha-COOH)(;)pK2(NH3)(;)pI(at 25oC). To use a parameter, simply set its value to 1 and rest to 0. For example, to use Hydrophobic and pK1, set this value to 1;0;0;1;0;0. Also entering the query as PAAC(0.5,10,1;1;1;0;0;0) will calculate PAAC with w= 0.5, lambda = 10, and taking the parameters - hydrophobic, hydrophilic and mass of the side chains. Detailed information regarding PAAC can be found in the literature(Chou 2001). The following are the acceptable styles for computing about PAAC:
PAAC(Weight Factor,lambda,X;X;X;X;X;X)
PAAC(Weight Factor,lambda)
PAAC(Weight Factor)
PAAC

APAAC's query style is slightly different, of the form:
APAAC(Weight Factor,lambda)
APAAC(Weight Factor)
APAAC
If the user doesn't exclusively specifies any parameter, the system will calculate both the composition on default parameters, which are weight factor = 0.5, lambda = 10 and X;X;X;X;X;X; as 1;1;1;1;1;1

Detailed information regarding APAAC can be found in the literature(Chou 2005)(Chou & Cai 2005).

The values used in the parameters can be found at this link

WARNING: Although the script can support any value of n - the composition degree, the number of possible fragments increases as 20^n, thus even for degree as 4, you'll have "8000" total fragments. The script might stop take long as the process get computationally intensive. Also the composition degree should not be higher than the sequence length.

Break mode (option: --bp & --bpVal)


Break Mode provides user with a unique facility to compute the composition of the sequences in parts. These parts can be formed automatically (Automated) that will be of of equal length or the length can be adjusted explicitly by defining the breakpoints (Manual) or the whole sequence would be computed at once (Disabled, option not used).

Automated
In this mode the user has to enter the required number of parts of the sequence of which the composition is to be calculated. The sequence would be then broken down into the specified number of parts of equal lengths and the composition of each part would be calculated separately and diplayed.

For example, if the user enters <3> in the text box then the sequence will be divided into three parts of equal length and the output will contain 80 vectors (in case of amino acid composition), 20 vectors of each part.

Manual
In this mode the user has to specify the specific positions of the break points in the sequence separated by a comma.

For example, if the user enters <50,100> or <0,50,100> in the text box then the sequence will be divided into three parts with one part from 0 to 50, next from 51 to 100 and the last one from 101 to the end of the sequence and the output will contain 80 vectors (in case of amino acid composition), 20 vectors of each part.

Plus, in case the user wants to compute the composition of only a particular length or a group of some porition of the sequence, he or she can do so by giving the following input to the break mode text box:

(starting position of fragment 1)-(ending position of fragement 1);(starting position of fragment 2)-(ending position of fragement 2);...

Output format (option: --outMode)


In script mode, we provide two types of output:

Table
The compositions are displayed in the form of static tables. The tables can be copied to MSexcel or any other processor for better processing
Feature Value Vector
Some users might want the output in the form of Vectors. The Vector is given as (class) (fragment no):(Composition)(tab)(fragment no):(Composition).... for e.g.
(descriptor)
+1 1:0.02 2:0.09 3:3.02 ....

Include description (option: -d)


This option lets you control the inclusion of description in the output.

Order (option: --order)


Finally, we also provide you the flexibility of choosing the order in which the compositions should be calculated. Selecting 'Forward' will calculate the compositions of the sequence as they are, while 'Reverse' will calculate the compositions of the sequences in reversed order. For e.g, if your input sequence is "ASD", the composition will be calculated for the sequence "DSA" if reverse order is selected.

In case of any feedback or comment, feel free to mail me at lavanyarishishwar@gmail.com, I appreciate your valuable feedbacks and comments.

Hope you find the script useful!

References


Rishishwar, L., Mishra, N., Pant, B., Pant, K., Pardasani, K.R. (2010). ProCoS - PROtein COmposition Server. Bioinformation, 5(5): 227. PMC: 3040505.
Chou, K.C.(2001). Prediction of protein cellular attributes using pseudo-amino-acid-composition. PROTEINS: Structure, Function, and Genetics. 43:246-255.
Chou, K.C. (2005). Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics. 21:10-19.
Chou, K.C. and Cai Y.D. (2005). Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inf Model. 45(2):407-13
Atchley, J., Zhao, A.D., Fernandes, T. Drüke (2005). Solving the protein sequence metric problem. Proc Natl Acad Sci. 102:6395–6400.