///////////////////////////////////////////////////////////////////
// //
// PLEK - predictor of lncRNAs and mRNAs based on k-mer scheme //
// Authors: Aimin Li, Junying Zhang //
// Contacts: LiAiminMail@gmail.com, jyzhang@mail.xidian.edu.cn //
// Webcite: https://sourceforge.net/projects/plek/ //
// Version: 1.2 //
// Updated on: January 12, 2016 //
// //
///////////////////////////////////////////////////////////////////
PLEKv2 published:
https://doi.org/10.1186/s12864-024-10662-y
Citation:
Aimin Li, Haotian Zhou, Siqi Xiong, Junhuai Li, Saurav Mallik, Rong Fei,
Yajun Liu, Hongfang Zhou, Xiaofan Wang, Xinhong Hei, Lei Wang.
PLEKv2: predicting lncRNAs and mRNAs based on intrinsic sequence features
and the coding-net model. BMC Genomics 2024, 25(1):756.
https://doi.org/10.1186/s12864-024-10662-y
=====================
INTRODUCTION
=====================
We present a tool called PLEK (predictor of long non-coding RNAs and messenger
RNAs based on k-mer scheme), which uses an improved computational pipeline
based on k-mer and support vector machine (SVM) to distinguish long non-coding
RNAs (lncRNAs) from messenger RNAs (mRNAs).
=====================
INSTALLATION
=====================
Pre-requisite:
1. Linux
2. C/C++ compiler (i.e. gcc, g++)
3. Python 2.5.0 or later versions (http://www.python.org/)
Steps:
1. Download PLEK.1.2.tar.gz from https://sourceforge.net/projects/plek/files/
and decompress it.
$ tar zvxf PLEK.1.2.tar.gz
2. Compile PLEK.
$ cd PLEK.1.2
$ python PLEK_setup.py
=====================
USAGE AND EXAMPLES
=====================
Usage of PLEK.py -- used to differentiate lncRNAs from mRNAs:
python PLEK.py -fasta fasta_file -out output_file -thread number_of_threads
-minlength min_length_of_sequence -isoutmsg 0_or_1 -isrmtempfile 0_or_1
-fasta The name of a fasta file, its sequences are to be predicted.
-out The file name for the results of prediction. Predicted positive
samples are labeled as "Coding", and negative as "Non-coding".
-thread (Optional) The number of threads for running the PLEK program.
The bigger this number is, the faster PLEK runs. Default value: 5.
-minlength (Optional) The minimum length of sequences. The sequences whose
lengths are more than minlength will be processed. Default
value: 200.
-isoutmsg (Optional) Output messages to stdout(screen) or not. "0" means
that PLEK be run quietly. "1" means that PLEK outputs the details
of processing. Default value: 0.
-isrmtempfile (Optional) Remove temporary files or not. "0" means that PLEK
retains temporary files. "1" means that PLEK remove temporary
files. Default value: 1.
Examples:
1. $ python PLEK.py -fasta PLEK_test.fa -out predicted -thread 10
NOTE: To predict the sequences in the 'PLEK_test.fa' file, run the PLEK program with
10 threads. The program outputs the predicted sequences in the file 'predicted'.
2. $ python PLEK.py -fasta PLEK_test.fa -out predicted -thread 10 -minlength 300
NOTE: To predict the sequences in the 'PLEK_test.fa' file, run the PLEK program with
10 threads. The program outputs the predicted sequences in the file 'predicted'.
The sequences with the length of >300nt will be processed (remained).
3. $ python PLEK.py -fasta PLEK_test.fa -out predicted -thread 10 -isrmtempfile 0
NOTE: To predict the sequences in the 'PLEK_test.fa' file, run the PLEK program with
10 threads. The program outputs the predicted sequences in the file 'predicted'.
The details of PLEK run will output to the files with "predicted" as prefix.
4. $ python PLEK.py -fasta PLEK_test.fa -out predicted -thread 10 -isoutmsg 1
NOTE: To predict the sequences in the 'PLEK_test.fa' file, run the PLEK program with
10 threads. The program outputs the predicted sequences in the file 'predicted'.
The details of PLEK run will output to user's screen(stdio).
=====================
Usage of PLEK Modelling -- used to build your own classifier with your
training data (mRNA/lncRNA transcripts):
python PLEKModelling.py -mRNA mRNAs_fasta -lncRNA lncRNAs_fasta
-prefix prefix_of_output -log2c range_of_log2c -log2g range_of_log2g
-thread number_of_threads -model model_file -range range_file
-minlength min_length_of_sequence -isoutmsg 0_or_1 -isrmtempfile 0_or_1
-k k_mer -nfold n_fold_cross_validation -isbalanced 0_or_1
-mRNA mRNA transcripts used to build predictor, in fasta format.
-lncRNA lncRNA transcripts used to build predictor, in fasta format.
-prefix Prefix of the output files.
-log2c (Optional) The specified range of C parameter for the svm parameter
search. Default value: 0,5,1. (from, to, by; 0,1,2,3,4,5)
-log2g (Optional) The specified range of G parameter for the svm parameter
search. Default value: 0,-5,-1. (from, to, by; 0,-1,-2,-3,-4,-5)
-thread (Optional) The number of threads for running the PLEKModelling
program. The bigger this number is, the faster PLEKModelling runs.
Note that a larger thread number means larger consumption of memory.
Default value: 12.
-model (Optional) The name of a predictor model file (an output file
by PLEKModelling.py).
-range (Optional) The name of a svm range file (an output file by
PLEKModelling.py).
-minlength (Optional) The minimum length of sequences. The sequences whose
lengths are more than minlength will be processed. Default
value: 200.
-isoutmsg (Optional) Output messages to stdout(screen) or not. "0" means
that PLEKModelling be run quietly. "1" means that PLEKModelling
outputs the details of processing. Default value: 0.
-isrmtempfile (Optional) Remove temporary files or not. "0" means that PLEKModelling
retains temporary files. "1" means that PLEKModelling remove temporary
files. Default value: 1.
-k (Optional) range of k. k=5 means that we will calculate usage of
1364 k-mer patterns. (k=1, 4 patterns; k=2, 16; k=3, 64; k=4, 256;
k=5, 1024; 1364=4+64+256+1024). Default value: 5.
-nfold (Optional) n-fold cross-validation in search for optimal parameters.
Default value: 10.
-isbalanced (Optional) In the case of isbalanced=1, if the samples are
unbalanced, it will subsample the overrepresented class to obtain an
equal amount of positives and negatives.
Default value: 0.
Examples:
1. $ python PLEKModelling.py -mRNA PLEK_mRNAs.fa -lncRNA PLEK_lncRNAs.fa -prefix 20140531
NOTE: This trains a classifier using the mRNA sequences in the 'PLEK_mRNAs.fa' file
and lncRNA in 'PLEK_lncRNAs.fa'. The program outputs the model in the file
'20140531.model' and the svm-scale range in '20140531.range'.
We can use the new model as follows:
$ python PLEK.py -fasta PLEK_test.fa -out 20140531.predicted -thread 10 \
-range 20140531.range -model 20140531.model
2. $ python PLEKModelling.py -mRNA PLEK_mRNAs.fa -lncRNA PLEK_lncRNAs.fa -prefix 20140601 \
-log2c 1,3,1 -log2g -1,-3,-1 -nfold 2 -k 4
NOTE: This example is used to demonstrate the usage of PLEKModelling.py
in a simple/quick way. User can run this to check if our program can run correctly.
It trains a classifier using the mRNA sequences in the 'PLEK_mRNAs.fa' file
and lncRNA in 'PLEK_lncRNAs.fa'. The range of log2c is 1,2,3. The range of log2g
is -1,-2,-3. Use a 2-fold cross-validation. K is 4, it will calculate the usage
of 340 patterns (4+16+63+256=340). The program outputs the model in the file
'20140601.model' and the svm-scale range in '20140601.range'.
We can use the new model as follows:
$ python PLEK.py -fasta PLEK_test.fa -out 20140601.predicted -thread 10 \
-range 20140601.range -model 20140601.model -k 4
Notes:
(1) In general, it is time-consuming to build a new classifier.
(Generally, it may take <48 hours in 36-threading runs or in
36 pbs parallel jobs.)
(2) The accuracy of a classifier model is connected with the quality and
quantity of training samples. We encourage users to supply as many reliable
samples as possible to PLEKModelling.py to train a classifier. We suggest
that transcripts annotated with 'pseudogene', 'predicted' and 'putative'
be removed before training models.
(3) To parallel run PLEKModelling.py in PBS, please see
PLEK_howto_generate_scripts.pdf
(4) This script is not suitable for organisms with compact genomes.
=====================
COPYING
=====================
GNU Public License version 3 (GPLv3)
Details on http://www.gnu.org/copyleft/gpl.html