Home
Name Modified Size InfoDownloads / Week
TransMCL-main 2023-06-20
test_dataset 2023-04-26
material 2023-04-03
README.txt 2023-04-26 9.0 kB
Totals: 4 Items   9.0 kB 0
Folder: test_dataset
A test dataset for detecting whether TransMCL is successfully installed and running.


Folder: material
The files list below are for testing the methods on the five datasets in our article. Here are the
descriptions for file contents and formats.

1. Model plant datasets
Folder: model_plant
[fasta/] initial transcripts of A.thaliana and E.salsugineum by Trinity, and real protein sequences of reference species 
[input/scripts] running scripts for generating blastp file, abc file, info file, orthogroup file, and running TransMCL
[input/all_sample2fa.list] map file for gene name to species, (TABULAR format)
[input/all_sample.faa.length] map file for gene name to gene length, (TABULAR format)
[input/info_tra.txt] transcriptome name
[input/info_trinity.fa] raw protein sequences (merged by Ath_trinity.fa and Esa_trinity.fa)
[input/info_tree.nwk] 10 species tree in newick format
[evaluation/Ath_trinity_cdhit.fa] representative transcriptome of A.thaliana by cdhit
[evaluation/Ath_trinity_corset.fa] representative transcriptome of A.thaliana by corset
[evaluation/Ath_trinity_transmcl.fa] representative transcriptome of A.thaliana by TransMCL
[evaluation/Esa_trinity_cdhit.fa] representative transcriptome of E.salsugineum by cdhit
[evaluation/Esa_trinity_corset.fa] representative transcriptome of E.salsugineum by corset
[evaluation/Esa_trinity_transmcl.fa] representative transcriptome of E.salsugineum by TransMCL
[evaluation/dataset] real protein sequences of A.thaliana, E.salsugineum and benchmark orthogroups  

2. Model animal datasets
Folder: model_animal
[fasta/] initial transcripts of H.sapiens and M.musculus by Trinity, and real protein sequences of reference species
[input/scripts] running scripts for generating blastp file, abc file, info file, orthogroup file, and running TransMCL
[input/all_sample2fa.list] map file for gene name to species, (TABULAR format)
[input/all_sample.faa.length] map file for gene name to gene length, (TABULAR format)
[input/info_tra.txt] transcriptome name
[input/info_trinity.fa] raw protein sequences (merged by Homo_trinity.fa and Mus_trinity.fa)
[input/info_tree.nwk] 12 species tree in newick format
[evaluation/Homo_trinity_cdhit.fa] representative transcriptome of H.sapiens by cdhit
[evaluation/Homo_trinity_corset.fa] representative transcriptome of H.sapiens by corset
[evaluation/Homo_trinity_transmcl.fa] representative transcriptome of H.sapiens by TransMCL
[evaluation/Mus_trinity_cdhit.fa] representative transcriptome of M.musculus by cdhit
[evaluation/Mus_trinity_corset.fa] representative transcriptome of M.musculus by corset
[evaluation/Mus_trinity_transmcl.fa] representative transcriptome of M.musculus by TransMCL
[evaluation/dataset] real protein sequences of H.sapiens, M.musculus and benchmark orthogroups  


3. Three subgroups of angiosperms datasets
Folder: 3subgroups_of_angiosperms
Subfolder: Rosids
[fasta] initial transcripts of A.thaliana by Trinity, and real protein sequences of reference species
[none/scripts] running scripts for generating blastp file, abc file, info file, orthogroup file, and running TransMCL
[none/all_sample2fa.list] map file for gene name to species, (TABULAR format)
[none/all_sample.faa.length] map file for gene name to gene length, (TABULAR format)
[none/Athaliana_transcriptome.fa] raw protein sequences of A.thaliana
[none/info_tra.txt] transcriptome name
[none/none.tree.nwk] 16 species tree in newick format
[node_a] input files when node_a masked
[node_b] input files when node_b masked
[node_c] input files when node_c masked
[node_d] input files when node_d masked
[evaluation/dataset] real protein sequences and benchmark orthogroups
[evaluation/none.fa] representative transcriptome of A.thaliana by TransMCL 
[evaluation/node_a.fa] representative transcriptome of A.thaliana by TransMCL when node_a masked
[evaluation/node_b.fa] representative transcriptome of A.thaliana by TransMCL when node_b masked
[evaluation/node_c.fa] representative transcriptome of A.thaliana by TransMCL when node_c masked
[evaluation/node_d.fa] representative transcriptome of A.thaliana by TransMCL when node_d masked

Subfolder: Asterids
[fasta] initial transcripts of S.lycopersicum by Trinity, and real protein sequences of reference species
[none/scripts] running scripts for generating blastp file, abc file, info file, orthogroup file, and running TransMCL
[none/all_sample2fa.list] map file for gene name to species, (TABULAR format)
[none/all_sample.faa.length] map file for gene name to gene length, (TABULAR format)
[none/Slycopersicum_transcriptome.fa] raw protein sequences of S.lycopersicum
[none/info_tra.txt] transcriptome =
[none/none.tree.nwk] 16 species tree in newick format
[node_e] input files when node_e masked
[node_f] input files when node_f masked
[node_g] input files when node_g masked
[node_h] input files when node_h masked
[evaluation/dataset] real protein sequences and benchmark orthogroups
[evaluation/none.fa] representative transcriptome of S.lycopersicum by TransMCL 
[evaluation/node_e.fa] representative transcriptome of S.lycopersicum by TransMCL when node_e masked
[evaluation/node_f.fa] representative transcriptome of S.lycopersicum by TransMCL when node_f masked
[evaluation/node_g.fa] representative transcriptome of S.lycopersicum by TransMCL when node_g masked
[evaluation/node_h.fa] representative transcriptome of S.lycopersicum by TransMCL when node_h masked

Subfolder: Monocots
[fasta] initial transcripts of S.italica by Trinity, and real protein sequences of reference species
[none/scripts] running scripts for generating blastp file, abc file, info file,orthogroup file, and running TransMCL
[none/all_sample2fa.list] map file for gene name to species (TABULAR format)
[none/all_sample.faa.length] map file for gene name to gene length (TABULAR format)
[none/S.italica_transcriptome.fa] raw protein sequences of S.italica
[none/info_tra.txt] transcriptome name
[none/none.tree.nwk] 16 species tree in newick format
[node_i] input files when node_i masked
[node_j] input files when node_j masked
[node_k] input files when node_k masked
[node_m] input files when node_m masked
[evaluation/dataset] real protein sequences and benchmark orthogroups
[evaluation/none.fa] representative transcriptome of S.lycopersicum by TransMCL 
[evaluation/node_i.fa] representative transcriptome of S.italica by TransMCL when node_i masked
[evaluation/node_j.fa] representative transcriptome of S.italica by TransMCL when node_j masked
[evaluation/node_k.fa] representative transcriptome of S.italica by TransMCL when node_k masked
[evaluation/node_m.fa] representative transcriptome of S.italica by TransMCL when node_m masked

4. Six transcriptome datasets
Folder: six_transcriptome
[fasta/Aly_transcriptome.fa] initial transcripts of A.lyrata, A.thaliana, S.lycopersicum, S.italica, S.viridis, Stuberosum by Trinity and real protein sequences of reference genes
[input/scripts] running scripts for generating blastp file, abc file, info file, orthogroup file and running TransMCL
[input/all_sample2fa.list] map file for gene name to species, (TABULAR format)
[input/all_sample.faa.length] map file for gene name to gene length, (TABULAR format)
[input/info.fasta] raw protein sequences (merged by six species)
[input/info_tra.txt] transcriptome name
[input/six_trans.tree.nwk] 16 species tree in newick format
[evaluation/clean.Alyrata.fa] representative transcriptome of A.lyrata by TransMCL
[evaluation/clean.Athaliana.fa] representative transcriptome of A.thaliana by TransMCL
[evaluation/clean.Sitalica.fa] representative transcriptome of S.italica by TransMCL
[evaluation/clean.Slycopersicum.fa] representative transcriptome of S.lycopersicum by TransMCL
[evaluation/clean.Stuberosum.fa] representative transcriptome of S.tuberosum by TransMCL
[evaluation/clean.Sviridis.fa] representative transcriptome of S.viridis by TransMCL
[evaluation/dataset] real protein sequences of six species and benchmark orthogroups

5. Single molecular sequences datasets
Folder: single_molecular_sequence
[fasta] initial transcripts of A.thaliana by single-molecule sequencing
[input/scripts] running scripts for generating blastp file, abc file, info file orthogroup file, and running TransMCL
[input/all_sample2fa.list] map file for gene name to gene length, (TABULAR format)
[input/all_sample.faa.length] map file for gene name to gene length, (TABULAR format)
[input/Ath_hq_as.fa] high-qulity consensus transcripts by ISO-seq of A.thaliana
[input/info_tree.nwk] six species tree in newick format
[input/transcriptome.txt] transcriptome name 
[evaluate/clean.fasta] representative transcriptome of A.thaliana by TransMCL
[evaluate/dataset] real protein sequences of A.thaliana

6. Src
Folder: src
[src/alignment.py] a python module
[src/evaluate.py] a python script to evaluate protein completeness
[src/evaluate_HOG.py] a python script to evaluate integrity of HOG genes
[src/scripts.txt] running scripts for evaluate the sequence completeness and HOG integrity
Source: README.txt, updated 2023-04-26