% TSS Classifier
% Timo Lassmann
% December 24th 2013
This document contains the code and description of the TSS classifier used in the FANTOM5 project.
The classifier relies on retrieving sequences from a tome database. See the document TomeDB.pdf in the doc folder for instructions on how to create a database.
To run the classifier simply:
$ tssclass [options] <tome database name> <knownPeaks.bed> <novelPeaks.bed>
Where tome dabase name is the prefix of a tome database, knownPeaks.bed are peaks we believe are real transcriptional start sites based on previous annotaiton and novelPeaks.bed are unannotated peaks.
A training set comprised of both positive and negative sequences is extracted from the data. Gaussian mixture models are trained to capture the relative distribution of 4-mer occurrences surrounding TSSs. Each sequence is scored against all models resulting in a 256 vector of values for each sequence. The latter together with the cluster label is used to derive a random decision tree ensemble model. Finally, the RDT model is used to classify test sequences not used in the training of any models. The entire procedure is repeated many times and predictions for each cluster averaged.
We used the permissive DPI clusters as the basis for our predictions. All clusters within 100bp from a known transcriptional start site were labelled as positive, remaining sequences as negative (see: parsing section below
). We extracted sequences centered on the middle of each CAGE cluster of length 2kb. Increasing the number of mixtures had little effect on the prediction accuracy. We therefore ran our predictor using only a single gaussian. In each run we ran 2,4,6,8,10-fold cross validation, each one 5 times. For each cluster the prediction scores were averaged over all runs in which the cluster was not used for training the model.
The starting annotation file can be dowloaded here:
http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_ann.txt.gz
Here is BASH code I use to make the positive and negative peak files:
Get all peaks within 100bp of an annotated start site:
gzcat hg19.cage_peak_ann.txt.gz | awk 'BEGIN{OFS="\t"}{if(NR > 8){x = split($4,a,"bp");y = split($1,b,":|,");z = split(b[2],c,".") ; if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) <= 100 ){ print b[1],c[1],c[3],$2,a[1],b[3]} }}' > permissive_100bp_pos.bed
Get all peaks between 100bp and 1kb of an annotated start site:
gzcat hg19.cage_peak_ann.txt.gz | awk 'BEGIN{OFS="\t"}{if(NR > 8){x = split($4,a,"bp");y = split($1,b,":|,");z = split(b[2],c,".") ; if(a[1] != "NA" && (a[1]<0?-a[1]:a[1]) > 100 ){ print b[1],c[1],c[3],$2,a[1],b[3]} }}' > permissive_100bp_neg.bed
Add all unannotated peaks:
gzcat hg19.cage_peak_ann.txt.gz | awk 'BEGIN{OFS="\t"}{if(NR > 8){x = split($4,a,"bp");y = split($1,b,":|,");z = split(b[2],c,".") ; if(a[1] == "NA" ){ print b[1],c[1],c[3],$2,a[1],b[3]} }}' >> permissive_100bp_neg.bed
The TSS classifier produces two output files names:
The former can be used to draw a ROC curve while the latter contain the actual predictions.
The TSS classifier produces an output file called XXX_num_4_roc.txt which can be used to draw a ROC curve. We can use the pROC package in R [^1] to draw a ROC curve:
library(pROC); mat = read.table( "XXXXXX_num_4_roc.txt"); pdf("ROC".pdf,height = 10,width = 10,bg="white",pointsize = 18) plot.roc(mat[,4],mat[,1],print.thres="best",auc.polygon=TRUE,print.auc = TRUE ,grid=TRUE ) dev.off();
The classifier merges both input bed files and appends a prediction score to the forth column. Based on the best threshold shown in the ROC curve above users can select TSS clusters like this:
cat XXX_predictions.bed | awk 'BEGIN{OFS="\t"}{x = split($4,a,",");if(a[x] >= BESTTHRESHOLD){print $1,$2,$3,$4,$5,$6,$2,$3,"60,179,113"}}' > TSS.bed
In the example above I append some extra columns to make the results appear in color in the UCSC genome browser.
Here is the actual code used.
/* tomeTools - a collection of programs to work with CAGE data. Author: Timo Lassmann <timolassmann@gmail.com> Copyright (C) 2013 RIKEN This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. */ #include <stdlib.h> #include <stdio.h> #include <getopt.h> #include <string.h> #include "tomedb.h" #include "tomedb_nuc.h" #include "cauldron.h" #if HAVE_CONFIG_H #include <config.h> #endif #include <gsl/gsl_errno.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> #define OPT_SAMPLE 1 #define OPT_WINDOW 2 #define OPT_NUM_TREE 3 #define OPT_NUM_THREAD 4 #define OPT_KMER 5 float** derive_model(float** model, struct genome_coorinates** gc, int numseq, int kmer, int window ); double** fill_prediction_matrix(float** model, double** prediction_matrix, struct tome_data* pos,struct tome_data* neg,int kmer , int window); void usage(); int main (int argc, char * argv[]) { FILE* file = 0; init_logsum(); init_nuc_code(); unsigned int seed2 = (unsigned int) (time(NULL) * (42)); int i,j,c,f,g; char* tmp = 0; int count; //int top =1; int input_num_samples = 0; char** sample_names = 0; int window = 2000; int help = 0; int quiet = 0; int gzipped = 0; char* column_spec = 0; char* outfile = 0; //int numseq; int num_tree = 1000; int num_thread = 8; int kmer = 4; while (1){ static struct option long_options[] ={ {"sample",required_argument,0,OPT_SAMPLE}, {"window",required_argument,0,OPT_WINDOW}, {"ntree",required_argument,0,OPT_NUM_TREE}, {"nthread",required_argument,0,OPT_NUM_THREAD}, {"kmer",required_argument,0,OPT_KMER}, {"out",required_argument,0, 'o'}, {"quiet",0,0,'q'}, {"help",0,0,'h'}, {0, 0, 0, 0} }; int option_index = 0; c = getopt_long_only (argc, argv,"o:qh:",long_options, &option_index); if (c == -1){ break; } switch(c) { case 0: break; case OPT_KMER: kmer = atoi(optarg); break; case OPT_NUM_TREE: num_tree = atoi(optarg); break; case OPT_NUM_THREAD: num_thread = atoi(optarg); break; case OPT_WINDOW: window = atoi(optarg); break; case OPT_SAMPLE: tmp = optarg; count = byg_count(",", tmp); //fprintf(stderr,"%d\n",count); input_num_samples = count+1; sample_names = malloc(sizeof(char*) * (count+1)); for(i = 0; i < count+1;i++){ sample_names[i] = malloc(sizeof(char)* 1000); } f = 0; g = 0; for(i = 0;i < strlen(tmp);i++){ if(tmp[i] != ','){ sample_names[f][g] = tmp[i]; g++; }else{ sample_names[f][g] = 0; f++; g = 0; } } sample_names[f][g] = 0; break; case 'q': quiet = 1; break; case 'h': help = 1; break; case 'o': outfile = optarg; break; case '?': exit(1); break; default: fprintf(stderr,"default\n\n\n\n"); abort (); } } char* db_file = 0; char* pos_file = 0; char* neg_file = 0; if(help){ usage(); exit(EXIT_SUCCESS); } FILE* file_ptr; db_file = argv[optind++]; if(!db_file){ fprintf(stderr,"\nNo database file...\n"); usage(); exit(EXIT_FAILURE); } pos_file = argv[optind++]; if(!pos_file){ fprintf(stderr,"\nNo positive input bed file...\n"); usage(); exit(EXIT_FAILURE); } neg_file = argv[optind++]; if(!neg_file){ fprintf(stderr,"\nNo negative input bed file...\n"); usage(); exit(EXIT_FAILURE); } char* outfile_name; outfile_name = malloc(sizeof(char)*(200)); int num_kmers = pow(4,kmer ); TOMEDB* db = 0; db = main_read_db(db,db_file); int lines_read = 0; // int i,j; //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// //////////////// //////////////// Maximum nuber of regulators or targets - all need to fit in memory for learning //////////////// //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// //param->num_query = 100000; clock_t cStartClock; //FILE* file = 0; //i//nt gzipped = 0; cStartClock = clock(); //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// //////////////// //////////////// Loading & normalizing data //////////////// //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// struct tome_data* td_pos = init_tome_data(1100000,0 ); file = open_track(pos_file, &gzipped); column_spec = get_column_specs_based_on_file_suffix(pos_file); lines_read = read_data_chunk(db,td_pos,column_spec,file); free(column_spec); close_track(file, gzipped); if(lines_read ==2000000){ fprintf(stderr,"Warning: will only use the top %d lines from file %s\n", lines_read,pos_file); } td_pos->us = init_use_samples(db,sample_names,0, input_num_samples ,0); td_pos = add_sequences(db,td_pos, "mid", 0, 0, window); struct tome_data* td_neg = init_tome_data(1100000,0 ); file = open_track(neg_file, &gzipped); column_spec = get_column_specs_based_on_file_suffix(pos_file); lines_read = read_data_chunk(db,td_neg,column_spec,file); free(column_spec); close_track(file, gzipped); if(lines_read ==2000000){ fprintf(stderr,"Warning: will only use the top %d lines from file %s\n", lines_read,neg_file); } td_neg->us = init_use_samples(db,sample_names,0, input_num_samples ,0); td_neg = add_sequences(db,td_neg, "mid", 0, 0, window); int total_numseq = td_pos->numseq + td_neg->numseq; fprintf(stderr,"Read Input Files..."); struct single_target_prediction* sp = init_stp(total_numseq, 0, num_kmers); for(i = 0; i < num_kmers;i++){ sp->var_index[i] = i; } sp->feather = num_kmers; struct split_structure* split = 0; split = malloc(sizeof(struct split_structure)); split->pred = malloc(sizeof(double)*total_numseq); split->samples = malloc(sizeof(int)*total_numseq); split->samples_tmp = malloc(sizeof(int)*total_numseq); split->test_samples = malloc(sizeof(int)*total_numseq); split->test_samples_tmp = malloc(sizeof(int)*total_numseq); split->start = 0; split->end = total_numseq; split->start_test = 0; split->end_test = total_numseq; split->seed = &seed2; double** prediction_matrix = 0; prediction_matrix = malloc(sizeof(double*) * num_kmers); for(i = 0; i < num_kmers;i++){ prediction_matrix[i] = malloc(sizeof(double)* total_numseq); } float* indicator = malloc(sizeof(float) * total_numseq); for(i = 0; i < td_pos->numseq;i++){ td_pos->gc[i]->score = 0.0f; td_pos->gc[i]->count = 0.0f; td_pos->gc[i]->label = 0.0; } for(i = 0; i < td_neg->numseq;i++){ td_neg->gc[i]->score = 0.0f; td_neg->gc[i]->count = 0.0f; td_neg->gc[i]->label = 0.0; } float** model = 0; model = malloc(sizeof(float*) * num_kmers); for(i = 0;i < num_kmers;i++){ model[i] = malloc(sizeof(float) * window); } int tree; int k_fold = 10; int k_fold_repetition = 0; int k_fold_counter = 0; int* sample_list_pos = 0; int* sample_list_neg = 0; int size; double sum,mean_x; sample_list_pos = malloc(sizeof(int) * td_pos->numseq); sample_list_neg = malloc(sizeof(int) * td_neg->numseq); //////// // // Outer loop : over all k ... kfolf_cross validations... // /////
Here I loop through 2 .. 10 fold cross validation.
for(k_fold = 2; k_fold < 12;k_fold+=2){ //////// // // mid loop 1: how often to repeat cross-validation // /////
The X fold validation is repeated five times. Each time the data is split differently.
for(k_fold_repetition = 0; k_fold_repetition < 5;k_fold_repetition++){ for(i = 0; i < td_pos->numseq;i++){ indicator[i] = (float) i; sample_list_pos[i] = -1; } indicator = shuffle_arr_r(indicator, td_pos->numseq ,&seed2); //indicator = shuffle_arr(indicator,td_pos->numseq); size = td_pos->numseq / k_fold; c = 0; j = 0; for(i = 0;i < k_fold;i++){ for(j = i * size;j < (i+1)*size;j++){ sample_list_pos[(int)indicator[j]] = i; } } for(j = size * k_fold-1;j < td_pos->numseq;j++){ sample_list_pos[(int)indicator[j]] = k_fold-1; } for(i = 0; i < td_neg->numseq;i++){ indicator[i] = (float) i; sample_list_neg[i] = -1; } indicator = shuffle_arr_r(indicator, td_neg->numseq ,&seed2); size = td_neg->numseq / k_fold; c = 0; j = 0; for(i = 0;i < k_fold;i++){ for(j = i * size;j < (i+1)*size;j++){ sample_list_neg[(int)indicator[j]] = i; } } for(j = size * k_fold-1;j < td_neg->numseq;j++){ sample_list_neg[(int)indicator[j]] = k_fold-1; }
Here I perform the actual cross validation - using k -1 parts to predict 1 part etc...
//////// // // mid loop 2: carry out cross validation 0 ... kfold how often to repeat cross-validation // ///// for(k_fold_counter = 0;k_fold_counter < k_fold;k_fold_counter++){ // select X pos and X neg - done by labelling... for(i =0; i < td_pos->numseq;i++){ td_pos->gc[i]->label = 0; //bed_pos[i]->label = 0; } for(i =0; i < td_pos->numseq;i++){ //fprintf(stderr,"%d %d\n",sample_list_pos[i] ,k_fold_counter); if(sample_list_pos[i] != k_fold_counter){ //bed_pos[i]->label = 1; td_pos->gc[i]->label = 1; } } for(i =0; i < td_neg->numseq;i++){ //bed_neg[i]->label = 0; td_neg->gc[i]->label = 0; } for(i =0; i < td_neg->numseq;i++){ if(sample_list_neg[i] != k_fold_counter){ //bed_neg[i]->label = 1; td_neg->gc[i]->label = 1; } } cStartClock = clock(); model = derive_model(model, td_pos->gc, td_pos->numseq , kmer, window ); prediction_matrix = fill_prediction_matrix(model, prediction_matrix,td_pos,td_neg , kmer, window); fprintf(stderr,"Trained GMMs in %4.2f seconds\n",(double)( clock() - cStartClock) / (double)CLOCKS_PER_SEC); cStartClock = clock(); for(tree = 0; tree < 30;tree++){ split->seed = &seed2; split->start = 0; split->start_test = 0; c = 0; f = 0; g = 0; for(i= 0 ; i < td_pos->numseq;i++){ split->pred[i] = 1; if( td_pos->gc[i]->label == 0){ split->test_samples[f] = i; split->test_samples_tmp[f] = -1; f++; }else{ split->samples[g] = i; split->samples_tmp[g] = -1; g++; } } //fprintf(stderr,"%d sequences in positive\n",g); for(i= 0 ; i < td_neg->numseq;i++){ split->pred[i+td_pos->numseq] = 0; if(td_neg->gc[i]->label == 0){ split->test_samples[f] = i+td_pos->numseq; split->test_samples_tmp[f] = -1; f++; }else{ split->samples[g] = i+td_pos->numseq; split->samples_tmp[g] = -1; g++; } } //fprintf(stderr,"%d sequences in positive + negative training\n",g); split->end_test = f; split->end = g; sp = light_rdt_tree_no_var_imp(sp, prediction_matrix ,split); } sum = 0.0; c = 0; mean_x = 0.0; for(i = 0; i < total_numseq;i++){ if(sp->calls[i]){ mean_x += split->pred[i]; sum += powf(sp->pred[i]/(float)sp->calls[i] - split->pred[i], 2); c++; } } fprintf(stderr," Kfold:%d Num:%d %f\n",k_fold, k_fold_repetition,sum / (float) c); fprintf(stderr,"Run %d trees in %4.2f seconds\n", 100,(double)( clock() - cStartClock) / (double)CLOCKS_PER_SEC); } sum = 0.0; c = 0; mean_x = 0.0; for(i = 0; i < total_numseq;i++){ if(sp->calls[i]){ mean_x += split->pred[i]; sum += powf(sp->pred[i]/(float)sp->calls[i] - split->pred[i], 2); c++; //fprintf(stdout,"%f %f %f %f\n",sp->pred[i]/(float)sp->calls[i],sp->pred[i],(float)sp->calls[i], split->pred[i]); }else{ fprintf(stderr,"Missing prediction?????\n"); } //} } fprintf(stderr,"Kfold:%d Num:%d %f\n",k_fold, k_fold_repetition,sum / (float) c); } } free(indicator);// = malloc(sizeof(float) * num_samples ); free(sample_list_pos);// = malloc(sizeof(int) * numseq_pos); free(sample_list_neg);// = malloc(sizeof(int) * numseq_neg);
Here I wastefully print out a file that contains for each peak the classifier score and an indication of whether it is a known or novel peak.
sprintf (outfile_name, "%s_num_4_roc.txt",outfile); if ((file_ptr = fopen(outfile_name, "w")) == NULL){ fprintf(stderr,"can't open output\n"); exit(-1); } for(i = 0; i < total_numseq;i++){ if(sp->calls[i]){ fprintf(file_ptr,"%0.4f %f %f %f\n",sp->pred[i]/(float)sp->calls[i],sp->pred[i],(float)sp->calls[i], split->pred[i]); } //} } fclose(file_ptr); sprintf (outfile_name, "%s_predictions.bed",outfile); if ((file_ptr = fopen(outfile_name, "w")) == NULL){ fprintf(stderr,"can't open output\n"); exit(-1); } char strand; for(i = 0; i < td_pos->numseq ;i++){ strand = '+'; if(td_pos->gc[i]->strand){ strand = '-'; } fprintf(file_ptr, "%s\t%d\t%d\t%s,%4.4f\t%f\t%c\n", td_pos->gc[i]->chr,td_pos->gc[i]->start,td_pos->gc[i]->end,td_pos->gc[i]->info,sp->pred[i]/(float)sp->calls[i],td_pos->gc[i]->count,strand ); //fprintf(stdout," %f %f\n",sp->pred[i], gc_targets[0]->expression[i]); } for(i = 0; i < td_neg->numseq ;i++){ strand = '+'; if(td_neg->gc[i]->strand){ strand = '-'; } fprintf(file_ptr, "%s\t%d\t%d\t%s,%4.4f\t%f\t%c\n", td_neg->gc[i]->chr,td_neg->gc[i]->start,td_neg->gc[i]->end,td_neg->gc[i]->info,sp->pred[i+td_pos->numseq ]/(float)sp->calls[i+td_pos->numseq ],td_neg->gc[i]->count,strand ); //fprintf(stdout," %f %f\n",sp->pred[i], gc_targets[0]->expression[i]); } fclose(file_ptr); for(i = 0;i < num_kmers;i++){ free(model[i]);// = malloc(sizeof(float) * window); } free(model);// = malloc(sizeof(float*) * num_kmers); for(i = 0; i < num_kmers;i++){ free(prediction_matrix[i]); } free(prediction_matrix); free(outfile_name); free_tome_data(td_pos); free_tome_data(td_neg); } float** derive_model(float** model, struct genome_coorinates** gc, int numseq, int kmer, int window ) { int i,j,c; int num_kmers = pow(4,kmer ); char* tmp = 0; int key; double* mean = malloc(sizeof(double) * num_kmers); double* sigma = malloc(sizeof(double) * num_kmers); for(i = 0;i < num_kmers;i++){ for(j = 0; j < window;j++){ model[i][j] = 0.1f; } } double S0,S1,S2; for(i = 0; i < numseq;i++){ //if(gc[i]->label == TYPE_POS){ if(gc[i]->label == 1){ tmp = gc[i]->seq; if(gc[i]->len >= window){ for(j = 0; j < window - kmer;j++){ key = 0; for(c = 0; c < kmer;c++){ if(tmp[j+c] > 3){ key = -1; break; } key = key << 2; key |= tmp[j+c]; } if(key != -1){ model[key][j] += 1.0f; } } }else{ fprintf(stderr,"%d: %d len, %s:%d-%d \n", i,gc[i]->len, gc[i]->chr,gc[i]->start,gc[i]->end); } } } for(i =0 ; i < num_kmers;i++){ mean[i] = 0.0f; sigma[i] = 0.0f; S0 = 0.0; S1 = 0.0; S2 = 0.0; for(j = 0; j < window- kmer;j++){ S0 = S0 +model[i][j]; S1 += j *model[i][j] ; S2 += (j*model[i][j]) * ( j*model[i][j]); } mean[i] =S1 / S0 ; sigma[i] = sqrt( (S0 * S2 - pow(S1,2.0)) / ( S0 *(S0-1.0) )); for(j = 0; j < window - kmer;j++){ model[i][j] = log(gsl_ran_gaussian_pdf ((double)j - mean[i], sigma[i] )); // fprintf(stderr,"%f ",model[i][j] ); } //fprintf(stderr," MEAN: %f SIGMA:%f\n", mean[i],sigma[i]); } free(mean); free(sigma); return model; } double** fill_prediction_matrix(float** model, double** prediction_matrix, struct tome_data* pos,struct tome_data* neg,int kmer, int window) { int i,j,c,key; int num_kmers = pow(4,kmer ); char* tmp = 0; for(i = 0; i < num_kmers;i++){ for(j = 0; j < pos->numseq + neg->numseq;j++){ prediction_matrix[i][j] = 0.0; } } for(i = 0; i < pos->numseq ;i++){ tmp = pos->gc[i]->seq; for(j = 0; j < window - kmer;j++){ key = 0; for(c = 0; c < kmer;c++){ if(tmp[j+c] > 3){ key = -1; break; } key = key << 2; key |= tmp[j+c]; } if(key != -1){ prediction_matrix[key][i] += model[key][j]; } } } for(i = 0; i < neg->numseq ;i++){ tmp = neg->gc[i]->seq; for(j = 0; j < window - kmer;j++){ key = 0; for(c = 0; c < kmer;c++){ if(tmp[j+c] > 3){ key = -1; break; } key = key << 2; key |= tmp[j+c]; } if(key != -1){ prediction_matrix[key][i +pos->numseq] += model[key][j]; //bed_pos[i]->expression[key] += p_matrix[key][j] ;// log(score); } } } return prediction_matrix; } void usage() { fprintf(stdout, "\n%s %s, Copyright (C) 2013 Riken <%s>\n",PACKAGE_NAME, PACKAGE_VERSION,PACKAGE_BUGREPORT); fprintf(stdout, "\n"); fprintf(stdout, "Usage: tssclass [options] <db> <knownPeaks.bed> <novelPeaks.bed>\n\n"); fprintf(stdout, "Options:\n"); fprintf(stdout, " -o STR output file prefix.\n"); fprintf(stdout, " -kmer INT kmer size used [4]\n"); fprintf(stdout, " -window INT size of genome sequence to use. [2000].\n"); fprintf(stdout, "\n"); }