From: <tr...@us...> - 2002-08-31 11:04:55
|
Update of /cvsroot/basedb/basedb/exec/source/pca In directory usw-pr-cvs1:/tmp/cvs-serv28128 Added Files: Makefile PCA_README PCA_v01.c TestBig.txt TestSmall.txt Log Message: Added PCA plugin from Greg Voronin --- NEW FILE: Makefile --- FILES = pca FLAGS = -c -O3 -Wall -march=i686 -Ignuplot_i-2.6/src/ C = gcc CC = g++ LDFLAGS = -lgsl -lgslcblas SOURCE = PCA_v01.c gnuplot_i-2.6/src/gnuplot_i.c OBJ_PCA = PCA_v01.o gnuplot_i-2.6/src/gnuplot_i.o OBJ_ALL = Makedep.rule $(OBJ_PCA) all: $(FILES) .cc.o: $(CC) $(FLAGS) -o $@ $< .c.o: $(C) $(FLAGS) -o $@ $< clean: rm -f $(OBJ_ALL) $(FILES) *~ touch -d20020101 Makedep.rule depend: $(CC) $(FLAGS) -MM $(SOURCE) >Makedep.rule Makedep.rule: Makefile @echo "The makefile has changed since the last 'make depend'." install: cp $(FILES) ../../apps # Dependencies for executables pca: $(OBJ_PCA) $(C) $(LDFLAGS) -o $@ $+ # Dependencies for object files include Makedep.rule --- NEW FILE: PCA_README --- Principal Components Analysis Plug-In Tool version 0.01 8/26/02 ----------------------------------------------- Contents: ------- 1) Acknowledgement of Support 2) Introduction and Credits 3) Program Outline 4) System Parameters 5) Package List 6) Compilation, Linking Running 5) Test Runs 7) Integration with BASE 8) To Do List 9) References 1) Acknowledgement of Support: ------------------------ Creation of this software and documentation was supported by a grant from the New Jersey Commission on Spinal Cord Research. Support is also acknowledged from the Spinal Cord Injury Project of the W.M. Keck Center for Collaborative Neuroscience of Rutgers University. Authors: Gregory O. Voronin go...@ya... Ronald P. Hart rh...@an... Neuroscience Gene Expression Laboratory( www.ngelab.org ), W.M. Keck Center Center for Collaborative Neuroscience Rutgers University Piscataway, NJ 08854 2) Introduction and Credits: -------------------- We are releasing this PCA tool as an open source package under the GNU General Public Liscense. A copy of this liscense is provided in the archive file.Please feel free to use, modify and improve this code. It free to use and incorporate into your own projects. If you use this code, remember to credit us for its use. This project could not have been completed with the following projects: a) GNU Gnu Scientific Library - A comprehensive set of tools for numerical analysis of a broad range of problems. As a testimonial to the usefulness of this package, I was able to get the PCA calculation component coded, debugged and tested against real data sets in 1 and 1/2 days of leisurlely programming. This package can be found at: http://www.gnu.org/software/gsl/ b) Nicolas Devillard's gnuplot_i - This package allows you to call and use gnuplot directly from your C program. Compile and run his example programs then look at his code to see how easy it is to incorporate gnuplot funtionality into your code. This can be found at: http://ndevilla.free.fr/gnuplot/ We also include it as part of the PCA download package. 3) Program Outline: -------------- Detailed descriptions of Principal Components Analysis(PCA) can be found in the reference list. The following is an outline of our program: PCA is a multivariate statistics method at reducing data set dimensionality. The calculation of the PCA for a given data set is straight forward, however, the interpretation can be challenging. The reference list, your own judgement and the biology of your experiment should all be used to interpret the results of PCA. This program calculates the PCA of the sample variance-covariance matrix of your data set: a) Data is parsed from a lowess transformed data set in serial BASE file format. b) The natural logarithm of the ratio of expression intensity 1 to expression intensity 2 is then calculated. c) The data matrix of the ln transformed date is then created. Rows correspond to genes and columns correspond to experimental variates. d) The average value of each column is calcuated and a second matrix, called the average matrix is then created. For each column of the average matrix, the row value is equal to the column average of the corresponding column. e) The average matrix is then subtracted from the data matrix. This produces the difference matrix. f) The variance-covariance matrix is calcuated from the product of the transpose of the difference matrix times the difference matrix itself. The resulting marix is a symmetric matrix with all positive real values. Multiplying the entire matrix by ( 1/rows -1 ) yields the final sample variance-covariance matrix. g) The eigenvalues and eigenvectors of the sample variance-covariance matrix are then calculated. The combination of eigenvalues and eigenvectors for a given matrix can be termed the eignensytem of that matrix. The eigensystem is then sorted by the value of the eigenvalue of the system. h) The variance-covariance matrix and the eigensystem( which is stored in a linked list) are used to generate the output files, currently: covariance.html, eigensystem.html, Scree.png and CoeffieicentPlot.png. Some notes: 1) We did not reinvent the wheel when it comes to finding the eigensystem of a matrix. We used the Gnu Scientific Library. In their documentation, GNU Scientific Library Reference Manual Edition 1.2, for GSL Version 1.2 15 July 2002, the developers state: "This chapter describes functions for computing eigenvalues and eigenvectors of matrices. There are routines for real symmetric and complex hermitian matrices, and eigenvalues can be computed with or without eigenvectors. The algorithms used are symmetric bidiagonalization followed by QR reduction. These routines are intended for "small" systems where simple algorithms are acceptable. Anyone interested finding eigenvalues and eigenvectors of large matrices will want to use the sophisticated routines found in LAPACK. The Fortran version of LAPACK is recommended as the standard package for linear algebra. " Always, always check your work. We have tested this our code with a 5000 gene by 5 variate data set. We then compared the eigensystems generated with GeneSpring's PCA and PCA we implemented on Mathcad Plus 6.0 from Mathsoft Inc. with the same data set. The results provided by these three idependent implementations on the same data set were essentially the same. But you must confirm your own results. You are welcome to determine what is meant by a "small" system and report back to us. Lest you think we are lazy in not implementing our own eigensystem routines, we rely on the wisdom of authors of Numerical Recipes in C: The Art of Scientific Computing: "You have probably gathered by now that the solution of eigensystems is a fairly complicated business. It is. It is one of the few subjects covered in this book for which we do not recommend that you avoid canned routines." 2) A Scree plot shows the ordered eigen values against their ordinal number. This can be used to estimate the number of "true" components or rank of your data set. 3) The coefficient plot displays the value of the coeffiecent of a given principal component on the Y axis. Each princiapl component is plotted in a different color. TheX axis corresponds the data set or column to which that coefficient corresponds. 4) System parameters: ----------------- We can confirm succesful compilation and operation on the following system(s): RedHat Linux 7.3 ( 2.4.18 kernel, gcc 2.96-RH, glibc 2.2.4, XFree86 4.2.0 ) BASE 1.0.6 gcc version 2.96 20000731 ( RedHat Linux 7.3 2.96 ) gsl 1.1.1 gnuplot linux version 3.7.1 gnuplot_i-2.6 Gnuplot must be installed and working on your system. On our RedHat 7.3 installation, gnuplot was installed and gsl was provided as an RPM. 5) Package list: ----------- The following files should be present in the tarred and gzipped archive called PCA_v01.tar.gz 1) PCA_v01.c source code 2) PCA_v01 executable 3) gnuplot_i-2.6 N. Devillard's open source c interface to gnuplot 4) TestSmall.txt serial BASEfile fromat test data file 5) TestBig.txt serial BASEfile format test data file 6) PCA_README this file 6) Compilation, Linking and Running: --------------------------- A) Create a directory on your system called /usr/PCA. This will serve as the working directory prior to integrating the plug-in with BASE. Copy the gzipped archive to /usr/PCA then unzip and untar the archive PCA_v01.tar.gz: > mkdir /usr/PCA > cp ./PCA_v01/tar/gz /usr/PCA > gunzip PCA_v01.tar.gz > tar xvf PCA_v01.tar This will extract the files listed in section 4) Package List. B) Try a test run of the executable file PCA_v01 first. If this works on your system, then you won't to go through the compilation and linking procedure( Steps C - ). a) Test to ensure the program is executtable by running the directory command and seeing the that the executable parameter is set: > ls -al The console output of this command should somewhat like the following: drwxrwxr-x 5 1183 root gnuplot_i-2.6 -rw-r--r-- 1 root root PCA_README -rwxr-xr-x 1 root root PCA_v01 -rw-r--r-- 1 root root PCA_v01.c -rw-r--r-- 1 root root PCA_v01.tar -rw-r--r-- 1 root root TestBig.txt -rw-r--r-- 1 root root TestSmall.txt As you can see from the above, the x's are set for PCA_v01. If this is not the case for your output, issue the following command: > chmod 777 PCA_v01 b) To execute the program type the following: > PCA_v01 < TestSmall.txt You should see the following output: The variance-covariance matrix: 0.625100 0.004427 -0.001474 0.004427 0.140561 -0.005588 -0.001474 -0.005588 0.170664 PCA calculations completed. Point your web browser to your current working directory and bring up the following files: Covariance.html Eigensystem.html Scree.png CoefficientPlot.png The four files you see listed can be viewed in your browser by pointing to the working directory where they are created and then clicking on them from the browser window. The working directory is simply the directory in which you run the program. If the above steps work and give the desired output, you do not need to recompile the PCA_v01 from the source code. You can proceed the Section 6, Integrating with BASE. C) To compile from the source code: a) Test the steps from B) once again to be sure you didn't miss anything. b) Since you followed all the steps in B, you have a working directory /usr/PCA. Move to this directory and descend into the subdirectory callled gnuplot_i-2.6: > cd /usr/PCA > cd gnuplot_i-2.6 c) Please consult the instructions in the gnuplot_1-2.6 documentation prior to completing this step. Additionally you should create the example programs provided in with the gnuplot_i-2.6 package. The are very interesting and the source code is sort and easy to read. While in the gnuplot_i-2.6 subdirectory issue the make command: > make To compile and link the Nicolas Devillard's example programs: > make tests f) A file called gnuplot_i.o should exist in the directory /usr/PCA/gnuplot_i-2.6. Copy this file to the directory containing the PCA plug-in source code. This assumes we are using /usr/PCA as our working directory: > cd /usr/PCA/gnuplot_i-2.6 > cp ./gnuplot_i.o /usr/PCA Now descend into the src directory in the gnuplot_i-2.6 sub-directory and copy the header file gnuplot_i.h to the PCA working directory: > cd /usr/PCA/gnuplot_i-2.6/src > cp ./gnuplot_i.h /usr/PCA g) Now you can compile and link the PCA source code in one fell swoop: > gcc -O3 -I/usr/include -lgsl -lgslcblas -lm -o PCA_v01 PCA_v01.c gnuplot_i.o This will create the executable PCA_v01 in the working directory. h) Test the executable according to the instructions in B). 7) Integrating with BASE: ------------------ A) To integrate the executable with BASE, first copy the executable file to the exec/apps directory in the BASE directory tree. This will of course depend on the location of base in your directory tree, ( i.e the base home path ): > cp /usr/PCA/PCA_v01 /usr/local/base/exec/apps/ The execuatable file is now present in the correct directory. B) Now we can tell out BASE installation about the PCA plug-in: a) Log onto BASE as root b) Clickon the Analyze Data section. c) Click on Plug-ins d) Click on Add plug-in. This will bring up a page where you can specify the parameters for the plug-in: Fill in the following fields: Name: Analysis: PCA Name of executable: PCA_v01 Active: Check: On Data format Choose: Serial BASEfile( one section per assay ) Columns Check: postion and molecule Fields Check: intensity1 and intensity2 e) After you have entered the above options, click Accept The PCA Plug-in will now be active and usable: Choose and experiment to run this on. Run the Lowess plug-in. Choose to the [run app] option on the lowess transformed option line. Choose the Analysis: PCA plug-in. Remember that PCA is for multidimensional data that has been lowess transformed. It may take a while to run. Please be patient. 8) To Do List: ---------- 1) Improve parser efficiency, or integrate with Carl's C++ BASEfile parser 2) Provide postscript output file. 3) Resolve diminishing/truncated reporters list issue. 4) Component vs. Component 2D and 3D output with data. 5) Document use/parsing of stdin.txt 6) SQL script for uploading the plug-in definition 9) References: ----------- Principles of Multivariate Analysis: A User's Perspective. Krzanowski, WJ. Oxford University Press 1988. A User's Guide to Principal Components. Jackson, EJ. John Wiley & Sons, Inc. 1991 Principal Components Analysis to Summarize Microarray Experiments: Application to Sporulation Time Series. Raychaudhuri S, Stuart JM, Altman RB. Pacific Symposium on Biocomputing 2000. Honolulu, Hawaii, 452-463, 2000. --- NEW FILE: PCA_v01.c --- // // PCA Plug-In for BioArraySoftware Environment( BASE ) // homepage http://www.ngelab.org // BASE - homepage http://base.thep.lu.se/ // Copyright (C) 2002 Ron Hart, Greg Voronin(rhgv) // // Write to: See Below // This file is part of PCA_v01, a Principal Components Analysis Plug_In for BASE. // // PCA_v01 is free software; you can redistribute it and/or // modify it under the terms of the GNU General Public License // as published by the Free Software Foundation; either version 2 // of the License, or (at your option) any later version. // // PCA_v01 is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received a copy of the GNU General Public License // along with this program; if not, write to the Free Software // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. // // // Creation of this software and documentation was supported by a grant from the New Jersey Commission on Spinal Cord Research. // Support is also acknowledged from the Spinal Cord Injury Project of the W.M. Keck Center for Collaborative Neuroscience of Rutgers University. // // Authors: // // Gregory O. Voronin go...@ya... // Ronald P. Hart rh...@an... // Neuroscience Gene Expression Laboratory( www.ngelab.org ), // W.M. Keck Center Center for Collaborative Neuroscience // Rutgers University // Piscataway, NJ 08854 // /* ********************************************************************************* */ /* See the file PCA_README for a complete outline of the method */ /* used to perform PCA. Also consult the references cited therein. */ /* ********************************************************************************* */ #include <stdio.h> #include <stdlib.h> #include <math.h> #include <unistd.h> #include <gsl/gsl_matrix.h> #include <gsl/gsl_blas.h> #include <gsl/gsl_eigen.h> #include "gnuplot_i.h" /* **************************************************************************************** Link list structure for the data list. . . Since we know neither the number genes/rows/reporters nor the number of assays/experiments in a set prior to parsing the data file, a dynamic data strucutre seems to be called for. Similarly, eigensystem of the variance-covariance matrix will be stored in a linked list of elements of structure eigenRec. ***************************************************************************************** */ struct lowessRec { int row; int col; double intensity1; double intensity2; double ratio; char *tag; struct lowessRec *next; }; struct eigenRec { char *annotation; int cols; double eigenvalue; gsl_vector *eigenvector; struct eigenRec *next; }; /* ********************************************************************************** These typedefs will allow us to easily use the linked list representing the datalist and eigensystem throughout our code. ************************************************************************************ */ typedef struct lowessRec MYREC; typedef MYREC *LINK; typedef struct eigenRec EIGENREC; typedef EIGENREC *ELINK; /* Global variables used */ /* *********************************************************************************** matrix a is the data matrix representing the natural log transformed expression intensity ratios for the given experiments. Rows are are the genes and columns are the experiments. matrix s is the variance-covariance matrix. This is a symetrical col number X col number sized matrix. ************************************************************************************* */ gsl_matrix *a; gsl_matrix *s; /* ******************************************************************************* These are shared data tags which help to define the contents of a particular structure element in the linked list of data parsed in tjhe lowess transformed data file. ********************************************************************************** */ char *HEAD ="Head"; char *BEGIN = "Begin"; char *DATA = "Data"; LINK dataList, lowessHead; ELINK eigenList; /* ********************************************************************************************************** Function prototype headers *********************************************************************************************************** */ /* ***** data list linked list function protypes ***** */ LINK initializeList(); void addElement( int _row, int _col, double _intensity1, double _intensity2, int _code, LINK list ); void setRowsAndColumns( int maxRow, int maxCol, LINK list ); /* ************************************************************** */ /* ***** eigensystem list linked list function prototypes ***** */ ELINK initializeEList(); void addEigenLink( char *_annotation, ELINK _elist ); /* ******************************************************************** */ /* ***** parsing function prototype ****** */ void parse( LINK dList, ELINK eList ); /* ************************************************* */ /* ****** PCA calculation function prototypes ****** */ gsl_matrix listToMatrix( LINK list, gsl_matrix *m ); void createCovarianceMatrix(gsl_matrix *m, gsl_matrix *sm, ELINK eList ); void calculateEigenSystem( gsl_matrix *m, ELINK eList ); /* ******************************************************************* */ /* ***** output function protoypes ***** */ void outputCovarMatrix( ELINK eList, gsl_matrix *m ); void outputEigenSystem( ELINK eList ); void outputScreePNG( ELINK eList, int cols ); /* *********************************************************************************************************** */ /* ***** Begin Program Function Implementations Here ****** */ /* **************************************************************************************** main method: 1) parse the lowess transformed BASEfile into the into the linked list for data 2) use the now known col and row length parameters here to create the matrices for PCA 3) convert the data linked list to the data matrix 4) perform the PCA calculations 5) output the results of the eigensystem of the variance-covariance matrix. ****************************************************************************************** */ int main( void ) { dataList = initializeList(); eigenList = initializeEList(); parse( dataList, eigenList ); a = gsl_matrix_alloc( dataList->row, dataList->col ); s = gsl_matrix_alloc( dataList->col, dataList->col ); listToMatrix( dataList, a ); createCovarianceMatrix( a, s, eigenList ); calculateEigenSystem( s, eigenList ); outputCovarMatrix( eigenList, s ); outputEigenSystem( eigenList ); outputScreePNG( eigenList, s->size1 ); printf( "PCA calculations completed. Point your web browser to your\n" ); printf( "current working directory and bring up the following files: \n\n" ); printf( "\tCovariance.html\n" ); printf( "\tEigensystem.html\n" ); printf( "\tScree.png\n" ); printf( "\tCoefficientPlot.png\n\n" ); return 0; } void parse( LINK dList, ELINK eList ) { /* jobController.php prepares a tab-delimited text file from the information to it from the database tables Program and ProgramParameter. A text file in the BASEfile format is then created. This file is called stdin.txt jobController then calls your program and redirects the stdin iostream from the console/keyboard to the file stdin.txt. This allows you to look at stdin.txt as a set of strings entered by the user at the keyboard. The parser parses this BASEfile( stdin.txt ) as such. It uses a combination of scanf() and fgets() to do this. The parser keeps track of its position in the BASEfile as it marches through the various sections by means of a set of state variables. These can be viewed as series of boolean on/off switches. The switches are set and reset as the parser changes state. */ /* stand text strings used in the BASEfile format. */ char *SECTION = "section"; char *SPOTS = "spots"; char *COUNT = "count"; char *PERCENT = "%"; char line[100]; char *token; char *delimiters; char inten1[15]; char inten2[15]; char columnTitle[30]; int col; int i; int index = 0; double data, fi1, fi2; /* ****************************************************** The following are parser state variables: 0 = false; 1 = true; They represent the 'states' the parser is in while working its way through the lowess transformed BASEfile ( serial format ). ******************************************************** */ int inSection = 0; int inSpots = 0; int inPercent = 0; int inData = 0; int inCount = 0; int inAnnotation = 0; int rowCount = 0; scanf( "%s", line ); while( line != NULL && !feof( stdin ) ) { if ( line != NULL ) { if ( strcmp( line, SECTION ) == 0 ) { // printf( "in a section\n" ); inSection = 1; inPercent = 0; inData = 0; } if (inSection && strcmp( line, COUNT ) == 0 ) { scanf( "%d", &col ); inSection = 0; inCount = 1; } if( inCount && strcmp( line, PERCENT ) == 0 ) { inCount = 0; inAnnotation = 1; } if ( inAnnotation ) { /* ignore this first line */ fgets( columnTitle, 100, stdin ); for ( i = 0; i < col; i++ ) { fgets( columnTitle, 100, stdin ); addEigenLink( columnTitle, eList ); } inAnnotation = 0; } if ( inSection && strcmp( line, SPOTS ) == 0 ) { /* Oh yeah, were getting close to the data! */ inCount = 0; inSection = 0; inSpots = 1; } if ( inSpots && strcmp( line, PERCENT ) == 0 ) { /* Yeah baby, yeah baby this must be it! */ inPercent = 1; inData = 1; inSpots = 0; } if ( inData ) { if ( strcmp( line, PERCENT ) == 0 ) { /* This element indicates the beginning of a new replicate */ rowCount = 0; addElement( 0, 0, 0.0, 0.0, 1, dList ); } else if ( strcmp( line, SECTION ) == 0 ) { /* a new section of data, no doubt */ inData = 0; // inSection = 1; } else { if ( index == 0 ) { index = index + 1; } else if ( index == 1 ) { index = index + 1; } else if ( index == 2 ) { /* This is the lowess transformed intensity 1. It is being converted from a string to a number. */ fi1 = atof( line ); index = index + 1; } else { /* This is the lowess transformed intensity 1. It is being converted from a string to a number. */ fi2 = atof( line ); rowCount = rowCount + 1; // printf( "The intensities are 1: %f 2: %f\n", fi1, fi2 ); addElement( 0, 0, fi1, fi2, 0, dList ); index = 0; } } } } /* end of first if line != NULL */ scanf( "%s", line ); } /* end of outer most while loop */ setRowsAndColumns( rowCount, col, dList ); } /* The implementation of the linked list for the data list follows. This is a standard linked list implementation available in numerous code cookbooks and data strcutures and algorithm's course text books. */ LINK initializeList() { lowessHead = (LINK)malloc( sizeof( MYREC ) ); if ( lowessHead == NULL ) { printf( "failure to allocate space for the data list\n" ); exit( 1 ); } lowessHead->tag = HEAD; lowessHead->next = NULL; return lowessHead; } void addElement( int _row, int _col, double _intensity1, double _intensity2, int _code, LINK list ) { int rowCount; double r; struct lowessRec *newRec; newRec = (LINK)malloc ( sizeof( MYREC ) ); if ( newRec == NULL ) { printf( "shit!\n" ); exit( 1 ); } /* initialize the new rec values here */ newRec->row = _row; newRec ->col = _col; newRec->intensity1 = _intensity1; newRec->intensity2 = _intensity2; if ( _code == 1 ) { newRec->tag = BEGIN; rowCount = 0; } else if ( _code == 0 ) { newRec->tag = DATA; rowCount = rowCount + 1; r = log( _intensity1 / _intensity2 ); newRec->ratio = r; } newRec->next = NULL; /* walk to the end of the list, but don't fall off! */ while( list->next != NULL ) { list = list->next; } /* add the newRec to the end of the list */ list->next = newRec; } void setRowsAndColumns( int maxRow, int maxCol, LINK list ) { list->row = maxRow; list->col = maxCol; } /* The implementation of the linked list for the eigensystem follows. This is a standard linked list implementation available in numerous code cookbooks and data structures and algorithm's course text books. The annotation string's location does not imply a particular attachment to the eigen vector or eigenvalue value in the linked list. It is a merely a convienent place to store this information and use a later time for output. */ ELINK initializeEList() { ELINK eigenHead; eigenHead = (ELINK)malloc( sizeof( EIGENREC ) ); if ( eigenHead == NULL ) { printf( "failure to allocate space for head of eigen system list\n" ); exit( 1 ); } eigenHead->next = NULL; return eigenHead; } void addEigenLink( char *_annotation, ELINK _elist ) { EIGENREC *newRec; if ( _elist == NULL ) { _elist = (ELINK)malloc( sizeof( EIGENREC ) ); // printf( "allocating a new list\n" ); } newRec = (ELINK)malloc( sizeof( EIGENREC ) ); if ( newRec == NULL ) { printf( "Falied to alloc space for an EIGENREC.\n" ); exit( 1 ); } newRec->annotation = ( char *)malloc( strlen( _annotation ) + 1 ); strcpy( newRec->annotation, _annotation ); while( _elist->next != NULL ) { _elist = _elist->next; } _elist->next = newRec; } /* This function converts the data list linked list into a data matrix, which then can be manpulated by gsl functions */ gsl_matrix listToMatrix( LINK list, gsl_matrix *m ) { int matrixRows = 0; int matrixCols = 0; int currentRow = 0; int currentCol = -1; int rows, cols; int i, j; int index; matrixRows = list->row; matrixCols = list->col; while( list != NULL ) { if ( strcmp( list->tag, BEGIN ) == 0 ) { currentCol = currentCol + 1; currentRow = 0; } if ( strcmp( list->tag, DATA ) == 0 ) { if ( currentRow < matrixRows ) { gsl_matrix_set( m, currentRow, currentCol, list->ratio ); } currentRow = currentRow + 1; } if ( list !=NULL ) { list = list->next; } } /* end traversal outer while */ return *m; } /* The following functions perform the PCA calculation. */ void createCovarianceMatrix( gsl_matrix *m, gsl_matrix *sm, ELINK eList ) { /* calculating the difference matrix, requires first the matrix of column means */ int i, j, index, maxR, maxC; double columnSum, columnAvg, scaling; FILE *fp; ELINK workingList; gsl_matrix *am; gsl_matrix *at; int maxRow = m->size1; int maxCol = m->size2; am = gsl_matrix_alloc( maxRow, maxCol ); columnSum = 0; columnAvg = 0; for ( i = 0; i < maxCol; i++ ) { for ( j = 0; j < maxRow; j++ ) { columnSum = columnSum + gsl_matrix_get( m, j, i ); } columnAvg = columnSum / ( m->size1 ); for ( j = 0; j< maxRow; j++ ) { gsl_matrix_set( am, j, i, columnAvg ); } columnSum = 0; columnAvg = 0; } /* the matrix m will now be turned into the diffrerence matrix */ gsl_matrix_sub( m, am ); scaling = 1.0 / ( m->size1 - 1 ); at = gsl_matrix_alloc( m->size2, m->size1 ); gsl_matrix_transpose_memcpy( at, m ); gsl_blas_dgemm( CblasNoTrans, CblasNoTrans, 1.0, at, m, 0.0, sm ); gsl_matrix_scale( sm, scaling ); printf( "The variance-covariance matrix:\n" ); for ( i = 0; i < sm->size1; i++ ) { for ( j = 0; j < sm->size2; j++ ) { printf( " %f ", gsl_matrix_get( sm, i, j ) ); } printf( "\n" ); } printf( "\n" ); /* the matrix s is now the variance-covariance matrix */ gsl_matrix_free( m ); gsl_matrix_free( am ); gsl_matrix_free( at ); } void calculateEigenSystem( gsl_matrix *m, ELINK eList ) { int i, j, index; gsl_vector *eigenValues; gsl_matrix *working_m, *eigenVectors; gsl_eigen_symmv_workspace *w; eigenValues = gsl_vector_alloc( m->size2 ); working_m = gsl_matrix_alloc( m->size1, m->size2 ); gsl_matrix_memcpy( working_m, m ); eigenVectors = gsl_matrix_alloc( m->size2, m->size2 ); w = gsl_eigen_symmv_alloc( m->size2 ); gsl_eigen_symmv( working_m, eigenValues, eigenVectors, w ); gsl_eigen_symmv_free( w ); gsl_eigen_symmv_sort( eigenValues, eigenVectors, GSL_EIGEN_SORT_VAL_DESC ); /* assign the eigensystem list to the appropriate values */ eList = eList->next; index = 0; while ( eList != NULL && index < eigenValues->size ) { eList->eigenvalue = gsl_vector_get( eigenValues, index ); eList->cols = eigenValues->size; eList->eigenvector = gsl_vector_alloc( eigenValues->size ); gsl_matrix_get_col( eList->eigenvector, eigenVectors, index ); index = index + 1; eList = eList->next; } } /* The following functions create the HTML and text files displaying the results of the PCA computation. */ void outputCovarMatrix( ELINK eList, gsl_matrix *m ) { int i, j, col, row; ELINK head = eList; /* maintain a pointer to the head of the list */ FILE *fp1 = fopen( "Covariance.html", "w" ); if ( fp1 != NULL ) { /* Begin of html output of the variance-covariance matrix */ fprintf( fp1, "<html>" ); fprintf( fp1, "<table border=\"3\" cellpadding=\"3\" cellspacing=\"1\">" ); fprintf( fp1, "<caption><b>Variance-Covariance Matrix of the Data</b></caption>" ); fprintf( fp1, "<tr><td></td>" ); /* Place experiment names across the top of the table */ /* Skip the head of the list */ eList = eList->next; while( eList != NULL ) { fprintf( fp1, "<td align=\"center\"><font color=\"blue\">%s</font></td>", eList->annotation ); eList = eList->next; } fprintf( fp1, "</tr>" ); /* Begin the rows and columns of the variance-covariance matrix by returning to the head of the list to obtain experiment names and used the variance-covariance matrix to display the results */ eList = head; eList = eList->next; row = 0; while ( eList != NULL ) { fprintf( fp1, "</tr>" ); fprintf( fp1, "<td align=\"center\"><font color=\"blue\">%s</font></td>", eList->annotation ); for (col = 0; col<m->size2; col++ ) { fprintf( fp1, "<td align=\"center\">%f</td>", gsl_matrix_get( m, row, col ) ); } eList = eList->next; row = row + 1; fprintf( fp1, "<tr>" ); } /* End of the html output of the variance-covariance matrix */ fprintf( fp1, "</table>" ); fprintf( fp1, "</html>" ); fclose( fp1 ); } } void outputEigenSystem( ELINK eList ) { int i, index; ELINK head; FILE *fp1 = fopen( "Eigensystem.html", "w" ); head = eList; if ( fp1 != NULL ) { /* The files have both been created and can be written too */ /* write to the HTML file */ fprintf( fp1, "<html><head>" ); fprintf( fp1, "<body><Br></Br>" ); eList = eList->next; /* Skip the head of the list */ fprintf( fp1, "<table border=\"3\" cellpadding=\"3\" cellspacing=\"1\">" ); fprintf( fp1, "<caption><b>Principal Components Table</b></caption>" ); fprintf( fp1, "<tr>" ); fprintf( fp1, "<td align=\"center\"><font color=\"red\">Eigenvalue</font></td>" ); fprintf( fp1, "</tr>" ); eList = head; eList = eList->next; index = 1; while ( eList != NULL ) { fprintf( fp1, "<tr>" ); fprintf( fp1, "<td align=\"center\"><font color=\"red\">%f</font></td>", eList->eigenvalue ); fprintf( fp1, "<td align=\"center\"><font color=\"blue\">principal component %d</td>", index ); for ( i = 0; i<eList->cols; i++ ) { fprintf( fp1, "<td align=\"center\">%f</font></td>", gsl_vector_get( eList->eigenvector, i ) ); } fprintf( fp1, "</tr>" ); eList = eList->next; index = index + 1; } fprintf( fp1, "</table>" ); fclose( fp1 ); } } void outputScreePNG( ELINK eList, int cols ) { int i, j; /* component label variables */ int size = 30; int count = 1; char *label = (char *)malloc( size ); double index[ cols ]; double eigenvalues[ cols ]; double coefficients[ cols ]; ELINK head; /* initialize gnuplot sessions */ gnuplot_ctrl *session1 = gnuplot_init(); gnuplot_ctrl *session2 = gnuplot_init(); head = eList; /* Maintain a pointer to the head of the list */ eList = eList->next; i = 0; while ( eList != NULL ) { eigenvalues[ i ] = eList->eigenvalue; index[ i ] = ( double )i; eList = eList->next; i = i + 1; } gnuplot_cmd( session1, "set terminal png color" ); gnuplot_cmd( session1, "set output \"Scree.png\""); gnuplot_setstyle( session1, "linespoints" ); gnuplot_set_xlabel( session1, "Ordinal Position of Eigenvalue" ); gnuplot_set_ylabel( session1, "Eigenvalue" ); gnuplot_plot_xy( session1, index, eigenvalues, cols, "" ); /* close the gnuplot session 1 */ gnuplot_close( session1 ); /* begin output form gnuplot session 2 */ gnuplot_cmd( session2, "set terminal png color" ); gnuplot_setstyle( session2, "linespoints" ); gnuplot_set_xlabel( session2, "Experimental Conditions" ); gnuplot_set_ylabel( session2, "Coefficient Value" ); eList = head; /* Maintain a pointer to the head of the list */ eList = eList->next; while ( eList != NULL ) { for ( j = 0; j < cols; j++ ) { coefficients[ j ] = gsl_vector_get(eList->eigenvector, j ); } snprintf( label, size, "Principal Component %d ", count ); gnuplot_cmd( session2, "set output \"CoefficientPlot.png\""); gnuplot_plot_x( session2, coefficients, cols, label ); count = count + 1; eList = eList->next; } /* close the gnuplot session 2 */ gnuplot_close( session2 ); } --- NEW FILE: TestBig.txt --- BASEfile fit_fraction 0.33 min_x_step 0.1 n_iter 4 section lowess settings % section assays count 3 columns id name Hrs Rx Test Tx annotationColumns Hrs Rx Test Tx % 4 Test Set 1 5 Test Set 2 6 Test Set 3 section spots columns position molecule assayData assayFields intensity1 intensity2 [...2566 lines suppressed...] 2573 336 239.187 261.954 990 5205 320.577 195.822 597 2074 206.474 304.039 636 2231 260.801 240.736 4270 3451 247.211 253.993 977 3823 282.499 222.359 932 1229 191.449 328.474 150 2255 244.294 257.886 2698 343 249.614 252.838 274 2114 215.69 292.656 3512 3034 292.825 216.129 216 1034 248.059 255.254 547 1841 260.216 243.49 87 4955 238.531 265.626 1566 918 287.291 220.641 898 4502 327.925 193.349 2278 3695 192.366 329.964 1480 2246 256.007 248.055 506 1996 265.489 239.196 3800 2461 283.063 224.459 --- NEW FILE: TestSmall.txt --- BASEfile fit_fraction 0.33 min_x_step 0.1 n_iter 4 section lowess settings % section assays count 3 columns id name Hrs Rx Test Tx annotationColumns Hrs Rx Test Tx % 4 Test Set 1 5 Test Set 2 6 Test Set 3 section spots columns position molecule assayData assayFields intensity1 intensity2 assays 4 % 4705 1876 0.182727772 0.639442747 3361 3490 0.178550103 0.734521416 2017 1092 0.75188986 0.946157646 673 4443 0.52163883 0.425572287 4706 4393 0.226277593 0.75675317 3362 4517 0.483571184 0.284209255 2018 4223 0.363921297 0.430420856 674 1501 0.282386944 0.374558038 4707 4540 0.53350036 0.745814532 3363 3266 0.718361323 0.855847816 section spots columns position molecule assayData assayFields intensity1 intensity2 assays 4 % 4705 1876 0.956653582 0.144926138 3361 3490 0.530925839 0.154646906 2017 1092 0.433994784 0.998590057 673 4443 0.474103618 0.36427883 4706 4393 0.382959597 0.793716884 3362 4517 0.257531312 0.304847628 2018 4223 0.426341693 0.119879148 674 1501 0.31309046 0.544443088 4707 4540 0.072738575 0.186893687 3363 3266 0.576870571 0.694229354 section spots columns position molecule assayData assayFields intensity1 intensity2 assays 4 % 4705 1876 0.116390925 0.643916614 3361 3490 0.791205844 0.630646561 2017 1092 0.769312628 0.557026574 673 4443 0.986900081 0.447894876 4706 4393 0.755242497 0.031620397 3362 4517 0.133165695 0.222270832 2018 4223 0.533144379 0.171807537 674 1501 0.252414263 0.34161909 4707 4540 0.422475338 0.672763583 3363 3266 0.68490596 0.375773707 |