Menu

Tree [5ae810] master /
 History

HTTPS access


File Date Author Commit
 cache 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 config 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 data 2013-10-24 ejfertig ejfertig [5e0441] added additional fields to the repository for c...
 datasrc 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 diagnostics 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 doc 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 graphs 2013-08-13 Elana Fertig Elana Fertig [d4b7e1] changed figure size for manuscript
 lib 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 logs 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 munge 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 profiling 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 src 2014-03-25 ejfertig ejfertig [5d40fd] additional code to look at alternative models
 tests 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 .gitignore 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files
 README.md 2014-06-17 ejfertig ejfertig [5ae810] updated README to describe code structure
 TODO 2013-08-12 Elana Fertig Elana Fertig [fadec0] added files

Read Me

pSVA

Sample source, procurement process, and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intra-group biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori.

Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin fixed and frozen samples. When applied to predict HPV status, pSVA improved cross- study validation even if the sample batches were highly confounded with HPV status in the training set.

Details about the algorithm and these results are published as Parker et al. (2014) Bioinformatics. Code supporting these results is written for R organized using ProjectTemplate.

Briefly, the file structure is as follows:

  • cache: R objects stored from intermediate analysis, and used to obtain results in the paper
  • data: processed gene expression and sample annotatation used for analysis
  • datasrc: code for obtaining and processing raw gene expression data from GEO
  • graphs: plots generated from R code, used as Figures in the manuscript
  • munge: code to preprocess gene expression data to useable format for analysis, subsequently stored in the cache
  • src: code to obtain published results from data preprocessed according to munge and stored in the cache
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.