By Benjamin C.G. Lee
Modern high-throughput DNA sequencers typically generate short DNA sequences of about 100 bases in length. Because a 100-base DNA sequence is relatively short, a commonly-used technique for dealing with longer DNA sequences is to use biochemical manipulations to generate much longer sequences of predetermined length and then to sequence the 100 bases at each end. The DNA sequencer records these "paired-end reads" in two files in which paired reads are located in corresponding positions in each file.
This "paired file comparison" program compares two such files to validate that they do indeed contain paired-end reads. It compares the metadata of the sequences at corresponding positions in each file. Because both files are validated from start to end, the program ensures that the files correspond to each other and that no data have been inserted or deleted. This process of validating metadata also serves as a reasonable test of data integrity in general, as any data corruption that affects the metadata will be detected.
Please note that the majority of the documentation for the code appears to the right of the code itself.
Run the application from the Windows or Linux command line with the following syntax:
PairedFileComparison "file specification 1" "file specification 2" number_of_threads (buffer size)
The application outputs results to a .log file in the same folder as the executable file. If the files are paired, the results are displayed on the console window as well. If the files are not paired or if there is an error, the console window will close, and the user must check the .log file.
On a CPU with multiple cores, performance tends to be limited by disk I/O bandwidth. When running on 8 concurrent threads with a 3GHz CPU, the program requires 8 seconds to validate two 4.06 GB FASTQ files each containing 26.5 million reads.
For sample files, including a sample command line entry and two sample FASTQ files containing metadata, please see the Samples folder under the "Code" tab.
The majority of the documentation for the code appears to the right of the code itself. //like so
This program was written while working in the Department of Physics and Astronomy at Johns Hopkins University. I would like to thank Dr. Richard Wilton for all of his guidance.