Paired Sequence File Comparison Wiki

Fast validation of FASTQ files containing paired-end reads

Brought to you by: bcglee

Home

By Benjamin C.G. Lee

Summary:

Modern high-throughput DNA sequencers typically generate short DNA sequences of about 100 bases in length. Because a 100-base DNA sequence is relatively short, a commonly-used technique for dealing with longer DNA sequences is to use biochemical manipulations to generate much longer sequences of predetermined length and then to sequence the 100 bases at each end. The DNA sequencer records these "paired-end reads" in two files in which paired reads are located in corresponding positions in each file.

This "paired file comparison" program compares two such files to validate that they do indeed contain paired-end reads. It compares the metadata of the sequences at corresponding positions in each file. Because both files are validated from start to end, the program ensures that the files correspond to each other and that no data have been inserted or deleted. This process of validating metadata also serves as a reasonable test of data integrity in general, as any data corruption that affects the metadata will be detected.

Please note that the majority of the documentation for the code appears to the right of the code itself.

System requirements:

Operating Systems:
-- Windows 7
-- Linux

Command-line syntax:

Run the application from the Windows or Linux command line with the following syntax:

PairedFileComparison "file specification 1" "file specification 2" number_of_threads (buffer size)

File specification 1 and file specification 2 indicate the paths of the two FASTQ files.
number_of_threads specifies the number of threads to be created and run by the program in
order to validate the metadata more quickly (it is recommended that the number of threads
specified equal the number of cores that the computer being used has available).
The optional buffer size parameter specifies the size in bytes of the buffered reads of the
files.

Output:

The application outputs results to a .log file in the same folder as the executable file. If the files are paired, the results are displayed on the console window as well. If the files are not paired or if there is an error, the console window will close, and the user must check the .log file.

Performance:

On a CPU with multiple cores, performance tends to be limited by disk I/O bandwidth. When running on 8 concurrent threads with a 3GHz CPU, the program requires 8 seconds to validate two 4.06 GB FASTQ files each containing 26.5 million reads.

Sample Files:

For sample files, including a sample command line entry and two sample FASTQ files containing metadata, please see the Samples folder under the "Code" tab.

Additional Notes:

 The majority of the documentation for the code 
 appears to the right of the code itself.         //like so

Acknowledgments:

This program was written while working in the Department of Physics and Astronomy at Johns Hopkins University. I would like to thank Dr. Richard Wilton for all of his guidance.

Project Admins:

Benjamin Lee

benjaminlee@college.harvard.edu