1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Sample Results

From cloudburst-bio

Jump to: navigation, search

Contents

CloudBurst has several parameters to control the sensitivity of the alignment algorithm. Here it finds the unambiguous best alignment for 100,000 reads allowing up to 3 mismatches when mapping to the corresponding Streptococcus suis genome. Sample data is available for download at: http://sourceforge.net/projects/cloudburst-bio/files/cloudburst-data/CloudBurst-sample-data/

The small data set used here is *considerably* smaller than those needed for human resequencing projects, but is helpful for ensuring your environment is running correctly. The running time of CloudBurst scales linearly as the number of reads increases or the size of the reference increases, but non-linearly as the number of allowed errors increases. Furthermore, the small number of reads in this dataset means more time will be spent processing overhead rather than computing actual alignments. For real evaluations of CloudBurst, you should use a larger reference (ideally at least a chromosome of the human genome) and several million reads to capture the full complexity of the problem.


Sample input data

Download here

  • s_suis.fa: Streptococcus suis reference genome sequence

Format the input data

Note: you can skip this step and use the prepackaged s_suis.br and 100k.br files

$ java -jar ConvertFastaForCloud.jar s_suis.fa s_suis.br
$ java -jar ConvertFastaForCloud.jar 100k.fa 100k.br
  • s_suis.br: reference genome in CloudBurst binary format
  • 100k.br: Reads in CloudBurst binary format


Sample Run

1. Copy the data files into the cloud

$ hadoop fs -mkdir /data/cloudburst
$ hadoop fs -put s_suis.br /data/cloudburst
$ hadoop fs -put 100k.br /data/cloudburst


2. Run CloudBurst: Takes about ~3 minutes on 24-cores

$ hadoop jar CloudBurst.jar /data/cloudburst/s_suis.br \
  /data/cloudburst/100k.br /data/results \
  36 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err


3. Copy the raw results to the local filesystem

$ hadoop fs -get /data/results/ results


4. Convert the raw results to a text file, sorting to ensure a consistent order

$ java -jar PrintAlignments.jar results | sort -nk4 > 100k.3.txt


Compare Results

Make sure you version of CloudBurst is running correctly by comparing your 100k.3.txt file to the included version. This is a text file of alignments, so you can compare the files with the unix tool diff. If everything is working correctly, then diff should produce no output. If there are any differences, check your copy of cloudburst.err to see if Hadoop reported any problems. See Troubleshooting for more information.

Personal tools