On 15/07/11 13:27, David Eccles (gringer) wrote:
> Number of contigs: 2
> Total length of contigs: 5383
> Number of contigs >= 500 nt: 2
> Total length of contigs >= 500 nt: 5383
> Number of scaffolds: 1
> Total length of scaffolds: 5652
> Number of scaffolds >= 500 nt: 1
> Total length of scaffolds >= 500: 5652
If I tell Ray to treat that interleaved sequence as single reads, it
still spits out two contigs, despite what I consider to be reasonable
evidence that the sequence that joins the contigs exists in reasonable
numbers in the input data:
$ mpirun -np 10 ../../code/Ray -s
~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta
Number of contigs: 2
Total length of contigs: 5383
Number of contigs >= 500 nt: 2
Total length of contigs >= 500 nt: 5383
Number of scaffolds: 2
Total length of scaffolds: 5383
Number of scaffolds >= 500 nt: 2
Total length of scaffolds >= 500: 5383
$ head -n 3 RayOutput.Contigs.fasta
>contig-6 4084 nucleotides
CTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCC
CTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGA
$ tail -n 3 RayOutput.Contigs.fasta
TTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAG
GATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACT
$ grep 'GTTAACACT[ACTG]CTGGTTATATTGACC'
~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta
| wc -l
120
$ grep 'GTTAACACT[A]CTGGTTATATTGACC'
~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta
| wc -l
75
$ grep 'GTTAACACT[C]CTGGTTATATTGACC'
~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta
| wc -l
0
$ grep 'GTTAACACT[T]CTGGTTATATTGACC'
~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta
| wc -l
0
$ grep 'GTTAACACT[G]CTGGTTATATTGACC'
~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta
| wc -l
45
Ah, of course... it's a bubble. So if I remove that...
$ perl -pe 's/GTTAACACT[ACTG]CTGGTTATATTGACC/GTTAACACTACTGGTTATATTGACC/'
linearised_head100k_interleaved_noN_phiX_region1101.fasta >
debubbled_linearised_head100k_interleaved_noN_phiX_region1101.fasta
$ mpirun -np 10 ../../code/Ray -s
~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/debubbled_linearised_head100k_interleaved_noN_phiX_region1101.fasta
...
Rank 1 is extending seeds [0/0] (completed)
Rank 1 extended 0 seeds out of 0 (-nan%)
Rank 5 is extending seeds [0/0] (completed)
Rank 5 extended 0 seeds out of 0 (-nan%)
Rank 8 is extending seeds [0/0] (completed)
Rank 8 extended 0 seeds out of 0 (-nan%)
Rank 2 is extending seeds [0/0] (completed)
Rank 2 extended 0 seeds out of 0 (-nan%)
Rank 0 is extending seeds [0/0] (completed)
Rank 0 extended 0 seeds out of 0 (-nan%)
Rank 9 is extending seeds [1/1]
Rank 9 starts on a seed, length is 1277 [0/1]
Rank 9 reached 0 vertices (GTGTTAACAGTCGGGAGAGGA) from seed 1
Rank 4 is extending seeds [0/0] (completed)
Rank 4 extended 0 seeds out of 0 (-nan%)
Rank 3 is extending seeds [0/0] (completed)
Rank 3 extended 0 seeds out of 0 (-nan%)
Rank 7 is extending seeds [0/0] (completed)
Rank 7 extended 0 seeds out of 0 (-nan%)
Rank 6 is extending seeds [1/1]
Rank 6 starts on a seed, length is 4063 [0/1]
Rank 6 reached 0 vertices (GCAGGTTGGATACGCCAATCA) from seed 1
Rank 9 reached 1277 vertices from seed 1 (changing direction)
Rank 9 reached 0 vertices (AGTTTTATCGCTTCCATGACG) from seed 1
Rank 6 reached 4063 vertices from seed 1 (changing direction)
Rank 6 reached 0 vertices (AGTTTTATCGCTTCCATGACG) from seed 1
Rank 9 reached 5364 vertices (TGATTGGCGTATCCAACCTGC) from seed 1
Rank 9 (extension done)
Rank 9 is extending seeds [1/1] (completed)
Rank 9 extended 1 seeds out of 1 (100.00%)
Rank 6 reached 5364 vertices (TGATTGGCGTATCCAACCTGC) from seed 1
Rank 6 (extension done)
Rank 6 is extending seeds [1/1] (completed)
Rank 6 extended 1 seeds out of 1 (100.00%)
...
Number of contigs: 1
Total length of contigs: 5384
Number of contigs >= 500 nt: 1
Total length of contigs >= 500 nt: 5384
Number of scaffolds: 1
Total length of scaffolds: 5384
Number of scaffolds >= 500 nt: 1
Total length of scaffolds >= 500: 5384
And this time, the contig appears to match the genome
(http://www.ncbi.nlm.nih.gov/nuccore/HM753704.1 -- slightly different
from the other phiX sequence. There are a few base errors, which means
that a simple grep doesn't work properly:
$ diffseq RayOutput.Scaffolds.fasta phiX174_isolate.fasta -wordsize 10
-outfile report.txt -aoutfeat afeat.txt -boutfeat bfeat.txt
$ cat report.txt
...
#=======================================
#
# Sequence: scaffold-0 from: 1 to: 5384
# HitCount: 11
#
# Compare: HM753704.1 from: 1 to: 5386
#
# scaffold-0 overlap starts at 1
# HM753704.1 overlap starts at 2
#
#
#=======================================
...
#---------------------------------------
#
# Overlap_end: 5384 in scaffold-0
# Overlap_end: 5385 in HM753704.1
#
# SNP_count: 10
# Transitions: 8
# Transversions: 2
#
#---------------------------------------
...
-- David
|