Menu

#28 KeyError in DNA.DNA.reversecomplement()

v1.0_(example)
closed-fixed
ATAC (6)
5
2015-05-11
2015-03-06
Dan Bolser
No

I'm getting many errors of this form in my results:

Start trimming for bp one-to-one-ness
trimMatchOverlapsInX:

posgaps, #abuts, #overlaps, #contained, bp_trimmed= 365858 2728 0 0 0

trimMatchOverlapsInY:

posgaps, #abuts, #overlaps, #contained, bp_trimmed= 365663 2677 0 0 0

Finished trimming for bp one-to-one-ness
At formPerfectRuns, step=4
Time elapsed=402.797996044
from /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/work/seq1vsseq2.matches.extended.uniq.t28.trim making /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/wor
k/seq1vsseq2.matches.extended.uniq.t28.trim.p6
formPerfectRuns step=0
formPerfectRuns step=1
formPerfectRuns step=2
formPerfectRuns step=3
At onlyKeepLongRuns, step=5
Time elapsed=516.475046158
from /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/work/seq1vsseq2.matches.extended.uniq.t28.trim.p6 making /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/
work/seq1vsseq2.matches.extended.uniq.t28.trim.p6.l100
At formPerfectRuns, step=6
Time elapsed=524.731652021
Heal the perfect runs
formPerfectRuns step=0
formPerfectRuns step=1
formPerfectRuns step=2
formPerfectRuns step=3
At squeezeIntraRunGaps, step=7
Time elapsed=635.825762987
from /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/work/seq1vsseq2.matches.extended.uniq.t28.trim.p6.l100.pr to /nfs/nobackup/ensemblgenomes/dbolser/ATAC/1
4/2/work/seq1vsseq2.matches.extended.uniq.t28.trim.p6.l100.pr.sq
KeyError in DNA.DNA.reversecomplement()
The query 0 seq2:128728 15968 98
GATCGTGAGC
TGGGCTGCTACATCCACCTACCTGGTTTCTCGGATTCCGGTAGAATCAACCACCCGATCT
ATCGACACAATTTCCCCCGCCGGTGC
KeyError in DNA.DNA.reversecomplement()
The query 0 seq2:128728 13916 1998
TCCTGCCATCCTGCTGGTCTTGGTCTCGTTGCACCGATATGGAA
ACATTTGCCTAATGCCTCGGGACTCCTTGCCTGTGCTTGCCTCTTTAGCACCAAAGAGGA
AACAAGTACGCTGCACGCTCTGGCACCCGCCTGGTACCCGCCTGGCCTTGGTCGTCATGG
CTTGCATCACAGGAACCTCGTGAGGTGCCCCCTGCCTTGATCTCTCCGCTCCTCGCGAGC
TAGCCTAGCGGGGCTGCCCCCGTGGAGGTCTTGCGTTGTCCGCCTCATGAGGCTTGGCCC
CTCACGAGGGTCTTGAGTTTTCGCTGATGAAGATGGGCCATACATGGCTGCTGGTGGAGC
CACGCCGTGGGCCGCAGGCAGGCAAGTCTGGGGACCCCCGTTCCCAGGACGCCGACAGTA
GCCCCCGGGCCCAAGGCGCGCTCGGGCTTGGCTTCGAGGCGAAGCCAAGGGGCAAGTGTG
GAGCGCCGCGGGCCCGAACAGCCTGCGACCTCGGTTGACGCATGGCGATTGATGGGATAT
GGGCGTCTCCGCTTCCCCACGCTGCCTCGGCAACTGCCCGACTTGACGAGTCCCTGCTGC
A

Scaffold186503 spam:spam spam:GCA_000347335.1
TGTCAGAGGCATACACTGTAGTACCCAACATGCGGAGTACCATCATTGTTCGACTTCACG
AAGGGGCGAAGAGCCAAGAACAGTAGTACGAGTAGTACTACTACTAGAACCGCATGGTGG
AGCGTCTGTACTCTTATTTTTTTATCTCTGATAAAGTACACGTTTATTCCGGAGAAGGGC
GGAAAACAGCAACCCAACATGTAATGAATGAATCGATCAACCAACCATGCTCGAATATAT
CTTGGCCTCCGTGCGAGCCCGTCTGTGTGTTCTC
C141205808 spam:spam spam:GCA_000347335.1
GCCAGCCGGTCGCCCCGCCCCTCAGGTCTCTCCTGCCCTGGGCGCCGGATCTGGCCCGAG
CTCGCCGGGAAAGTAGCCGGAGGCGAAGGAGGTGTGTGTGAGAGAGAGAGGTGGGTCATG
GGAGAGGAGAGGCACGGCGAGTTGGAGCAGAGAAAGGGAGGAAGGCAGCGCCGGAGAAAG
GGCTCGCCGGTCAGCAGCGTACTGGGTGGCGCAGAAGCTTCGAGCGCGAGCTCTCCCAAT
GTACGACGCAGCGCACGGCCTCTTCGTCGGCACTGTTGATGTCGGGGGAACCCTCGCTGA
TGGATCCTTCCAGCCGGCGGCAAGCTATCGCCTTCCGCGGTGAGGCATTTGGGGGTTAAG
CACCGTTCGTAGATGCTTTTGCAAAGCCATTAATGTTTCCGAAACATAGGTCTTGTTGCA
ATAGTCAACAATGTTTTTAAAACAAAGCTCTTGTTTGCAAAGAGGGATCATGCACGGAGA
GATTTGCAACGTCATCAATGTTTCTGAAACAAAGCTCTTGTTGCAGTAGTCACCGATATG
TTTCCGAAACAAAGCTCTTGTTTCAATAGTCATCAACGTGTTTCTGAAACAAAGCTCTTG
TTGCAATAGTCACCGACGTGTTTCTGAAACAAAGATCTTGTTGCAATAGTCAATGCTCTT
GTTGTAAAGAGCAATCGTACGGTTAGCAACCTATATCGGACGGCTCGCGAGGCGGCGGAT
CTTCTGAAAAGATCAACCAGCCGACGCGTAGCACTCCCCATAAATCAAACCATGTATAAT
AAACCGAATTAAAAGAGGTAGAGGAGTGGGTATTGTTTCTTAACTAGGGACAAAAAGAAA
GGCTTCAGAGAGGAAACATCTCCCAAAGGCCCATTAGATAATAGCCCAAATCAACTGAAA
AAAGAGAACGGACCAATCGAGCAGCCGCGGCCCATACAAGAGGTGGCCCGGGGAGAGGCC
GGGTGCCCTGTCTCCTCCGAGGCTTCGGATAGAGCGAGTCTTCTTCTTCCTCCCTGC
KeyError in DNA.DNA.reversecomplement()
The query 0 seq2:128728 13439 360
TCGAACACTGGGGTGCGCATGAAGATCTCTCCC
CCTCTAGCTCACGCACTCACCGTGATCTCAAAGCTCATTGCGTCGAACTCGCAGAACAAG
GGACACAGGGTTTATACTGGTTCGGGCCACCGATGTGGTGTAATACCCTACTCCAGTGTG
GTGTGGTGGATTGCCTCTTGGGCTGAGGATGAACGAGTACAAGGGAAGAACAGCCTCCTG
AGGAGAGGTGTCCTTGAGCTCGGTGAGCTTGTGTGGTTGAGGATGATCTGAATGGCCCAA
GATCTGTTGCCTCCTACTGTGGTGGCTAGTCCTATTTATAGAGGCCCTGGTCCTCTTCCC
AAATGTAGGCGGGAAGGATCC
KeyError in DNA.DNA.reversecomplement()
The query 0 seq2:128728 13221 96
ACCATCA
CCACGTCATCGTGCTACCGGAACTCATCCACTACCTCGCCATCTTTCTAGATCAAGAAGG
CGGAGGACGTCATCAAGCTGAACGTGT
KeyError in DNA.DNA.reversecomplement()
...
<many lines!="">
...
KeyError in DNA.DNA.reversecomplement()
The query 0 seq1:299055 915 960
ACGGTGTACGACCCCGGATATGAAATCTATCACCAGT
CCGGATTCGGCTGTTTTAACAGGTTTGGCTTCTATCCATTTGGTGAATTTATCCACCATG
ACCAGTAAGTATTTTTTTCTTGTGGGTTCCCCCTTTAAGGGGTACAACCATGTCAAGCTC
CTAGACCGCGAAGGGCCATGTAACGGGGATTGTTTGGAGGGCGGTGGGTGGCGTGTGGCT
TTGATTTGCAAAAAGCTGGCAACCGGCGCAGCGTTCGACCAAGTCCTGTGCGTCTGCCCG
GGCCGTCGGCCAATAGAATCCTGTACGGAAGGCCTTGCTTACAACAGCCCGGGCTGCGGC
GTGGTGGCCGCCGAGTCCGGCGTGAATTTCTGCCAAAAGGTTCCGCCCTTCCTCTTCGGA
GATGCACCTTTGAAGAACATCGGTAGTGCTTTTTTTTTACAGTTCTCCCTCATGGACCCT
GTAGGCCTTAGTTCGCCGCACTATGCGGCGTGCCTCATTTTGGTCCTCCGGAAGTTCCTG
CCTAGTTAGGTAGGCCAAGAATGGTTCTGTCCACGGGGCTATGATGGCCATTATTACGTG
GGCTGAAGGTGTTGTTTCGGTGGCACAGCCTCCAATTATGTCAGAGTGTTCGGTATCGAG
TGTTGGGTACGGACTAATATTGCCGGTGTCCCCCTCCCATACCACGGATGGCTTAAACAA
CCTCTCTAAAAAGATGTTGGGGGGACTGCCTCGCGTTTTGCGCCAATGCGGCCGAGGATA
TCCGCCGCCTGATTGTTTTCCCGAGCCACATGGTGAAATTCGAGCCCCTCAAACCGAGCT
GACATCGTTAGGACGGTGTTACGATAGGCCACCATCTTCGGATCCTTGGCATCAAAGTCT
CCATTTATTTGGGATATTGTGAGGTTCGAATCCCCACGCACCTCCAGGCGCTGAATGCCC
ATGGAAA
countLines 377650 inter_run_gap_count 31315
At TrimMatchOverlaps, step=11
Time elapsed=918.060847998
trim the overlaps
trimMatchOverlapsInX:

posgaps, #abuts, #overlaps, #contained, bp_trimmed= 365055 6280 86 21 1022

trimMatchOverlapsInY:

posgaps, #abuts, #overlaps, #contained, bp_trimmed= 364516 6539 104 21 1176

Ran in 1178 seconds.

/nfs/production/panda/ensemblgenomes/development/grabmuel/atac/kmer/atac-driver/correctGaps -m /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/work/seq1vsse
q2.matches.extended.chained.atac -l /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/work/seq1vsseq2.matches.extended.chained.gapsclosed.log > /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/work/seq1vsseq2.matches.extended.chained.gapsclosed.atac

cd /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2 && /nfs/production/panda/ensemblgenomes/development/grabmuel/atac/kmer/atac-driver/clumpMaker -c 5000 -1 -f /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/seq1vsseq2.atac > /nfs/nobackup/ensemblgenomes/dbolser/ATAC/14/2/seq1vsseq2.ref=seq1.clumpCost=5000.atac

1 load the matches
2 sort the matches
3 score the matches
4 mark clumps
5 sort the matches
6 output matches with clumps
...

grep -c "^KeyError in DNA.DNA.reversecomplement" Scratch/14/2/bsub.err
300440

Discussion

  • Dan Bolser

    Dan Bolser - 2015-03-06

    Sorry, can't seem to edit to fix formatting...

     
  • Brian Walenz

    Brian Walenz - 2015-03-06

    The formatting is messed up by the #'s at the start of the line. :-(

    I think this is generally harmless but it might be preventing some local alignment between matches. It's caused by the input fasta not having sequence all on one line.

    You can create fasta like this with 'leaff -f in.fasta -W > out.fasta'. leaff is part of the kmer package.

    It's been years since I've run this pipeline, and I'm fuzzy on the details. You should be able to rescue some of the compute - the kmers and 'seatac' output won't change, only the outputs of the currently running python code. Remove those, and restart, giving it the one-line fasta instead.

     
  • Brian Walenz

    Brian Walenz - 2015-03-06
    • assigned_to: Brian Walenz
     
  • Dan Bolser

    Dan Bolser - 2015-03-06

    Ah, I remember this now. I thought the pipeline did this itself, but obviously not. There are some large regions between blocks (within runs) that are not aligned that we can't understand. I'll try to see if this is it.

    Do you do WGA using another tool?

    Many thanks,
    Dan.

     
  • Brian Walenz

    Brian Walenz - 2015-03-06

    I've been fortunate in avoiding that type of analysis directly. ;-)

    We've used nucmer and atac in the past. Nucmer is slow, but will find all the blocks, where atac is fast, but only finds the 1-to-1 blocks (and duplicates of some unknown size if they're bordered by 1-to-1 blocks).

     
  • Dan Bolser

    Dan Bolser - 2015-03-06

    Yeah, the 1-to-1 requirement is 'breaking' some of our alignments, so I'm trying the merlimit option to see if we can work around it. Unfortunately, any setting of merlimit greater than 1 seems to degrade the results in all but smallest values of mersize :-/

    Thanks for speedy replies... I think we may end up with a hybrid 'atac first pass + LASTZ' type pipeline ...

    Cheers,
    Dan.

     
  • Dan Bolser

    Dan Bolser - 2015-03-19

    Putting the sequence 'all on one line' in the input fasta fixed the issue we were seeing with "large regions between blocks (within runs) that are not aligned that we can't understand".

    You can close this ticket now.

    Cheers,
    Dan.

     
  • Brian Walenz

    Brian Walenz - 2015-05-11

    Revision 2001 fixes this issue by rewriting the input fasta files instead of symlinking to them.

     
  • Brian Walenz

    Brian Walenz - 2015-05-11
    • status: open --> closed-fixed
     

Log in to post a comment.