[sending bug report here, because I couldn't find the usual reporting place on the sourceforge
project page]
I'm using the HTSeq module for GFF file reading to read through the first stop codon (preliminary
python script is attached), and came across an error right at the end of my file:
$ ./readthrough.py > ./protein_readthrough_S288C_R64-1-1_20110203.fa
Features read: 16405Traceback (most recent call last):
File "./readthrough.py", line 41, in <module>
for feature in gtfFile:
File "/usr/local/lib/python2.7/dist-packages/HTSeq/__init__.py", line 214, in __iter__
strand, frame, attributeStr ) = line.split( "\t", 8 )
ValueError: need more than 1 value to unpack
Looking at the GFF file, I realised/remembered that there was an extra section in the file:
$ tail -n +16423
/home/bioinf/illuminadata/Genome/scer/S288C_reference_genome_R64-1-1_20110203/saccharomyces_cerevisiae_R64-1-1_20110208.gff
| head
2-micron SGD CDS 3271 3816 . + 0
Parent=R0030W;Name=R0030W;gene=RAF1;Alias=RAF1;Ontology_term=GO:0003674,GO:0005634,GO:0006276;Note=Anti-repressor%20that%20increases%202%20micron%20plasmid%20copy%20number%20by%20relieving%20repression%20of%20the%20FLP1%20site-specific%20recombinase%20caused%20by%20the%20Rep1-Rep2p%20trascription%20regulator%3B%20also%20itself%20repressed%20by%20the%20Rep1p-Rep2p%20complex;dbxref=SGD:S000029674;orf_classification=Verified
2-micron SGD gene 5308 6198 . - .
ID=R0040C;Name=R0040C;gene=REP2;Alias=REP2;Ontology_term=GO:0003674,GO:0005634,GO:0030543;Note=Master%20regulator%20that%20acts%20in%20concert%20with%20Rep1p%20to%20regulate%20transcript%20levels%20of%20the%20FLP1%20gene%20that%20promotes%20plasmid%20copy%20amplification%3B%20also%20autoregulates%20levels%20of%20its%20own%20transcript;dbxref=SGD:S000029676;orf_classification=Verified
2-micron SGD CDS 5308 6198 . - 0
Parent=R0040C;Name=R0040C;gene=REP2;Alias=REP2;Ontology_term=GO:0003674,GO:0005634,GO:0030543;Note=Master%20regulator%20that%20acts%20in%20concert%20with%20Rep1p%20to%20regulate%20transcript%20levels%20of%20the%20FLP1%20gene%20that%20promotes%20plasmid%20copy%20amplification%3B%20also%20autoregulates%20levels%20of%20its%20own%20transcript;dbxref=SGD:S000029676;orf_classification=Verified
###
##FASTA
>chrI
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
HTSeq seemed to be getting caught up by the FASTA section (or more accurately, the lines following
the FASTA section).
I've attached a patch to fix this, which just identifies the three-hash directive in GFF3 files that
indicates records are all done and breaks out of the iteration loop after that is seen. There's
information about other directives here:
http://www.sequenceontology.org/gff3.shtml
Hope this helps,
- David Eccles (gringer)
|