From: Henrik L. <Hen...@sl...> - 2012-02-03 09:43:17
|
First, a big thank you to Ole and Nathan for their kind help. I am still not entirely comfortable with perl and regular expressions, but will work through the examples given and see what I can come up with. Second, in the Celera assembly where Cavalidate has marked a huge amount of compressed pairs, Hawkeye reports a sample mean for the insert size of 292.12 and a standard deviation of 138.68 (on the library tab in the main view). Nathan discusses this below. However, in the scaffold view, pairs are marked as compressed or expanded if they are outside the range of 245-585. This seems to be in rather poor agreement with mean insert size and SD reported, and must be why so many pairs are marked as deviating form the norm. This was calculated by Cavalidate, I have not added any info on insert size myself. I get the same bimodal distribution when mapping (using Shrimp2) the Illumina reads on other assemblies, and there is no difference if I use quality filtered reads or not. I have trimmed and filtered the Illumina reads using Trimmomatic with both end trimming and sliding window filtering using a window size of four and a quality value of 15. It seems to me that something went wrong in the lab, although I have contacted the lab and they could see no problems when they did the sequencing. Their quality measures all looked ok. In the end I might stick with my Newbler and Mira assemblies and try to construct high quality scaffolds rather than continuing with the somewhat weird Celera assembly, even if Celera gave me a huge increase in N50. Thanks for all the help, Henrik 2 feb 2012 kl. 23.57 skrev Nathan Watson-Haigh: Hi Henrik, The different file formats contain different info. From my understanding ACE files does explicitly record which reads are from the same template and thus paired/mates. Although Nebler can also output a consed directory structure which has an ACE file and an associated phdball directory. That phdball does contain info that links reads to templates. However, while I think AMOS can read the old-style phd directory, I don’t think it can read the newer phdball directory. In fact my first foray into AMOS has been with the consed output from Newbler. I’ve then done some filtering of reads in the ACE file, which admittedly might have broken something with read pairing etc, but I specify this with a mates file (https://sourceforge.net/apps/mediawiki/amos/index.php?title=Bambus_Manual#The_.mates_file). I have an assembly with 454 shotgun, 3kb and 20kb mates plus illumine paired-ends. Here’s the contents of my mates file (tab delimited): library 3kb 1000 5000 (GK33D5G01|GK33D5G02|GK9RFDF01|GK9RFDF02|GLBLI5G01|GLBLI5G02|GLDJ2JW01|GLKV9FY01).* library 20kb 8000 32000 (F5XZBDV01|F5XZBDV02).* library illimina pe 200 700 (AGRF).* pair (.+?)_left.* (.+?)_right.* NOTE: I don’t have a line to pair up my Illumina reads, only my 454 mate pairs, see below for more info. As per the link above, this specifies 3 mate/pair libraries with a min/max estimate of the size distribution and the last column is the Perl regular expression used to assign the reads to that corresponding library. In my case any read with a prefix of GK33D5G01 or GK33D5G02 etc are assigned to my 3kb library reads with a F5XZBDV01 or F5XZBDV02 prefix are from my 20kb library etc. Now the pair line is used to determine which reads come from the same teamplate. In my case I want all reads that have the same prefix, up to but not including the _left and _right, to be designated as pairs. I’m pretty sure without this, AMOS doesn’t know what reads belong what library or pair. However, the input format might already contain this information, or I may have broken it due to editing my ACE file output from Newbler. I also use the 454Scaffold.txt file output by Newbler to import the scaffold information into the AMOS bank file. I did make a slight edit to the agp2amos script to skip the TLE record if $eid2iid{$ceid} didn’t exist. I can’t remember if that was a side-effect of me edited the raw ACE file ouput by Newbler or not but it made bank-transact fail. Anyway, here’s my bash script for converting my ACE file into an AMOS bank (run from the edit_dir of the consed directory): ace="454Contigs_sub.ace.1" time toAmos -ace $ace -m AH-E.mates -o - | bank-transact -m - -b ${ace%.*}.bnk -c time bank2fasta -i -b ${ace%.*}.bnk > ${ace%.*}.fasta time bank2contig -i -s ${ace%.*}.bnk > ${ace%.*}.sam time ./agp2amos ../../454Scaffolds.txt ${ace%.*}.bnk | bank-transact -m - -b ${ace%.*}.bnk ./amosvalidate ${ace%.*}.bnk I’m also in the midsts of extracting out the 454 and illumine singletons to allow amosvalidate to do the singleton breakpoint analysis. Without running amosvalidate, hawkeye shows a lot of my mates as being stretched/compressed. However, once amsovalidate calculates the insert sizes itself, this drops to being just those outside 2SD’s (~5%). From your plot the only ones outside 2SD’s are those > ~570bp (about 0.8% of your reads. So I’m curious to know why you get compression, which would be pairs smaller than 292.12 – (2*138.68) = 14.76bp. Remember that these insert sizes are estimated from your assembly by measuring the distance between pairs in contigs (and scaffolds?). Thus a poor assembly, with extra bits inserted/deleted in a contig may make this distribution non-normal/bi-modal. I don’t know much about lab side of things but it might be worth checking with the people that made the library to check they did a library validation and to see if there were any anomalies with that? Also, what sort of QC filtering of reads did you do? I did a filter that all positions in the read needed a QV of >=20. I think I’d persist with the Newbler (or other) assembly to see if this is actually an artefact of the Celera assembly process. This is a small subset of my illumina paired-ends from a Newbler assembly: <image003.png> Although due to the large size of my genome, the amount of total data I have, and memory constraints I opted not to pair up my Illumina reads in my AMOS bank file and thus they show up as just single end reads. This way I’m only looking at my 3kb and 20kb mates in a scaffold context. When I get a memory upgrade I may opt to redo the pairing of my illumina reads. I’d do this by adding another line to my mates file like this (assuming they have the /1 and /2 suffix to denote the pairs): pair (.+?)\/1.* (.+?)\/2.* Hope this helps, Nathan <image9ba15c.JPG> Nathan Watson-Haigh Senior Bioinformatician | The Australian Wine Research Institute Waite Precinct, Hartley Grove cnr Paratoo Road, Urrbrae (Adelaide) SA 5064 | Map<http://www.awri.com.au/contact/map.asp> PO Box 197, Glen Osmond SA 5064, Australia •+61 8 83136836 (direct) |Ê+61 8 83136601 | • www.awri.com.au<http://www.awri.com.au/> | AWRI Events<http://www.awri.com.au/events/calendar/> This communication, including attachments, is intended only for the addressee(s) and contains information which might be confidential and/or the copyright of The Australian Wine Research Institute (AWRI) or a third party. If you are not the intended recipient of this communication please immediately delete and destroy all copies and contact the sender. If you are the intended recipient of this communication you should not copy, disclose or distribute any of the information contained herein without the consent of the AWRI and the sender. Any views expressed in this communication are those of the individual sender except where the sender specifically states them to be the views of the AWRI. No representation is made that this communication, including attachments, is free of viruses. Virus scanning is recommended and is the responsibility of the recipient. From: Henrik Lantz [mailto:Hen...@sl...] Sent: Friday, 3 February 2012 1:00 AM To: Nathan Watson-Haigh Cc: AMO...@li...<mailto:AMO...@li...> Subject: Re: [AMOS-help] Amosvalidate on mapped 454 reads Hi Nathan, thanks for the reply. The insert distribution size is bimodal, very far off from a normal distribution (I attach a screenshot). I get the same results when mapping the reads on other assemblies (from Newbler or Mira), so it seems something went wrong in the lab rather than something being wrong with the Celera assembly (or something is wrong with all assemblies, and that sounds unlikely). I have received some help from Michael Schatz who said that Celera is good at handling strange distributions like this, and that the Celera assembly does not have to be negatively affected by it, which was a worry for me. Also, we have 15x 454 single end reads, and 10x 454 3-5 kb mate-pairs too, so we are not dependent on the Illumina data alone. It was mostly intended to correct for homopolymers in our 454 data. Nevertheless, I still want to find regions that might be misassembled. Celera gives me a contig size that is four times bigger than either Mira or Newbler, and I can't help but wonder if Celera is assembling over repeats that should not be assembled. I must admit I have problems understanding how to include the regular expressions you mention, and can't find the necessary "mates" file. CaValidate just used the .asm file from Celera, I have not supplied any estimated insert sizes at any stage. Same goes for my try to use Amosvalidate on the Newbler assembly, I just used the .ace file without any extra information. If you could point me to some information about this mate file, I would be very grateful. I just realized that I can color reads from different libraries differently in Hawkeye, and this helps me spot the 454 reads among the Illumina ones when I look at the Celera assembly, but it is still messy. Still haven't solved this, so all comments are very welcome. /Henrik <image001.png> 1 feb 2012 kl. 22.20 skrev Nathan Watson-Haigh: > I'm new to AMOS, but i hope i can be some help. > > AMOS uses a "mates" file to match up the pairs/mates and assign reads to libraries. you also supply some info about the expected insert size distribution for each of these libraries. one of the first things AMOSvalidate does is to take those pairs and reestimate the insert size distributions. without this step hawkeye uses the estimates you provided to determine mate "happiness", stretch and compression. so if your estimates were off and you don't run Amosvalidate then you could get a lot of unhappy mates. > > I've found it easiest to supply regular expressions for my mates file in order to match up the pairs. > > A quick question about the odd distribution: if this isn't reasonably normally distributed then could the assembler have made some mistakes because it too would assume normality. How far from normal is it!? > > Cheers, > Nathan > > Sent from my Android phone. > > Henrik Lantz <Hen...@sl...<mailto:Hen...@sl...>> wrote: > > > Hi > > We have the genome of a fungus that was assembled de novo using Illumina PE, 454 mate-pair, and 454 single-end reads with Celera Assembler. I ran the assembly though Cavalidate and it turns out that our Illumina paired ends have a funny insert-size distribution, and Cavalidate marks a huge numbers of these as stretched or compressed. This makes it very hard for me to identify regions that truly are mis-assembled repeats, as everything is marked, and this hides any truly problematic regions. If I could restrict the view to just the 454 reads my problem would be solved, but this seems impossible to do. > > I therefore thought to just map/assemble the 454 mate-pairs, which have a normal distributed insert size, on the Celera assembly, in the hope that this would make it easier to spot the truly misassembled regions. My idea was to use Newbler to map the 454 mate-pairs using the Celera contig-file as a reference => convert the .ace file to amos-format using toAmos => create a .bnk using this file and bank-transact => populate the .bnk with features using amosvalidate => view the result in Hawkeye. > > However, when I do this, the mate pair information seems to have been lost at some stage, as all reads are reported as single reads in Hawkeye. > > I am doing something wrong, or is this not possible to do? Is there any other way for me to get what I want, which is a visualization of my 454 mate-pairs mapped/assembled and compressed/stretched regions marked? > > Any suggestions are very welcome. > > /Henrik > ------------------------------------------------------------------------------ > Keep Your Developer Skills Current with LearnDevNow! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-d2d > > > Nathan Watson-Haigh > Senior Bioinformatician | The Australian Wine Research Institute > Waite Precinct, Hartley Grove cnr Paratoo Road, Urrbrae (Adelaide) SA 5064 | Map > PO Box 197, Glen Osmond SA 5064, Australia > T: +61 8 83136836 (direct) | F: +61 8 83136601 | > www: www.awri.com.au<http://www.awri.com.au> | AWRI Events > > This communication, including attachments, is intended only for the addressee(s) and contains information which might be confidential and/or the copyright of The Australian Wine Research Institute (AWRI) or a third party. If you are not the intended recipient of this communication please immediately delete and destroy all copies and contact the sender. If you are the intended recipient of this communication you should not copy, disclose or distribute any of the information contained herein without the consent of the AWRI and the sender. Any views expressed in this communication are those of the individual sender except where the sender specifically states them to be the views of the AWRI. No representation is made that this communication, including attachments, is free of viruses. Virus scanning is recommended and is the responsibility of the recipient. > > _______________________________________________ > AMOS-help mailing list > AMO...@li...<mailto:AMO...@li...> > https://lists.sourceforge.net/lists/listinfo/amos-help |