From: Arnaud K. <ax...@sa...> - 2004-08-05 12:49:09
|
Hi I want to load BLAST results in GUS. Before running LoadBlastSimFast module, I want to load Uniprot and EMBL databases in DoTS::ExternalAASequence and DoTS::ExternalNASequence., just the Ids not the sequences themselves. I need some help to use GUS::Common::Plugin::InsertNewExternalSequences plugin. I have the sequences in FASTA, what regexp shall I need ? Is only --regex_name parameter required ? I don't know the taxon attached to the sequence entries. Does it matter if I don't give taxon_id parameter ? Re. LoadBlastSimFast module, It seems to parse the output in a specific format. Does it require generateBlastSimilarity.pl script ? Where can I get this script ? cheers Arnaud -- Arnaud Kerhornou The Wellcome Trust Sanger Institute The Pathogen Sequencing Unit Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK Work: +44 (0) 1223 494955 Fax: +44 (0) 1223 494919 |
From: Pablo N. M. <pa...@pa...> - 2004-08-05 15:43:57
Attachments:
parseBlastFilesForSimilarity.pl
|
Hi Arnaud, I don't know if I'll be able to answer all your questions, but I'll try some to the extent of my knowledge on GUS up to now. > I have the sequences in FASTA, what regexp shall I need? > Is only --regex_name parameter required? > Does it matter if I don't give taxon_id parameter ? I've run the plugin with these parameters: [pablo@mkiwi mcl]$ ga GUS::Common::Plugin::InsertNewExternalSequences --external_database_release_id=3D38 --regex_source_id=3D(.*) --table_name=3DDoTS::ExternalAASequence --sequencefile=3Dtbrucei =96commit The regex_*, as you probably noted, are the regular expressions to extract the referred info from the FASTA header (name, source_id, secondary_id, etc.). I've only used the regex_source_id. Also, it seems not to matter if you don't give the taxon_id parameter. But you obviously won't make the associations in DoTS::AASequenceTaxon between sequences and taxa. > Does LoadBlastSimFast module require generateBlastSimilarity.pl script? Where can I get this script ? This module reads similarity results in an especific format like: >479679 (3 subjects) Sum: 479680:1871:1.2e-194:1:353:1:353:1:353:353:353:0: HSP1: 479680:353:353:353:1871:1.2e-194:1:353:1:353:0: Sum: 488460:1826:7.0e-190:1:353:1:353:1:353:342:348:0: HSP1: 488460:342:348:353:1826:7.0e-190:1:353:1:353:0: >479680 (3 subjects) Sum: 479679:1871:1.2e-194:1:353:1:353:1:353:353:353:0: HSP1: 479679:353:353:353:1871:1.2e-194:1:353:1:353:0: The script parseBlastFilesForSimilarity.pl (attached) will do the trick. I don't know if there are multiple versions of this script traveling around the list. [jdai@headnode mclorth]$ ls /scratch/jdai/Cpgus_vs_Pfgus/ | perl parseBlastFilesForSimilarity.pl --regex=3D'(\S+)' --outputFile=3DLm_vs_Lm_parsed=20 --dir=3D/scratch/jdai/Cpgus_vs_Pfgus/ Hope this is useful, Pablo On Thu, 2004-08-05 at 08:49, Arnaud Kerhornou wrote: > Hi >=20 > I want to load BLAST results in GUS. > Before running LoadBlastSimFast module, I want to load Uniprot and EMBL= =20 > databases in DoTS::ExternalAASequence and DoTS::ExternalNASequence.,=20 > just the Ids not the sequences themselves. >=20 > I need some help to use GUS::Common::Plugin::InsertNewExternalSequences= =20 > plugin. >=20 > I have the sequences in FASTA, what regexp shall I need ? Is only =20 > --regex_name parameter required ? I don't know the taxon attached to th= e=20 > sequence entries. Does it matter if I don't give taxon_id parameter ? >=20 > Re. LoadBlastSimFast module, It seems to parse the output in a specific= =20 > format. Does it require generateBlastSimilarity.pl script ? Where can I= =20 > get this script ? >=20 > cheers > Arnaud --=20 ----------------------------- Pablo Nascimento Mendes CTEGD EMF TIPS Fellow Kissinger Lab Department of Genetics University of Georgia C210 Life Sciences Bldg. Athens, Georgia 30602 Phone:706 542-1447 E-mail: pa...@ug... |
From: Arnaud K. <ax...@sa...> - 2004-08-09 13:59:02
|
Pablo N. Mendes wrote: >Hi Arnaud, >I don't know if I'll be able to answer all your questions, but I'll try >some to the extent of my knowledge on GUS up to now. > > > >>I have the sequences in FASTA, what regexp shall I need? >>Is only --regex_name parameter required? >>Does it matter if I don't give taxon_id parameter ? >> >> > >I've run the plugin with these parameters: > >[pablo@mkiwi mcl]$ ga GUS::Common::Plugin::InsertNewExternalSequences >--external_database_release_id=38 >--regex_source_id=(.*) >--table_name=DoTS::ExternalAASequence >--sequencefile=tbrucei –commit > >The regex_*, as you probably noted, are the regular expressions to >extract the referred info from the FASTA header (name, source_id, >secondary_id, etc.). I've only used the regex_source_id. Also, it seems >not to matter if you don't give the taxon_id parameter. But you >obviously won't make the associations in DoTS::AASequenceTaxon between >sequences and taxa. > > > it still doesn't work... >>Does LoadBlastSimFast module require generateBlastSimilarity.pl >> >> >script? Where can I get this script ? > >This module reads similarity results in an especific format like: > > > >>479679 (3 subjects) >> >> > Sum: 479680:1871:1.2e-194:1:353:1:353:1:353:353:353:0: > HSP1: 479680:353:353:353:1871:1.2e-194:1:353:1:353:0: > Sum: 488460:1826:7.0e-190:1:353:1:353:1:353:342:348:0: > HSP1: 488460:342:348:353:1826:7.0e-190:1:353:1:353:0: > > >>479680 (3 subjects) >> >> > Sum: 479679:1871:1.2e-194:1:353:1:353:1:353:353:353:0: > HSP1: 479679:353:353:353:1871:1.2e-194:1:353:1:353:0: > >The script parseBlastFilesForSimilarity.pl (attached) will do the trick. >I don't know if there are multiple versions of this script traveling >around the list. > >[jdai@headnode mclorth]$ ls /scratch/jdai/Cpgus_vs_Pfgus/ | perl >parseBlastFilesForSimilarity.pl >--regex='(\S+)' >--outputFile=Lm_vs_Lm_parsed >--dir=/scratch/jdai/Cpgus_vs_Pfgus/ > > > Thanks, it works fine. Arnaud >Hope this is useful, >Pablo > >On Thu, 2004-08-05 at 08:49, Arnaud Kerhornou wrote: > > >>Hi >> >>I want to load BLAST results in GUS. >>Before running LoadBlastSimFast module, I want to load Uniprot and EMBL >>databases in DoTS::ExternalAASequence and DoTS::ExternalNASequence., >>just the Ids not the sequences themselves. >> >>I need some help to use GUS::Common::Plugin::InsertNewExternalSequences >>plugin. >> >>I have the sequences in FASTA, what regexp shall I need ? Is only >>--regex_name parameter required ? I don't know the taxon attached to the >>sequence entries. Does it matter if I don't give taxon_id parameter ? >> >>Re. LoadBlastSimFast module, It seems to parse the output in a specific >>format. Does it require generateBlastSimilarity.pl script ? Where can I >>get this script ? >> >>cheers >>Arnaud >> >> -- Arnaud Kerhornou The Wellcome Trust Sanger Institute The Pathogen Sequencing Unit Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK Work: +44 (0) 1223 494955 Fax: +44 (0) 1223 494919 |