Boulder::Genbank(3pm) | User Contributed Perl Documentation | Boulder::Genbank(3pm) |
Boulder::Genbank - Fetch Genbank data records as parsed Boulder Stones
use Boulder::Genbank # network access via Entrez $gb = Boulder::Genbank->newFh( qw(M57939 M28274 L36028) ); while ($data = <$gb>) { print $data->Accession; @introns = $data->features->Intron; print "There are ",scalar(@introns)," introns.\n"; $dna = $data->Sequence; print "The dna is ",length($dna)," bp long.\n"; my @features = $data->features(-type=>[ qw(Exon Source Satellite) ], -pos=>[90,310] ); foreach (@features) { print $_->Type,"\n"; print $_->Position,"\n"; print $_->Gene,"\n"; } } # another syntax $gb = new Boulder::Genbank(-accessor=>'Entrez', -fetch => [qw/M57939 M28274 L36028/]); # local access via Yank $gb = new Boulder::Genbank(-accessor=>'Yank', -fetch=>[qw/M57939 M28274 L36028/]); while (my $s = $gb->get) { # etc. } # parse a file of Genbank records $gb = new Boulder::Genbank(-accessor=>'File', -fetch => '/usr/local/db/gbpri3.seq'); while (my $s = $gb->get) { # etc. } # parse flatfile records yourself open (GB,"/usr/local/db/gbpri3.seq"); local $/ = "//\n"; while (<GB>) { my $s = Boulder::Genbank->parse($_); # etc. }
Boulder::Genbank provides retrieval and parsing services for NCBI Genbank-format records. It returns Genbank entries in Stone format, allowing easy access to the various fields and values. Boulder::Genbank is a descendent of Boulder::Stream, and provides a stream-like interface to a series of Stone objects.
>> IMPORTANT NOTE <<
As of January 2002, NCBI has changed their Batch Entrez interface. I have modified Boulder::Genbank so as to use a "demo" interface, which fixes things, but this isn't guaranteed in the long run.
I have written to NCBI, and they may fix this -- or they may not.
>> IMPORTANT NOTE <<
Access to Genbank is provided by three different accessors, which together give access to remote and local Genbank databases. When you create a new Boulder::Genbank stream, you provide one of the three accessors, along with accessor-specific parameters that control what entries to fetch. The three accessors are:
It is also possible to parse a single Genbank entry from a text string stored in a scalar variable, returning a Stone object.
This section lists the public methods that the Boulder::Genbank class makes available.
# Network fetch via Entrez, with accession numbers $gb=new Boulder::Genbank(-accessor => 'Entrez', -fetch => [qw/M57939 M28274 L36028/]); # Same, but shorter and uses -> operator $gb = Boulder::Genbank->new qw(M57939 M28274 L36028); # Network fetch via Entrez, with a query # Network fetch via Entrez, with a query $query = 'Homo sapiens[Organism] AND EST[Keyword]'; $gb=new Boulder::Genbank(-accessor => 'Entrez', -fetch => $query); # Local fetch via Yank, with accession numbers $gb=new Boulder::Genbank(-accessor => 'Yank', -fetch => [qw/M57939 M28274 L36028/]); # Local fetch via File $gb=new Boulder::Genbank(-accessor => 'File', -fetch => '/usr/local/genbank/gbpri3.seq');
The new() method creates a new Boulder::Genbank stream on the accessor provided. The three possible accessors are Entrez, Yank and File. If successful, the method returns the stream object. Otherwise it returns undef.
new() takes the following arguments:
-accessor Name of the accessor to use -fetch Parameters to pass to the accessor -proxy Path to an HTTP proxy, used when using the Entrez accessor over a firewall.
Specify the accessor to use with the -accessor argument. If not specified, it defaults to Entrez.
-fetch is an accessor-specific argument. The possibilities are:
For Entrez, the -fetch argument may point to a scalar, in which case it is interpreted as an Entrez query string. See http://www.ncbi.nlm.nih.gov/Entrez/linking.html for a description of the query syntax. Alternatively, -fetch may point to an array reference, in which case it is interpreted as a list of accession numbers to retrieve. If -fetch points to a hash, it is interpreted as extended information. See "Extended Entrez Parameters" below.
For Yank, the -fetch argument must point to an array reference containing the accession numbers to retrieve.
For File, the -fetch argument must point to a string-valued scalar, which will be interpreted as the path to the file to read Genbank entries from.
For Entrez (and Entrez only) Boulder::Genbank allows you to use a shortcut syntax in which you provde new() with a list of accession numbers:
$gb = new Boulder::Genbank('M57939','M28274','L36028');
$fh = Boulder::GenBank->newFh('M57939','M28274','L36028'); while ($record = <$fh>) { print $record->asString; }
The object returned is a Stone::GB_Sequence object, which is a descendent of Stone.
The Entrez accessor recognizes extended parameters that allow you the ability to customize the search. Instead of passing a query string scalar or a list of accession numbers as the -fetch argument, pass a hash reference. The hashref should contain one or more of the following keys:
m MEDLINE p Protein n Nucleotide s Popset
-proxy => http://www.firewall.com:9000
If you think you need this, get the correct URL from your system administrator.
As an example, here's how to search for ESTs from Oryza sativa that have been entered or modified since 1999.
my $gb = new Boulder::Genbank( -accessor=>Entrez, -query=>'Oryza sativa[Organism] AND EST[Keyword] AND 1999[MDAT]', -db => 'n' });
Each record returned from the Boulder::Genbank stream defines a set of methods that correspond to features and other fields in the Genbank flat file record. Stone::GB_Sequence gives the full details, but they are listed for reference here:
Get the length of the sequence.
Get the start position of the sequence, currently always "1".
Get the end position of the sequence, currently always the same as the length.
features() will search the entry feature list for those features that meet certain criteria. The criteria are specified using the -pos and/or -type argument names, as shown below.
-pos => 1500; # feature must overlap postion 1500
or a range of positions in this way:
-pos => [1000,1500]; # 1000 to 1500 inclusive
If no criteria are provided, then features() returns all the features, and is equivalent to calling the Features() accessor.
-type => 'Exon'
or with a list of types, as in
-types => ['Exon','CDS']
The names "-type" and "-types" can be used interchangeably.
Returns a Bio::Seq object from the Bioperl project. Dies with an error message unless the Bio::Seq module is installed.
The tags returned by the parsing operation are taken from the NCBI ASN.1 schema. For consistency, they are normalized so that the initial letter is capitalized, and all subsequent letters are lowercase. This section contains an abbreviated list of the most useful/common tags. See "The NCBI Data Model", by James Ostell and Jonathan Kans in "Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins" (Eds. A. Baxevanis and F. Ouellette), pp 121-144 for the full listing.
These are tags that appear at the top level of the parsed Genbank entry.
Example:
my $accessionNo = $s->Accession;
my $A = $s->Basecount->A; my $C = $s->Basecount->C; my $G = $s->Basecount->G; my $T = $s->Basecount->T; print "GC content is ",($G+$C)/($A+$C+$G+$T),"\n";
Example:
my $keywords = $s->Keywords
my @references = $s->Reference;
The Features tag points to a Stone record that contains multiple subtags. Each subtag is the name of a feature which points, in turn, to a Stone that describes the feature's location and other attributes. The full list of feature is beyond this document, but the following are the features that are most often seen:
Cds a CDS Intron an intron Exon an exon Gene a gene Mrna an mRNA Polya_site a putative polyadenylation signal Repeat_unit a repetitive region Source More information about the organism and cell type the sequence was derived from Satellite a microsatellite (dinucleotide repeat)
Each feature will contain one or more of the following subtags:
Example:
foreach ($s->Features->Cds) { my $gene = $_->Gene; my $position = $_->Position; Print "Gene $gene ($position)\n"; }
Boulder, Boulder::Blast
Lincoln Stein <lstein@cshl.org>.
Copyright (c) 1997-2000 Lincoln D. Stein
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See DISCLAIMER.txt for disclaimers of warranty.
The following is an excerpt from a moderately complex Genbank Stone. The Sequence line and several other long lines have been truncated for readability.
Authors=Spritz,R.A., Strunk,K., Surowy,C.S.O., Hoch,S., Barton,D.E. and Francke,U. Authors=Spritz,R.A., Strunk,K., Surowy,C.S. and Mohrenweiser,H.W. Locus=HUMRNP7011 2155 bp DNA PRI 03-JUL-1991 Accession=M57939 Accession=J04772 Accession=M57733 Keywords=ribonucleoprotein antigen. Sequence=aagcttttccaggcagtgcgagatagaggagcgcttgagaaggcaggttttgcagcagacggcagtgacagcccag... Definition=Human small nuclear ribonucleoprotein (U1-70K) gene, exon 10 and 11. Journal=Nucleic Acids Res. 15, 10373-10391 (1987) Journal=Genomics 8, 371-379 (1990) Nid=g337441 Medline=88096573 Medline=91065657 Features={ Polya_site={ Evidence=experimental Position=1989 Gene=U1-70K } Polya_site={ Position=1990 Gene=U1-70K } Polya_site={ Evidence=experimental Position=1992 Gene=U1-70K } Polya_site={ Evidence=experimental Position=1998 Gene=U1-70K } Source={ Organism=Homo sapiens Db_xref=taxon:9606 Position=1..2155 Map=19q13.3 } Cds={ Codon_start=1 Product=ribonucleoprotein antigen Db_xref=PID:g337445 Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ... Gene=U1-70K Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPRDAPPPTR... } Cds={ Codon_start=1 Product=ribonucleoprotein antigen Db_xref=PID:g337444 Evidence=experimental Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ... Gene=U1-70K Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPR... } Polya_signal={ Position=1970..1975 Note=putative Gene=U1-70K } Intron={ Evidence=experimental Position=1100..1208 Gene=U1-70K } Intron={ Number=10 Evidence=experimental Position=1100..1181 Gene=U1-70K } Intron={ Number=9 Evidence=experimental Position=order(M57937:702..921,1..1011) Note=2.1 kb gap Gene=U1-70K } Intron={ Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1208) Gene=U1-70K } Intron={ Evidence=experimental Position=order(M57935:284..406,M57936:1..284,M57937:1..599, <1..>1208) Note=first gap-0.14 kb, second gap-0.62 kb Gene=U1-70K } Intron={ Number=8 Evidence=experimental Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1181) Note=first gap-0.14 kb, second gap-0.62 kb Gene=U1-70K } Exon={ Number=10 Evidence=experimental Position=1012..1099 Gene=U1-70K } Exon={ Number=11 Evidence=experimental Position=1182..(1989.1998) Gene=U1-70K } Exon={ Evidence=experimental Position=1209..(1989.1998) Gene=U1-70K } Mrna={ Product=ribonucleoprotein antigen Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ... Gene=U1-70K } Mrna={ Product=ribonucleoprotein antigen Citation=[2] Evidence=experimental Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ... Gene=U1-70K } Gene={ Position=join(M57928:207..719,M57929:1..562,M57930:1..577, ... Gene=U1-70K } } Reference=1 (sites) Reference=2 (bases 1 to 2155) =
2022-06-08 | perl v5.34.0 |