bp_genbank_ref_extractor - Retrieves all related sequences for a
list of searches on Entrez gene
bp_genbank_ref_extractor [options] [Entrez Gene
Queries]
This script searches on Entrez Gene database and retrieves
not only the gene sequence but also the related transcript and protein
sequences.
The gene UIDs of multiple searches are collected before attempting
to retrieve them so each gene will only be analyzed once even if appearing
as result on more than one search.
Note that by default no sequences are saved (see options
and examples).
Several options can be used to fine tune the script behaviour. It
is possible to obtain extra base pairs upstream and downstream of the gene,
control the naming of files and genome assembly to use.
See the section bugs for problems when using default values of
options.
- --assembly
- When retrieving the sequence, a specific assemly can be defined. The value
expected is a regex that will be case-insensitive. If it matches more than
one assembly, it will use the first match. It defauls to
"(primary|reference) assembly".
- --debug
- If set, even more output will be printed that may help on debugging.
Unlike the messages from --verbose and --very-verbose, these
will not appear on the log file unless this option is selected. This
option also sets --very-verbose.
- --downstream,
--down
- Specifies the number of extra base pairs to be retrieved downstream of the
gene. This extra base pairs will only affect the gene sequence, not the
transcript or proteins.
- --email
- A valid email used to connect to the NCBI servers. This may be used by
NCBI to contact users in case of problems and before blocking access in
case of heavy usage.
- --format
- Specifies the format that the sequences will be saved. Defaults to
genbank format. Valid formats are 'genbank' or 'fasta'.
- --genes
- Specifies the name for gene file. By default, they are not saved. If no
value is given defaults to its UID. Possible values are 'uid', 'name',
'symbol' (the official symbol or nomenclature).
- --help
- Display the documentation (this text).
- --limit
- When making a query, limit the result to these first specific results.
This is to prevent the use of specially unspecific queries and a warning
will be given if a query returns more results than the limit. The default
value is 200. Note that this limit is for each search.
- --non-coding,
--nonon-coding
- Some protein coding genes have transcripts that are non-coding. By
default, these sequences are saved as well. --nonon-coding can be
used to ignore those transcripts.
- --proteins
- Specifies the name for proteins file. By default, they are not saved. If
no value is given defaults to its accession. Possible values are
'accession', 'description', 'gene' (the corresponding gene ID) and
'transcript' (the corresponding transcript accesion).
Note that if not using 'accession' is possible for files to be
overwritten. It is possible for the same gene to encode more than one
protein or different proteins to have the same description.
- --pseudo,
--nopseudo
- By default, sequences of pseudo genes will be saved. --nopseudo can
be used to ignore those genes.
- --save
- Specifies the path for the directory where the sequence and log files will
be saved. If the directory does not exist it will be created altough the
path to it must exist. Files on the directory may be rewritten if
necessary. If unspecified, a directory named extracted sequences on
the current directory will be used.
- --save-data
- This options saves the data (gene UIDs, description, product accessions,
etc) to a file. As an optional value, the file format can be specified.
Defaults to CSV.
Currently only CSV is supported.
Saving the data structure as a CSV file, requires the
installation of the Text::CSV module.
- --transcripts,
--mrna
- Specifies the name for transcripts file. By default, they are not saved.
If no value is given defaults to its accession. Possible values are
'accession', 'description', 'gene' (the corresponding gene ID) and
'protein' (the protein the transcript encodes).
Note that if not using 'accession' is possible for files to be
overwritten. It is possible for the same gene to have more than one
transcript or different transcripts to have the same description. Also,
non-coding transcripts will create problems if using 'protein'.
- --upstream,
--up
- Specifies the number of extra base pairs to be extracted upstream of the
gene. This extra base pairs will only affect the gene sequence, not the
transcript or proteins.
- --verbose,
--v
- If set, program becomes verbose. For an extremely verbose program, use
--very-verbose instead.
- --very-verbose,
--vv
- If set, program becomes extremely verbose. Setting this option,
automatically sets --verbose as well. For help in debugging,
consider using --debug
- "bp_genbank_ref_extractor --transcripts=accession '"homo
sapiens"[organism] AND H2B'"
- Search Entrez gene with the query '"homo
sapiens"[organism] AND H2B', and save their transcripts
sequences. Note that default value of --limit may only extract some
of the hits.
- "bp_genbank_ref_extractor --transcripts=accession
--proteins=accession --format=fasta '"homo sapiens"[organism] AND
H2B' '"homo sapiens"[organism] AND MCPH1'"
- Same as first example but also searches for '"homo
sapiens"[organism] AND MCPH1', proteins sequences, and saves
them in the fasta format.
- "bp_genbank_ref_extractor --genes --up=100 --down=500 '"homo
sapiens"[organism] AND H2B'"
- Same search as first example but saves the genomic sequences instead
including 100 and 500 bp upstream and downstream.
- "bp_genbank_ref_extractor --genes --asembly='Alternate HuRef'
'"homo sapiens"[organism] AND H2B'"
- Same search as first example but saves genomic sequences and from the
Alternate HuRef genome assembly instead.
- "bp_genbank_ref_extractor --save-data=CSV '"homo
sapiens"[organism] AND H2B'"
- Same search as first example but does not save any sequence but saves all
the results in a CSV file.
- "bp_genbank_ref_extractor --save='search results' --genes=name
--upstream=200 downstream=500 --nopseudo --nonnon-coding --transcripts
--proteins --format=fasta --save-data=CSV '"homo
sapiens"[organism] AND H2B' '"homo sapiens"[organism] AND
MCPH1'"
- Searches on Entrez gene for both '"homo
sapiens"[organism] AND H2B' and '"homo
sapiens"[organism] AND MCPH1' and saves the gene sequences of
all hits (not passing the default limit and ignoring pseudogenes) plus 200
and 500bp upstream and downstream of them. It will also save the sequences
of all transcripts and proteins of each gene (but ignoring non-coding
transcripts). It will save the sequences in the fasta format, inside a
directory "search results", and save the
results in a CSV file
- •
- When supplying options, it's possible to not supply a value and use their
default. However, when the expected value is a string, the next argument
may be confused as value for the option. For example, when using the
following command:
"bp_genbank_ref_extractor --transcripts
'H2A AND homo sapiens'"
we mean to search for 'H2A AND homo sapiens' saving only the
transcripts and using the default as base for the filename. However, the
search terms will be interpreted as the base for the filenames (but
since it's not a valid identifier, it will return an error). To prevent
this, you can either specify the values:
"bp_genbank_ref_extractor --transcripts
'accession' 'H2A AND homo sapiens'"
"bp_genbank_ref_extractor
--transcripts='accession' 'H2A AND homo sapiens'"
or you can use the double hash to stop processing options.
Note that this should only be used after the last option. All arguments
supplied after the double dash will be interpreted as search terms
"bp_genbank_ref_extractor --transcripts
-- 'H2A AND homo sapiens'"
User feedback is an integral part of the evolution of this and
other Bioperl modules. Send your comments and suggestions preferably to the
Bioperl mailing list. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Please direct usage questions or support issues to the mailing
list: bioperl-l@bioperl.org
rather than to the module maintainer directly. Many experienced
and reponsive experts will be able look at the problem and quickly address
it. Please include a thorough description of the problem with code and data
examples if at all possible.
Report bugs to the Bioperl bug tracking system to help us keep
track of the bugs and their resolution. Bug reports can be submitted via the
web:
https://github.com/bioperl/%%7Bdist%7D
Carnë Draug <carandraug+dev@gmail.com>
This software is copyright (c) 2011-2015 by Carnë
Draug.
This software is available under the GNU General Public License,
Version 3, June 2007.