`_ that takes a genome sequence and determines a fine type and antibiotic resistance.
The code for this rule is as follows: ::
#Clinical identification rule
#Update job viewer status
update_status({stage=>'Scanning MLST loci'});
#Scan genome against all scheme 1 (MLST) loci
scan_scheme(1);
#Update job viewer status
update_status({percent_complete=>30, stage=>'Scanning PorA and FetA VRs'});
#Scan genome against the PorA VR and FetA VR loci
scan_locus($_) foreach qw(PorA_VR1 PorA_VR2 FetA_VR);
Add text to main output
append_html("Strain type
");
#Set variables for the scanned results. These can be found in the
#$results->{'locus'} hashref
my %alleles;
$alleles{$_} = $results->{'locus'}->{$_} // 'ND' foreach qw(PorA_VR1 PorA_VR2);
$alleles{'FetA_VR'} = $results->{'locus'}->{'FetA_VR'} // 'F-ND';
#Scheme field values are automatically determined if a complete
#profile is available. These are stored in the $results->{'scheme'} hashref
my $st = $results->{'scheme'}->{1}->{'ST'} // 'ND';
append_html("- P1.$alleles{'PorA_VR1'}, $alleles{'PorA_VR2'}; $alleles{'FetA_VR'}; ST-$st ");
#Reformat clonal complex using a regular expression, e.g.
#'ST-11 clonal complex/ET-37 complex' gets rewritten to 'cc11'
my $cc = $results->{'scheme'}->{1}->{'clonal_complex'} // '-';
$cc =~ s/ST-(\S+) complex.*/cc$1/;
append_html("($cc)
");
if ($st eq 'ND'){
append_html("ST not defined. If individual MLST loci have been found "
. "they will be displayed below:
");
#The get_scheme_html function automatically formats output for a scheme.
#Select whether to display in a table rather than a list, list all loci, and/or list fields.
append_html(get_scheme_html(1, {table=>1, loci=>1, fields=>0}));
}
#Antibiotic resistance
update_status({percent_complete=>80, stage=>'Scanning penA and rpoB'});
scan_locus($_) foreach qw(penA rpoB);
if (defined $results->{'locus'}->{'penA'} || defined $results->{'locus'}->{'rpoB'} ){
append_html("Antibiotic resistance
");
if (defined $results->{'locus'}->{'penA'}){
append_html("- penA allele: $results->{'locus'}->{'penA'}");
#If a client isolate database has been defined and values have been defined in
#the client_dbase_loci_fields table, the values for a field in the isolate database can be
#retrieved based on isolates that have a particular allele designated.
#The min_percentage attribute states that only values that are represented by at least that
#proportion of all isolates that had a value set are returned (null values are ignored).
my $range = get_client_field(1,'penA','penicillin_range',{min_percentage => 75});
append_html(" (penicillin MIC: $range->[0]->{'penicillin_range'})") if @$range;
append_html("
");
}
if (defined $results->{'locus'}->{'rpoB'}){
append_html("- rpoB allele: $results->{'locus'}->{'rpoB'}");
my $range = get_client_field(1,'rpoB','rifampicin_range',{min_percentage => 75});
append_html(" (rifampicin MIC: $range->[0]->{'rifampicin_range'})") if @$range;
append_html("
");
}
append_html("
");
}
Rule files
----------
The rule file is placed in a rules directory within the database configuration directory, e.g. /etc/bigsdb/dbase/pubmlst_neisseri_seqdef/rules. Rule files are suffixed with '.rule' and their name should be descriptive since it is used within the interface, i.e. the above rule file is named Clinical_identification.rule (underscores are converted to spaces in the web interface).
Linking to the rule query
-------------------------
Links to the rule query are not automatically placed within the web interface. The above rule query can be called using the following URL:
``_
To place a link to this within the database contents page an HTML file called job_query.html can be placed in a contents directory within the database configuration directory, e.g. in /etc/bigsdb/dbases/pubmlst_neisseria_seqdef/contents/job_query.html. This file should contain a list entry (i.e. surrounded with and tags) that will appear in the 'Query database' section of the contents page.
Adding descriptive text
-----------------------
Descriptive text for the rule, which will appear on the rule query page, can be placed in a file called description.html in a directory with the same name as the rule within the rule directory, e.g. in /etc/bigsdb/dbases/pubmlst_neisseria_seqdef/rules/Clinical_identification/description.html.
.. _mlst_workflow:
.. index::
pair: adding; MLST scheme
*************************************
Workflow for setting up a MLST scheme
*************************************
The workflow for setting up a MLST scheme is as follows (the example seqdef database is called seqdef_db):
**Seqdef database**
1. Create appropriate loci
2. Create new scheme 'MLST'
3. Add scheme_field 'ST' with primary_key=TRUE (add clonal_complex if you want; set this with primary_key=FALSE)
4. Add each locus as a scheme_member
5. You'll then be able to add profiles
**Isolate database**
1. Create the same loci with the following additional parameters (example locus 'atpD')
* dbase_name: seqdef_db
* dbase_table: sequences
* dbase_id_field: allele_id
* dbase_id2_field: locus
* dbase_id_value: atpD
* dbase_seq_field: sequence
* url: something like /cgi-bin/bigsdb/bigsdb.pl?db=seqdef_db&page=alleleInfo&locus=atpD&allele_id=[?]
2. Create scheme 'MLST' with:
* dbase_name: seqdef_db
* dbase_table: scheme_1 (or whatever the id of your seqdef scheme is)
3. Add scheme_field ST as before
4. Add loci as scheme_members
.. index::
pair: locus; adding
*****************************************************
Defining new loci based on annotated reference genome
*****************************************************
An annotated reference genome can be used as the basis of defining loci. The 'Databank scan' function will create an upload table suitable for pasting directly in to the batch locus add form of the :ref:`sequence definition ` or :ref:`isolate ` databases.
Click 'Database scan' on the curator's contents pag.
.. image:: /images/administration/database_scan.png
Enter an EMBL or Genbank accession number for a complete annotated genome and press 'Submit'.
.. image:: /images/administration/database_scan2.png
A table of loci will be generated provided a valid accession number is provided.
.. image:: /images/administration/database_scan3.png
Tab-delimited text and Excel format files will be created to be used as the basis for upload files for the sequence definition and isolate databases. Batch sequence files, in text and Excel formats, are also created for defining the first allele once the locus has been set up in the sequence definition database.
.. image:: /images/administration/database_scan4.png
.. index::
single: genome filtering
.. _genome_filtering:
****************
Genome filtering
****************
Within a genome there may be multiple loci that share allele pools. If an allele sequence is tagged from a genome using only BLAST then there is no way to determine which locus has been identified. It is, however, possible to further define loci by their context, i.e. surrounding sequence.
.. index::
single:genome filtering; in silico PCR
single:in silico PCR
Filtering by in silico PCR
==========================
Provided a locus can be predicted to be specifically amplifed by a PCR reaction, the genome can be filtered to only look at regions prediced to fall within amplification products of one or more PCR reactions. Since this is in silico we don't need to worry about problems such as sequence secondary structure and primers can be any length.
.. figure:: /images/administration/in_silico_pcr.png
Genome filtering by in silico PCR.
To define a PCR reaction that can be linked to a locus definition, click the add (+) PCR reaction link on the curator's main page.
.. image:: /images/administration/in_silico_pcr2.png
In the resulting web form you can enter values for your two primer sequences (which can be any length), the minimum and maximum lengths of reaction products you wish to consider and a value for the allowed number of mismatches per primer.
.. image:: /images/administration/in_silico_pcr3.png
* id - PCR reaction identifier number.
* Allowed: integer.
* description - Description of PCR reaction product.
* Allowed: any text.
* primer1 - Primer 1 sequences
* Allowed: nucleotide sequence (IUPAC ambiguous characters allowed).
* primer2 - Primer 2 sequence.
* Allowed: nucleotide sequence (IUPAC ambiguous characters allowed).
* min_length - Minimum length of predicted PCR product.
* Allowed: integer.
* max_length - Maximum length of predicted PCR product.
* max_primer_mismatch - Number of mismatches allowed in primer sequence.
* Allowed: integer.
* Do not set this too high or the simulation will run slowly.
Associating this with a particular locus is a two step process. First, create a locus link by clicking the add (+) PCR locus link on the curator's main page. This link will only appear once a PCR reaction has been defined.
.. image:: /images/administration/in_silico_pcr4.png
Select the locus and PCR reaction name from the dropdown lists to create the link. You also need to edit the locus table and set the pcr_filter field to 'true'.
Now when you next perform :ref:`tag scanning ` there will be an option to use PCR filtering.
.. index::
single: genome filtering; in silico hybridization
single: in silico hybridization
Filtering by in silico hybridization
====================================
An alternative is to define a locus by proximity to a single probe sequence. This is especially useful if you have multiple contigs and the locus in question may be at the end of a contig so that it doesn't have upstream or downstream sequence available for PCR filtering.
.. figure:: /images/administration/in_silico_hybridization.png
Filtering by in silico hybridization
The process is very similar to setting up PCR filtering, but this time click the nucleotide probe link on the curator's content page.
.. figure:: /images/administration/in_silico_hybridization2.png
Enter the nucleotide sequence and a name for the probe. Next you need to link this to the locus in question. Click the add (+) probe locus links link on the curator's main page. This link will only appear once a probe has been defined.
.. figure:: /images/administration/in_silico_hybridization3.png
Fill in the web form with appropriate values. Required fields have an exclamation mark (!) next to them:
* probe_id - Dropdown list of probe names.
* Allowed: selection from list.
* locus - Dropdown list of loci.
* Allowed: selection from list.
* max_distance - Minimum distance of probe from end of locus.
* Allowed: any positive integer.
* min_alignment - Minimum length of alignment allowed.
* Allowed: any positive integer.
* max_mismatch - Maximum number of mismatches allowed in alignment.
* Allowed: any positive integer.
* max_gaps - Maximum number of gaps allowed in alignment.
* Allowed: any positive integer.
Finally edit the locus table and set the probe_filter field for the specified locus to 'true'.
Now when you next perform :ref:`tag scanning ` there will be an option to use probe hybridization filtering.
.. index::
single: locus positions; setting
.. _genome_positions:
******************************
Setting locus genome positions
******************************
The genome position for a locus can be set directly by editing the locus record. To batch update multiple loci based on a tagged genome, however, a much easier way is possible. For this method to work, the reference genome must be represented by a single contig.
From the curator's main page, you need to do a query to find the isolate that you will base your numbering on. Click 'isolate query' to take you to a standard query form.
.. image:: /images/administration/genome_positions.png
Perform your search and click the hyperlinked id number of the record.
.. image:: /images/administration/genome_positions2.png
In the isolate record, click the sequence bin 'Display' button to bring up details of the isolate contigs.
.. image:: /images/administration/genome_positions3.png
Click the 'Renumber' button:
.. image:: /images/administration/genome_positions4.png
A final confirmation screen is displayed with the option to remove existing numbering that doesn't appear within the reference genome. Click 'Renumber'.
.. image:: /images/administration/genome_positions5.png
.. index::
single: composite fields
*************************
Defining composite fields
*************************
Composite fields are virtual fields that don't themselves exist within the database but are made up of values retrieved from other fields or schemes and formatted in a particular way. They are used for display and analysis purposes only and can not be searched against.
One example of a composite field is used in the Neisseria PubMLST database which has a strain designation composite field made up of serogroup, PorA VR1 and VR2, FetA VR, ST and clonal complex designations in the format:
[serogroup]: P1.[PorA_VR1],[PorA_VR2]: [FetA_VR]: ST-[ST] ([clonal_complex])
e.g. A: P1.5-2,10: F1-5: ST-4 (cc4)
Additionally, the clonal complex field in the above example is converted using a regular expression from 'ST-4 complex/subgroup IV' to 'cc4'.
Composite fields can be added to the database by clicking the add (+) composite fields link on the curator's main page.
.. image:: /images/administration/composite_fields.png
Initially you just enter a name for the composite field and after which field it should be positioned. You can also set whether or not it should be displayed by default in main results tables following a query - this is overrideable by user preferences.
.. image:: /images/administration/composite_fields2.png
Once the field has been created it needs to be defined. This can be done from query composite field link on the main curator's page.
.. image:: /images/administration/composite_fields3.png
Select the composite field from the list and click 'Update'.
.. image:: /images/administration/composite_fields4.png
From this page you can build up your composite field from snippets of text, isolate field, locus and scheme field values. Enter new values in the boxes at the bottom of the page.
.. image:: /images/administration/composite_fields5.png
Once a field has been added to the composite field, it can be edited by clicking the 'edit' button next to it to add a regular expression to modify its value by specific rules, e.g. in the clonal complex field above, the regular expression is set as: ::
s/ST-(\S+) complex.*/cc$1/
which extracts one or more non-space characters following the 'ST-' in a string that then contains the work 'complex', and appends this to 'cc' to produce the final string.
This will convert 'ST-4 complex/subgroup IV' to 'cc4'.
You can also define text to be used for when the field value is missing, e.g. 'ND'.
.. index::
pair: extended attributes; provenance fields
**********************************************
Extended provenance attributes (lookup tables)
**********************************************
Lookup tables can be associated with an isolate database field such that the database can be queried by extended attributes. An example of this is the relationship between continent and country - every country belongs to a continent but you wouldn't want to store the continent with each isolate record (not only could data be entered inconsistently but it's redundant). Instead, each record may have a country field and the continent is then determined from the lookup table, allowing, for example, a search of isolates limited to those from Europe.
To set up such an extended attribute, click the add (+) isolate field extended attributes link on the curator's main page.
.. image:: /images/administration/extended_attributes.png
Fill in the web form with appropriate values. Required fields have an exclamation mark (!) next to them:
* isolate_field - Dropdown list of isolate fields.
* Allowed: selection from list.
* attribute - Name of extended attribute, e.g. continent.
* Allowed: any text (no spaces).
* value_format - Format for values.
* Allowed: integer/float/text/date.
* value_regex - `Regular expression `_ to enforce allele id naming.
* ^: the beginning of the string
* $:the end of the string
* \d: digit
* \D: non-digit
* \s: white space character
* \S: non white space character
* \w: alpha-numeric plus '_'
* .: any character
* \*: 0 or more of previous character
* +: 1 or more of previous character
* e.g. ^F\d-\d+$ states that a value must begin with a F followed by a single digit, then a dash, then one or more digits, e.g. F1-12
* description - Long description - this isn't currently used but may be in the future.
* Allowed: any text.
* url - URL used to hyperlink values in the isolate information page. Instances of [?] within the URL will be substituted with the value.
* Allowed: any valid URL (either relative or absolute).
* length - Maximum length of extended attribute value.
* Allowed: any positive integer.
* field_order - Integer that sets the order of the field following it's parent isolate field.
* Allowed: any integer.
The easiest way to populate the lookup table is to do a batch update copied from a spreadsheet. Click the batch add (++) isolate field extended attribute values link on the curator's main page (this link will only appear once an extended attribute has been defined).
.. image:: /images/administration/extended_attributes2.png
Download the Excel template:
.. image:: /images/administration/extended_attributes3.png
Fill in the columns with your values, e.g.
+-------------+---------+-----------+------+
|isolate_field|attribute|field_value|value |
+=============+=========+===========+======+
|country |continent|Afghanistan|Asia |
+-------------+---------+-----------+------+
|country |continent|Albania |Europe|
+-------------+---------+-----------+------+
|country |continent|Algeria |Africa|
+-------------+---------+-----------+------+
|country |continent|Andorra |Europe|
+-------------+---------+-----------+------+
|country |continent|Angola |Africa|
+-------------+---------+-----------+------+
Paste from the spreadsheet in to the upload form and click 'Submit'.
.. index::
single: settings; validation
***********************
Sequence bin attributes
***********************
It is possible that you will want to store extended attributes for sequence bin contigs when you upload them. Examples may be read length, assembler version, etc. Since there are almost infinite possibilities for these fields, and they are likely to change over time, they are not hard-coded within the database. An administrator can, however, create their own attributes for a specific database and these will then be available in the web form when uploading new contig data. The attributes are also searchable.
To set up new attributes, click the add (+) 'sequence attributes' link on the isolate database curator's index page.
.. image:: /images/administration/sequence_attributes.png
Enter the name of the attribute as the 'key', select the type of data (text, integer, float, date) and an optional short description. Click 'Submit'.
.. image:: /images/administration/sequence_attributes2.png
This new attribute will then be available when :ref:`uploading contig data `.
.. image:: /images/administration/sequence_attributes3.png
*************************************************
Checking external database configuration settings
*************************************************
Click the 'Configuration check' link on the curator's index page.
.. image:: /images/administration/config_check.png
The software will check that required helper applications are installed and executable and, in isolate databases, test every locus and scheme external database to check for connectivity and that data can be retrieved.
.. image:: /images/administration/config_check2.png
Any problems will be highlighted with a red :red:`X`.