There are many different ways that a search on the website can be modified. Below is a list of the different accepted inputs and the parameters that can be modified. Also included are the parameter names that are required when using the API. This section is meant to be a guide to using the website, but further information can be found in the extensive HMMER guide. The parameter names used on the site are typically the same as the command line parameters, with the exception of the input data parameters. Each section is followed by a summary table that can be used as a quick reference.
The searches on the website, when used in the simple mode, hide most of the search parameters and default values are used. Below is a list of the parameters and values used in the default search for each algorithm:
phmmer, hmmscan and jackhmmer take a single protein amino acid sequence as the input, controlled by the seq parameter. The website currently accepts valid FASTA format or simply the amino acid sequence. Alternatively, you can query by sequence accession or identifier (option selected above the input textfield), which will offer suggestions as the name is typed. If you are using the API, then the parameter acc should be set, regardless of whether it is an accession or identifier.
| Parameter Name | seq | acc |
|---|---|---|
| Description | Sets the query sequence | |
| Algorithm(s) | phmmer, hmmscan, jackhmmer | |
| Accepted Values | Protein sequence (FASTA format) | An accession or identifier from one of the supported databases |
| Default | None | None |
| Required | Yes - seq or acc | |
hmmsearch and jackhmmer can takes a multiple protein sequence alignment as an input. The alignment formats currently accepted are:
Currently the parameter for this field is seq, when using the API. We plan to change this.
We have provided an example link(s) next to the sequence/multiple sequence alignment/HMM input areas, which when clicked, insert a sequence. alignment or profile HMM. These examples have been chosen to show a result set that demonstrates the various features available on the results pages.
The four different search algorithms can be modified in a variety of ways by changing the search parameters under the advanced option. The different options available are summarised below:
| Algorithm | Sequence Database |
HMM Database |
Cut-offs | Gap penalties | Filters | ||||
|---|---|---|---|---|---|---|---|---|---|
| E-value | Bit score | Gathering | Open | Extend | Matrix | No bias | |||
| phmmer | x | - | x | o | - | x | x | x | o |
| hmmscan | - | x | x | o | o | - | - | - | o |
| hmmsearch | x | - | x | o | - | - | - | - | o |
| jackhmmer | x | - | x | o | - | o | o | o | o |
Key: x - required/default, - - not applicable, o - optional
The sequence database field changes which target sequence database is searched. On the website, the default varies depending on which continent the HTTP request comes from. From North America, NR will be searched against, for the rest of the world, UniProt will be the default. This is one of the few parameters that is required by phmmer, hmmsearch or jackhmmer.
| Parameter Name | seqdb |
|---|---|
| Description | Sets the target database |
| Algorithm(s) | phmmer, hmmsearch, jackhmmer |
| Accepted Values | nr, uniprotkb, swissprot, pdb, env_nr, unimes, rp |
| Default | nr or uniprotkb (see below) |
| Required | Yes |
This field indicates which profile HMM database the query should be searched against. Currently, there is only one profile HMM database, Pfam.
| Parameter Name | hmmdb |
|---|---|
| Description | Sets the target hmm database |
| Algorithm(s) | hmmscan |
| Accepted Values | pfam |
| Default | pfam |
| Required | Yes |
All four algorithms have the ability to set two different categories of cut-offs: significance and reporting thresholds. These cut-offs can be defined either as E-values (the default option) or bit scores. When setting either category of threshold, there are two values for each of the threshold categories: sequence and hit. A query can match a target in multiple places, defined as a hit (or domain) score. The sum of all hits on the sequence is the sequence score.
For example, trying to match repeating motifs can often be difficult, due to sequence variation in the repeating sequence motif. However, it can be possible to capture all examples of the motif, by relaxing the hit parameter while maintaining a stringent sequence parameter. This means that multiple matches, even if they are not strong matches, can be detected, but the sum of these matches must be sufficient to achieve the sequence score, there by limiting the rate of false positives.
Significance (or inclusion) thresholds are stricter than reporting thresholds and take precedence over them. These determine whether a sequence/hit is significant or not.
Sequence and hit significance E-value thresholds will set matches with E-values less than or equal to the cut-off E-value as being significant. The default sequence E-value threshold is 0.01 and 0.03 for hits. If you are using the API, the incE and incdomE parameters are used to set the sequence and hit E-value thresholds respectively. In the absence of any threshold parameters the server will default to using E-value thresholds with the defaults just described.
| Parameter Name | incE | incdomE |
|---|---|---|
| Description | Sequence E-value threshold | Hit E-value threshold |
| Algorithm(s) | phmmer, hmmscan, hmmsearch, jackhmmer | |
| Accepted Values | 0<x≤10 | 0<x≤10 |
| Default | 0.01 or set to hit threshold if present. | 0.03 or set to sequence threshold if present. |
| Required | No | No |
Alternatively, the sequence and hit significance thresholds can be specified as bit scores. Any sequence or hit scoring greater than or equal to that given threshold will be considered a significant hit. By default, the form on the website is filled with typical values, with the sequence cut-off set to 25.0 bits and the bit cut-off set to 22.0 bits. If you are using the API, the incT and incdomT parameters are used to set the sequence and hit bit thresholds respectively. This threshold is not used by default. If only one of these two parameters is set, then the unassigned parameter is set to the other assigned parameter value.
| Parameter Name | incT | incdomT |
|---|---|---|
| Description | Sequence bit score threshold | Hit bit score threshold |
| Algorithm(s) | phmmer, hmmscan, hmmsearch, jackhmmer | |
| Accepted Values | x>0 | x>0 |
| Default | 25.0 | 22.0 |
| Required | No | No |
The reporting thresholds controls how many matches that fall below the significance threshold are still shown in the results (i.e. reported). As every entity in the target database is compared to the query, if all matches were reported, then potentially vast outputs would be generated. However, it can often be useful to view border-line matches as they may reveal more distant potential informative similarities to the model. As with the significance thresholds, there is a value for both the sequence and the hit, which again can be defined as either an E-value or a bit score. Such reported matches are indicated by a yellow background in the results table produced in the website.
Any sequence or hit with an E-value less than or equal to that given threshold will be reported. By default, the reporting sequence E-value will be set to 1 and the hit cut-off set to 1. In the API, the E and domE parameters are used to set the sequence and hit E-value reporting thresholds respectively. If significance thresholds are set, yet either or both reporting thresholds are undefined, the default form values will be set server side.
| Parameter Name | E | domE |
|---|---|---|
| Description | Sequence E-value threshold (reporting) | Hit E-value threshold (reporting) |
| Algorithm(s) | phmmer, hmmscan, hmmsearch, jackhmmer | |
| Accepted Values | 0<x≤10 | 0<x≤10 |
| Default | 1 or set to hit threshold if present. | 1 or set to sequence threshold if present. |
| Required | No | No |
The sequence and hit reporting thresholds can also be specified as bit scores. Any sequence or hit scoring greater than or equal to that given threshold will be reported. By default, the form on the website is filled with typical values, with the sequence cut-off set to 7.0 bits and the hit cut-off set to 5.0 bits. If you are using the API, the T and domT parameters are used to set the sequence and hit bit thresholds respectively. If significance thresholds are set, yet either or both reporting thresholds are undefined, these default form values will be set server side.
| Parameter Name | T | domT |
|---|---|---|
| Description | Sequence bit score threshold (reporting) | Hit bit score threshold (reporting) |
| Algorithm(s) | phmmer, hmmscan, hmmsearch, jackhmmer | |
| Accepted Values | x>0 | |
| Default | 7.0 | 5.0 |
| Required | No | No |
Note: These threshold scoring systems can not be used in combination between significance and reporting, i.e. choose either significance and reporting E-values or significance and reporting bit scores when setting thresholds.
Specific to hmmscan, the gathering threshold indicates to HMMER to use the sequence and hit thresholds defined in the HMM file to be used. In the case of Pfam, these are set conservatively to ensure that there are no known false positives. Thus, if a query sequence scores with a bit score greater than or equal to the gathering thresholds, then that match can be treated with high confidence. This threshold is the default setting for hmmscan. If you are using the API, you can use the cut_ga parameter to signify that the gathering threshold should be used.
| Parameter Name | threshold |
|---|---|
| Description | Sets the threshold to gathering threshold |
| Algorithm(s) | hmmscan |
| Accepted Values | cut_ga | Required | No |
You can customize the results table to display different columns and/or to restrict the number of rows in the table to a manageable amount. This can be performed before or after your search, with the customization stored in a cookie so that you will not have to keep re-configuring the table after each search. The customization of results is discussed further here.
These are specific to phmmer and jackhmmer (initiated with a single sequence).
The open parameter (called popen in HMMER) sets the probability for opening a gap in an alignment between target sequence against the model (or query sequence). The default value is 0.02, but can be set anywhere from 0 (no gaps) to less than 0.5 (more likely to extend the gap).
The extend parameter (called pextend in HMMER) sets the probability for extending the gap for a target sequence against the model or query sequence. The default value is 0.4, but can be set anywhere from 0 (less likely to extend) to less than 1 (more likely to extend the gap).
When using phmmer, the query is a single sequence so the residue alignment probabilities are calculated from a substitution matrix. Substitution matrices provide scores that indicate the likelihood of two aligned amino acids appearing due to conservation rather than by chance. There are five different matrices available for selection: BLOSUM45, BLOSUM62 (default), BLOSUM90, PAM30 and PAM70. These BLOSUM matrices are based on observed alignments between amino acids in the BLOCKS database, where as the PAM matrices have been extrapolated from comparisons of closely related proteins. The different matrices alter the stringency of the alignment e.g. PAM90 can be used to find more distantly related sequences than PAM70, as PAM70 is more stringent; BLOSUM62 can be used to find more closely related sequence than using BLOSUM45, as BLOSUM45 is less stringent.
| Parameter Name | popen | pextend | mx |
|---|---|---|---|
| Description | Gap open penalty | Gap extend penalty | Substitution matrix |
| Algorithm(s) | phmmer, jackhmmer | ||
| Accepted Values | 0≤x<0.5 | 0≤x<1 | BLOSUM45, BLOSUM62, BLOSUM90, PAM30, PAM70 |
| Default | 0.02 | 0.4 | BLOSUM62 |
| Required | Yes - set to default server side if absent | ||
Turning off the bias composition filter can increases sensitivity, but at a high cost in speed, especially if the query has biased residue composition (such as a repetitive sequence region, or if it is a membrane protein with large regions of hydrophobicity). Without the bias filter, too many sequences may pass the filter with biased queries, leading to slower than expected performance, hence by default it is switched on. This feature can be disabled using the nobias parameter.
| Parameter Name | nobias |
|---|---|
| Description | Turns off the bias composition filtering. |
| Algorithm(s) | phmmer, hmmscan, hmmsearch, jackhmmer |
| Accepted Values | 1 | Required | No |
By default when performing a phmmer search via the website (and when JavaScript is enabled), a default hmmscan search against the Pfam HMM library is also performed. This feature is not available via the API, but can be mimicked by making separate requests to phmmer and hmmscan.
It is also possible to search protein sequences in a batch mode, rather than pasting sequence after sequence. For both phmmer and hmmscan files containing multiple sequences in FASTA format can be uploaded via the "Upload a file" link. These sequences will then be searched, in turn, against the specified databases. Note that we have put a limit of 500 sequences per batch request. This is because we need to stop the servers getting overloaded parsing huge sequence files, but there is nothing stopping you from submitting multiple requests. Once the job is submitted, a slightly different results page will be returned, showing a table with each row in that table representing a sequence in your file. This table periodically updates, indicating the progress of your batch job. As results appear in the table, you can view the details. If you have many sequences, you can also request that an e-mail be sent when the batch job has completed.
The jackhmmer batch system operates in a slightly different manner. Under the advance settings you can select the number of iterations to be performed and the batch mode will automaticaly run through each iteration (or until convergence), taking the results and using all the sequences scoring above the significance threholds to generate the input multiple sequnece alignment for the next round. Only one sequence, multiple sequence aligment or profile HMM can be submitted at a time. It is also possible to use the batch mode for hmmsearch, again with a single multiple alignment or profile HMM.
The batch system also works via the API, except the seq parameter is substituted for the file parameter - the other parameters remain the same. Requesting an email notification can be set using the email parameter.
A bit score in HMMER is the log of the ratio of the sequence's probability according to the profile (the homology hypothesis) to the null model probability (the non-homology hypothesis).
An E-value (expectation value) is the number of hits that would be expected to have a score equal to or better than this by chance alone. A good E-value is much less than 1, for example, an E-value of 0.01 would mean that on average about 1 false positive would be expected in every 100 searches with different query sequences. An E-value around 1 is what we expect just by chance. E-values are widely used as all you need to decide on the significance of a match is the E-value, but note that they vary according to the size of the target database.
Also called the gathering cut-off, the gathering threshold is actually comprised of two bit scores, a sequence cut-off and a domain cut-off, used to define the significance of a sequence and a hit respectively. These are defined in the profile HMM and set both significance and reporting thresholds so that no insignificant hits are reported.
The "null model" calculates the probability that the target sequence is not homologous to the query profile and is a one-state HMM configured to generate "random" sequences of the same mean length L as the target sequence, with each residue drawn from a background frequency distribution (a standard i.i.d. model: residues are treated as independent and identically distributed). This background frequency is based on the mean residue frequencies in Swiss-Prot 50.8 (October 2006).
Profile hidden Markov Models (HMMs) are a way of turning a multiple sequence alignment into a position-specific scoring system, which is suitable for searching databases for remotely homologous sequences.
STOCKHOLM format is a multiple sequence alignment format supported by HMMER.