Future Developments
`
In the next major release of the site, we aim to increase the scope and flexibility
of target databases.
Latest Changes
Version 1.1, April 2012
Release 1.1 of the site is largely a performance and bug fix release. We have updated the code
behind the scenes to reduce a number of bottlenecks in producing the results, which should result
in shorter load times.
Bug Fixes
- Better error messages - A number of the error messages that we return have
been expanded to provide assistance in preventing the error from occurring.
- Pfam domains - Fixed a bug that was not properly setting the Pfam type
definition (Family, Domain, Repeat, or Motif) as part of the hmmscan post-processing.
Pfam entries tagged with the type 'Repeat' or 'Motif' will now be represented with straight
edges, similar to how they appear in the Pfam website.
Version 1.0, March 2012
The website has been stable over the last 6 months - this is about to change
with this release of the website. Nearly every aspect of the underlying code
has been altered to make improvements and to allow for iterative searches.
New features
- Jackhmmer - We have now implemented the iterative protein search
algorithm from HMMER, jackhmmer. The website version of jackhmmer, unlike the command line
version, takes either a single sequence, multiple sequence alignment or profile HMM as input.
It is possible to run up-to five iterations, or successive searches, where the
aligned results of one search are used as the input for the next. With the website,
you can interactively include or exclude sequence matches from the results, so that
they will or will not be used in the following searches.
- Downloads - We have expanded the different alignment formats
to include Clustal, PSI-BLAST and PHYLIP formats. Furthermore, single
sequence initiated searches, will have the query sequence include in the output
multiple sequence alignment.
- Search interface - The target sequence/HMM database and
thresholds are now displayed on the standard search page. Previously, they were
hidden under the advanced search option.
- Validation - We have changed the way that we do input validation to
make use of some of the latest developments in the Easel software suite that
is utilised by HMMER. It is now possible to upload the following alignment formats:
- Aligned FASTA
- Clustal (and Clustal-like)
- PSI-BLAST
- PHYLIP
- Selex
- GCG/MSF
- STOCKHOLM
- A2M
The website now performs auto-detection on the data input fields, so you do not
have to make the distinction between whether something is a single sequence,
multiple sequence alignment or profile HMM.
- Hmmscan - By default, hmmscan results are expanded to display
a full list of domain hits and their E-values. We have also included
Pfam Clan information.
- Search again - The search again button on the right side of the
the search results page navigation bar has been modified to allow you to use the results
of the current search as input for a new search, rather than having to go via a download,
followed by an upload.
- API documentation - With the introduction of jackhmmer,
we have included a description and an example of how to use the API version of this search.
This is a bit more complicated as the job actually comprises multiple searches.
We will be producing some more examples and discussing them via the blog.
- Documentation - We have updated the search/result documentation
in the help section, to describe the use of jackhmmer and the changes to the hmmscan
results table.
Homepage - The homepage has been modified to include fast access
to the different search algorithms, client OS detection for HMMER binary download
and the inclusion of HMMER tagged blog posts from Cryptogenomicon.
Version 0.9.5, October 2011
New features
- Domain Architecture - this release has focused on producing
a way of visualizing results from phmmer or hmmsearch searches
according to the domain architecture of the sequences matched. This visualization is based
on the domain architecture view used for
Pfam families, where sequences with the same domains found on them are grouped together.
For each architecture, a representative sequence is drawn. The positions of the
query sequence match is also shown. We feel that this is a very powerful tool for
the identification of new domains and for understanding potential function when the
query contains domains that are distantly related to those found in Pfam.
Version 0.9, July 2011
New features and improvements
- Taxonomy visualization - the biggest and most important new feature in
this release is the ability to visualize results according to taxonomy. With this
view, which is dependent on JavaScript, the matches to the query are displayed
on a taxonomic tree. Below each level (or node) in the taxonomy tree is
a hit distribution graphic, similar to that showing in the score page. This
sparkline-like graph gives an idea of the distribution of your hits at
that point. The tree is interactive, allowing interrogation of the results within
the tree. Descending the tree levels reduces the numbers of species in the
table below it, with only those species belonging to the visible root of the tree
displayed. It is possible to use the tree to filter results, showing only those
belonging to a specific clade in the score page, by selecting the show all at the
bottom of the table.
- Target databases - We have updated all of the underlying sequence databases:
- Pfam - updated to Pfam release 25.0.
- UniProt - SwissProt, Unimes and UniProt versions updated to release 2011_07.
- NCBI data - Updated NR and environmental division of NR to July release.
- PDB - downloaded July 19 2011.
- Representative proteomes - We now include the set of representative proteomes
produced at PIR, which is
based on the July version of UniProt. This dataset contains complete proteomes and
reference viral strains. This new target database works particularly well with
the taxonomy view as the data is redundant according to the species, so it
becomes easy to observe expansions in different lineages.
- Percentage identity and similarity - scores, familiar in BLAST outputs are not
available for the hits alongside the alignments. The number of aligned positions that this
is calculated is also included in the table.
- Hit Position graphic - users can now include a small graphic showing the position
of the hits between the query and target sequence. Permuted hit positions are color
coded, highlighting their presence.
- Structure links - we now link to both RCSB and PDBe, so either site can be used to explore the structure.
- Multiple sequence alignments - results can now be downloaded as a multiple
sequence alignments. Two formats are currently supported - STOCKHOLM and aligned FASTA.
These can be accessed via the downloads link found at the bottom of the search results
page.
- Downloads section - since release 0.8 there have been a number of user requests
regarding the ability to download their results in different formats - in addition to
the alignment formats - you can download the hit regions or full length sequences
as FASTA format. We have also move the previous text, JSON and XML formats to the
downloads section, which is linked from the bottom of the search results page.
- HMMER - We have upgraded to a new, internal, version of HMMER. This comes with the positive
points in that a error in the bias composition pipeline has been fixed (which means
faster phmmer searches). This version also allows use to sort results, such that
the highest scoring match always appears at the top. In the previous release, several high
scoring sequences have all scored with E-values approximating to zero, but the order
has been arbitrary. Finally, and by no means least, phmmer search results now show
the original query sequence, rather than the most probably sequence. This was causing
methionines to be replaced with leucines as they are five time more likely - Sean says
that is an expected feature - albeit possibly not what people were expecting. The negative
point is that you will not be able to retrieve old results as the data structure has changed,
making it trick for use to retrofit the website to deal with both results formats.
- Show all - Another feature request was the ability to show all alignment. This
feature is only available when requesting 100 or less results per page in the customizable
view.
- Twitter - We now have a HMMER twitter account. Following us here will allow you to
easily follow developments and data updates. We will also use this to inform you
if we are experiencing any problems with our servers.
Bug Fixes
Not so many bug fixes this time:
- Sequence alignment uploads - were still occasionally being rejected by HMMER,
or worst still giving 500 responses. We have put in a temporary fix to try and identify
when HMMER is not going to accept an alignment - this is often caused by duplicate
sequence labels, trailing tabs and so on. The Easel library that deals with alignment parsing
is currently undergoing as major overhaul in readiness for HMMER 3.1.
Version 0.8, April 2011
New features and improvements
- We have re-factored the results displays to make them more flexible to users
needs:
- You can now customize the results table. This
includes the switching on and off of columns and the selection of
different numbers of results to display in the table. Customization
can be performed before or after the search is performed.
- The results are now paginated, if the number of
matches exceeds the requested number of rows. The hit
distribution graph can still be used to navigate the results,
jumping to the appropriate page and row.
- The multiple sequence alignments are now blocked.
This uses a bit more of the screen real estate, but on most screens,
the complete pairwise alignment should be instantly visible without
the need for horizontal scrolling. We have also changed the position of the
posterior probability line, such that it is below the aligned sequences.
- Color coding of results, according to whether the sequence falls
below the sequence significance threshold (yellow), or where the
sequence scores above threshold, but where no domains/hits scores above
the domain significance thresholds (red).
- Where appropriate, we indicate where the sequence threshold lies
for phmmer and hmmsearch results with a bold red line across the table.
- It is now possible to turn off HMMER's bias filter.
- We have removed some of the fields for, the sequence and domain hit
sections, which will only be evident if you use the RESTful API.
These fields were all internal to HMMER. We have added the individual E-value field
to the domain section.
- Behind the scenes we have also be working on speeding
up the server. Nothing spectacular here, but we have more than halved the
time spent by the server performing validation and post-processing.
HMMER really is the bottleneck!
Bug Fixes
- Submission of more than a single sequence in the search textfield was
breaking the queuing system, with some jobs being entered into the queue,
but never being run as they did not pass the quality control checks.
- We noticed that some of the batch jobs were not been run due to the pending
interactive jobs. Interactive jobs will be ignored after 15 minutes.
- There have been some issues validating multiple sequence alignments formats.
We have tightened up the validation and provided some more informative error
messages to help users diagnose problems. We have seen PHYLIP format entered
a few times, but this is not currently supported.
- Long sequences have also been causing some issues when running
phmmer. There are two technical issues running phmmer with long query sequences.
First they can take longer than the default 30 seconds
to run, so when we were getting queries of longer sequences users
were getting error messages. We have now changed the fixed 30 seconds time out
to a dynamically set limit on the query length.
Second of all, alignments between a long query
and a long target consume many gigabytes of memory. Consequently, we have
had to stop searches with queries longer than 10,000 amino acids as they
were crashing the back-end search engine.
This is only a stop gap fix, while the HMMER software is optimized to deal
with this sort of search. This only affects approximately 200 sequences
currently found in the large database collections, which are primarily
examples of Titin.
Version 0.7.1, March 2011
New features and improvements
- Hit distribution graph - the results pages for phmmer and
hmmsearch now contain a graph that shows both the distribution of matches in
terms of E-value and the break down of hits according to taxonomic
kingdoms
- Structure links - we now indicate when a structure is
known for a sequence or even part of the sequence. This uses the
SIFTS
resource.
- Dynamic table columns - The results table contains many
different items of information, making it quite wide. As the browser window
is made smaller, some of the columns from the results table will be omitted,
so that the essential information will still be visible.
- View of redundant sequences - Many of the sequence
database that we support contain 100% identical sequences. Rather than
showing each identical sequence, which would inflate the results table even
more, we show just one sequence. However, we now indicate the redundancy and
show the annotations associated with those redundant sequences.
- Updated documentation - the help pages have been
updated, particularly the pages describing the
RESTful interface. We have
added examples of Python and Java
clients, in addition to the Perl and curl
examples.
- Hmmsearch alignment formats - the first version of
this search interface only accepted STOCKHOLM format. We now accept Clustal,
MSF, aligned FASTA and selex formats, in addition to STOCKHOLM.
- Results tab - There is now a specific results tab in
the top level navigation found on each page. Clicking on this will take you
to a page that allows the retrieval of results via the job's unique
identifier.
- Change log - This change log has been added to catalog
the changes that we make to the site.
Bug Fixes
- We have identified an issue with HMMER master that was causing corrupted
response - A lovely race condition that has been punted back to Sean.
The interim fix means that searches will be 50 msecs slower.
- The redundant information, such as species, contained in the sequence
descriptions has been removed. We have also fix a bug linking NR
environmental sequences back to the source database.
Version 0.6
New features and improvements
- We have now added a system for batch search, where fasta files
containing up to 500 sequences can be upload and searched. You can
request an email when the results are finished or bookmark the results
page that shows the progress of the batch search.
Version 0.5
New features and improvements
- Added icons to indicate external links.
- We have updated to the latest (internal) version of the HMMER
software - fixes a number of bugs.
Bug Fixes
- There was an issue using the command line tool curl
to fetch results. Curl needs an extra flag to stop a 417 error being
produced. The documentation was updated to include the
"-H 'Expect:'".
- After a search is submitted, the browser is immediately sent a
redirect to the appropriate results page. This restores the expected
browser back button behaviour.
Version 0.4
New features and improvements
- Started to add documentation pages for searches and the API.
- Search results pages now contain the original search details.
- Added a download option for saving search results in different formats
(html version only).
- Results are now loaded in blocks of 1000 matches, rather than all at
once.
Bug Fixes
- Several IE compatibility bugs have been fixed.
Version 0.3
First release
- This is our first release of the website that enables HMMER searches to
be performed over the Web.