Reference: This is an automatically produced HTML version of the first submitted draft of Brenner SE. 1998. "Practical database searching." The Trends Guide to Bioinformatics. 9-12. However, to reference this piece, please cite the original research article:
    Brenner SE, Chothia C, Hubbard TJP. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA 95:6073-6078. (Request reprint)

Practical database searching

 

More sequences have been putatively characterized by database searches, and by blast in particular, than by any other single technology. For good reason: BLAST is fast and reliable. However, sequence comparison procedures should be treated as experiments analogous to wet-lab characterization. Their use deserves the same care both in the design of the experiment and in the interpretation of results.

 

The database search experiment

Before planning the BLAST run, it is essential to consider what information is to be learned about the query sequence of interest. It should be borne in mind that database searching can only reveal similarity. However, from this similarity one may infer homology (i.e., evolutionary relationship), and from that, one may be able to further infer function. While the former of these inferences is now reliable for carefully performed sequence comparison, the second is still fraught with challenges. The summary box provides guidelines for performing reliable and sensitive database searches.

Planning a good experiment requires understanding of the method being applied. Fundamentally, database searches are a simple operation: a query sequence is locally aligned with each of the sequences (called targets) in a database. Most programs, such as BLAST and FASTA, use heuristics to speed the alignment procedure, while the Smith-Waterman (implemented, for example, in SSEARCH) rigorously compares the query sequence against each target in the database.

A score is computed from each alignment, and the query/target pairs with the best scores are then reported to the user. Statistics typically are used to help improve the ability to interpret these scores. A more detailed description of the process may be found in [Altschul, p.XX]

 

Databases and types of comparison

To formulate the experiment, it is first necessary to decide what types of sequences will be compared: DNA, Protein, or DNA as Protein. If the sequence under consideration either is a protein or codes for a protein, then it is almost always the case that the search should take place at the protein level. Proteins allow one to detect far more distant homology than DNA, for several reasons. In DNA comparisons, there is noise from the rapidly mutated third base position in each codon, and from comparisons of non-coding frames (though this latter issue still arises in DNA as Protein searches). Additionally amino acids have chemical characteristics, which allow one to assess degrees of similarity rather simply recognition of identity or non-identify. For these reasons, DNA versus DNA comparison (using the blastn program) is typically only used to find identical regions of sequence in a database. One would do such a search to discover whether another group has sequenced or studied a gene, and to learn where it is expressed or where splice junctions occur. In short, protein-level searches are valuable for detecting evolutionarily related genes, while DNA searches are best for locating nearly identical regions of sequence.

Next, it is necessary to select a database to search against (e.g., GenBank, Genomes, nr proteins, ESTs, etc.). For homology searches, the most common database on the NCBI website to search is nr, which stands for non-redundant. The nr protein database combines data from several sources, removes the redundant identical sequences, and yields a collection with nearly all known proteins. The NCBI nr database is frequently updated, to incorporate as many sequences as possible. Obviously, a search won’t identify a sequence that has not been included in the database, and since databases are growing so rapidly, it is essential to keep current. Several specialized databases are also available, each of which is a subset of the nr database.

One may also wish to search DNA databases at the protein level. Programs can do so automatically by first translating the DNA in all six reading frames and then making comparisons with each of these conceptual translations. The nr DNA database (containing most known DNA sequence except GSS, EST, STS, or HTGS sequences) is useful to search when hunting new genes; the identified genes in this database would already be in the protein nr database. Searches against the GSS, EST, STS, and HTGS databases can find new homologous genes, and are especially useful to learn about expression data or genome map location. Because of the different combinations of queries and database types, there are several variants of BLAST (see box).

Note that it is desirable to use the newest versions of BLAST, which support gapped alignments. (See [Altschul, pXX] for description of gapped alignments.) The older versions are slower, detect fewer homologs, and have problems with some statistics. The programs can be run over the Web and can be downloaded from the FTP site to run locally. Another option is to use the FASTA package, which continues to be updated. The FASTA program can be slower by more effective than BLAST, and it also contains SSEARCH, an implementation of the rigorous Smith-Waterman algorithm that is slow but most sensitive. Iterative programs like PSI-BLAST require extreme care in their operation, as they can provide very misleading results; however, they have the potential to find more homologs than purely pair-wise methods.

 

Filtering

The statistics for database searches assume that unrelated sequences will look essentially random with respect to each other. However, certain patterns in sequences violate this rule. The most common exceptions are long runs of a small number of different residues (such as a poly-alanine tract). Such regions of sequence may spuriously obtain extremely high match scores. For this reason, the NCBI BLAST server will automatically remove such sections in proteins (replacing them with X) using the seg program if "default filtering" is selected. DNA sequences will be similarly masked by dust. Though these programs automatically remove the majority of problematic matches, some problems invariably slip through; moreover, valid hits may be missed due to masking of part of the sequence. Therefore, it may be helpful to try using different masking parameters.

Other sorts of filtering are also often desirable; for example, iterative searches are prone to contamination by regions of proteins that resemble coiled-coils or transmembrane helices. Here, one protein that is similar only because it has the general characteristics may match initially. The profile then emphasizes these inappropriate characteristics, eventually causing many spurious hits. Heavily cysteine rich proteins can also obtain anomalous high scores. If these characteristics are not filtered, then it is necessary to carefully review the alignment results to ensure that they have not led to incorrect matches.

 

Alignment, algorithmic, and output parameters

Three other set parameters also affect search results, but they rarely require careful consideration by most users. First, the matrix and gap parameters determine how similarity between two sequences is determined. When two residues in a protein are aligned, programs use the matrix to determine whether the amino acids are similar (and thus receive a positive score) or very different. The default matrix for BLAST is called blosum62, and the programs will not currently operate reliably with other matrices. The gap parameters determine how much an alignment is penalized for having gaps: the existence parameter is a fixed cost for having a gap and the per-position is a cost dependent upon the length. Typically there is a great cost associated with introducing a gap and a small additional such that longer gaps are worse. It is rarely beneficial to change these from their defaults.

The second set of parameters determines the heuristics BLAST uses. By altering these numbers, it is possible to make the program run slower and be more sensitive, or to run faster at the cost of missing more homologs. The complexity of these parameters in BLAST precludes extensive description here. Currently, it is very rare for users to alter these options from the defaults. The FASTA program has one such parameter that a user will often want to set, called ktup. Searches with ktup=1 are slower, but are more sensitive than BLAST; ktup=2 is faster and less effective.

A third set of parameters regulates how many results are reported. By default, the programs will report only matches with an E-value (described below) up to 10. The total number of matches is limited to the best 500, and detailed information with the alignment is provided for up to 100 pairs. To retrieve more matches, these numbers can be increased.

 

Interpretation of results

Interpretation of the results of a sequence database search involves first evaluating the matches, to determine whether they are significant and therefore imply homology. The most effective way of doing so is through use of the statistical scores, or E-values. The E-values are more useful than the raw or bit scores, and they are far more powerful than percentage identity (which is best not even considered unless the identity is very high). Fortunately, the E-values from FASTA, SSEARCH, and NCBI gapped BLAST seem to be accurate and are therefore easy to interpret.

The E-value (or expectation-value) of a match should measure the expected number sequences in the database which would achieve a given score. Therefore, in the average database search, one expects to find ten random matches with E-value scores below 10; obviously, such matches are not significant. However, lacking better matches, sequences with these scores may provide hints of function or suggest new experiments. Scores below 0.01 would occur by chance only very rarely, and are therefore likely to indicate homology, unless biased in some way. Scores of near 1e-50 are now seen frequently, and these offer extremely high confidence that the query protein is evolutionarily related to the matched target in the database.

Inferring function from the homologous matched sequences is a process still fraught with difficulty. If the score is extremely good and the alignment covers the whole of both proteins, then there is a good chance that they will share the same or a related function. However, is dangerous to place too much trust in the query having the same function as the matched protein: functions do diverge, and organismal or cellular roles may alter even when biochemical function is unchanged. Moreover, a significant fraction of functional annotations in databases are wrong, so one needs to be suspicious. There are other complexities; for example, if only a portion of the proteins align, they may share a domain which only contributes an aspect of the overall function. It is often the case that all of the highest-scoring hits align to one region of the query, and matches to other regions need to be sought much lower in the score ranking. For this reason, it is necessary to carefully consider the overlap between the query and each of the targets.

Database search methods are also limited because most homologous sequences have diverged too far to be detected by pair-wise sequence comparison methods. Thus, failure to find a significant match does not indicate that no homologs exists in the database; rather, it suggests that either more-powerful computational methods [such as those described on pXX] or experiments will would be necessary to locate them.

 

Conclusion

One should neither have excessive faith in the results of a BLAST run, nor should they be blithely disregarded. The BLAST programs are well-tested and reliable indicators of sequence similarity, and their underlying principles are straightforward. Complexities added by the fast algorithm typically need not be carefully considered, because the program and its parameters have been optimized for the hundreds of thousands of runs every day. If one is careful about posing the database search experiment and interprets the results with care, sequence comparison methods can be trusted to rapidly and easily provide an incomparable wealth of biological information.


References


URLs

BLAST website: http://www.ncbi.nlm.nih.gov/BLAST/

BLAST FTP site: ftp://ncbi.nlm.nih.gov/blast/

FASTA FTP site: ftp://ftp.virginia.edu/pub/fasta

FASTA at EBI: http://www2.ebi.ac.uk/fasta3/

 


Summary [Box]


BLAST variants for different searches [Box]

Program Query DB Comparison Common Use

blastn DNA DNA DNA-level Seek identical DNA sequences, and splicing patterns

blastp Protein Protein Protein-level Seek homologous proteins

blastx DNA Protein Protein-level Query new DNA to find genes and seek homologous proteins

tblastn Protein DNA Protein-level Search for genes in un-annotated DNA

tblastx DNA DNA Protein-level Discover gene structure;

n.b., similar variants programs are available for FASTA. Protein-level searches of DNA sequences are performed by comparing translations of all six reading frames.