|
This site is dedicated to aiding biomolecular sequence and structure
searching. It is presently being constructed, and currently provides
only limited resources which are ancillary to our recent paper on
reliable and effective database searching.
Contents
Selected general database-searching resources
- Sequence searches
- Structure searches
- Structure databases
General advice for sequence searching
Some general rules, in order of importance:
- Think about what you are doing at all stages of the
process. The results you obtain from a database search require
as much consideration and understanding as any other biology
experiment.
- Compare proteins, not DNA if possible. Protein sequences
have far more information (both because they ignore silent mutations
and because they incorporate information about chemical structure).
So, even if your sequences are derived entirely from DNA, search with
the translated version of the gene. Similarity, DNA databases should
be searched using a 'translating' search algorithm.
- Search a large database. If the sequence
you're trying to find isn't in the database, obviously you won't find
it. Because of the rapid growth of databases, it is essential that
you use a recently-updated database. Make sure that the database
is complexity-masked; otherwise, you are likely to find a large
number of spurious matches
- Use statistical scores (E-values) to interpret the result.
These scores can find 10 times as many distant homologs at the same
rate of error. The statistical scores from FASTA, SSEARCH, and
new BLAST 2 are reliable. For example, a match with an E-value
of 0.01 would come up at random in every 100 different searches. So,
the statistical score gives you a good measure of how much you should
trust the match.
Beware: The scores from old BLAST 1.4.9, WU-BLAST
exaggerate significance very considerably. Moreover, the reported
scores from PSI-BLAST can be misleading by tens of orders of
magnitude! Do not use raw scores or percentage identity, as these
are likely either to miss matches or find unrelated proteins.
- Among pairwise programs, slower methods (FASTA, SSEARCH) work
slighly better, but the new BLAST 2 and WU-BLAST are nearly as
effective. PSI-BLAST and Intermediate Sequence Searching are more
powerful than pairwise methods, but are currently considerably more
difficult to interpret.
- Be aware that even in an ideal database search, most distant
homologs are likely to go undetected. So, if you don't find a
match, that doesn't indicate that there are no homologs in the
database; it just means that they are so distant that your program
cannot find them.
Additional information about sequence searching
For a good, though somewhat shallow, introduction to using bioinformatics
techniques, I recommend the Trends Guide to
Bioinformatics, which was distributed as a free supplement to
all of the November 1998 Trends Journals.
Following are a list of papers which may be of interest.
- Brenner SE. 1998. Practical Database Searching. The Trends
Guide to Bioinformatics. 9-12.
---- A brief guide to database searching, written for the general
molecular biologist. Draws upon the results of Brenner et al. 1998.
Proc. Natl. Acad. Sci. paper, below.
View HTML of first submitted draft.
- Brenner SE, Chothia C,
Hubbard TJP. 1998. Assessing sequence comparison methods with reliable
structurally identified distant evolutionary relationships.
Proc. Natl. Acad. Sci. USA 95:6073-6078.
---- Principally a research paper, but it demonstrates how different
sequence comparsion methods compare and measures the reliability of
different scores. Essential for doing reliable database
searching. Request reprint.
- Altschul et al. 1994. Nat. Genet 6:119-129.
---- A good overview of many aspects of sequence searching, but few
specifics
- Pearson WR. 1996. Effective protein sequence comparison.
Meth. Enzymol. 266:227-258.
---- This article mixes general background, technical advice, and
research results. Probably the best single reference for how to
do database searching
-
Keith Robison's Guide to Sequence Searching
---- A more complete set of guidelines for database searching, but
now slighly out of date.
- Sternberg MJE. 1996. Protein Structure prediction: A practical approach,
IRL Press
---- A compilation of many articles on structure prediction
Supplementary information
Following is supplementary information to Brenner SE, Chothia C,
Hubbard TJP. 1998. Assessing sequence comparison methods with reliable
structurally identified distant evolutionary relationships.
Proc. Natl. Acad. Sci. USA 95:6073-6078.
(Request reprint.)
Additional information will be added in the future.
http://sss.berkeley.edu/
Steven E. Brenner
University of California, Berkeley
brenner@compbio.berkeley.edu
|