The Sequence and Structure Searching Site

This site is dedicated to aiding biomolecular sequence and structure searching. It is presently being constructed, and currently provides only limited resources which are ancillary to our recent paper on reliable and effective database searching.

Contents


Selected general database-searching resources


General advice for sequence searching

Some general rules, in order of importance:
  1. Think about what you are doing at all stages of the process. The results you obtain from a database search require as much consideration and understanding as any other biology experiment.

  2. Compare proteins, not DNA if possible. Protein sequences have far more information (both because they ignore silent mutations and because they incorporate information about chemical structure). So, even if your sequences are derived entirely from DNA, search with the translated version of the gene. Similarity, DNA databases should be searched using a 'translating' search algorithm.

  3. Search a large database. If the sequence you're trying to find isn't in the database, obviously you won't find it. Because of the rapid growth of databases, it is essential that you use a recently-updated database. Make sure that the database is complexity-masked; otherwise, you are likely to find a large number of spurious matches

  4. Use statistical scores (E-values) to interpret the result. These scores can find 10 times as many distant homologs at the same rate of error. The statistical scores from FASTA, SSEARCH, and new BLAST 2 are reliable. For example, a match with an E-value of 0.01 would come up at random in every 100 different searches. So, the statistical score gives you a good measure of how much you should trust the match.
    Beware: The scores from old BLAST 1.4.9, WU-BLAST exaggerate significance very considerably. Moreover, the reported scores from PSI-BLAST can be misleading by tens of orders of magnitude! Do not use raw scores or percentage identity, as these are likely either to miss matches or find unrelated proteins.

  5. Among pairwise programs, slower methods (FASTA, SSEARCH) work slighly better, but the new BLAST 2 and WU-BLAST are nearly as effective. PSI-BLAST and Intermediate Sequence Searching are more powerful than pairwise methods, but are currently considerably more difficult to interpret.

  6. Be aware that even in an ideal database search, most distant homologs are likely to go undetected. So, if you don't find a match, that doesn't indicate that there are no homologs in the database; it just means that they are so distant that your program cannot find them.

Additional information about sequence searching

For a good, though somewhat shallow, introduction to using bioinformatics techniques, I recommend the Trends Guide to Bioinformatics, which was distributed as a free supplement to all of the November 1998 Trends Journals.

Following are a list of papers which may be of interest.

  • Brenner SE. 1998. Practical Database Searching. The Trends Guide to Bioinformatics. 9-12.
    ---- A brief guide to database searching, written for the general molecular biologist. Draws upon the results of Brenner et al. 1998. Proc. Natl. Acad. Sci. paper, below. View HTML of first submitted draft.

  • Brenner SE, Chothia C, Hubbard TJP. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA 95:6073-6078.
    ---- Principally a research paper, but it demonstrates how different sequence comparsion methods compare and measures the reliability of different scores. Essential for doing reliable database searching. Request reprint.

  • Altschul et al. 1994. Nat. Genet 6:119-129.
    ---- A good overview of many aspects of sequence searching, but few specifics

  • Pearson WR. 1996. Effective protein sequence comparison. Meth. Enzymol. 266:227-258.
    ---- This article mixes general background, technical advice, and research results. Probably the best single reference for how to do database searching

  • Keith Robison's Guide to Sequence Searching
    ---- A more complete set of guidelines for database searching, but now slighly out of date.

  • Sternberg MJE. 1996. Protein Structure prediction: A practical approach, IRL Press
    ---- A compilation of many articles on structure prediction

Supplementary information

Following is supplementary information to Brenner SE, Chothia C, Hubbard TJP. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA 95:6073-6078. (Request reprint.) Additional information will be added in the future.
  • sdqib40-1.35.seg.fa (called PDB40D in the paper)
  • sdqib90-1.35.seg.fa (called PDB90D in the paper)
  • Other databases, including updated versions of these databases, are available upon request.

http://sss.berkeley.edu/
Steven E. Brenner
University of California, Berkeley

brenner@compbio.berkeley.edu