Saturated BLAST User's Manual

Detect remote homologoues, maintain progein families, and hunt new genes

Table of Contents

  1. Introduction
    1. What is Saturated BLAST
    2. What could be done with Saturated BLAST
    3. Reference
  2. Installation
    1. Obtain Saturated BLAST package
    2. System requirement
    3. Install required Perl packages
    4. Install local BLAST programs (optional)
    5. Configure Saturated BLAST
    6. Run Saturated BLAST
  3. Getting started
    1. Input query sequences
    2. Set parameters
    3. Set BLAST search options
    4. Send first search
    5. Repeat search loop
    6. View results
    7. Save results
    8. Restart
  4. Algorithm
    1. Multiple intermediate sequence search
    2. Saturated BLAST search
    3. Some terms
    4. Sturcture of Saturated BLAST
    5. Filter
    6. Seed selector
    7. Saturated BLAST alignment
    8. Cluster analysis
  5. Usage
    1. Interface
    2. Mouse operation
    3. File
    4. Edit
    5. Set parameters
    6. Set BLAST search
    7. Set seed
    8. Set break point
    9. Set smart filter
    10. Selection
    11. Advanced selection
    12. Display
    13. BLAST alignment
    14. Muitiple alignment
    15. Program LOG
    16. Pair alignment
    17. Cluster
    18. Help
    19. Action buttons
  6. FAQ

TOC

Introduction

  1. What is Saturated BLAST
  2. What could be done with Saturated BLAST
  3. Reference

What is Saturated BLAST

Just from the word "Saturated", it is not difficult to guess that this program intents to search for as many sequences as possible in database. Yes, and for this purpose, this program adopts Intermediate Sequence Search (ISS) method.

ISS is a strategy for recognizing distant homologues using transitive sequences. This idea is when the similarity between two remotely homologous sequences can not be detected by normal sequence comparison, if there is an intermediate sequence with significant alignment scores to both of them, their similarity can still be established.

ISS and its extension, multiple ISS, which applies more than one intermediate steps, have been proved to be sensitive and practical. However, the brute-force search, which repetitively runs database searching, is time-consuming and hard to be done in an automated way. Saturated BLAST is a software with graphic user interface, it can perform the iterated multiple intermediate sequence search more efficiently and automatically.

The Saturated BLAST package was developed under LINUX system using Perl as the programming language. Starting with a single query or a set of related sequences, Saturated BLAST runs a BLAST search, organizes the output, identifies representative as search seeds, and then repetitively takes these new seeds as queries for next generation of BLAST searches. The friendly graphic user interface and the built-in BLAST result parser, multiple alignment tools and clustering algorithms provide an easy way to edit, visualize, analyze, monitor and control the search.

What could be done with Saturated BLAST

Finding distant homologues is the primary usage of Saturaged BLAST. It is also very good tool for maintaining large protein family, and hunting new genes in genomics database. It will save 90% time you do it by hands.

Reference

Please cite:
  Weizhong Li, Frederic Pio, Krzysztof Pawlowski & Adam Godzik.
  Saturated BLAST: An automated multiple intermediate sequence search
  used to detect distant homology. Bioinformatics (2000) in press


TOC

Installation

  1. Obtain Saturated BLAST package
  2. System requirement
  3. Install required Perl packages
  4. Install local BLAST programs (optional)
  5. Configure Saturated BLAST
  6. Run Saturated BLAST

Obtain Saturated BLAST package

The Saturated BLAST package is distributed from http://bioinformatics.burnham-inst.edu/xblast, the official site of this package. Users can download a UNIX Gzipped tar archive file in form of xblast_1.00.tar.gz.

System requirement

The Saturated BLAST was develped under LINUX operating system. The authors have installed and used it on RedHat Linux distributions 6.0, 6.1 and 6.2. Since Saturated BLAST is implemented with Perl script language, it is possible to install this package on other UNIX systems supporting Perl.

So, UNIX and Perl are the basic requirements for installation of Saturated BLAST. Because we use some third-party Perl packages (see below) in Saturated BLAST, you may notice what version of Perl is needed by these packages.

Install required Perl packages

When looking at script of Saturated BLAST, users may notice some lines like:
  use Tk;
  use LWP::UserAgent;
  use LWP::Simple;
  use HTTP::Request::Common;
We use two set of Perl packages in the Saturated BLAST program, Perl/Tk is used for the Graphic User Interface, and Libwww-perl is needed by the internet connection. So, before you run Saturated BLAST, you should have them installed on your computer.

Perl/Tk is a great toolkit to develop Graphic User Interface, it is written by Nick Ing-Simmons. This package is available from either
Perl CPAN, or author's site

I have installed Perl/Tk on several different computers with different operating system. The installation has been very smooth. As described in the 'INSTALL' of package, after you unpack the distribution, you will have a new directory named Tk800.XXX. Enter this directory, do the following:
  perl Makefile.PL
  make 
  make test
  make install
The whole will take about 15 miniutes. There is a demo application - demos/widget, if you can run this program well, you can run Saturated BLAST well.

Libwww-perl is a general internet tool package for perl. Since it is also a bundled package, maybe you already have it, Please type following commend under your UNIX shell:
  perl -e 'use LWP::UserAgent'
If nothing appear (No error messages), you already have it, and you can skip the installation. Otherwise, do the following.

Download the libwww-perl from either CPAN or from its author's site. Please read the instruction carefully, libwww-perl need some other packages ( URI, MIME-Base64, HTML-Parser, libnet, and Digest::MD5 ). So, maybe you will have to install them, But don't worry, this installations of them are very simple, you only need repeatly download package and type
  perl Makefile.PL
  make
  make test
  make install
For several times. My own expected time of installing Perl/Tk and libwww is less than 1.5 hours.

Install local BLAST programs (optional)

Saturated BLAST can run BLAST searches on remote NCBI BLAST server and also local computer. It is flexible to have local copy of BLAST program on your computer, so that you can build your own databases of insterests. So we recommand you to install the BLAST program on your computer.

It is very simple. Open the
FTP site of NCBI BLAST, and download the per-complied linux BLAST programs executables/blast.linux.tar.Z. Then unpack the compressed file into a directory such as "/data/ncbi" (This dir should be same with the setting of Saturated BLAST, see next section), make directories "/data/ncbi/bin" and "/data/ncbi/db", and move the executables files (blastpgp, blastall and so on) into "/data/ncbi/bin".

The FASTA format databases needed by BLAST are avaiable at ftp://ftp.ncbi.nlm.nih.gov/blast/db. You can also generated your own databases. Please refer to the related documents within BLAST distribution on how to format BLAST databases.

Configure Saturated BLAST

Now, it is ready for the last step of the installation. unpack the xblast_1.00.tar.gz by
  gunzip < xblast_version_number.tar.gz | tar xvf -
and enter xblast subdirectory. What need to be done is reconfiguring some local definitions in Perl scripts.

In the file xblast.pl when you meet the line like:
  my $XBLAST_ROOT = "/usr/local/bin/xblast";
change this directory to your installation site. In the file blastruntool.pl please change the directory of your local BLAST. You will have lines like these:
  my $blast_root = "/data/ncbi"; 
  $ENV{"NCBI"} = $blast_root; 
  $ENV{"BLASTDB"} = "$blast_root/db"; 
  $ENV{"BLASTMAT"} = "$blast_root/data"; 

Run Saturated BLAST

Put the directory where you install Saturated BLAST into your search path, then just type xblast.pl, a nice window should appear on your screen.

TOC

Getting started

  1. Input query sequences
  2. Set parameters
  3. Set BLAST search options
  4. Send first search
  5. Repeat search loop
  6. View results
  7. Save results
  8. Restart
In this section, we demonstrate a quick tour working with Saturated BLAST, Since this program has many adjustable parameters, and they will be discussed in detail later, we will skip most steps.

Saturated BLAST is window-based application, first let's have a look how this program appears on screen. Through this graphic user interface, it is easy to preform complicated jobs by simple mouse clicks.

Saturated BLAST window

Input query sequences

To initialize a new Saturated BLAST search, go to menubar File -> new, a small window will popup. Simply supply a job name, paste your query sequence in fasta format or give the filename of sequence, and click [Ok] button. The main display window will refresh itself, and a new line (your query sequence) will be added there.



Set parameters

The major important parameters are set through menubar Set -> parameter. You may open this window to have a look. Here, I suggest you just use the default setting, because I will explain each these items in following sections, so just press the [Ok] button.













Set BLAST search options

Follow menubar Set -> BLAST search to set program and options for BLAST. The left panel of the window is a pre-defined BLAST job corresponding to the some default BLAST options that can be set on the right panel. You may want to specify a BLAST job. Just change the parameters, select current BLAST job at left panel by a mouse click, and press [Replace] button so that the old job is replaced with yours. Then press the [Ok] button


Send first search

Now you have already defined the query sequence, parameters and details of BLAST search. On the top of main window, there are several buttons used for sending BLAST searches. Press the [next] button. Wait for sometime, the window won't respond you until the job is finished. Then all the qulified hits in BLAST serarch will be tabluated in the window.

You can practice some basic mouse operation on the main window.
(a) When the pointer of mouse is moved over the descriptions of each line, the text are redisplayed in red font. Double clicking it will open a message window.
(b) To select or (unselect) a sequence, click on the non-text area of that line, the selected sequences are marked with light-yellow background.
(c) If there aleardy is netscape window on the display, Double clicking on the gi number can open a netscape window connected with NCBI entrzy database.

Repeat search loop

Some of the sequences has been marked with red <x>, indicating that these sequences have been selected as new queries to be used in following searches. There is red triangle on the left pointed to the first one of new queries. Now there are several ways to continue Saturated BLAST search.

(a) press the [next] button to run next query.
(b) set a break point at a query, and press [cont] button to run searches from current query to the query on the break point. To set break point, first select that query, then pull down menubar to Set -> break point on.
(c) press the [cont] button to run all the queries one by one until the search is saturated.

Any combination of above methods can be used to send BLAST jobs. And perhaps you would like to check the new results before send new jobs.

View results

There are several ways to view or analysis the results. These tools will be discussed in following charpters, but you may try some simple function from menubar edit, set, select, display, and tools. Most are self-explaining tools, so that you can practice without reading further manual.

Save results

Before sending each job, Saturated BLAST save all the results and all the parameters into a restart file : Your_Job_Id.Restart. The orginal usage of this file was to recover everything if the program crashes. And then I found it is just the ouput file of Saturated BLAST. You can save it from menubar save .

Saturated BLAST can also save results in other formats: plain FASTA, HTML, a tab-delimited table, and plain text. It is easy to view, edit, and analysis the result in third-party softwares such as ClustalX.

Restart

Since the Restart file restores all the results and setting of parameters of a Saturated BLAST search. The search can be "Restarted" by importing this file using file -> open .


TOC

Algorithm

  1. Multiple intermediate sequence search
  2. Saturated BLAST search
  3. Some terms
  4. Sturcture of Saturated BLAST
  5. Filter
  6. Seed selector
  7. Saturated BLAST alignment
  8. Cluster analysis

Multiple intermediate sequence search

The basis of Saturated BLAST is the so called intermediate sequence search (ISS), Two proteins, let's call them query (Q) and target (T), can have similar structure and biological function, but have sequences sufficiently different that traditional protein sequence comparison algorithms do not identify their relationship. But if there is a intermediate sequences (I), and I is similar to both Q and T in sequence, the homology between Q and T can be established. The connection can be written in "Q-I-T".

ISS figure


ISS is very sensitive method to recognize remote homology. The natural extension of ISS, the multiple intermediate sequence search (MISS), which makes connection of "Q-I1-I2---T", is more powerful than simple ISS. Saturated BLAST uses MISS strategy to search database for remote homologous sequences.

Saturated BLAST search

The starting point of Saturated BLAST is a single sequence (but you can input multiple sequences), With this query, Saturated BLAST run a BLAST search and parses the output. For every hit sequence, if this sequence is qulified to the user defined standard, it is added into the result database. At the same time, some of these representatives of hit sequences are marked as new BLAST queries.

Then, the program take next query run BLAST search, parse output, filter result and select new queries as before. The program repeat this search loop until no new sequences be found or pre-defined criteria are met.

In this manner, the program can dig out as many as possiable related homologous sequences in the database. and it is why the program is called Saturated BLAST.

Some terms

Here, we define some terms used in Saturated BLAST.

Sturcture of Saturated BLAST

All the sequences found in Saturated BLAST search and the sequences input by user are stored and maintained in a main database.

The BLAST search and the sequences are governed by several tools including seed selector, BLAST result parser and filter. There are also some other tools can be used to visualize and analyse the result such as alignment builder and cluster function.

Filter

There are four kind of filters designed to confine a MISS to a desired direction and to provent it from diverging. They are redundancy filter, low significance filter, keywords filter, and smart filter.

Seed selector

The output of BLAST can be very redundant. Searches with redundant queries will not gain any new information and will waste computer time and resources. After each new BLAST search, all the filtered sequences are checked for significence in terms of expect value, sequence identity and sequence length. And significent sequences are clustered into sub-groups according to sequence identity. Saturated BLAST select the longest sequence as seed from each cluster.

Saturated BLAST alignment

Saturated BLAST provides four kinds of alignments, which are derived directly or indirectly from the BLAST search output.

Cluster analysis

The seed selector can cluster the sequences, but it only take the sequences with same parent. And in most cases, it is good enough for seed selection. However, when you have hundreds or more sequences in your result database, you may need a tool to cluster all the sequences of a selection from them.

Saturated BLAST uses a standard average linkage cluster method, so a all-against-all similarity matrix is calculated using the pairwise alignments available within the Saturated BLAST. There are two options to form the similarity matrix, the normalized alignment score S and sequence identity.

Given the pairwise alignment, the S is calculated as
  S = s - [ln(Kmn)/ lambda ]
where s is the raw score, and n and m are the lengths of the sequences. for the default BLAST matrix blosum62, lambda is 0.216 and K is 0.014. And when sequence identity is used, it is the percentage of identical residues of the shorter sequence of the alignment.


TOC

Usage

  1. Interface
  2. Mouse operation
  3. File
  4. Edit
  5. Set parameter
  6. Set BLAST search
  7. Set seed
  8. Set break point
  9. Set smart filter
  10. Selection
  11. Advanced selection
  12. Display
  13. BLAST alignment
  14. Muitiple alignment
  15. Program LOG
  16. Pair alignment
  17. Cluster
  18. Help
  19. Action button

Interface

If you have already run Saturated BLAST, you know the graphic user interface. The results are tabulated in a main display window, and contents of the columns appearing as the head table from left to right are:

At the bottom of the window, there is a status bar which displays the numbers of search, seed left and so on.

Mouse operation

File

Menu File -> open opens a Saturated BLAST Restart file, which was automated saved by program before each BLAST search or by user. When a new Restart file is read in, it replaces all the data in current window.

Menu File -> new starts a new Saturated BLAST job. A diaglog window will appear, and user need to supply the job name and query sequence. Job name should be composed by normal word letters, because it is first part of names of some temporary files and output files. User can input one or more query sequences in FASTA format or simple one-letter code.
After the OK button is pressed, the input sequences will replace the current if any.

Menu File -> save saves the Restart file. The Restart file is the basic output file of Saturated BLAST and can be opened later by the program.

Menu File -> save as saves the results in different formats including HTML, table and plain text. The HTML output files are frame-based, and supporting dynamic display of message enabled by Javascript. The table file is a text table delimited by 'TAB', so it can be read by software like Microsoft Excel.

Menu File -> export exports sequences or selected sequences in different ways: the gi number, FASTA format and BLAST alignments. The gi numbers will be sorted.

Menu File -> quit quits Saturated BLAST program.

Edit

Menu File -> delete selected will delete the selected sequence. Acturally, these sequences are only marked as deleted, they are still restored in the program. So the deleted sequences can be recovered. Here, the sequences which have children can not be deleted, otherwise some connections will be broken.

Menu Edit -> delete with children will delete the selected sequences along with their children, grand-children and so on.

Menu Edit -> delete with parent will delete the selected sequences along with their parents and children of their parents.

Menu Edit -> undelete will recover all the deleted sequences.

Menu Edit -> clear deletion remove the deleted sequences from momery, and a new serial number will assign to all the remaining sequences. After this operation, the deleted sequences can no be recovered.

Menu Edit -> insert sequences allow user to insert one or more query sequences. All the inserted sequences will be marked as seeds, they have parent of -1 and level of 0. FASTA format or one letter code file is permitted.

Menu Edit -> search can search the annotation of all the sequences for simple word. The match is case insensetive. A message window of next matching sequence will be opened.

Set parameter

Menu Set -> parameter set various parameters and thresholds.

General parameters: Parameters for filter:


Parameters for select seed:

Parameters to stop Saturated BLAST:

Set BLAST search

Menu Set -> BLAST serarch defines BLAST search program, site and options.

Left-upper panel of the window is a defined job or a job list. These jobs will be assigned to each seed after the Ok is pressed. Use Add, Replace, and Delete buttons to add new job, replace and delete selected job.

Right panel has the BLAST options:

Set seed

Menu Set -> seed set the selected sequences to seeds.

Menu Set -> unset seed clear the seed mark of selected sequences.

Menu Set -> clears seed clear all the seed selection,

Menu Set -> default seed clear all the seed selection first, and set seed according to the parameter defined by menu Set -> parameters.

Menu Set -> reuse seed activates the selected seeds if they have been used before.

Set break point

Menu Set -> break point on turns on break points on selected seeds.

Menu Set -> break point off turns off break points on selected seeds.

Menu Set -> clear break point turns off all the break points.

Set smart filter

Menu Set -> smart filter enables the smart filter, then the program will remember all the sequences user deleted.

Menu Set -> show filter list lists the gi numbers of all the sequences in smart filter list.

Menu Set -> export/import filter list output or input the smart filter list.

Menu Set -> clear filter list emptys the smart filter list.

Selection

Menu Select -> all selects all the sequences, and refresh the main display window.

Menu Select -> clear clear all the selection.

Menu Select -> inverse make a inverse selection.

Menu Select -> logic set logical pattern for advanced selection, and Menu Select -> seed.

Advanced selection

Menu Select -> customize can make very complecated selection. There are a list restrictions can be used to make a selection. The logical of selection is "and". For most of them, user need give a comparison operator such as ">", "==", "and" and so on.

For example, if you specify "expect < 0.0001" and "parent == 0", the program will select all the children of sequence 0 with expect lower than 0.0001.

The keywords matching, and regular expression are case insensitive. Here, regular expression refers to the regular expression of Perl language. For example, "human|mouse" will match the sequences contain either "human", "Human", "HUMAN", or "mouse". It is very useful, you can also use it search sequence pattern.

If you are not familar with Perl, consult the
Perl manual

Display

Menu Display -> refresh redraws the main display window.

Menu Display -> all displays all the sequences.

Menu Display -> seed displays all seed sequences.

Menu Display -> selected displays all selected sequences.

Menu Display -> sort by sorts the displayed sequences in order of serial no, gi no, expect, and so on.

Menu Display -> sort mode defines whether the sequences are sorted in ascendent or descendent order.

Menu Display -> show (gi, hit no, and so on) turns on/off of each field.

BLAST alignment

Menu Tool -> BLAST alignment organizes a multiple alignment of one seed and its children. This alignment is derived from BLAST alignment. And the gaps in seed sequence are deleted. User need to supply the serial no of seed sequence. The Menu Display -> all, seed or selected can also control the display of this window.

Muitiple alignment

Menu Tool -> linked alignment calculates the multiple alignment of any selection of sequences, derived directly or indirectly from BLAST results. In the alignment, the first sequence should be the ancestor of all other sequence, This is the master sequence. This window has a simple pull-down menubar.

Program log

Menu Tool -> view log displays the LOG file, containing some important activities of Saturated BLAST, For example, if a BLAST is failed, the program will record it into LOG file.

Pair alignment

Menu Tool -> align 2 sequences can align any pair of sequences, the alignment is either derived from BLAST or is calculated by dynamic program function. Users need to supply the serial numbers of two sequences.

Cluster



Menu Tool -> cluster selected cluster the selected sequences into subgroups by average linkage cluster method. This window has a simple pull-down menubar. Both sequence identity and normolized alignment score can be used to cluster the sequences. Users need to give the threshold. Because the pairwise alignments are required by cluster program, if they can not be derived from BLAST output, the dynamic program function is a option. The pairwise scores are saved after calculation, so a new cluster computation will be faster than old one. This window has some simple pull-down menubar.

Help

Menu Help -> help opens a small inline help window.

Menu Help -> manual (offical, local) calls a netscape window and open this "User's manul" from the offical Saturated BLAST site or from user's local copy.

Action button

There are seven action button just below the menubar of Saturated BLAST. They are used to control the BLAST search.

TOC

FAQ

I need your input! Thanks,