Research

[ CD-HIT | Novel Gene Discovery | Saturated BLAST | FFAS & PDB-BLAST | Modeller]
[Virtual Screening| Drug Target Discovery| Solviing Biology Problems| Previous Works ]

Summary : My research focuses on developing novel and effective methods for sequence analysis, gene discovery, protein structure &function prediction, structure-based drug discovery, genome analysis and molecular modeling. I write programs and build databases and web servers to helpmy research and to serve research communities. I also apply the tools I developed to solve practical biological problems.
CD-HIT

The official web page of CD-HIT is at http://bioinformatics.burnham.org/cd-hi

CD-HIT is a novel and ultra fast algorithm for clustering large protein database.

Clustering is to group similar sequences into families. It is very useful in classifying protein into families, analyzing domains, organizing large protein databases, and improving performance of database search. However, clustering large database is very time-consuming; it may take years to cluster the NCBI non-redundant proteins database of more than two million sequences with traditional methods. But CD-HIT algorithm can finish such job on a PC in few hours. It is hundreds to thousands times faster than other clustering methods such as NCBI BLASTCLUST.

The CD-HIT program is currently used by hundreds of research and educational groups, including some of the world's best-known institutions such as UniProt, PDB, EBI, and TIGR. For example, PDB applies it to develop non-redundant protein sets, and UniProt uses it to prepare the UniRef reference databases.

Novel Gene Discovery
 
In the past years, I have been developing methods to identify novel genes from genomic sequences. Unlike commonly used ab initio prediction programs such as GENSCAN, I use sequence homology based approach.

Given proteins of interests, we use them to search genome for possible fragments of protein coding genes with sensitive homology detection tools such as PDB-BLAST, FFAS, and Saturated BLAST (see following sections). Then we do comprehensive sequence analysis and structural modeling on these fragments followed by hand gene assembling. We can effectively discovery novel genes not identified by in public genome annotations. We have successfully identified several novel apoptosis proteins families and other families such as kinase and glucose transports.

I recently leaded a large scale novel gene discovery project in collaboration with Biogen-IDEC pharmaceuticals. The goal is to identify novel human genes for use as therapeutic targets to treat cancers or inflammatory disorders. We identified significant amount of potential novel targets which are being validated by experimental group at Biogen-IDEC.

The detailed structure of my gene discovery system is show below.

We initially translated the entire human genome in all six reading frames and collected translated peptide fragments without stop codon and with ³ 30 amino acids. This resulted in about 26 million fragments. Half of them originated from repetitive DNA sequences and were therefore eliminated from further analysis. The rest were collected in a library of Putative Coding Fragments (PCF), and sequence profiles were built with PDB-BLAST and FFAS method for these PCFs.

Overview

Given one or more known proteins, the system:

  • a) First searches the protein database with an intermediate sequence or profile search to retrieve all the close and remote homologues of this protein family.
  • b) Selects representatives of the family (depending upon diversity within the family) by applying an appropriate clustering tool such as CD-HIT or PSI-CD-HIT. These representative sequences are then used to retrieve a conserved sequence pattern from the multiple sequence alignment.
  • c) Builds sequence profile for each representative protein from
  • d) Performs profile-sequence and profile-profile search against PCFs to collect genomic fragments.
  • e) Calculates a broad array of properties of fragments such as: sequence complexity, exon probability, relative positions to known genes and ESTs and score for matching specific sequence pattern
  • f) Annotates the fragments by comparing them against known protein profile databases such as PDB, Pfam and UniProt.
  • g) Filters out hits that are random or those that correspond to known genes. It also ranks fragments according to all available calculated results. The more detailed analyses are performed to the top ranking fragments.
  • h) Full length genes are assembled
  • i) Fragments and assembled genes are tested experimentally
  • j) Fragments, predicted genes and their annotations are built into a relational database.

The figure above is the snapshot of novel gene annotation server.

 

Saturated BLAST
 
The official web page of Saturated BLAST is at http://bioinformatics.burnham.org/xblast

Saturated BLAST is a program implemented with automated multiple intermediate sequence searches (MISS) method. It is a robust and efficient tool for detecting remote homology, analyzing protein families, and discovering new genes.

MISS is powerful in remote homology detection. However it produces significant redundant and false positive hits. In Saturated BLAST, several techniques have been designed to address these problems.

Besides MISS, other powerful homology detection tools such as PDB-BLAST, multiple alignment tools, clustering function, and many other unique utilities are also implemented into Saturated BLAST.

This program has been extensively used by our group in gene finding, protein sequence analysis and fold prediction. We have found a lot of new genes for several biological groups. It had been a very useful tool in our CASP4 prediction.

 

FFAS and PDB-BLAST
 
The official web page of FFAS is at http://ffas.burnham.org/FFAS The old web page of FFAS is at http://bioinformatics.burnham.org/FFAS

FFAS is a sensitive sequence profile-profile alignment program, which ranked the first place in fold recognition prediction in the CAFASP competition.

The official web page of PDB-BLAST is at http://bioinformatics.burnham.org/pdb_blast

PDB-BLAST is a PSI-BLAST based method; it builds a sequence profile from a curated non-redundant comprehensive protein database, and then uses this profile to search an object database. PDB-BLAST doubles the fold recognition accuracy and save up to 10 folds of database search time.


Number of correct vs wrong hits in database search

FFAS is more sensitive, and PDB-BLAST is faster, the combination of these two tools is the best strategy in homology detection.

Structure Based Virtual Lead Screening
 

I have 2 years of real drug development experiences in Quorex Pharmaceuticals Inc. (now acquired by Pfizer), where I was highly involved in the development of RaLead, a computational approach for rapid lead identification and optimization.

The key elements of RaLead I developed included high throughput virtual compound screening method, curated diverse screening compound library, combinatorial library modeling, in-depth protein-ligand interaction calculation, and so on.


RaLead played an important role in Quorex' accelerated drug discovery

 

Drug Target Discovery
 
As the leader of bioinformatics project at Quorex Pharmaceuticals, one of my responsibilities was selecting drug targets. Quorex<92> drug development focused on infective diseases and we were looking for novel targets from bacteria genomes.

We developed a high throughput annotation system for pathogenic bacteria genomes and built an annotation database. We evaluated all the pathogenic bacteria genomes and identified a list of targets suitable for structure based drug development

I wrote a unique structure-sequence alignment program called Q-eye. It can map sequence features on 3D structures and vice versa. It can analyze a set of disconnected fragments or residues, so it can effectively evaluate properties such as degree of conservation of the binding sites of targets.


Alignment of binding site residues of two anti-bacteria targets GyrB and ParE by Q-eye

 

Web Modeller
 
The official web page of Web-modeller is at http://bioinformatics.burnham.org/modeller

Web modeller, a automated comparative modelling server. is a web interface of Sali's Modeller program. You provide an alignment, the server do all the rest for you to build 3D model.

This server is also connected with PDB-BLAST server and FFAS server, So a few clicks will help you finish the whole process from database searching to 3D model building.

 

Case Studies of Biological Problems
 
Solving practical biological problems is an important part of my research. I have several papers published with experimental groups covering various research fields including finding novel genes and proteins families, in-depth analysis of protein-substrates interaction, protein structure and function prediction, homology modeling, and experimental data analysis. The figure below, results of analysis of MMPs, is one of such examples.


(left) clustering analysis of catalytic activities of three Matrix Metalloproteases (MMPs) on different substrates (picture generated by my own program). (right) docking modeling of substrate binding on MMPs

 

Previous works
 
Before I came to US, I had participated in the following research projects: bioinformatics servers, protein structure prediction, protein design, drug design, and protein electrostatics.

Here is one of my previous research pages.