| CD-HIT |
![]() CD-HIT is a novel and ultra fast algorithm for clustering large protein database. Clustering is to group similar sequences into families. It is very useful in classifying protein into families, analyzing domains, organizing large protein databases, and improving performance of database search. However, clustering large database is very time-consuming; it may take years to cluster the NCBI non-redundant proteins database of more than two million sequences with traditional methods. But CD-HIT algorithm can finish such job on a PC in few hours. It is hundreds to thousands times faster than other clustering methods such as NCBI BLASTCLUST. The CD-HIT program is currently used by hundreds of research and educational groups, including some of the world's best-known institutions such as UniProt, PDB, EBI, and TIGR. For example, PDB applies it to develop non-redundant protein sets, and UniProt uses it to prepare the UniRef reference databases.
|
| Novel Gene Discovery | |
|
In the past years, I have been developing methods to identify novel genes from genomic sequences. Unlike commonly used ab initio prediction programs such as GENSCAN, I use sequence homology based approach. Given proteins of interests, we use them to search genome for possible fragments of protein coding genes with sensitive homology detection tools such as PDB-BLAST, FFAS, and Saturated BLAST (see following sections). Then we do comprehensive sequence analysis and structural modeling on these fragments followed by hand gene assembling. We can effectively discovery novel genes not identified by in public genome annotations. We have successfully identified several novel apoptosis proteins families and other families such as kinase and glucose transports. I recently leaded a large scale novel gene discovery project in collaboration with Biogen-IDEC pharmaceuticals. The goal is to identify novel human genes for use as therapeutic targets to treat cancers or inflammatory disorders. We identified significant amount of potential novel targets which are being validated by experimental group at Biogen-IDEC. The detailed structure of my gene discovery system is show below.
|
| Saturated BLAST |
|
The official web page of Saturated BLAST is at http://bioinformatics.burnham.org/xblast
![]() Saturated BLAST is a program implemented with automated multiple intermediate sequence searches (MISS) method. It is a robust and efficient tool for detecting remote homology, analyzing protein families, and discovering new genes. MISS is powerful in remote homology detection. However it produces significant redundant and false positive hits. In Saturated BLAST, several techniques have been designed to address these problems. Besides MISS, other powerful homology detection tools such as PDB-BLAST, multiple alignment tools, clustering function, and many other unique utilities are also implemented into Saturated BLAST. This program has been extensively used by our group in gene finding, protein sequence analysis and fold prediction. We have found a lot of new genes for several biological groups. It had been a very useful tool in our CASP4 prediction.
|
| FFAS and PDB-BLAST |
|
The official web page of FFAS is at http://ffas.burnham.org/FFAS The old web page of FFAS is at http://bioinformatics.burnham.org/FFAS FFAS is a sensitive sequence profile-profile alignment program, which ranked the first place in fold recognition prediction in the CAFASP competition. The official web page of PDB-BLAST is at http://bioinformatics.burnham.org/pdb_blast PDB-BLAST is a PSI-BLAST based method; it builds a sequence profile from a curated non-redundant comprehensive protein database, and then uses this profile to search an object database. PDB-BLAST doubles the fold recognition accuracy and save up to 10 folds of database search time.
![]() Number of correct vs wrong hits in database search FFAS is more sensitive, and PDB-BLAST is faster, the combination of these two tools is the best strategy in homology detection. |
| Structure Based Virtual Lead Screening |
|
I have 2 years of real drug development experiences in Quorex Pharmaceuticals Inc. (now acquired by Pfizer), where I was highly involved in the development of RaLead, a computational approach for rapid lead identification and optimization. The key elements of RaLead I developed included high throughput virtual compound screening method, curated diverse screening compound library, combinatorial library modeling, in-depth protein-ligand interaction calculation, and so on.
![]() RaLead played an important role in Quorex' accelerated drug discovery
|
| Drug Target Discovery |
|
As the leader of bioinformatics project at Quorex Pharmaceuticals, one of my responsibilities was selecting drug targets. Quorex<92> drug development focused on infective diseases and we were looking for novel targets from bacteria genomes. We developed a high throughput annotation system for pathogenic bacteria genomes and built an annotation database. We evaluated all the pathogenic bacteria genomes and identified a list of targets suitable for structure based drug development I wrote a unique structure-sequence alignment program called Q-eye. It can map sequence features on 3D structures and vice versa. It can analyze a set of disconnected fragments or residues, so it can effectively evaluate properties such as degree of conservation of the binding sites of targets. ![]() Alignment of binding site residues of two anti-bacteria targets GyrB and ParE by Q-eye
|
| Web Modeller |
|
The official web page of Web-modeller is at http://bioinformatics.burnham.org/modeller Web modeller, a automated comparative modelling server. is a web interface of Sali's Modeller program. You provide an alignment, the server do all the rest for you to build 3D model. This server is also connected with PDB-BLAST server and FFAS server, So a few clicks will help you finish the whole process from database searching to 3D model building.
|
| Case Studies of Biological Problems |
|
Solving practical biological problems is an important part of my research. I have several papers published with experimental groups covering various research fields including finding novel genes and proteins families, in-depth analysis of protein-substrates interaction, protein structure and function prediction, homology modeling, and experimental data analysis. The figure below, results of analysis of MMPs, is one of such examples.
![]() (left) clustering analysis of catalytic activities of three Matrix Metalloproteases (MMPs) on different substrates (picture generated by my own program). (right) docking modeling of substrate binding on MMPs
|
| Previous works |
|
Before I came to US, I had participated in the following research projects: bioinformatics servers, protein structure prediction, protein design, drug design, and protein electrostatics. Here is one of my previous research pages.
|