WO2003072701A1

WO2003072701A1 - A system for analyzing dna-chips using gene ontology and a method thereof

Info

Publication number: WO2003072701A1
Application number: PCT/KR2003/000400
Authority: WO
Inventors: Yang-Suk Kim; Jung-Uk Hur; Sung-Geun Lee
Original assignee: Istech Co., Ltd.
Priority date: 2002-02-28
Filing date: 2003-02-28
Publication date: 2003-09-04
Also published as: KR100431620B1; KR20030071225A; AU2003212669A1

Abstract

The present invention relates to a system and a method for biological analysis of gene expression patterns of DNA chips or DNA microarrays by a mathematical modeling of Gene OntologyTM hierarchical structure. Disclosed is a system for analyzing DNA chip data using Gene OntologyTM, comprising: means for receiving statistical clustering results of DNA chip data and for assigning appropriate GO identifiers to each gene pertaining to given cluster; means for converting each GO identifier assigned to said gene into a GO code using a GO code file; means for selecting a proper process among three predetermined process that adopt pseudo-distance to designate necessary parameters and for extracting an optimal branch; d) means for extracting biological meanings from each extracted optimal branch; and optionally, visualizing means for displaying the optimal branch, GO code and biological meanings of a given cluster of gene. The present invention enables a systematic and automated biological analysis of gene expression patterns of DNA chip by modeling GO hierarchy.

Description

A SYSTEM FOR ANALYZING DNA-CHIPS USING GENE ONTOLOGY AND A METHOD THEREOF

Technical Field

The present invention relates to a system for DNA microarray analysis using

Gene Ontology™ and a method thereof, and more specifically to a system and a method for biologically analyzing a gene expression pattern of a DNA chip or microarray assays by modeling of a hierarchical structure of gene ontology (hereinafter referred to as "GO") .

Background Art

Since the discovery of the double helix structure of DNA by Watson and Crick in 1954, discovery of restriction enzymes and development of hybridization techniques and polymerase chain reaction (PCR) have greatly contributed to the understanding of life phenomena at the molecular level. Also, to meet the rising need for comprehensive and non-fragmentary understanding of the life phenomena showing a complicated regulation mechanism, through the human genome project (HGP) or the like, studies have been made to identify the functions of base sequences, resulting in the development of DNA chips. In order to efficiently utilize the results of the HGP and DNA chips, studies on bioinformatics and functional genomics are highly active.

Biochips are broadly divided into microarray chips and microfluidics chips. A microarray chip contains thousands of or tens of thousands of DNA or protein samples arranged at regular intervals, and thus can process analyte to identify its binding pattern. Microarray chips generally refer to DNA chips and protein chips. DNA chips have been the most dominant biochips up to date. Microfluidics chips pass over a small amount of analyte in controlled flow and analyze the reaction pattern of the analyte with the molecule on a chip or with a sensor.

DNA chips are made by spotting a target DNA, cDNA or oligonucleotide on a glass slide, nitrocellulose membrane or silicon. In other words, DNA chips consist of a small-sized solid on which cDNA or oligonucleotide probes with known base sequences are micro-arrayed at predetermined positions.

DNA chips, if hybridized with a probe labeled with a radioactive isotope or fluorescent dye, can be used in identification of gene mutations and levels of gene expression, single nucleotide polymorphism (SNP), diagnosis of diseases, high-throughput screening (HTS) and so on. When a sample DNA fragment to be analyzed is combined to a DNA chip, the probe affixed to the DNA chip and the base sequence of the sample DNA fragment are hybridized depending on the level of complementarity. It is possible to analyze the base sequence of the sample DNA by detecting and understanding the hybridization by an optical or radioactive chemical method. If DNA chips are utilized, expression information of genes can be easily and rapidly obtained. DNA chips are now used for development of new drugs and medical diagnosis.

Both statistical methods and biological methods are used to analyze DNA microarray data. Based on the gene expression levels detected by an image analysis, genes showing the common expression pattern are clustered by a statistical method. It is possible to give a general biological meaning to the cluster and validate the biological reliability of the cluster based on the known function of each gene.

Previously, biological validation has been made by obtaining information about the functions of genes from biomedical literature or existing biological databases and then comparing the functions of genes with the DNA microarray data. Available databases are the NCBI (National Center for Biotechnology Information) providing basic DNA information, the MIPS (Munich Information Center for Protein Sequences) and CGAP (Cancer Genome Anatomy Project) providing functional category information, and the Swiss-Prot Protein Knowledgebase providing annotated protein sequence information. However, most biological identification works have been done manually by researchers or scientists. Due to the diversity of biological terms, it has been difficult to perform systematic and automated biological analyses.

As a well-known biological database, for example, the Swiss-Prot provides protein information, classifies the functions of proteins by keywords. However, there is no correlation or hierarchy between the keywords used for the classification, which makes it difficult to perform automated biological analyses of DNA chips. Further, the group information of the particular fields, such as the CGAP (Cancer Genome Anatomy Project), provides highly focused information for particular fields only. And since the group information covers a overly broad function, it is difficult to cover a detailed function.

As an effort to overcome such difficulties, GO terms offered by the Gene Ontology Consortium can be used. Ontology here means a classification system for biological terms and vocabularies. The goal of the Gene Ontology Consortium is to construct a controlled and unified system of biological terms. The Consortium provides about 10,000 dynamic controlled terms - the number of terms is vaiying when necessary - that can be applied to describe the roles of genes and gene products in all organisms. Gene Ontology™(GO) shows the relationships between genes and the keywords assigned for each gene and it is applicable to bioinformatics.

The GO terms, amounting to about 10,000, have a tree-like hierarchical structure called DAG(Directed Acyclic Graph) and are divided into three categories. The GO terms can be used to find a biological meaning when analyzing a DNA chip. Generally, GO terms are classified into three categories reflecting the biological roles of genes: i) molecular function, ii) biological process and iii) cellular component. Hierarchically controlled vocabularies are established for each category. The three categories are not exclusive but all descriptive of a gene.

Disclosure of the Invention

Therefore, the present invention has been made in view of the above-mentioned problems, and it is an object of the present invention to provide a system and a method for DNA chip analysis utilizing Gene Ontology™ to enable a systematic biological analysis of a gene expression pattern of a DNA chip test by modeling of a GO hierarchy.

Another object of the present invention is to provide a method for extracting representative functions which are most common and ideal among the genes contained in a cluster formed by statistical clustering of DNA chip test results, utilizing GO terms and hierarchical tree structure. Brief Description of Drawing

The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which:

FIG. 1 is a view showing the construction of a system for DNA chip analysis using Gene Ontology™ according to the present invention.

FIG. 2 shows one example of the GO tree structure according to the present invention.

FIG. 3 shows one example of a modification of the GO tree structure in text format according to the present invention.

FIG. 4 shows one example of a conversion of extracted GO codes according to the present invention. FIG. 5 is a view briefly showing the principle of finding an optimal branch using GO according to the present invention.

FIG. 6 is a view showing the principle of measuring a pseudo-distance according to the present invention.

FIG. 7 is an operation flow of the analysis of a DNA chip using GO according to the present invention.

Best Mode for Carrying Out the Invention

In order to accomplish the objects of the present invention, there is provided a system for DNA chip analysis using Gene Ontology™, comprising: a) means for receiving statistical clustering results of DNA chip data and for assigning appropriate GO identifiers to each gene pertaining to given cluster; b) means for converting each GO identifier assigned to said gene into a GO code using a GO code file; c) means for selecting a proper process among three predetermined process that adopt pseudo-distance to designate necessary parameters and for extracting an optimal branch; and d) means for extracting biological meanings from each extracted optimal branch; and e) optionally, visualizing means for displaying the optimal branch, GO code and biological meanings of a given cluster of gene. The visualizing means displays summarized information on the GO code, optimal branch and biological meanings of a given cluster of genes in a form of table, or in a form of a graphical tree structure.

The optimal branch gives a proper weight to each level of the GO tree structure. The pseudo-distance "Pd(vl,v2)," wherein vl and v2 represent nodes, is a weight of the level corresponding to the GO code of the optimal branch formed by nodes vl and v2. When vl and v2 are the same, the Pd value is zero.

In a cluster or a group "G" of GO codes, each optimal branch is computed using the maximum pseudo-distance (max_pd) and the average pseudo-distance (aver_pd). Given a multiset G={vl, v2, v3, v4, , vn} of GO codes, max_pd and aver_pd are defined as follows:

max_pd(G) = max{pd(Vj,Vj)} (1 < i < j < n) aver_pd(G) = (sum of all pd(vι,V_j) in G)/_nC₂ = 2 x (sum of all pd(vj,V_j) in G)/n(n-l)

Among possible combinations of codes, the lowest values of max_pd and aver_pd are finally considered as optimal. The maximum pseudo-distance (max_pd) is used to roughly evaluate clusters. If the optimal branch of a cluster is located at a higher level, it will be likely that the cluster may include bad genes which do not share common biological characteristics with the other genes in that cluster.

The average pseudo-distance (aver_pd) shows how well genes are clustered with the same functional categories, and how frequently similar codes are observed.

The predetermined process of the means for extracting an optimal branch comprises i) basic process, ii) N-level selective process and iii) percentage selective process. The means designates a proper process among them and necessary parameters to extract an optimal branch. The basic process utilizes the maximum pseudo-distance (max_pd) and average pseudo-distance (aver_pd) of all nodes in the GO tree structure. The results obtained by the basic process roughly show the biological meanings of a given cluster. The N-level selective process predesignates and computes each level of the optimal branch, observes formation of the optimal branch at a particular level N and analogizes the biological meaning at a lower level. The percentage selective process predesignates the percentage of genes pertaining to the optimal branch and shows all combinations of genes in percentages desired by a user. The N-level selective process shows both the first candidate of GO code combination and the next candidate of combinations to reflect the diversity that a single gene can be involved in two or more functions. In order to accomplish the objects of the present invention, there is also provided a method for DNA chip analysis using Gene Ontology™, comprising: a) receiving statistical clustering results of DNA chip data and for assigning appropriate GO identifiers to each gene pertaining to given cluster; b) converting each GO identifier assigned to said gene into a GO code using a GO code file; c) selecting a proper process among three predetermined process that adopt pseudo-distance to designate necessary parameters and for extracting an optimal branch; and d) extracting biological meanings from each extracted optimal branch; and e) optionally, displaying an optimal branch, GO code and biological meanings of a given cluster of gene.

At the step of extracting an optimal branch, the predetermined process comprises i) basic process, ii) N-level selective process and iii) percentage selective process. Proper one of these processes and necessary parameters are designated to extract an optimal branch. Hereinafter, a preferred embodiment of the present invention will be described with reference to the accompanying drawings.

FIG. 1 is a view showing the construction of a system for DNA chip analysis using GO. The system for DNA chip analysis comprises: an input section(llθ) for inputting statistical clustering results of said DNA chip data; GO identifier assigning section(130) for assigning GO identifiers to each gene pertaining to each cluster for inputting clustering results using a GO identifier index file(120); GO identifier/GO code converting section(140) for converting each GO identifier assigned to the corresponding genes into a GO code utilizing GO code file; an optimal branch extracting section(220) for selecting a predetermined process according to pseudo-distance algorithm(210) to designate a necessary parameter for said GO code and extracting an optimal branch; and biological meanings extracting section(230) for extracting biological meanings from each optimal branch. The system may further comprise a visualization module(310) for displaying the extracted optimal branch, the GO code and the biological meanings over a cluster.

The present invention aims to extract the GO terms that are the commonest or most functionally-related among genes pertaining to a cluster formed by statistical clustering of DNA chip data, utilizing GO terms and GO tree structure.

To this end, the present invention assigns GO terms to each gene, extracts an optimal branch by mathematically utilizing the GO hierarchical tree structure, and efficiently displays the results of the optimal branch extraction.

FIG. 2 shows an example of the GO tree structure according to the present invention. The highest level refers to GO level. The second level refers to three categories, i.e., molecular function, biological process and cellular component. The lower levels (3^rd, 4^th and 5^th) form a tree-like inheritance structure. FIG. 3 shows an example of a modification of the GO tree in a text format according to the present invention. Actually, GO is not a tree structure in original form but a mathematical graph called DAG, directed graph without cycle. The GO structure can be simply changed to a GO tree structure in the present invention. Further, FIG. 4 shows an example of conversion from GO terms to GO codes according to the present invention. This drawing illustrates the outputs of GO codes converted by the GO code converting section(140).

An optimal branch refers to the lowest nodes among the nodes including the greatest number of genes at the bottom in a tree structure. The optimal branch is a broad term representing all the functions of genes included in the nodes at the bottom. The system of the present invention assigns genes pertaining to a given cluster in the GO tree structure, finds the optimal branch through the pseudo-distance algorithm and displays the results. GO terms are assigned to corresponding genes by text mining of various biological databases. For the allocation of GO terms, such information, at the DNA level or the protein level as provided by UniGene, LocusLink, Swiss-Prot and MGI, is utilized together with direct comparison of identifiers and sequence similarity searches. Also, gene identifier conversion files provided by each database that participated in the GO Consortium are utilized to assign GO terms.

UniGene of the NCBI (National Center of Biotechnology Information) provides gene information at the DNA level. LocusLink, which is the result of the reference sequence project of the NCBI, provides information about the functions of genes and representative sequences. Swiss-Prot of the Swiss Institute of Bioinformatics provides gene information at the protein level. MGI (Mouse Genome Informatics) provides integrated access to data on the genomics of laboratory mice.

In the present invention, in order to compute an optimal branch in GO tree and to find representative GO terms over a given cluster based on the optimal branch, all nodes in the GO tree are GO-coded. As shown in FIG. 4, each GO code is a sequence of numbers, here 15 numbers. Note that the sequence length 15 is variable according to the versions of GO syntax files. Each number in a GO code sequence represents the positional information at each step. Since a unique GO code is assigned to each node in GO tree, GO terms are distinguishable even if the same terms are used in different nodes in the GO tree structure.

FIG. 5 is a view for briefly explaining the principle of finding an optimal branch using GO tree according to the present invention.

Computationally the optimal branch can be found using GO codes. At each level of GO tree, a weight and a pseudo-distance between GO codes is defined.

FIG. 6 is a view for explaining the principle of measuring a pseudo-distance between nodes according to the present invention. The pseudo-distance is defined as follows.

Pd(vl,v2) is a weight of the level at which there exists the optimal branch formed by nodes vl and v2. When vl and v2 are the same, the Pd value is zero. Ultimately, a combination of GO codes is selected through the following pseudo-distance concept.

pd(vl,v2) = weight of the level where the optimal branch between vl and v2 is located(when vl≠v2) pd(vl,v2) = 0 (when vl=v2)

max_pd(G) = max{pd(vj,V_j)} with 1 < i < j < n αver_pd(G) = (sum of all pd(vj,V_j) in G)/_nC₂ = 2 x (sum of all pd(v„V_j) in G)/n(n-l)

The maximum pseudo-distance (max_pd) is used to roughly evaluate clusters. If the optimal branch of a cluster is located at a higher level, the cluster is likely to include bad genes which do not share the common characteristics with the other genes in that cluster.

The average pseudo-distance (aver_pd) shows how well genes are clustered in a given cluster with similar functional categories, and how frequently similar GO codes are observed. The pseudo-distance is applicable to three processes: a basic process, an

N-level selective process and a percentage selective process.

The basic process has two modules using the maximum pseudo-distance (max_pd) and average pseudo-distance (aver _pd) of all nodes in GO tree structure. The results of basic process show the overall biological meanings of a given cluster. In the N-level selective and percentage selective processes, a user can designate particular limits. The N-level selective process predesignates the level of an optimal branch so that the formation of the optimal branch at a particular level N can be easily computed. Also, the N-level selective process enables the user to easily analogize the biological meanings at a lower level, which is not possible in the basic process. The N-level selective process shows both the first candidate of GO code combination and the next candidates of combinations to reflect the diversity that a single gene can be involved in two or more functions. The percentage selective process predesignates the percentage of genes pertaining to the optimal branch and finds all combinations of genes in percentages desired by a user. Like the N-level selective process, the percentage selective process can fully show the functional diversity of genes.

FIG. 7 is the flow diagram of DNA chip analysis using GO according to the present invention. The method of analysis comprises the steps of: receiving statistical clustering results of DNA chip data (S10) and assigning GO identifiers to each gene pertaining to given cluster (S20); converting each GO identifier assigned to the corresponding genes into a GO code using GO code file (S30); selecting a process among basic process (S41), N-level selective process (S42) and percentage selective process (S43) according to pseudo-distance algorithm (S40) to designate a necessary parameter for said GO code and extracting an optimal branch (S50); extracting a biological meaning of each extracted optimal branch (S60); and displaying an optimal branch of a cluster and its GO code (S70).

Referring to FIG. 1 and FIG. 7, the system for biological analysis of the gene expression pattern of DNA chip using GO structure according to the present invention is comprised of three broad sections 100, 200 and 300. The operation of each section will be described in detail by reference to FIG. 7.

First, GO identifiers and their GO codes are assigned to each gene in a given cluster that is obtained from a statistical clustering method. More specifically, when clustering results are inputted (S10), GO identifiers are assigned to each gene within a cluster (S20) based on the index file that has previously assigned GO identifiers to genes through data mining of various databases. Subsequently, each GO identifier assigned to genes in a given cluster is converted into a GO code (S30) using the GO code file which all nodes in GO tree structures are all coded.

A proper process among basic process (S41), N-level selective process (S42) and percentage selective process (S43) is chosen using pseudo algorithm, and necessary parameters are designated. An optimal branch is then computed (S50) based on the pseudo distance in each process. Also, biological meanings of the optimal branch are extracted.

The optimal branch extracted for genes in each cluster and the GO code assigned to the genes are displayed. Summarized information on the GO code for each gene, the optimal branch and the biological meanings can be displayed in the form of a table or a graphical tree.

The pseudo algorithm is also applicable to a different biochip, protein chip. The pseudo-distance algorithm can be utilized to analyze a protein chip in the same way as utilized to analyze a DNA chip in FIGs. 1 and 7.

While this invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiment and the drawings, however, on the contrary, it is intended to cover various modifications and variations within the spirit and scope of the appended claims.

Industrial applicability

As can be seen from the foregoing, the present invention enables a systematic and automated biological analysis of gene expression patterns of DNA chip assays by a mathematical modeling of GO hierarchy. Also, the present invention can extract the biological functions that are commonest and most optimal among genes within a cluster formed by a statistical clustering method of DNA chip data, utilizing GO terms and tree structure.

Claims

What is claimed is:

1. A system for DNA chip analysis using Gene Ontology™, comprising: a) means for receiving statistical clustering results of DNA chip data and for assigning appropriate GO identifiers to each gene pertaining to given cluster; b) means for converting each GO identifier assigned to said gene into a GO code using a GO code file; c) means for selecting a proper process among three predetermined process that adopt pseudo-distance to designate necessary parameters and for extracting an optimal branch; and d) means for extracting biological meanings from each extracted optimal branch.

2. The system according to claim 1, further comprising (e) visualizing means for displaying the optimal branch, GO code and biological meanings of a given cluster of gene.

3. The system according to claim 2, wherein said visualizing means displays the summarized information on the GO code, optimal branch and biological meanings of a given cluster of genes in a form of table, or in a form of a graphical tree structure.

4. The system according to claim 1, wherein said optimal branch gives a proper weight to each level of the GO tree structure.

5. The system according to claim 1, wherein said pseudo-distance

wherein vl and v2 represent nodes, is a weight of the level corresponding to a code of the optimal branch formed by nodes vl and v2, the Pd value being zero when vl and v2 are the same, and wherein the optimal branch is obtained based on the maximum pseudo-distance (max_pd) and the average pseudo-distance (aver_pd), which are defined as follows in a group "G" of GO codes or in a given cluster (G={vl, v2, v3, v4, , vn}): max_pd(G) = max {pd(Vj,V_j)} ( 1 < i < j < n) αver_pd(G) = (sum of all pd(vi, ) in G)/_nC₂

= 2x(sum of all pd(vj,Vj) in G)/n(n-l), the lowest values of max_pd and aver_pd being finally obtained among possible combinations of GO codes.

6. The system according to claim 5, wherein said maximum pseudo-distance

(max_pd) is used to roughly evaluate clusters and shows that, if the optimal branch of a cluster is located at a higher level, the cluster is likely to include bad genes which do not share the common biological characteristic with the other genes in that cluster.

7. The system according to claim 5, wherein said average pseudo-distance (aver_pd) shows how well GO codes are clustered in a given cluster with similar functional categories and how frequently similar GO codes are observed.

8. The system according to claim 1, wherein the predefined process of corresponding means for extracting an optimal branch comprises i) basic process, ii) N-level selecting process and iii) percentage selecting process, proper one of the three processes and necessary parameters being designated to extract an optimal branch.

9. The system according to claim 8, wherein said basic process utilizes the maximum pseudo-distance (max_pd) and average pseudo-distance (aver_pd) of all nodes in the GO tree structure, the results obtained by said basic process showing the overall biological meanings of a given cluster.

10. The system according to claim 8, wherein said N-level selecting process computes an optimal branch of a cluster at pre-designated level N, observes formation of the optimal branch at a particular level N and analogizes the biological meanings at a lower level.

11. The system according to claim 10, wherein said N-level selecting process shows both the first candidate of GO code combination and the next candidates of combinations to reflect the diversity that a single gene can be involved in two or more functions.

12. The system according to claim 8, wherein said percentage selecting process predesignates the percentage of genes pertaining to the optimal branch and shows all combinations of genes in percentages desired by a user.

13. A method for DNA chip analysis using Gene Ontology™, comprising the steps of: a) receiving statistical clustering results of DNA chip data and for assigning appropriate GO identifiers to each gene pertaining to given cluster; b) converting each GO identifier assigned to said gene into a GO code using a GO code file; c) selecting a proper process among three predetermined process that adopt pseudo-distance to designate necessary parameters and for extracting an optimal branch; and d) extracting biological meanings from each extracted optimal branch.

14. The method according to claim 13, further comprising (e) step of displaying the optimal branch, GO code and biological meanings of a given cluster of gene.

15. The method according to claim 13, wherein said pseudo-distance "Pd(v\,v2)," wherein vl and v2 represent nodes, is a weight of the level corresponding to a code of the optimal branch formed by nodes vl and v2, the Pd value being zero when vl and v2 are the same, and wherein the optimal branch is obtained based on the maximum pseudo-distance (max_pd) and the average pseudo-distance (aver_pd), which are defined as follows in a group "G" of GO codes or in a given cluster (G={vl, v2, v3, v4, , vn}): max_pd(G) - max {pd(vj,V_j)} ( 1 < i < j < n) αver_pd(G) = (sum of all pd(vi,v,) in G)/_nC₂

= 2x(sum of all pd(v„V_j) in G)/n(n-l), the lowest values of max_pd and aver_pd being finally obtained among possible combinations of GO codes.

16. The method according to claim 15, wherein said maximum pseudo-distance (max_pd) is used to roughly evaluate clusters and shows that, if the optimal branch of a cluster is located at a higher level, the cluster is likely to include bad genes which do not share the common biological characteristic with the other genes in that cluster.

17. The method according to claim 15, wherein said average pseudo-distance (aver_pd) shows how well GO codes are clustered in a given cluster with similar functional categories and how frequently similar GO codes are observed.

18. The method according to claim 13, wherein the predefined process of corresponding step for extracting an optimal branch comprises i) basic process, ii)

N-level selecting process and iii) percentage selecting process, proper one of the three processes and necessary parameters being designated to extract an optimal branch.