WO2004051544A2

WO2004051544A2 - Methods and products for representing and analyzing complexes of biological molecules

Info

Publication number: WO2004051544A2
Application number: PCT/CA2003/001883
Authority: WO
Inventors: Christopher Hogue; Gary Bader
Original assignee: Mount Sinai Hospital
Priority date: 2002-12-02
Filing date: 2003-12-02
Publication date: 2004-06-17
Also published as: WO2004051544A3; AU2003287815A1

Abstract

The invention relates to the analysis of relationships among and between objects, including systems, methods and products for representing and analyzing complexes of biological molecules in interaction networks. In particular, the invention provides systems, methods and products for identifying relationships in a data set of objects comprising detecting densely connected regions in the data set based solely on connectivity data of the objects.

Description

TITLE: METHODS AND PRODUCTS FOR REPRESENTING AND ANALYZING

COMPLEXES OF BIOLOGICAL MOLECULES

A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the Patent and Trademark files or records, but otherwise reserves all copyright rights whatsoever. FIELD OF THE INVENTION

The invention is in the field of analysis of relationships, in particular biological relationships, and more particularly in the field of systems, methods and products for representing and analyzing relationships among objects in interaction networks, in particular complexes of biological molecules in a biological molecular interaction network. BACKGROUND OF THE INVENTION

Protein interaction, biochemical pathway, protein structure and gene expression data is accumulating at a rapid pace [1-9], necessitating the creation of bioinformatics systems for its storage, management, visualization, and analysis. One such system is the Biomolecular Interaction Network Database

(BIND - http://www.bind.ca"> [15]. Designed to store information on molecular interactions, complexes and pathways in a very structured and standard way [15], BIND is an effective platform for data-mining and analysis.

Currently, most proteomics data is available for the model organism Saccharomyces cerevisiae, by virtue of the availability of a defined and relatively stable proteome, full genome clone libraries [12] 11], established molecular biology experimental techniques and an assortment of well designed genomics databases [12-14]. Using BIND as an integration platform, 15,143 yeast protein-protein interactions have been collected among 4,825 proteins (about 75% of the yeast proteome). Much larger data sets than this will be available for other well studied model organisms as well as for the human proteome. These complex data sets present a formidable challenge for computational biology to develop automated data mining analyses for knowledge discovery. There is a fundamental need to synthesize and integrate the disparate sources of genomics and proteomics data into biologically meaningful information. Summary of the Invention

Disclosed are methods, systems, and products for effectively identifying relationships between and among objects in an interaction network based solely on connectivity data. The relationships which may be identified are similarities/dissimilarities between objects. In an aspect, the methods, systems and products are based on vertex weighting by local neighbourhood density and outward traversal from a locally dense seed vertex to isolate regions according to selected parameters. The method has a directed mode that allows fine- tuning of clusters of interest without considering the rest of the network and allows examination of cluster interconnectivity which is particularly relevant for biological molecular interaction networks.

Thus, the invention provides a computer-implemented method for identifying relationships in a set of objects comprising detecting densely connected regions in the set based solely on connectivity data of the objects. In an embodiment, the invention provides a computer-implemented method for identifying potential relationships in a data set of objects wherein the objects are represented as vertices and the relationships between and among the objects are represented by interconnecting edges, the method comprising performing a density operation on the data set of objects and traversing outward from a selected seed vertex to isolate densely connected regions.

Thus, in accordance with an aspect, the invention provides a computer-implemented method for identifying relationships between objects in a data set of objects comprising;

(a) implementing a relational graph comprising the data set of objects represented as vertices and the relationship between the objects represented by edges interconnecting the vertices;

(b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices from the relational graph wherein each vertex is assigned a weight based on its local network density; and

(c) performing an operation on a vertex weighted graph comprising choosing a seed vertex with a selected assigned weight and recursively moving outward from the seed vertex to include vertices based on a threshold assigned weight, to produce a subgraph embodying relationships among the objects, the objects in a relationship represented by vertices and the relationship between the objects represented by edges interconnecting the vertices.

The invention also provides a computer-implemented method for identifying objects that form a relationship comprising:

(a) implementing a relational graph comprising a data set of one or more known objects of a relationship and unknown objects that are not known to be part of the relationship, wherein the objects are represented as vertices and the relationships between the objects are represented by edges interconnecting the vertices;

(b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices representing the known objects and unknown objects wherein each vertex is assigned a weight based on its local network density; (c) performing an operation on the vertex weighted graph to identify unknown objects in the relationship by recursively moving outward from a seed vertex representing a known object in the relationship, based on a threshold assigned weight, to add other vertices representing unknown objects in the relationship;

(d) optionally repeating step (c) for the other vertices until no more vertices can be added based on the threshold assigned weight; and

(e) producing a subgraph embodying known and unknown objects in the relationship that are represented by the vertices identified in (c) and (d).

The invention still further provides a computer-implemented method for determining the putative function of a selected object comprising: (a) implementing a relational graph comprising a data set comprising the selected object and other objects that potentially are in a relationship with the selected object, wherein the objects are represented as vertices and the relationship between the objects is represented by edges interconnecting the vertices; (b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices representing the selected objects and other objects wherein each vertex is assigned a weight based on its local network density; and

(c) performing an operation on the vertex weighted graph to identify objects that form a relationship with the selected objects by recursively moving outward from a seed vertex representing the selected objects, based on a threshold assigned weight, to include vertices representing the other objects;

(d) producing a subgraph embodying relationships comprising the selected object and other objects; and (e) determining the putative function of the selected object based on the nature, structure or function of the other objects in the relationship. In an aspect the invention provides methods, systems, and products for effectively identifying densely connected regions of a data set of biological molecules which correspond to molecular complexes, based solely on connectivity data. This approach applied to the analysis of interaction networks, particularly protein interaction networks, performs well using minimal qualitative information.

Thus, the invention provides a computer-implemented method for identifying molecular complexes in a data set of biological molecules comprising detecting densely connected regions in the data set based solely on connectivity data of the biological molecules. In an embodiment, the invention provides a computer-implemented method for identifying potential molecular complexes in a data set of biological molecules wherein the biological molecules are represented as vertices and the relationships between the biological molecules are represented by interconnecting edges, the method comprising performing a density operation on the data set of biological molecules and traversing outward from a selected seed vertex to isolate densely connected regions to thereby identify potential molecular complexes.

In a particular embodiment, Applicants have implemented the "Molecular Complex Detection" (MCODE) algorithm and it has been evaluated using protein interaction and complex information from the yeast Saccharomyces cerevisiae.

Predicting molecular complexes from protein interaction data, using the methods, systems, and products of the invention provides another level of functional annotation above other guilt-by-association methods. Since sub-units of a molecular complex generally function towards the same biological goal, prediction of an unknown protein as part of a complex also allows increased confidence in the annotation of that protein.

The methods, systems, and products of the invention make the visualization of large networks manageable by extracting the dense regions around a protein of interest. This is important since current visualization tools present on many interaction databases[15], originally based on the Sun Microsystems- embedded spring graph layout Java applet, do not scale well to large networks

(http://java.sun.eom/applets/jdk/l.l/demo/GraphLayout/examplel.html).

The present invention provides a powerful tool for the analysis of large data sets of biological molecules and the discovery of novel gene and protein relationships as well as for the corroboration of relational data. Thus, the present invention provides an electronic system, computer-implemented methods, and program product in which relationships, in particular molecular complexes, are stored, manipulated and/or graphically output on a display or other output device. The relationships, in particular molecular complexes, are represented by subgraphs, in particular molecular complex subgraphs. The invention also provides a comprehensive model to identify, organize, analyze, and store relationships of objects in an interaction network, in particular molecular complexes of a biological molecular interaction network.

The invention further provides algorithms to identify, organize, and analyze relationships of objects in an interaction network, in particular molecular complexes of a biological molecular interaction network. The invention still further provides a software program to implement and visualize relationships of objects in an interaction network, in particular molecular complexes of a biological molecular interaction network.

The invention contemplates a database for the storage and organization of relationships of objects in an interaction network, in particular molecular complexes of a biological molecular interaction network. The invention also contemplates methods for predicting biochemical, signal transduction and gene regulatory circuit pathways for an organism using information obtained from the methods of the invention on biomolecular interactions.

The systems, methods, and products of the present invention identify molecular complexes by identifying densely connected regions of a network. Thus the present invention is particularly useful in predicting molecular complexes in large protein-protein interaction networks. The systems and methods of the invention can be used to analyze connectivity and relationships between molecular complexes.

The methods, systems, and products of the invention are useful for identifying the function of unknown proteins.

The methods, systems, and products of the invention can be used to identify locally dense regions around a known disease protein and excising proteins that interact with the disease protein. These excised proteins may be suitable targets for drug development.

Thus, in accordance with an aspect, the invention provides a computer-implemented method for identifying molecular complexes in a data set of biological molecules comprising;

(a) implementing a relational graph comprising the data set of biological molecules represented as vertices and the relationship between the biological molecules represented by edges interconnecting the vertices;

(b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices from the relational graph wherein each vertex is assigned a weight based on its local network density; and (c) performing an operation on a vertex weighted graph comprising choosing a seed vertex with a selected assigned weight and recursively moving outward from the seed vertex to include vertices based on a threshold assigned weight, to produce a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices. The invention also provides a computer-implemented method for identifying biological molecules that form a molecular complex comprising: (a) implementing a relational graph comprising a data set of one or more known biological molecules of the molecular complex and unknown biological molecules that are not known to be part of the molecular complex, wherein the biological molecules are represented as vertices and the relationship between the biological molecules is represented by edges interconnecting the vertices; (b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices representing the known biological molecules and unknown biological molecules wherein each vertex is assigned a weight based on its local network density;

(c) performing an operation on the vertex weighted graph to identify unknown biological molecules in the molecular complex by recursively moving outward from a seed vertex representing a known biological molecule in the molecular complex, based on a threshold assigned weight, to add other vertices representing unknown biological molecules;

(e) producing a molecular complex subgraph embodying known and unknown biological molecules in the molecular complex that are represented by the vertices identified in (c) and (d). In an embodiment, the seed vertex represents a biological molecule associated with disease. The invention still further provides a computer-implemented method for determining the putative function of a selected biological molecule comprising: (a) implementing a relational graph comprising a data set comprising the selected biological molecule and other biological molecules that potentially may form molecular complexes with the selected biological molecule, wherein the biological molecules are represented as vertices and the relationship between the biological molecules is represented by edges interconnecting the vertices; (b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices representing the selected biological molecules and other biological molecules wherein each vertex is assigned a weight based on its local network density; and

(c) performing an operation on the vertex weighted graph to identify biological molecules that form molecular complexes with the selected biological molecule by recursively moving outward from a seed vertex representing the selected biological molecule, based on a threshold assigned weight, to include vertices representing the other biological molecules;

(d) producing a molecular complex subgraph embodying molecular complexes comprising the selected biological molecule and other biological molecules; and (e) determining the putative function of the selected biological molecule based on the nature, structure or function of the other biological molecules in the molecular complexes. In embodiments of the invention, the data set is a biological molecular interaction network. The methods of the invention can be implemented as computer software. By way of example, a software program can be written using any suitable programming language. A software program implementing the method of the invention may have the following features: (a) implementation of relational graphs and the ability to store in a local and/or remote database; and (b) implementation of operators that directly manipulate the relational graphs to identify relationships between objects, in particular molecular complexes. Therefore the invention also provides a computer program product for performing a method of the invention.

In an aspect, the invention contemplates a computer program product for performing a density operation upon one or more relational graphs wherein the relational graphs comprise a data set of objects, in particular biological molecules, represented as vertices and the relationship between the objects, in particular biological molecules, represented by edges interconnecting the vertices, and wherein the computer program product comprises a computer data medium on which is carried a means for performing a density operation on the relational graphs to produce one or more weighted vertex graphs embodying vertices from the relational graph wherein each vertex is assigned a weight based on its local network density.

The computer data medium may further comprise a means for performing an operation on a vertex weighted graph to choose a vertex with a selected assigned weight and to recursively move outward from the selected vertex to include vertices based on a threshold assigned weight, thereby producing a subgraph embodying objects represented by vertices and the relationship between the objects represented by edges interconnecting the vertices.

The invention further contemplates a computer program product comprising a computer data medium on which is carried means for identifying a subset of objects in densely connected regions of a relational graph and a means for performing a density operation upon the subset to produce a subgraph embodying relationships with objects in the relationships represented by vertices and the relationship between the objects represented by edges interconnecting the vertices.

In particular, the computer data medium may further comprise a means for performing an operation on a vertex weighted graph to choose a vertex with a selected assigned weight and to recursively move outward from the selected vertex to include vertices based on a threshold assigned weight, thereby producing a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices. The invention further particularly contemplates a computer program product comprising a computer data medium on which is carried means for identifying a subset of biological molecules in densely connected regions of a relational graph and a means for performing a density operation upon the subset to produce a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices.

The methods of the invention can be implemented on a programmed general purpose computer system. The methods can also be implemented on a special purpose computer, a programmed microprocessor or microcontroller and peripheral integrated circuit element, an ASIC or other integrated circuit, a hardwired electronic or logic circuit (e.g. discrete element .circuit), or a programmable logic device (e.g. PLD, PLA,

FPGA or PAL), or the like.

In an aspect, the invention provides a system for electronically identifying, and/or visualizing molecular complexes in a data set of biological molecules comprising (a) a computer having relational graphs comprising the biological molecules wherein the biological molecules are represented by vertices and the relationship between the biological molecules is represented by edges interconnecting the vertices; (b) a density operation for processing such relational graphs to produce vertex weighted graphs embodying vertices from the relational graphs wherein each vertex is assigned a weight based on its local network density; (c) an operation for processing vertex weighted graphs to choose a seed vertex with a selected assigned weight and recursively moving outward from the seed vertex to include vertices based on a threshold assigned weight, to produce a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices; and (d) an imaging operation for producing images of the molecular complexes; said system, in response to data requests, creating and transmitting to a plurality of end-users, the images defining the molecular complexes.

A data set of objects in particular biological molecules used in the invention may be scale free. In particular, the relational graph may have a vertex connectivity distribution that follows a power law.

In an embodiment a subgraph, in particular a molecular complex subgraph may have at least two vertices that may represent different types of objects (e.g. biological molecules). In another embodiment, a subgraph (e.g. a molecular complex subgraph) may have at least two edges that may represent different types of relationships between the objects (e.g. biological molecules).

The methods, systems, and products of the invention may also include operations to filter out objects (in particular filter out molecular complexes) that do not contain at least two edges; add objects to the relationships (in particular add biological molecules to the molecular complexes); remove objects (in particular remove biological molecules from the molecular complexes); add objects, in particular biological molecules, that have not been encountered when recursively moving outward from the seed vertex; and/or remove objects, in particular biological molecules, that are connected to the seed vertex by a single edge. These operations may be used to refine complexes according to selective criteria, in particular biological criteria. The methods, systems, and products of the invention may also comprise an operation to visualize data from the vertex weighted graph and/or subgraph, in particular molecular complex subgraph. In an embodiment, a product or system of the invention preferably includes visualization software for the demonstration of the data resulting from computation using the relational graph. Such software can be written in any suitable programming language, for example, the Java programming language In the methods, systems and products of the invention, the identified relationships, in particular molecular complexes, may be assigned a score and ranked. In an aspect, the score is the product of the subgraph density, in particular molecular complex subgraph density [C=(V,E) where V = the number of vertices and E = the number of edges], and the number of vertices in the subgraph. The local network density utilized in the methods, systems and products of the invention may be based on the highest /c-core of the neighbourhood of a vertex, wherein the Λ-core represents the number of vertices with at least k edges to other vertices in the core i.e. a network of minimum degree lc. Viewed another way, the local network density may be based^'on a core clustering coefficient of a vertex v defined as the density of the highest Λ-core of the immediate vertices connected to v, including v. In an aspect of the invention, k represents the highest value corresponding to the most densely connected region of the relational graph. In an embodiment of the invention, k is at least 2.

The density operation utilized in the methods, systems, and products of the invention may amplify the weighting of heavily interconnected graph regions while removing noise.

In an aspect of the invention, the threshold assigned weight is determined as a percentage of the weight of the seed vertex.

In an embodiment of the invention, the MCODE algorithm described herein is used to find locally dense regions of a graph. MCODE uses vertex-weighting based on a core clustering coefficient to measure the cliquishness of the neighborhood of a vertex.

The disclosed methods, systems, and products can be implemented and used at varying levels from software components to integrated packages with user-interface which allows a wide range of applications.

Different graph tools can be implemented. In addition, the software programs of the invention may be readily interfaced with other software packages such as common statistical packages.

Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompany drawings.

DESCRIPTION OF THE DRAWINGS AND TABLES

The invention will be better understood with reference to the drawings in which:

Figure 1 illustrates the effect of overlap score threshold on number of predicted and matched known complexes for the Gavin evaluation. Average and maximum number of predicted and matched known complexes seen during MCODE parameter optimization (840 parameter combinations) plotted as a function of overlap score threshold. As the stringency for the closeness that a predicted complex must match a known complex is increased (increase in overlap score), fewer predicted complexes match known complexes. The curves do not correspond to the best parameter set, but rather are an average of results from all tried parameter combinations. Figure 2 shows the number of predicted and matched known complexes at overlap score threshold of 0.2. The number of known complexes matched to MCODE predicted complexes is plotted against number of MCODE predicted complexes, both with an overlap score above 0.2.

Figure 3 provides examples of Gavin benchmark complexes missed and hit by MCODE. Protein complexes are represented as graphs using the spoke model. Vertices represent proteins and edges represent experimentally determined interactions. Blue vertices are baits in the Gavin et al. study. A) A CDC3 complex hand-annotated by Gavin et al. that was missed by MCODE because of a lack of connectivity information among sub-components. This complex annotation was the result of a single co- immunoprecipitation experiment. B) The Arp2/3 complex as annotated by Gavin et al. and as found by MCODE with parameters optimized to the data set. Note the five extra proteins that have minimal connectivity to the main cluster. C) The protein connection map seen from the crystal structure of the Arp2/3 complex. The crystal structure is from Bos taurus (cow), but is assumed to be very similar to yeast based on very high similarity between cow and yeast Arp2/3 subunits.

Figure 4 illustrates the effect of vertex weight percentage parameter on predicted complex size. As the vertex weight percentage parameter of MCODE is increased, the number of predicted complexes steadily decreases and the average and largest size of predicted complexes increases exponentially. The y-axis follows a logarithmic scale. For reference, the average and maximum size of the MIPS benchmark complexes are 6 and 81, respectively and of the Gavin benchmark complexes are 11.8 and 88, respectively.

Figure 5 illustrates the overlap score distributions of pre HTMS and AllYeast interaction sets with MIPS complex benchmark optimized MCODE parameter sets. The number of MCODE predicted complexes in the pre-large scale mass spectrometry (Pre HTMS) and AllYeast protein-protein interaction sets with a given overlap score threshold compared to the MIPS benchmark complex set is shown. The majority of predicted complexes have an overlap score of zero meaning that they had no overlap with the catalogue of known MIPS protein complexes. Figure 6 shows the sensitivity vs. specificity plots of MCODE results among various data sets.

Specificity is plotted versus sensitivity of the best MCODE results at an overlap score above 0.2 against both the MIPS (Panel A) and Gavin (Panel B) complex benchmarks. Panel A shows that there are no large inherent differences among interaction data sets resulting from significantly different experimental methods (data set: sensitivity, specificity; Y2H:0.10,0.27; Benchmark:0.29,0.36; HTP Only:0.14;0.24; Pre HTMS:0.27,0.31; AllYeast:0.27,0.26; Gavin Spoke:0.10,0.38). Panel B shows that the Gavin benchmark is expectedly biased towards the Gavin interaction data set and thus should not be used as a general benchmark (data set: sensitivity, specificity; Y2H:0.03,0.10; Benchmark:0.11,0.16; HTP Only:0.24;0.33; Pre HTMS:0.10,0.13; AllYeast:0.27,0.26; Gavin Spoke:0.31,0.79).

Figure 7 illustrates that the second highest ranked MCODE predicted complex is involved in RNA processing and modification. This complex incorporates the known polyadenylation factor I complex (Cftl,

Cft2, Fipl, Papl, Pfs2, Ptal, Yshl, Ythl and Ykl059c) and contains other proteins highly connected to this complex, some of unknown function. The fact that the unknown proteins (Yorl79c and Ptil) connect more to known RNA processing/modification proteins than to other proteins in the larger data set likely indicates that these proteins function in RNA processing/modification. This complex was most highly ranked by MCODE from the predicted complexes in the AllYeast interaction set.

Figure 8 shows an MCODE predicted complex involved in cytokinesis. This predicted complex incorporates the known Septin complex (Cdc3, CdclO, Cdcll and Cdcl2) involved in cytokinesis and other cytokinesis related proteins. The Yal027w protein is of unknown function, but likely functions in cell cycle control according to this figure, possibly of cytokinesis. This complex was ranked 23^rd by MCODE from the predicted complexes in the AllYeast interaction set.

Figure 9 shows the effect of complex score threshold on MCODE prediction accuracy. MCODE complexes equal to or greater than a specific score were compared to a benchmark comprising the combined MIPS and Gavin benchmarks. Accuracy was calculated as the number of known complexes better or equal to the threshold score divided by the total number of predicted complexes (matching and non-matching) at that threshold. A complex was deemed to match a known complex if it had an overlap score above 0.2. The number of predicted complexes that matched known complexes at each score threshold is shown as labels on the plot. Figure 10 shows an MCODE predicted complex that is too large (relaxed parameters). An example of a predicted complex that incorporates two complexes, proteasome (left) and an RNA processing complex (right). These should probably be predicted as separate complexes as can be seen by the clear distinction of biological role annotation on one side of this layout compared to the other (purple versus blue). This figure, however, shows the large amount of overall connectivity between these two complexes. This complex was ranked fourth by MCODE from the predicted complexes in the AllYeast interaction set with slightly relaxed parameters compared to the optimized prediction.

Figure 11 illustrates MCODE in directed mode. MCODE was used in directed mode to further study the complex in Figure 10 by using seed vertices from high density regions of the two parts of this complex. A) The result of examining the Lsm complex using MCODE parameters that are too relaxed (haircut=TRUE, fluf FALSE, VWP=0.05). B) The final Lsm complex using MCODE parameters of haircut=TRUE, fluff=FALSE and VWP=0 seeded with Lsm4. C) The final 26S proteasome complex seeded with Rptl using MCODE parameters haircut=TRUE, fluff=TRUE and VWP=0.2. Visible here are two regions of density in this complex corresponding to the 20S proteolytic subunit (left side - mainly Pre proteins) and the 19S regulatory subunit (right side - mainly Rpt and Rpn proteins). Figure 12 illustrates examining complex connectivity with MCODE. The complexes shown here are known to be nuclear localized and are involved in protein degradation (19S proteasome subunit), mRNA processing (Lsm complex and mRNA Cleavage/Polyadenylation complex), cell cycle (anaphase promoting complex) and transcription (SAGA transcriptional activation complex).

The invention will be better understood with reference to the Tables in which: Table 1 - Summary of MCODE Results with Best Parameters on Various Data Sets

Statistics and a summary of results are shown for the various data sets used to evaluate MCODE. 'Gavin Spoke' is the Gavin et al. data set represented as binary interactions using the spoke model; 'Pre HTMS' is the set of all yeast interactions not including the recent high-throughput mass spectrometry studies[6,7]; 'AllYeast' is the set of all yeast interactions that were collected; 'Benchmark' is a set of interactions found in the literature from YPD, MIPS and PreBIND; 'HTP Only' is the combination of all large-scale and high-throughput yeast two-hybrid and mass spectrometry data sets; 'Y2H' is the set of all yeast two-hybrid results from large-scale and literature sources. The 'Best MCODE Parameters' are formatted as haircut True of False, fluff True or False\VWP\Fluff Density Threshold Parameter. Table 2 - Average Number of YPD and GO Annotation Terms in Complex Sets

The average number of YPD and GO functional annotation terms per protein in an MCODE predicted complex is shown for MCODE predicted complexes on the AllYeast set, the MIPS complex database and the MCODE random model. A lower number indicates that the complexes from a set contain more functionally related proteins (or unannotated proteins). In the cases of multiple annotation, all terms are taken into account. Even though there are multiple annotation terms per protein and a variable amount of unannotated proteins per complex, these numbers should perform well in relative comparisons based on the assumption that the distribution of the latter two factors is similar in each data set.

Table 3 - Statistics for Top, Middle and Bottom Five Scoring Optimized MCODE Predicted Complexes Found in All Known Yeast Protein Interaction Data Set

Score is defined as the product of the complex subgraph density and the number of vertices

(proteins) in the complex subgraph (Dςx |V|). This ranks larger more dense complexes higher in the results.

Density is calculated using the loop formula if homodimers exist in the complex, otherwise the 'no loop' formula is used. The cell role column is a manual combination of annotation terms for the proteins reported in the complex.

Table 4 - Example of a MCODE specification.

Table S - Additional Files -All Yeast Predicted Complexes

This Table comprises examples of annotation and report files for complexes found using MCODE on the set of all yeast interactions. Basic instructions for using Pajek are also included. DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

The disclosed methods, systems, and products identify relationships among objects. "Objects" include but are not limited to any entity, data, property, attribute, component, element, ingredient, or item where it is useful to represent the relationships between instances of or different ones of such entity, data, property, attribute, component, element, ingredient, or item. Examples of objects are processes, machines, compositions of matter, compounds, devices, financial data, instruments, trends, traits or characteristics, scientific properties, traits and characteristics associated with humans or other animals, software products, and machines. In a particular embodiment, the object is a biological molecule.

"Relationship(s)" refer to any characterizations shared with, linking, correlating, identifying, or otherwise describing any two or more objects such as biological molecules. A relationship may represent similarities or dissimilarities of objects.

A "biological molecule" refers to any molecule or portion thereof or multi-molecular assembly or composition that has a biological origin or is related to a molecule that has a biological origin. Biological molecules include but are not limited to genes, open reading frames, expressed sequence tags, single nucleotide polymorphisms, sequence tag sites, nucleic acids, DNA, RNA, mRNA, cDNA, proteins, peptides, enzymes, metabolites, carbohydrates, exons, introns, cleavage fragments, restriction fragments, amino acid modifications, protein domains, DNA or RNA secondary or tertiary structures, nucleic acid motifs, protein motifs, and metal ions.

"Molecular complex(es)" refers to assemblages composed of two or more biological molecules. For example a molecular complex may be composed of two or more proteins, nucleic acids, protein(s) and nucleic acid(s), and protein(s) and lipid(s). Each component of the molecular complex binds together by non- covalent or covalent bonds (e.g. disulphide covalent bonds). There is no limitation on the number of the biological molecules of the complex. Preferably, a molecular complex comprises two, three, four, five, six, seven, eight, nine, ten, fifteen, twenty, twenty-five, or thirty interacting biological molecules that potentially have a common origin, function, structure, mechanism, or activity.

"Interaction network" refers to a collection or data set of information regarding relationships among objects. The collection of information may be incomplete (i.e. relationships between objects may not be known), and it may be inexact i.e. the relationships may be defined in terms of allowed ranges or limits. Information regarding relationships may be derived from observation, measurement, and a priori knowledge. "Biological molecular interaction network" refers to an interaction network (i.e. a collection or data set of information regarding relationships) among biological molecules. A biological molecular interaction network may represent the entire interaction map of a proteome that specifies the entire signal transduction and metabolic networks of a cell such as a yeast cell.

The relationships between objects, in particular biological molecules, are embodied in relational graphs. A "relational graph" is a representation of objects, in particular biological molecules, as vertices and the interconnections among the vertices, designated by a set of edges.

The term "graph" or "graphs" refers to mathematical representations recognized as graphs and is not intended to be limited to visual depictions of data, although such depictions are encompassed by the disclosed invention. "Computer data medium" includes but is not limited to magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as floptical disks, and hardware devices that are particularly configured to store and perform program instructions such as read-only memory devices (ROM) and random access memory (RAM). It may also be embodied in a carrier wave traveling over an appropriate medium including but not limited to airwaves, optical lines, and electric lines.

"Vertex" or "Vertices" refers to a representation of an object, in particular a biological molecule or a set of biological molecules. Generally, each vertex represents one object, in particular a protein molecule. A name that uniquely identifies an object (e.g. biological molecule) can be used to label a vertex.

"Edge" refers to the connection or relationship between vertices (objects, in particular biological molecules). It is generally defined by a pair of vertices. However, in some embodiments of the invention it may be defined by more than two vertices i.e. as a set representing a relationship that involves more than a pair of interactions.

The relationships of biological molecules embodied by the edges may be quantitative or qualitative, and include but are not limited to (a) the association of biological molecules in time, space, or logical meaning; (b) the physical or logical state of the biological molecules; (c) the spatial distance between genes on a chromosome; (d) the real value measurement of time or kinetic information; (e) logical relationship measured by Euclidean and other distance metrics in feature space, by correlation coefficient as a statistical metric, or by values of fuzzy set membership function; (f) casual relationship measured by conditional probability; (g) physical or genetic distances between genes, open reading frames, single nucleotide polymorphisms, expressed sequence tags, sequence tag sites, or a combination thereof; (h) protein-protein interactions; (i) protein-nucleic acid interactions; (j) protein expression regulation; (k) gene expression regulation; (1) nucleic acid-nucleic acid interactions; (m) interaction networks of biological molecules (i.e. signal transduction pathways); (n) sequence similarity between genes or proteins; (o) structural similarity between proteins; (p) disease states; (q) physical or logical states such as activation or inhibition; (r) radiation hybrid mapping distances between biological molecules; and (s) metabolic pathways.

An edge of a relational graph may comprise directional labels indicating the direction of the edge if the relationship is directional between the two vertices (e.g. labels representing direction of chemical action or direction of information flow); and, a table that stores properties of the relationship between the vertices. A relational graph may comprise quantitative and statistical information. In an aspect, a relational graph comprises a reliability index or p-value on the edges. The index or value may be determined by considering factors such as the number of different types of experiments that find the same interaction, the quality of the experiment, the date the experiment was conducted, and other factors that pertain to the reliability of the interaction including the standing of a publication describing the interaction. A relational graph generally comprises at least 2, 3, 5, 10, 20, 50, 100, 500, 1000, 2000, 3000, 4000,

5000, 10,000 different types of vertices, which may be linked by some number of edges.

In an embodiment of the invention, at least two vertices in a relational graph may represent different types of biological molecules. In another embodiment, at least one or two edges in a relational graph represent different types of relationships between the biological molecules represented by the vertices connected by the edges.

Relational graphs can be stored as large databases, for example, a relational graph database can comprise large data sets from a variety of sources, such as gene expression analysis, proteome analysis, genome mapping, and functional genome annotation.

Examples of data sets that can be embodied in relational graphs are PreBIND (http://bioinfo.mshri.on.ca/prebind/), BIND (http://www.bind.ca). Gene Ontology [38], the MIPS [14];

Databases of Interacting Proteins (DIP) [16; http://dip.doe-mbi.ucla.edu) and YPD [15] protein interaction catalogues, SGD [13], RefSeq[39], Gene Registry (http://genome- www.stanford.edu/Saccharomvces/registry.html). the genes from the yeast deletion consortium [12], GO terms [38], and the protein-protein interaction data sets of Gavin et al.[9] and Uetz et al. (2000, Nature 403 (6770):623-7). Data sets may comprise all of the experimentally known or hypothesized biomolecular interactions of a single organism. For example, the data set available at http://pim.hvbrigenics.com contains protein-protein interactions known in the bacteria H. pylori.

The data sets should not be considered to be static entities or to be limited to the data they contain at the time of the application. The data sets are a source of data to utilize in the methods of the invention. In an embodiment of the invention, a relational graph embodies protein-protein interactions or nucleic acid-nucleic acid interactions of an organism such as mouse, human, bacteria, viruses, or yeast. In a particular embodiment, a relational graph embodies a protein-protein interaction data set and comprises a set of vertices representing all the proteins involved in the data set and edges connecting the vertices representing the interacting proteins. In particular, a relational graph embodies protein-protein interaction data sets from BIND, MIPS, and Gavin et al [9].

A relational graph may be in the form of text files and it may be in parsable ASCII text, machine readable text, or binary text. A relational graph may be stored on any data source or medium (e.g. disk, file, network).

A relational graph may be a clique. A "clique" refers to a maximally connected relational graph i.e. each vertex is connected to every other possible vertex in the relational graph.

A "density operation" or "density operator" used in the disclosed method is an operation or function that manipulates, separates, splits, or filters a relational graph based on the density of the relational graph. A density operation or density operator manipulates the relational graphs as objects.

"Density" refers to the level of connectivity of a relational graph. The more edges (the higher the density), the more likely the objects, in particular biological molecules form a relationship (e.g. molecular complex). The density of a graph may be defined by the following formula:

D_G = |E|/|E|_max

where |E| is the number of edges and |E|_max is the theoretical maximum number of edges.

A density operation or density operator produces one or more product graphs embodying subsets of vertices with a high level of connectivity. A relational graph may be selected where everything is connected to everything else with at least "k" edges.

Subsets of a relational graph can be selected that reflect an independent group of biological molecules with similar function, structure, or activity. In an embodiment, subsets of molecular complexes are selected from relational graphs embodying protein-protein interaction data sets.

Description of an Embodiment: The disclosed methods and systems can be understood further by reference to the following example system relating to biological molecules.

Algorithm

The MCODE algorithm operates in three stages, vertex weighting, complex prediction and optionally post-processing to filter or add proteins in the resulting complexes by certain connectivity criteria. A network of interacting molecules can be intuitively modeled as a graph, where vertices are molecules and edges are molecular interactions. If temporal pathway or cell signaling information is known, it is possible to create a directed graph with arcs representing direction of chemical action or direction of information flow, otherwise an undirected graph is used. Using this graph representation of a biological system allows graph theoretic methods to be applied to aid in analysis and solve biological problems. This graph theory approach has been used by other biomolecular interaction database projects such as DIP[16],

CSNDB[17], TRANSPATH[18], EcoCyc[19] and WIT[20] and is discussed by Wagner and Fell[21].

Algorithms for finding clusters, or locally dense regions, of a graph are an ongoing research topic in computer science and are often based on network flow/minimum cut theory [22,23] or more recently, spectral clustering[24]. To find locally dense regions of a graph, MCODE instead uses a vertex- weighting scheme based on the clustering coefficient, Q, which measures 'cliquishness' of the neighborhood of a vertex[25]. Cj = 2nlk£k-V) where k_t is the vertex size of the neighborhood of vertex i and n is the number of edges in the neighborhood (the immediate neighborhood density of v not including v). A clique is defined as a maximally connected graph. There is no standard graph theory definition of density, but definitions are normally based on the level of connectivity of a graph. Density of a graph, G=(V,E), with number of vertices, |V|, and number of edges, |E|, is defined here as |E| divided by the theoretical maximum number of edges possible for the graph, |E|_max. For a graph with loops (an edge connecting back to its originating vertex), |E|_max = |V| (|V|+l)/2 and for a graph with no loops, |E|_max = |V| (|V|-l)/2. So, density of G, D_G=|E|/|E|_max is thus a real number ranging from 0.0 to 1.0. The first stage of MCODE, Vertex weighting, weights all vertices based on their local network density using the highest Λ-core of the vertex neighborhood. A fc-core is a graph of minimal degree k (graph G, for all v in G, deg(v)>=£). The highest k-core of a graph is the central most densely connected subgraph. The term core-clustering coefficient of a vertex, v, is defined to be the density of the highest /c-core of the immediate neighborhood of v (vertices connected directly to v) including v (note that Q does not include v). The core-clustering coefficient is used here instead of the clustering coefficient because it amplifies the weighting of heavily interconnected graph regions while removing the many less connected vertices that are usually part of a biomolecular interaction network, known to be scale-free[6,21,26-29]. A scale-free network has a vertex connectivity distribution that follows a power law, with relatively few highly connected vertices (high degree) and many vertices having a low degree. A given highly connected vertex, v, in a dense region of a graph may be connected to many vertices of degree one (singly linked vertex). These low degree vertices do not interconnect within the neighborhood of v and thus would reduce the clustering coefficient, but not the core-clustering coefficient. The final weight given to a vertex is the product of the vertex core- clustering coefficient and the highest /c-core level, /^, of the immediate neighborhood of the vertex. This weighting scheme further boosts the weight of densely connected vertices. This specific weighting function is based on local network density. Many other functions are possible and some may have better performance for this algorithm.

The second stage, molecular complex prediction, takes as input the vertex weighted graph, seeds a complex with the highest weighted vertex and recursively moves outward from the seed vertex, including vertices in the complex whose weight is above a given threshold, which is a given percentage away from the weight of the seed vertex. This is the vertex weight percentage (VWP) parameter. If a vertex is included, its neighbours are recursively checked in the same manner to see if they are part of the complex. A vertex is not checked more than once, since complexes cannot overlap in this stage of the algorithm (see below for a possible overlap condition). This process stops once no more vertices can be added to the complex based on the given threshold and is repeated for the next highest unseen weighted vertex in the network. In this way, the densest regions of the network are identified. The vertex weight threshold parameter defines the density of the resulting complex. A threshold that is closer to the weight of the seed vertex identifies a smaller, denser network region around the seed vertex.

The third stage is post-processing. Complexes are filtered if they do not contain at least a 2-core (graph of minimum degree 2). The algorithm may be run with the 'fluff option, which increases the size of the complex according to a given 'fluff parameter between 0.0 and 1.0. For every vertex in the complex, v, its neighbors are added to the complex if they have not yet been seen and if the neighborhood density (including v) is higher than the given fluff parameter. Vertices that are added by the fluff parameter are not marked as seen, so there can be overlap among predicted complexes with the fluff parameter set. If the algorithm is run using the 'haircut' option, the resulting complexes are 2-cored, thereby removing the vertices that are singly connected to the core complex. If both options are specified, fluff is run first, then haircut.

Resulting complexes from the algorithm are scored and ranked. The complex score is defined as the product of the complex subgraph, C=(V,E), density and the number of vertices in the complex subgraph (D_c x |V|). This ranks larger more dense complexes higher in the results. Other scoring schemes are possible. MCODE may also be run in a directed mode where a seed vertex is specified as a parameter. In this mode, MCODE only runs once to predict the single complex that the specified seed is a part of. Typically, when analyzing complexes in a given network, one would find all complexes present (undirected mode) and then switch to the directed mode for the complexes of interest. The directed mode allows one to experiment with MCODE parameters to fine tune the size of the resulting complex according to existing biological knowledge of the system. In directed mode, MCODE will first pre-process the input network to ignore all vertices with higher vertex weight than the seed vertex. If this were not done, MCODE would preferentially branch out to denser regions of the graph, if they exist, which could belong to a separate, but denser complex. Thus, a seed vertex for directed mode should always be the highest density vertex among the suspected complex. There is an option to turn this pre-processing step off, which will allow seeded complexes to branch out into denser regions of the graph, if desired.

The time complexity of the entire algorithm is polynomial 0(wm/ι³) where n is the number of vertices, m is the number of edges and h is the vertex size of the average vertex neighbourhood in the input graph, G. This comes from the vertex- weighting step. Finding a k-core in a graph proceeds by progressively removing vertices of degree < k until all remaining vertices are connected to each other by degree k or more, and is thus 0(κ²). The highest k-core is found by trying to find k-cores from one up until all vertices have been found and cannot go beyond a number of steps equal to the highest degree in the graph. Thus, the highest k-core step is 0(n³). Since this k-core step operates only on the neighbourhood of a vertex, the n in this case is the number of vertices in the average neighbourhood of a vertex, h. The inner loop of the algorithm only operates twice for every edge in the input graph, thus is 0(2mh³). The outer loop operates once on all vertices in the input graph, thus the entire time complexity of the weighting stage is 0(n2mh³) =

0(wn/ϊ³). The complex prediction stage is O(n) and the optional post-processing step can be up to 0(cs²), where c is the number of complexes that were found in the previous step and s is the number of vertices in the largest complex - Q(cs²} to find the 2-core once for each complex.

Even though the fastest min-cut graph clustering algorithms are faster, at O(n²logn)[30], MCODE has a number of advantages. Since weighting is done once and comprises most of the time complexity, many algorithm parameters can be tried, in O(n), once weighting is complete. This is useful when evaluating many different parameters. MCODE is relatively easy to implement and since it is local density based, has the advantage of a directed mode and a complex connectivity mode. These two modes are generally not useful in typical clustering applications, but are useful for examining molecular interaction networks. Additionally, only those proteins above a given local density threshold are assigned to complexes. This is in contrast to many clustering applications that force all data points to be part of clusters, whether they truly should be part of a cluster or not. Implementation MCODE has been implemented in ANSI C using the cross-platform NCBI Toolkit

(http://www.ncbi.nlm.nih.gov/IEB) and the BIND graph library in the SLRI Toolkit (http://sourceforge.net/projects/slritools). Both of these source code libraries are freely available. The MCODE program has been compiled and tested on UNIX, Mac OS X and Windows. In an embodiment, a yeast gene name dictionary is used to recognize input and generate output, and the MCODE executable works for yeast proteins in a user friendly manner. The MCODE algorithm, however is completely general, via the graph theory abstraction, to any graph and thus to any biomolecular interaction network. MCODE binaries are available from ftp://ftp.mshri.on.ca/pub/BIND/Tools/MCODE.

The following non-limiting example is illustrative of the present invention: EXAMPLE: The following materials and methods were used in the studies described in the Example.

Materials and Methods Data Sources

All protein interaction data sets from MIPS[13], Gene Ontology}[42] and PreBIND (http://bioinfo.mshri.on.ca/prebind/) were collected as described previously[6]. The YPD protein interaction data are from March 2001 and were originally requested from Proteome, Inc. (http://www.proteome.com).

Other interaction data sets are from BIND (http://www.bind.ca). A BIND yeast import utility was developed to integrate data from SGD[12], RefSeq [43], Gene Registry (http://genome- www.stanford.edu/Saccharomyces/registry.html), the list of essential genes from the yeast deletion consortiumfl 1] and GO terms [42]. This database ensures proper matching of yeast gene names among the multiple data sets that may use different names for the same genes. The yeast proteome used here is defined by SGD and RefSeq and contains 6,334 ORFs including the mitochondrial chromosome. Before performing comparisons, the various interaction data sets were entered into a local instance of BIND as pairwise protein interaction records. The MIPS complex catalogue was downloaded in February 2002.

The protein interaction data sets used here were composed as follows. 'Gavin Spoke' is the spoke model of the raw purifications from Gavin et al.[7]. 'Y2H' is all known large-scale[2-5,10] combined with normal yeast two-hybrid results from MIPS. 'HTP Only' is only high-throughput or large-scale data[2-7,10]. The 'Benchmark' set was constructed from MIPS, YPD and PreBIND as previously described[6]. 'Pre HTMS' was composed of all yeast sets except the recent large-scale mass spectrometry data sets[6,7]. 'AllYeast' was the combination of all above data sets. All data sets are non-redundant. Network Visualization

Visualization of networks was performed using the Pajek program for large network analysis [39] (http://vlado.frnf.uni-lj.si/pub/networks/pajek/) as described previously[6,10] using the Kamada-Kawai graph layout algorithm followed by manual vertex adjustments and was formatted using CorelDraw 10. Power law analysis was also accomplished as previously described[6]. Results

Evaluation of MCODE

The evaluation of MCODE requires a set of experimentally determined biomolecular interactions and a set of associated experimentally determined molecular complexes. Currently, the largest source for such data is for proteins from the budding yeast, Saccharomyces cerevisiae. Recently, a large-scale mass spectrometry study by Gavin et al.[7] provided a large data set of protein interactions with manually annotated molecular complexes. Also available are the protein interaction and complex tables of MIPS[13] and YPD[14]. MCODE was used to automatically predict protein complexes in the collected protein-protein interaction data sets. Resulting complexes were then matched to known molecular complexes from Gavin et al. (the Gavin benchmark) and the MIPS benchmark using an overlap score. Parameter optimization was then used to maximize the biological relevance of predicted complexes according to the given benchmarks. YPD was not used.

To ensure that MCODE is not unduly affected by the expected high false-positive rate in large-scale interaction data sets, large-scale and literature derived MCODE predictions were compared. MCODE was then used to predict complexes in the entire set of machine readable protein-protein interactions that could be collected for yeast. Complexes of interest were then further examined using the directed mode and complex connectivity mode of MCODE. Evaluation of MCODE using the Gavin data set of protein interactions and complexes

In this study, all forms of protein interaction data available were to be used. This requires mixing of different types of experiments, such as yeast two-hybrid and co-immunoprecipitation. Two-hybrid results are inherently pairwise, whereas copurification results are sets of one or more identified proteins. For a copurification result, only a set of size 2 can be directly considered a pairwise interaction, otherwise it must be modeled as a hypothetical interaction. Biochemical copurifications can be thought of as populations of complexes with some underlying pairwise protein interaction topology that is unknown from the experiment. In the general case of the purification used by Gavin et al., one affinity tagged protein was used as bait to pull associated proteins out of a yeast cell lysate. The. two extreme cases for the topology underlying the population of complexes from a single purification experiment are a minimally connected 'spoke' model, where the data are modeled as direct bait-associated protein pairwise interactions, and a maximally connected 'matrix' model, where the data are modeled as all proteins connected to all others in the set. The real topology of the set of proteins must lie somewhere between these two extremes.

Population of complexes: C = {b, c, d, e) (ό=bait) Spoke model hypothetical interactions: i_s = fb-c, b-d, b-e}

Matrix model hypothetical interactions; i_M = {b-b, b-c, b-d, b-e, c-c, c-d, c-e, d-d, d-e, e-e} Advantages of the spoke model are that it is biologically intuitive, biologists often represent their copurification results in this manner, and it is about 3 times more accurate than the matrix model [31]. Disadvantages are that it could misrepresent interactions. The matrix model, alternatively, can not misrepresent interactions, as all possible interactions are generated, but this is at the cost of generating a large number of false interactions. Matrix topologies are also physically implausible for larger complexes because of increased possibility of steric clash if all subunits are interacting with all others. Ultimately, the spoke model should be reasonable for use in evaluating MCODE.

Gavin et al. raw data from 588 biochemical purifications were represented using the spoke model, described above, to get 3,225 hypothetical protein-protein interactions among 1,363 proteins for input to MCODE. A list of 232 manually annotated protein complexes based on the original purification data reported by Gavin et al. was filtered to remove five reported 'complexes' each composed of a single protein and six complexes of two or three proteins that were already in the data set as part of a larger complex. This yielded a filtered set of 221 complexes that were used to evaluate MCODE, although some of these complexes have significant overlap to other complexes in the set. To evaluate which parameter choice would allow automatic prediction of protein complexes from the spoke modeled Gavin et al. interaction set that best matched the manually annotated complexes, MCODE was run using all four possible combinations of the two Boolean parameters (haircut: true/false, fluff: true/false) over a full range of 20 vertex weight percentage (VWP) and fluff parameters (0 to 0.95 in 0.05 increments). During this parameter optimization process, MCODE was limited to find complexes of size two or higher.

A scoring scheme was developed to determine how effectively an MCODE predicted complex matched a complex from the benchmark set of complexes. In this case, the benchmark complex set was the Gavin et al. hand-annotated complex set. The overlap score was defined as ω = i²/a*b, where i is the size of the intersection set of a predicted complex with a known complex, a is the size of the predicted complex and b is the size of the known complex. A protein is part of the intersection set only if it is present in both predicted and known complexes. Thus, a predicted complex that has no proteins in a known complex has ω = 0 and a predicted complex that perfectly matches a known complex has ω = 1. Also, predicted complexes that fully overlap, but are much larger or much smaller than any known complexes will get a low ω. The overlap score of a predicted complex vs. a benchmark complex is then a measure of biological significance of the prediction, assuming that the benchmark set of complexes is biologically relevant. The best parameter choice for MCODE on this protein interaction data set is one that predicts a set of complexes that match the largest number of benchmark complexes above a threshold ω. Since there is overlap in the Gavin benchmark complex database, a predicted complex may match more than one known complex with a high ω.

To choose an overlap score that maximizes biological relevance of the predicted complexes without filtering away too many predictions, each of the 840 parameter combinations tested during the parameter optimization stage. The number of MCODE predicted complexes was plotted against the number of matched known complexes over a range of ω thresholds from 'no threshold' to 0.1 to 0.9 (in 0.1 increments). If no ω threshold is used, a predicted complex only needs at least one protein in common with a known complex to be considered a match. If predicted and known complexes are only counted as a match when their ω is above a specific threshold, the number of matched complexes declines with increasing ω threshold, as shown in Figure 1. Interestingly, the average and maximum number of matched known complexes drops more quickly from zero until a ω threshold of 0.2 than from 0.2 to 0.9 indicating that many predicted complexes only have one or a few proteins that overlap with known complexes. A co threshold of 0.2 to 0.3 thus seems to filter out most predicted complexes that have insignificant overlap with known complexes.

Figure 2 shows the range of number of complexes predicted and number of known complexes matched for the 0.2 ω threshold over all tried MCODE parameters. A y=x line is also plotted to show that data points tend to be skewed towards a higher number of matched known complexes than predicted complexes because of the redundancy in the Gavin complex benchmark. Data points closest to the upper right portion of the graph maximize both number of matched known complexes and number of predicted complexes. MCODE parameter combinations that result in these data points therefore optimize MCODE on this data set (according to the overlap score threshold). This result shows that the number of predicted complexes should be similar to the number of matched known complexes for a parameter choice to be reasonable, although the number of matched known complexes may be larger because of some commonality among complexes in the benchmark set. The parameter combination corresponding to the best data point (63,88) at an overlap score threshold of 0.2 is haircut=FALSE, fluff=TRUE, VWP=0.05 and a fluff density threshold between 0 and 0.1. These parameter optimization results for MCODE over this data set were stable over a range of ω thresholds up to 0.5. Above 0.5, the result was not stable as there were generally too few predicted complexes with high overlap scores (Figure 1).

A specificity versus sensitivity analysis [32] was also performed. Defining the number of true positives (TP) as the number of MCODE predicted complexes with ω over a threshold value and the number of false positives (FP) as the total number of predicted MCODE complexes minus TP. The number of false negatives (FN) equals the number of known benchmark complexes not matched by predicted complexes.

Sensitivity was defined as [TP/(TP+FN)] and specificity was defined as [TP/(TP+FP)]. The MCODE parameter choice that optimizes both specificity and sensitivity is the same as from the above analysis. The optimal sensitivity of this analysis was ~0.31 and the corresponding specificity was -0.79.

The 63 MCODE predicted complexes only matched 88 of the 221 complexes in the known data set indicating that MCODE could not recapitulate the majority of the Gavin complex benchmark solely using protein connectivity information. As mentioned above, there are more matched complexes than predicted because of some redundancy in the benchmark. This low sensitivity is not surprising, since many of the hand-annotated complexes were created directly from single co-immunoprecipitation results, which are not highly interconnected in the spoke model. For example, Cdc3 was used as a bait to co-immunoprecipitate CdclO, Cdcll, Cdcl2 and Ydl225w. A complex was annotated as containing these five proteins, but only

Cdc3 was used as bait. If more elements of a complex are used as baits, the proteins become more interconnected and more readily predicted by MCODE. A good example of this is the Arp2/3 complex, which is highly conserved in eukaryotes and is involved in actin cytoskeleton rearrangement. The structure of this complex is known by X-ray crystallography[33] thus actual protein-protein interactions from the structure can be matched up to the co-immunoprecipitation results. MCODE predicted all seven components of the Arp2/3 complex crystal structure and five extra proteins using the optimized parameters. Six out of the seven Arp2/3 subunits were used as baits by Gavin et al. and the resulting benchmark complex included the five extra proteins that MCODE also predicted (Nog2, Pfkl, Prtl, Cct8 and Cct5) that are not in the crystal structure. Cct5 and Cct8 are known to be involved in actin assembly, but Nog2, Pfkl and Prtl are not. These extra proteins likely represent non-specific binding in the experimental approach. These two cases are shown diagrammatically in Figure 3. Interestingly, using the haircut parameter would remove all five extra proteins that are not in the crystal structure, leaving only the seven that are present. This shows that while the parameter optimization allows maximum matching of the hand-annotated known complexes, these complexes may not all be physiologically relevant and thus another parameter set may better predict 'real' complexes.

To explore the effect of certain MCODE parameters on resulting predicted complexes, various features of the predicted complexes were examined while changing specific parameters and keeping all else constant. Linearly increasing the VWP parameter increased the size of the predicted complexes exponentially while reducing the number of complexes predicted in a linear fashion. Figure 4 shows this effect with both fluff and haircut parameters turned off. At high VWP values, very large complexes were predicted and these encompassed most of the data set, thus were not very useful.

Because using haircut=TRUE would have led MCODE to predict the Arp2/3 complex perfectly (according to the crystal structure as discussed above), whether the haircut parameter has any general effect on the number of matched predicted complexes was examined. Setting haircut=TRUE had no significant effect on the number of complexes predicted at high ω thresholds, but generally reduced the number of matched known complexes at low ω thresholds (0 to 0.1) compared to haircut=FALSE. Since the haircut=TRUE option removes less-connected proteins on the fringe of a predicted complex and this reduces the number of predicted complexes with low overlap scores, these fringe proteins likely contribute to low- level overlap (<0.2 ω) of the known complexes.

The effect of changing the fluff density threshold when setting fluff=TRUE on the number of matched benchmark complexes was also investigated. Linearly increasing the fluff density threshold in the MCODE post-processing step linearly decreased the number of matched complexes above an overlap score of 0.2.

Evaluation of MCODE using MIPS data set of protein interactions and complexes

Since the Gavin et al. data set was developed by only one group using a single experimental method, it may not accurately represent protein complex knowledge for yeast. The MIPS protein complex catalogue (http://mips.gsf.de/proj/yeast/catalogues/complexes/) is a curated set of 260 protein complexes for yeast that was compiled from the literature and is thus a more realistic data set comprised of varied experiments from many labs using different techniques. After filtering away 50 'complexes' each composed of a single protein and 2 highly similar complexes, 208 complexes were left for the MIPS known set. This set did not include information from the recent large-scale mass spectrometry studies[6,7]. The MIPS complex is currently the best available public resource for yeast protein complexes. MCODE was run again with a full combination of parameters, this time over a set of 9088 protein- protein interactions among 4379 proteins which did not include the recent large-scale mass spectrometry studies but included all interactions from the MIPS, YPD and PreBIND databases as well as from the majority of large-scale yeast two-hybrid experiments to date[2-4,10,32]. This interaction set is termed 'Pre HTMS'. All of the interactions in this set were published before the last update specified on the MIPS protein complex catalogue and many are included in the MIPS protein interaction table, thus it was assumed that the MIPS complex catalogue took into account the information in the known interaction table. Protein complexes found by MCODE in this set were compared to the MIPS protein complex catalogue to evaluate how well MCODE performed at locating protein complexes ab initio.

The same evaluation of MCODE that was done using the Gavin et al. data set was performed with the MIPS data set. From this analysis, including specificity versus sensitivity plots (optimized sensitivity = -0.27 and specificity = -0.31), the MIPS complex benchmark optimized parameters were haircut=TRUE, fluff=TRUE, VWP=0.1 and a fluff density threshold of 0.2. This result was stable up to a ω threshold of 0.6 after which it was difficult to evaluate the results as there were generally too few predicted complexes above the high ω thresholds. This parameter combination led MCODE to predict 166 complexes of which 52 matched 64 MIPS complexes with a ω of at least 0.2. Examining the ω distribution for this parameter set reveals that, even though this prediction is optimized, most of the predicted complexes don't show overlap to those in the known MIPS set (Figure 5). The complexes predicted here are also different from those predicted from the Gavin interaction data. Nine complexes have an overlap score above 0.2 between these two sets, with the highest overlap score being 0.43 and all the rest being below 0.27. This might signify that either the MIPS complex catalogue is not complete, that there is not enough data in the dataset that MCODE was run on, or a human annotated definition of a complex does not perfectly match with a graph density based definition. The effect of the VWP parameter on complex size and of the haircut and fluff parameters on number of matched complexes was very similar to that seen when evaluating MCODE on the Gavin complex benchmark. Effect of Data Set Properties on MCODE

Since many large-scale protein interaction data sets from yeast are known to contain a high level of false positives [35], the effect these might have on MCODE predictions was examined. Sensitivity vs. specificity was plotted for MCODE predictions, with parameters chosen to maximize these values at ω threshold of 0.2 against the MIPS and Gavin complex benchmarks for the various data sets (Figure 6).

MCODE predictions on the high-throughput data sets, termed 'Gavin Spoke', 'Y2H' and ΗTP only' (see Methods), are about as specific as the literature derived interaction data set, but not as sensitive (Figure 6A). MCODE predictions on interaction data sets containing the literature derived benchmark, labelled

'Benchmark', 'Pre HTMS' and 'AllYeast', are generally more sensitive and specific than those containing just the large-scale interaction sets. Since the specificity drops from Benchmark to Pre HTMS to AllYeast, with increasing amounts of large-scale data, it could be argued that addition of this data negatively affects MCODE. However, large-scale data is known to contain a high number of false positives, so it should be expected that these false-positives would not randomly contribute to the formation of dense regions, which are highly unlikely to occur by chance (see below). More complexes should be predicted with the addition of the large-scale data, assuming this data explores previously unseen regions of the interactome, but the high number of false-positives should limit the amount of new complexes compared to the amount of added interactions. The MIPS complex benchmark used here is not expected to contain complexes newly found in large-scale studies, explaining the decrease in specificity. This is exactly what occurs in this analysis. In an effort to further test the effect of large-scale data on MCODE prediction performance, the Benchmark interaction data set was augmented with the addition of interactions from large-scale experiments that only connect proteins in the Benchmark set with each other. Over 3100 interactions were added to the

Benchmark data set to create a set of over 6400 interactions. MIPS complex benchmark optimised MCODE predicted 52 complexes matching 66 MIPS benchmark complexes, almost exactly the same number of complexes found using the Benchmark set by itself (Table 1). These analyses strongly suggest the addition of large-scale experimentally derived interactions does not unduly affect the prediction of complexes by MCODE.

It can be seen from Figure 6B that the Gavin complex benchmark set is biased towards the Gavin et al. spoke modeled interaction data. This is expected and is the main reason why the less biased MIPS complex set is used throughout this work as a benchmark instead of the Gavin set.

Since the result of a co-immunoprecipitation experiment is a set of proteins, which are modeled as binary interactions using the spoke method, it was decided to evaluate whether this affects complex prediction compared to an experimental system that generates purely binary interaction results, such as yeast two-hybrid. As can be seen in Table 1, MCODE does find known complexes in the 'Y2H' set of only yeast two-hybrid results, thus this set does contain dense regions that are known protein complexes. This being said, the Y2H set is the least dense of all data sets examined here so is expected to have less dense regions of the network and thus less MCODE predictable complexes per number of proteins present in the set.

MCODE predicts a similar amount of complexes as well as finding a similar amount of known complexes in the Y2H and Gavin Spoke data sets indicating that these data sets are not significantly different from each other in the amount of dense network regions that they contain, even though they are different sizes. Taken together, the latter results and those in Figure 6B show that the spoke model is a reasonable representation of the Gavin et al tandem affinity purification data.

Predicting Complexes in the Yeast Interactome

Given that MCODE performed reasonably well on test data, it was decided to predict complexes in a much larger network. All machine-readable protein-protein interaction data from various data sets[2- 7,10,13,14] were collected and integrated to form a non-redundant set of 15,143 experimentally determined yeast protein interactions encompassing 4,825 proteins, or approximately three quarters of the proteome.

This set was termed 'AllYeast'. MCODE was parameter optimized, as above, using the MIPS benchmark. The best resulting parameter set was haircut=TRUE, fluff=TRUE, VWP=0 and a fluff density threshold of 0.1. With these parameters, MCODE predicted 209 complexes, of which 54 matched 63 MIPS benchmark complexes above an overlap score of 0.2 (see examples in Table 5). Complexes found in this manner should be further studied using MCODE in directed mode by specifying a seed vertex and trying different parameters to examine how large a complex can get before seemingly biologically irrelevant proteins are added (see below).

Figure 5 shows that even when a large set of interactions is used as input to MCODE, most of the MCODE predicted complexes do not match well with known complexes in MIPS. The complex size distribution of MCODE predicted complexes matches the shape of the MIPS set, but the MCODE complexes are on average larger (Average MIPS size=6.0, Average MCODE Predicted size=9.7). The average number of YPD and GO functional annotation terms per protein in an MCODE predicted complex is similar to that of MIPS complexes (Table 2). This seems to indicate that MCODE is predicting complexes that are functionally relevant. Also, closer examination of the top, middle and bottom five scoring MCODE complexes shows that MCODE can predict biologically relevant complexes (Table 3).

Many of the 209 predicted complexes are of size 2 (9 predicted complexes) or 3 (54 predicted complexes). Complexes of this size may not be significant since it is easy to create high density subgraphs of size 2 or 3, but becomes combinatorially more difficult to randomly create high density subgraphs as the size of the subgraph increases. To examine the relevance of these small predicted complexes of size 2 or 3, the sensitivity and specificity of the optimized MCODE predictions was calculated against the MIPS complex benchmark while disregarding the small complexes. First, complexes of size 2, then of size 3, were removed the optimized MCODE predicted complex set. Removing each of these sets independently resulted in only small sensitivity and specificity changes. Because both sets overlap the MIPS benchmark, small complexes have been reported as predictions. Also, because MCODE found these small complexes in regions of high local density, they may be good cores for further examination with MCODE in directed mode, especially since the haircut option was turned on here to produce them.

Complexes that are larger and denser are ranked higher by MCODE and these generally correspond to known complexes. Interestingly, some MCODE complexes contain unknown proteins that are highly connected to known complex subunits. For example, the second highest ranked MCODE complex is involved in RNA processing/modification and contains the known polyadenylation factor I complex (Cftl, Cft2, Fipl, Papl, Pfs2, Ptal, Yshl, Ythl and Ykl059c). Seven other proteins involved in mainly RNA processing/modification (Firl, Hca4, Pcfll, Ptil, Ref2, Rnal4, Ssu72) and protein degradation (Uba2 and Ufdl) are highly connected within this predicted complex. Two unknown proteins Ptil and Yorl79c are highly connected to RNA processing/modification proteins and are therefore likely involved in the same process (Figure 7). Ptil may be an unknown component of the polyadenylation factor I complex. The 23^rd highest ranked predicted complex is interesting in that it is involved in cell polarity and cytokinesis and contains two proteins of unknown function, Yhr033w and Yal027w. Yal027w interacts with two kinases, Gin4 and Kcc4, which in turn interact with the components of the Septin complex (Cdc3, CdclO, Cdcll and Cdcl2) (Figure 8).

Significance of MCODE Predictions

Recent research on modeling complex systems [21,24,26]21,25, 27] has found that networks such as the world wide web, metabolic networks[25] and protein-protein interaction networks[34] are scale-free. That is, the connectivity distribution of the vertices of the graph follows a power law, with many vertices of low degree and few vertices of high degree. Scale-free networks are known to have large clustering coefficients, or clustered regions of the graph. In biological networks, at least in yeast, these clustered regions seem to correspond to molecular complexes and these subgraphs are what MCODE is designed to find. To test the significance of clustered regions in biological networks, 100 random permutations of the large set of all 15,143 yeast interactions were made. If the graph to be randomised is considered as a set of edges between two vertices ( ^ v₂), a network permutation is made by randomly permuting the set of all v₂ vertices. The random networks have the same number of edges and vertices as the original network and follow a power-law connectivity distribution as do the original data sets. Running MCODE with the same parameters as the original network (haircut=TRUE, fluff=TRUE, VWP=0 and a fluff density threshold of 0.1) on the 100 random networks resulted in an average of 27.4 (SD=4.4) complexes per network. The size distribution of complexes found by MCODE did not match that of the complexes found in the original network, as some complexes found in the random networks were composed of >1500 proteins. One random network that had an approximately average number of predicted complexes (27) was parameter optimized using the MIPS benchmark to see how parameter choice affects the size distribution and number of predicted complexes. Parameters of haircut=TRUE, fluff=TRUE, VWP=0.1 and a fluff density threshold of zero produced the maximal number of 81 complexes for this network, but these complexes were composed of on average 27 proteins (without counting an outlier complex of size 1961), which is much larger than normal (e.g. larger than the MIPS set average of 6.0). None of these predicted complexes matched any MIPS complexes above an overlap score of 0.1. Also, the random network complexes had a much higher average number of YPD and GO annotation terms per protein per complex than for MIPS or MCODE on the original network (Table 2). This indicates, as expected, that the random network complexes are composed of a higher level of unrelated proteins than complexes in the original network. Thus, the number, size and functional composition of complexes that MCODE predicts in the large set of all yeast interactions are highly unlikely to occur by chance.

To evaluate the effectiveness of the scoring scheme, which scores larger, more dense complexes higher than smaller, more sparse complexes, the accuracy of MCODE predictions at various score thresholds was examined. As the score threshold for inclusion of complexes is increased, less complexes are included, but a higher percentage of the included complexes match complexes in the benchmark. This is at the expense of sensitivity as many benchmark matching complexes are not included at higher score thresholds (Figure 9). For example, of the ten predicted complexes with MCODE score greater or equal to six, nine match a known complex in either the MIPS or Gavin benchmark above a 0.2 threshold overlap score, yielding an accuracy of 90%. 100% of the five complexes that had an MCODE score better or equal to seven matched known complexes. Thus, complexes that score highly on our simple density based scoring scheme are very likely to be real. Directed Mode of MCODE

To simulate an obvious example where the directed mode of MCODE would be useful, MCODE was run with relaxed parameters (haircut=TRUE, fluff=TRUE, VWP=0.05 and a fluff density threshold of 0.2) compared to the best parameters on the AllYeast network. The resulting fourth highest ranked complex, when visualized, shows two clustered components and represents two protein complexes, the proteasome and an RNA processing complex, both found in the nucleus (Figure 10). This is an example of where a lower VWP parameter would have been superior since it would have divided this large complex into two more functionally related complexes. The highest weighted vertices in the center of each of the two dense regions in Figure 10 are the Rptl and Lsm4 proteins. MCODE was run in directed mode starting with these two proteins over a range of VWP parameters from 0 to 0.2, at 0.05 increments. For Lsm4, the parameter set of haircut=TRUE, fluff=FALSE, VWP=0 was used to find a core complex, which contained 9 proteins fully connected to each other (Dcpl, Keml, Lsm2, Lsm3, Lsm4, Lsm5, Lsm6, Lsm7 and Patl). Above this VWP parameter, the core complex branched out into proteasome subunit proteins, which are not part of the Lsm complex (see Figure 11 A). Using this VWP parameter, combinations of haircut and fluff parameters were used to further expand the core complex. This process was stopped when the predicted complexes began to include proteins of sufficiently different known biological function to the seed vertex. Proteins, such as Vam6 and Yor320c were included in the complex at moderate fluff parameters (0.4-0.6), but not at higher fluff parameters, and these are known to be localized in membranes outside of the nucleus, thus are likely not functionally related to the Lsm complex proteins. Therefore, the 9 proteins listed above were decided to be the final complex (Figure 1 IB). This is intuitive because of their maximal density (a 9-clique).

Using this same method of known biological role "titration" on Rptl found a complex of 34 proteins (Gal4, Gcn4, Hsm3, Lhsl, Nas6, Prel, Pre2, Pre3, Pre4, Pre5, Pre6, Pre7, Pre9, Pup3, RpnlO, Rpnl l, Rpnl3, Rpn3, Rpn5, Rpn6, Rpn7, Rpn8, Rpn9, Rptl, Rpt2, Rpt3, Rpt4, Rpt6, Rril, Sell, Stsl,

Ubp6, Ydrl79c, Ygl004c) and 160 interactions using the parameter set haircut=TRUE, fluff=TRUE, VWP=0.2 and a fluff density threshold of 0.3. Two regions of density can be seen here corresponding to the two known subunits of the 26S proteasome. The 20S proteolytic subunit of the proteasome is comprised of 15 proteins (Prel to PrelO, Pupl, Pup2, Pup3, Sell and Umpl) of which Pre7, Pre8, PrelO, Pupl, Pup2 and Umpl are not found with MCODE. The 19S regulatory subunit of the proteasome is known to have 21 subunits (Nas6, Rpnl to Rpnl3, Rptl to Rpt6 and Ubp6) of which Rpnl, Rpn2, Rpn4, Rρnl2 and Rpt5 are not found with MCODE. Known complex components not found by MCODE are not present at a high enough local density regions of the interaction network, possibly because not enough experiments involving these proteins are present in the data set. Figure 11C shows the final Rptl seeded complex. Of note, Ygl004c is unknown and binds to almost every Rpt and Rpn protein in the complex although all of these interactions were from a single immunoprecipitation experiment[6]. As well, Rril and Ydrl79c have unknown function and both bind to each other and to Rpn5. Thus one would predict that these three unknown proteins function with or as part of the 26S proteasome. The protein Hsm3 binds to eight other 19S subunits and is involved in DNA mismatch repair pathways, but is not known to be part of the proteasome, although all of these Hsm3 interactions are from a particular large-scale exρeriment[7].

Interestingly, Gal4, a transcription factor involved in galactose metabolism, is found to be part of the proteasome complex. While this metabolic functionality seems unrelated to protein degradation, it has recently been shown that the binding is physiologically relevant[37]. These cases illustrate the possible unreliability of both functional annotation and interaction data, but also that seemingly unrelated proteins should not be immediately discounted if found to be part of a complex by MCODE.

Of note, the known topology of the 26S proteasome[38] compares favourably with the complex visualization of Figure 11C without considering stoichiometry. Thus, if enough interactions are known, visualizing complexes may reveal the rough structural outline of large complexes. This should be expected when dealing with actual physical protein-protein interactions since there are few allowed topologies for large complexes considering the specific set of defining interactions and steric clashes between protein subunits.

Complex Connectivity

MCODE may also be used to examine the connectivity and relationships between molecular complexes. Once a complex is known using the directed mode, the MCODE parameters can be relaxed to allow branching out into other complexes. The MCODE directed mode preprocessing step must also be turned off to allow MCODE to branch into other connected complexes, which may reside in denser regions of the graph than the seed vertex. As an example, this was done with the Lsm4 seeded complex (Figure 12). MCODE parameters were relaxed to haircut=TRUE, fluff=FALSE, VWP=0.2 although they could be further relaxed for greater extension out into the network.

Discussion

This method represents an initial step in taking advantage of the protein function data being generated by many large-scale protein interaction studies. As the experimental methods are further developed, an increasing amount of data will be produced which will require computational methods for efficient interpretation. The algorithm described here allows the automated prediction of protein complexes from qualitative protein-protein interaction data and is thus able to help predict the function of unknown proteins and aid in the understanding of the functional connectivity of molecular complexes in the cell. The general nature of this method may allow complex prediction for molecules other than proteins as well, for example metabolic complexes that include small molecules. MCODE may be combined with a graph visualization system to ease the understanding of the relationships among molecules in the data set. The Pajek program is used for large network analysis[37] with the Kamada-Kawai graph layout algorithm[38]. Kamada-Kawai models the edges in the graph as springs, randomly places the vertices in a high energy state and then attempts to minimize the energy of the system over a number of time steps. The result is that the Euclidean distance, here in a plane, is close to the graph- theoretic or path distance between the vertices. The vertices are visually clustered based on connectivity.

Biologically, this visualization can allow one to see the rough structural outline of large complexes, if enough interactions are known, as evidenced in the proteasome complex analysis above (Figure 11C).

It is important to note and understand the limitations of the current experimental methods (e.g. yeast two-hybrid and co-immunoprecipitation) and the protein interaction networks that these techniques generate when analyzing the resulting data. One common class of false-positive interactions arising from many different kinds of experimental methods is that of indirect interactions. For instance, an interaction may be seen between two proteins using a specific experimental method, but in reality, those proteins do not physically bind each other, and one or more other molecules that are generally part of the same complex mediate the observed interaction. As can be seen for the Arp2/3 complex shown in Figure 3, when pairwise interactions between all combinations of proteins in a complex are studied, this creates a very dense graph.

Interestingly, this false-positive effect is normally considered a disadvantage, but is an advantage with MCODE as it increases the density in the region of the graph containing a complex, which can then be more easily predicted. Apart from the experimental factors that lead to false-positive and false-negative interactions, representational limitations also exist computationally. Temporal and spatial information is not currently described in interaction networks. A complex found by the MCODE approach may not actually exist even though all of the component proteins bind each other in vitro. Those proteins may never exist in an organism at the same time and place. For example, molecular complexes that perform different functions sometimes have common subunits as with the three types of eukaryotic RNA polymerases.

Complex stoichiometry, another important aspect of biological data, is not represented either. While it is possible to include full stoichiometry in a graph representation of a biomolecular interaction network, many experimental methods do not provide this information, so a homo-multimeric complex is normally represented as a simple homodimer. When an experiment does provide stoichiometry information, it is not stored in most current databases, such as MIPS and YPD. Thus, one is forced to return to the primary literature to extract the data, an extremely time-consuming task for large data sets.

Some quantitative and statistical information is present when integrating results of large-scale approaches and this is not used in the current graph model. For instance, the number of different types of experiments that find the same interaction, the quality of the experiment, the date the experiment was conducted (newer methods may be superior in certain aspects) and other factors that pertain to the reliability of the interaction could all be considered to determine a reliability index or p-value on edges in the graph. For instance, one may wish to rank results published in high-impact journals above other journals and rank classical purification methods above high-throughput yeast two-hybrid techniques when determining the quality of the interaction data. It may also be possible to weight vertices on the graph by other quality criteria, such as whether a protein is hypothetical from a gene prediction or not or whether a protein is expressed at a particular time and place in the cell. For example, if one were interested in a certain stage of the cell cycle, proteins that are known to be absent at that stage could be reduced in weight (VWP in the case of MCODE) compared to proteins that are present. It should be noted that any weighting scheme that tries to assess the quality of an interaction might make false assumptions that would prevent the discovery of new and interesting data.

This work shows that the structure of a biological network can define complexes, which can be seen as dense regions. This may be attributed to indirect interactions accumulating in the literature. Thus, interaction data taken out of context may be erroneous. For instance, if one has a collection of protein interactions from various different experiments done at different times in different labs from a specific complex that form a clique, and if one chooses an interaction from this clique, then how can one verify if it is indirect or not. One would only begin to know if one had a very detailed description of the experiment from the original papers where one could tell the amount of work and quality of work that went into measuring each interaction. Thus with only a qualitative view of interactions, in reference to Dobzhansky [39], nothing in the biomolecular interaction network would make sense except in light of molecular complexes and the functional connections between them. If one had a highly detailed representation of each interaction including time, place, experimental condition, number of experiments, binding sites, chemical actions and chemical state information, one would be able to computationally delve into molecular complexes to resolve topology, structure, function and mechanism down to the atomic level. This information would also help to judge the biological relevance of an interaction. Thus, databases like BIND[15] are required to store this information. The integration of known qualitative and quantitative molecular interaction data in a machine- readable format should allow increasingly accurate protein interaction, molecular complex and pathway prediction, including actual binding site and mechanism information in a sequence and structural context. Based on the scale- free network analysis, it would seem that real biological networks are organized differently than random models of scale-free networks in that they have higher clustering coefficients around specific regions (complexes) and the vertices in these regions are related to each other, by biological function. Thus, attempts to model biological networks and their evolution in a global way solely using the statistics of scale-free networks may not work, rather modeling should take into account as much extant biological knowledge as possible.

Improvements on MCODE could include different, possibly adaptive, vertex scoring functions taking into account, for example, the local density of the network past the immediate neighborhood of a vertex, and the inclusion of functional annotation and p-values on edges. Time, space and stoichiometry should also be represented on networks and in visualization systems. The process of 'functional annotation titration' in the directed mode of MCODE could be automated.

Conclusions

MCODE effectively finds densely connected regions of a molecular interaction network, many of which correspond to known molecular complexes, based solely on connectivity data. Given that this approach to analyzing protein interaction networks performs well using minimal qualitative information implies that large amounts of available knowledge is buried in large protein interaction networks. More accurate data mining algorithms and systems models could be constructed to understand and predict interactions, complexes and pathways by taking into account more existing biological knowledge. Structured molecular interaction data resources such as BIND will be vital in creating these resources. Examples of a MCODE specification are set out in Table 4. Having illustrated and described the principles of the invention in a preferred embodiment, it should be appreciated to those skilled in the art that the invention can be modified in arrangement and detail without departure from such principles. We claim all modifications coming within the scope of the following claims.

All publications, patents and patent applications referred to herein are incorporated by reference in their entirety to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety.

Table 1

Table 2

Table 3

Table 4 Pseudocode

Stage 1: Vertex Weighting procedure MCODE- VERTEX- WEIGHTING input: graph: G = (V,E) for all v in G do N = find neighbors of v to depth 1

K = Get highest /c-core graph from N k = Get highest &-core number from N d = Get density of K

Set weight of v = k x d end for end procedure

Stage 2: Molecular Complex Prediction procedure MCODE-FIND-COMPLEX input: graph: G = (V,E); vertex weights: W; vertex weight percentage: d; seed vertex: s if s already seen then return for all v neighbors of s do if weight of v > (weight of ^)(1 - d) then add v to complex C call: MCODE-FIND-COMPLEX (G, W, d, v) end for end procedure procedure MCODE-FIND-COMPLEXES input: graph: G = (V,E); vertex weights: W; vertex weight percentage: d for all v in G do if not already seen v then call: MCODE-FIND-COMPLEX(G, W, d, v) end for end procedure

Stage 3: Post-Processing (optional) procedure MCODE-FLUFF-COMPLEX input: graph: G= (V,E); vertex weights: W; fluff density threshold: d; complex graph: C = (U,F) for all u in C do if weight of u > rf then add u to complex C end for end procedure

procedure MCODE-POST-PROCESS input: graph: G = (V,E); vertex weights: W; haircut flag: h; fluff flag:/; fluff density threshold: ; set of predicted complex graphs: C for all c in C do if c not 2-core then filter if h is TRUE then 2-core complex if /is TRUE then call: MCODE-FLUFF-COMPLEX(G, W, t, c) end for end procedure Overall Process: procedure MCODE input: graph: G = (V,E); vertex weight percentage: d; haircut flag: h; fluff flag:/; fluff density threshold: t; set of predicted complex graphs: C call: W = MCODE- VERTEX-WEIGHTING (G) call: C = MCODE-FIND-COMPLEXES (G, W, d) call: MCODE-POST-PROCESS (G, W, h,f, t, C) end procedure

Table 5 - Page 1

>46|MCD Generated Complex|MCDOutput

Pre7 Ura7 Hsm3 Rpn5 Rpn6 Rpn4 Rpt2 Rpt3 Rpn9 Rad23 Hyp2 Gcn4 Prel Rpn3 Pup3 Rad24 Rpt6 Sell Ygl004c Dbf2 Pre9 Nas6 Ecm29 Rpnl RpnlO Rptl Lhsl Rpnl3 Pre8 Yku70 Pre5 Rfc3 Rad50 Mktl Rfc4

Pre6 Rpt5 Rpt4 Rpn8 Ypl070w Pre2 Rpn7 Rpnll Ubp6 Pre4 Rpnl2

>19|MCD Generated Complex|MCDOutput

Ptal Ref2 Pcfll Cftl Uba2 Firl Ufdl Ptil Hca4 Fipl Mpel Papl Cft2 Yshl Rnal4 Pfs2 Ssu72 Yorl79c

Ythl >56|MCD Generated Complex|MCDOutput

Funl9 Myo4 Hhtl Spt7 Taf5 Srb6 Ybr270c Sgf29 Taf2 Cdc36 Rpo21 Luc7 Tafl2 Swi5 TaflO Nggl Adrl

Pcfll Yap6 Spt3 Ada2 Sptl5 Taf6 Sgf73 Pdrl Gcn5 Tafl Irel Yng2 Tral Cdc23 Sapl85 Prp40 Spt8 Taf8

Tafl3 Tafll Yapl Rnal4 Ubp8 Taf7 Taf9 Fet4 Hht2 Hhf2 Spt20 Med7 Ahcl Rpb2 Esal Hfil Gal4 Tafl4

Taf3 Hacl Epll >18|MCD Generated Complex|MCDOutput

Cdc27 Ybr270c Apcl l Apc4 Spt2 Dmcl Docl Cdc23 Rptl Cdcl6 Sicl Apc9 Apc2 Leu3 Apcl Apc5 Spc29

Cdc26

>15]MCD Generated Complex|MCDOutput

Trs20 Usol Gsgl Trs23 Trsl20 Trs31 Krel 1 Betl Gyp6 Bet3 Sec22 Fksl Bet5 Trsl30 Trs33 >10|MCD Generated Complex|MCDOutput

Ref2 Pcfl l Cftl Ptil Hca4 Fipl Mpel Yshl Rnal4 Pfs2

>36|MCD Generated Complex|MCDOutput

Hek2 Hmtl Mudl Sro9 Air2 Luc7 Vps41 Prp42 Tif4632 Snu71 Smdl Tif4631 Nam8 Prp8 Yju2 Mud2

Prp40 Msl5 Clfl Smd3 Bur2 Smd2 Yhcl Senl Rsel Cusl Rpd3 Nrdl Publ Nab3 Cbc2 Brrl Sgvl Smx3 Stol Smx2

>21|MCD Generated Complex|MCDOutput

Rrp42 Rrp45 Ygr090w Rrp46 Mtr3 Ski6 Rrp4 Yhr081w UtplO Masl Gspl Ecml6 Istl Csl4 Rrp40 Dis3

Rrp6 Ski7 Myo2 Rrp43 Utp20

>49|MCD Generated Complex|MCDOutput Rox3 Rpgl Cyc8 Med8 Srb6 Srb8 Tupl Rpo21 Med2 Yap6 Srb7 Ssn2 Srb4 Nutl Cwh41 Pgdl Ygr090w

Srb5 Srb2 Med6 Tim44 Rpb3 Gttl Imel Rgrl Yapl Arg80 Mcml Medll Caf40 Sin4 Asi2 Ynl086w End3

Ssn8 Citl Cse2 Urkl Pop2 Med7 Alrl Galll Ckb2 Med4 Cpal Ssn3 Medl Nut2 Rsc8

>19|MCD Generated Complex|MCDOutput

Usol Slyl Sec20 Sec34 Sec35 Sec24 Betl Ykt6 Sed5 Bosl Sec22 Sec21 Ufel Rud3 Sly41 Sari Sec23 Tip20 Yptl

>55|MCD Generated Complex|MCDOutput

Sefl Lsm2 Ybr094w Sif2 Radl8 Air2 Dhhl Copl Trm3 Dopl Sec26 Lsm6 Sac7 Yel015w Gcdll Lsm4

Spr6 Dsel Lsm5 Keml Cegl Smdl Prp31 Prp8 Neol Farl Gzf3 Imll Ste6 Snull4 Basl Shm2 Rps28b

Smd2 Vacl4 Lsm3 Zds2 Pφ24 Hshl55 Sec21 Gcr2 Lsm7 Top2 Dcpl Ckb2 Smel Rps28a Risl Yor320c Bem3 Prp4 Gdbl Patl Yfl066c Smx2

>13|MCD Generated Comρlex|MCDOutput

Spp381 Gcn2 Prp3 Smbl Smdl Prp31 Prp8 Snull4 Smd2 Snu66 Dibl Prp4 Smx3

>12|MCD Generated Complex|MCDOutput

Arxl Nugl Nsa2 Lsgl Sdal Bud20 Mdnl Ynll82c Rlp7 Nog2 Nogl Ycr072c >10|MCD Generated Complex|MCDOutput

Pre7 Ubc4 Pph22 Prel Sell Pup2 Pre3 Pupl Pre2 Pre4

>6|MCD Generated Complex|MCDOutput

Seel Sec9 Sso2 Snc2 Ssol Sro7 Table 5 - Page 2

Table 5 - Page 3

^♦Vertices 46 1 "Rpn4" ic Gray 2 "Pre6" ic Gray

3 "Pup3" ic Gray 4"RpnlO"icGray

5 "Gcn4" ic Black

6 "Ura7" ic Black 7 "Hsm3" ic Yellow

8 "Hyp2" ic Blue

9 "Rpn5" ic Gray

10 "Ypl070w" ic Black H"Nas6"icGray 12 "Pre5" ic Gray

13 "Rpt3" ic Gray

14 "Rpn9" ic Gray

15 "Rad23" ic Yellow

16 "Sell" ic Gray 17 "Rpnl2" ic Gray

18"Dbf2"icRed

19 "Rpnll" ic Gray

20 "Pre8" ic Gray

21 "Rpt5" ic Gray 22 "Rpt6" ic Gray

23 "Rpnl3" ic Gray

24 "Rfc4" ic Yellow

25 "Rfc3" ic Yellow

26 "Pre9" ic Gray 27 "Rpt4" ic Gray

28 "Pre2" ic Gray

29 "Rpn7" ic Gray

30 "Rpn3" ic Gray 31"Rpn8"icGray 32 "Rpn6" ic Gray

33 "Ecm29" ic Black

34 "Rpnl" ic Gray

35 "Rptl" ic Gray

36 "Rpt2" ic Gray 37 "Ygl004c" ic Black

38 "Prel" ic Gray

39 "Pre7" ic Gray

40 "Mktl" ic Black

41 "Lhsl" ic Black 42 "Rad50" ic Yellow

43 "Rad24" ic Black

44 "Yku70" ic Yellow

45 "Pre4" ic Gray

46 "Ubp6" ic Gray *Edges

141wl

1171 wl

2381wl 2201 w 1 241wl

2161wl Table 5 - Page 4

Table 5 - Page 5

Rank Name Density Core Density (CD) Core Level Score (CD*CL) (CL)

1 Tafl 2 0.566667 0.87619 13 11.3905

2 Tral 0.43083 0.87619 13 11.3905

3 Taf6 0.256684 0.87619 13 11.3905

4 Ngg1 0.327586 0.87619 13 11.3905

5 Spt20 0.672515 0.87619 13 11.3905

6 Taf9 0.40942 0.87619 13 11.3905

7 Taf5 0.230159 0.87619 13 11.3905

8 Sptδ 0.87619 0.87619 13 11.3905

9 Spt3 0.75 0.87619 13 11.3905

10 Gcn5 0.358974 0.87619 13 11.3905

11 Ada2 0.344828 0.87619 13 11.3905

12 Hfi1 0.8 0.87619 13 11.3905

13 Spt7 0.413043 0.87619 13 11.3905

14 Ta l O 0.393162 0.87619 13 11.3905

15 Dod 0.75641 0.878788 10 8.78788

16 Cdc26 0.659341 0.878788 10 8.78788

17 Apc4 0.878788 0.878788 10 8.78788

18 Cdc27 0.878788 0.878788 10 8.78788

19 Cdc16 0.525 0.878788 10 8.78788

20 Apd 0.75641 0.878788 10 8.78788

21 Cdc23 0.320346 0.878788 10 8.78788

22 Apc11 0.75641 0.878788 10 8.78788

23 Apc5 0.878788 0.878788 10 8.78788

24 Apc9 0.580952 0.878788 10 8.78788

25 Apc2 0.492647 0.878788 10 8.78788

26 Cft2 0.654412 0.780952 10 7.80952

27 Pap1 0.437229 0.780952 10 7.80952

28 Pta1 0.489474 0.780952 10 7.80952

29 Lsm8 0.0664683 0.836364 9 7.52727

30 Lsm1 0.103608 0.836364 9 7.52727

31 Lsm7 0.303333 0.833333 9 7.5

32 Lsm2 0.0894214 0.833333 9 7.5

33 Lsm3 0.26087 0.833333 9 7.5

34 Lsm6 0.297101 0.833333 9 7.5

35 Dcp1 0.383399 0.833333 9 7.5

36 Pat1 0.166387 0.833333 9 7.5

37 Kem1 0.529412 0.833333 9 7.5

38 Lsm5 0.156472 0.833333 9 7.5

39 Lsm4 0.132404 0.833333 9 7.5

40 Rptl 0.32197 0.735294 10 7.35294

41 Rpn10 0.264634 0.735294 10 7.35294

42 Rpt3 0.169683 0.735294 10 7.35294

43 Rpn6 0.370115 0.735294 10 7.35294

44 Yth1 0.558333 0.794872 9 7.15385 Table 5 - Page 6

Abp1 Cla4

Abp1 Rvs167

Abp1 Srv2

Abp1 Ynl094w

Abp1 Yor284w

Acf2 Rvs167

Act1 Cof1

Act1 Las 17

Act1 Pfy1

Act1 Rvs167

Act1 Srv2

Aip1 Srv2

Yil079c Cdc24

Apg17 Apg17

Apg17 Exo84

Apg17 Myo1

Apg17 NiplOO

Apg17 Rho1

Apg17 Rho2

Apg1 Sro77

Apg17 Dad2

Apg7 Sso2

Arp1 NiplOO

Bcy1 Sro77

Bem1 Boi2

Bem1 Cdc24

Bem1 Cdc42

Bem1 Far1

Bem1 Sec15

Bem1 Swe1

Bem1 Yel043

Bem1 Caf130

Bub2 Gid

Bud2 Cln2

Bud8 Ste20

Bud8 Ykl082c

Cap1 Cap2

Cap1 Gic2

Cap1 Ypr171w

Cdc11 Bem4

Cdc11 Cdc12

Cdc11 Nfi1

Cdc11 Spr28

Cdc11 Yor084w

Cdc11 Zds2

Cdc12 Bem4

Cdc12 Cdc12

Cdc12 Cla4 Table 5 - Page 7

Beginner Pajek Instructions

Pajek is a program for large network analysis that can be applied to the study of biomolecular interaction networks, such as protein interaction networks.

Setup:

To setup Pajek on your Windows compatible computer: 1. Download Pajek from http://vlado.fmf.uni-lj.si/pub/networks/pajek/

2. Follow instructions to install Pajek

3. Run Pajek

4. Load a network (follow instructions below)

5. Draw network (follow instructions below) 6. Change the following default settings for Pajek in the Options' menu of the draw window to make your networks look nicer.

7. Options->Colors->BackGround - change to white

8. Options->Colors->vertices - change to 'As defined on input file'

9. Options->Colors->edge - change to 'As defined on input file' 10. Options->Colors->arcs - change to 'As defined on input file'

11. Options->Size-> Vertices - change to '6'

12. Options->Size-> Arrows - change to ' 10 '

13. Options->Size->Font - change to ' 10' The Basics

Pajek contains three windows that you should be familiar with:

1. Main window: This appears when you first start the program and is used to load and save networks and to perform analysis. 2. Results window: Messages from Pajek are displayed in this window.

3. Draw window: This window appears when you select draw from the main window 'Draw' menu.

Load a network: Click the button with a picture of a file folder on it in the Network section of the main window and pick a Pajek .net file.

Save a network: Click the button with the floppy disk on it in the Network section of the main window. Networks are saved with their spatial layouts intact.

Display a network: Select Draw from the Draw menu of the main window.

Network visual layout: In the Draw window, your can automatically layout a network. A network layout uses a computer algorithm to reposition nodes (circles) and edges (lines) to try to minimize visual overlap. This does not work well for very large networks. You can choose to layout a network using some of the algorithms in the Layout menu of the Draw window such as Layout->Energy->Fruchterman Reingold and Layout->Energy->Kamada-Kawai.

Rotating the network in the Draw window:

Press x,X,y,Y,z or Z to rotate the network in the appropriate direction. Zooming:

Select Options->Transform->Resize from the Draw window Options menu. You can then enter the factor you want to zoom in each of x,y or z directions. Other options are also available from the Options- transform menu. Manipulating nodes: You can move nodes using the mouse button. Ctrl-L turns on labels and Ctrl-D turns the labels off. Table 5 - Page 8 Network Analysis

Getting information about a network: Select Info->Network->General from the Main window. Click Ok on the resulting dialog box. The network information appears in the Results window as follows:

Number of vertices (n) : 42 Number of arcs: 40 0 loops

Number of edges: 0

0 loops Densityl [loops allowed] = 0.0226757

Dens±ty2 [no loops allowed] = 0.0232288

The number of vertices (nodes, circles) is shown. This is the number of proteins in a protein interaction network. Edges (lines) represent interactions and arcs are directed edges. Loops are homodimeric interactions. Density of a network is the ratio of actual edges present in a network to the number of possible interactions in a network.

Find the diameter of the network: Diameter is the longest minimal length path between any two nodes in the network. From the Net menu: Click Net->Paths between two vertices->Diameter. The results from this operation can be seen in the result window e.g. "Longest path from xxx(yy) to aaa(bb). Diameter is d" Where aaa and xxx are protein names and yy and bb are nodes number in the network.

Looking at paths between two proteins:

From the Net menu: Click "Net->Paths between two vertices->One shortest" and enter either the labels (protein names) or the vertex numbers of your proteins of interest when prompted. Pajek will ask some questions about the operation, the default answers are fine. When you draw the network again, only the shortest path between your selected proteins will be shown.

Finding the cores of a network:

Go to the Net menu and select Net->Partitions->Core->all (the options input, output and all are only relevant in a graph with arrows). A partition file will appear in the Partition section of the main window. You can get information about this partition in the Info menu. To extract a core, select from the Operations menu, Operations->Extract From Network->Partition and type the start and end partition to extract. E.g. if the highest core in the network is a 7 core, then you can extract that 7 core by extracting partition 7 to 7. If you wanted the 6 core of the same network, you can extract partition 6 to 7. This simple set of instructions explains the basic Pajek operations and represents a tiny sampling of this powerful program. A more detailed manual on Pajek geared for graph theorists is available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/pajekman.htm

File Format It is possible to edit the Pajek .net files in a text editor like Notepad for Windows or SimpleText for MacOS to create your own networks that can then be visualized using Pajek. More information about the Pajek file format is available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/draweps.htm'

Exporting Figures You can export figures from the Draw menu in Pajek. If you want a quick screen shot, you can export as a bitmap. For figure quality, the best results come from exporting EPS/PS or SVG graphics and importing the result into e.g. the latest version of CorelDraw or Adobe Illustrator. Once in these powerful drawing programs, you can recolor, edit and change the fonts in the resulting PS file. References

1. Fields S: Proteomics. Proteomics in genomeland. Science 2001, 291: 1221-1224.

2. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403: 623-627.

3. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98: 4569-4574.

4. Drees BL, Sundin B, Brazeau E, Caviston JP, Chen GC, Guo W et al.: A protein interaction map for cell polarity development. J Cell Biol 2001, 154: 549-571. 5. Fromont-Racine M, Mayes AE, Brunet-Simon A, Rain JC, Colley A, Dix I et al.: Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins. Yeast 2000, 17:

95-110. 6. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL et al: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415: 180-183. 7. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415: 141-147.

8. Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A, Cort JR et al.: Structural proteomics of an archaeon. Nat Struct Biol 2000, 7: 903-909.

9. Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM et al.: A gene expression map for Caenorhabditis elegans. Science 2001, 293: 2087-2092.

10. Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L et al.: A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 2002, 295: 321-324.

11. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B et al.: Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 1999,

285: 901-906.

12. Chervitz SA, Hester ET, Ball CA, Dolinski K, Dwight SS, Harris MA et al: Using the Saccharomyces Genome Database (SGD) for analysis of protein similarities and structure. Nucleic Acids Res 1999, 27: 74-78. 13. Mewes HW, Frishman D, Gruber C, Geier B, Haase D, Kaps A et al: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2000, 28: 37-40. 14. Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS et al.: YPD,

PombePD and WormPD: model organism volumes of the BioKnowledge library, an integrated resource for protein information. Nucleic Acids Res 2001, 29: 75-79. 15. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW: BIND-The biomolecular interaction network database. Nucleic Acids Res 2001, 29: 242-245. 16. Xenarios I, SalwinsM L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of

Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic

Acids Res 2002, 30: 303-305. 17. Takai-Igarashi T, Nadaoka Y, Kaminuma T: A database for cell signaling networks. J Comput Biol 1998, 5: 747-754.

18. Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V et al: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res 2000, 28: 316-319. 19. Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A: The EcoCyc and MetaCyc databases. Nucleic Acids Res 2000, 28: 56-59. 20. Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov EJ, Kyrpides N et al: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids ices 2000, 28: 123-125. 21. Wagner A, Fell DA: The small world inside large metabolic networks. Proc R Soc Land B Biol Sci

2001, 268: 1803-1810.

22. Flake GW, Lawrence S, Giles CL, Coetzee FM: Self-Organization of the Web and Identification of Communities. IEEE Computer 2002, 35: 66-71.

23. Goldberg AV: Finding a Maximum Density Subgraph. Technical Report UCB/CSD University of California, Berkeley, CA 1984, 84.

24. Ng A, Jordan M, Weiss Y: On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14: Proceedings of the 2001 2001.

25. Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393: 440-442.

26. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The large-scale organization of metabolic networks. Nature 2000, 407: 651-654.

27. Albert R, Jeong H, Barabasi AL: Error and attack tolerance of complex networks. Nature 2000, 406: 378-382.

28. Barabasi AL, Albert R: Emergence of scaling in random networks. Science 1999, 286: 509-512.

29. Fell DA, Wagner A: The small world of metabolism. Nat Biotechnol 2000, 18: 1121-1122. 30. Hartuv E, Shamir R: A clustering algorithm based on graph connectivity. Information processing letters 1999, 76: 175-181.

31. Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol 2002, 20: 991-997.

32. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412-424.

33. Robinson RC, Turbedsky K, Kaiser DA, Marchand JB, Higgs HN, Choe S et al: Crystal structure of Arp2/3 complex. Science 2001, 294: 1679-1684.

34. Mayes AE, Verdone L, Legrain P, Beggs JD: Characterization of Sm-like proteins in yeast and their association with U6 snRNA. EMBOJ 1999, 18: 4321-4331. 35. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S et al: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417: 399-403. 36. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature

2001, 411: 41-42. 37. Gonzalez F, Delahodde A, Kodadek T, Johnston SA: Recruitment of a 19S proteasome subcomplex to an activated promoter. Science 2002, 296: 548-550.

38. Bochtler M, Ditzel L, Groll M, Hartmann C, Huber R: The proteasome. Annu Rev Biophys Biomol Struct 1999, 28: 295-317. 39. Batagelj V, Mrvar A: Pajek - Program for Large Network Analysis. Connections 1998, 2: 47-57.

40. Kamada T, Kawai S: An algorithm for drawing general indirect graphs. Information processing letters 1989, 31: 7-15.

41. Dobzhansky T: Nothing in Biology Makes Sense Except in the Light of Evolution. American Biology Teacher 1973, 35: 125-129. 42. The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet

2000, 25: 25-29.

43. Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res

2001, 29: 137-140.

Claims

WE CLAIM:

1. A computer-implemented method for identifying relationships in a data set of objects comprising detecting densely connected regions in the data set based solely on connectivity data of the objects.

2. A computer-implemented method of claim 1 wherein the objects are represented as vertices and the relationships between the objects are represented by interconnecting edges, the method comprising performing a density operation on the data set of objects and traversing outward from a selected seed vertex to detect densely connected regions.

3. A computer-implemented method of claim 1 or 2 wherein the objects are biological molecules.

4. A computer-implemented method of claim 1 or 2 wherein the data set of objects is a biological molecular interaction network.

5. A computer-implemented method for identifying relationships between objects in a data set of objects comprising:

6. A computer-implemented method for identifying objects that form a relationship comprising: (a) implementing a relational graph comprising a data set of one or more known objects of a relationship and unknown objects that are not known to be part of the relationship, wherein the objects are represented as vertices and the relationship between the objects is represented by edges interconnecting the vertices;

(b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices representing the known objects and unknown objects wherein each vertex is assigned a weight based on its local network density;

(c) performing an operation on the vertex weighted graph to identify unknown objects in the relationship by recursively moving outward from a seed vertex representing a known object in the relationship, based on a threshold assigned weight, to add other vertices representing unknown objects in the relationship;

7. A computer-implemented method as claimed in claim 5 or 6 wherein in the subgraph at least two of the vertices represent different types of objects.

8. A computer-implemented method as claimed in claim 5 or 6 wherein in step (c) relationships that do not contain at least two edges are filtered out.

9. A computer-implemented method as claimed in claim 5 or 6, further comprising adding objects to the relationships and/or removing objects from the relationships.

10. A computer-implemented method as claimed in claim 5 or 6 wherein in step (c) objects are added that have not been encountered when recursively moving outward from the seed vertex.

11. A computer-implemented method as claimed in claim 5 or 6, wherein in step (c) objects are removed that are connected to the seed vertex by a single edge.

12. A computer-implemented method as claimed in claim 5 or 6 wherein the relationships in (c) are assigned a score and are ranked.

13. A computer-implemented method as claimed in claim 12 wherein the score is the product of the subgraph density [C=(V,E) where V = the number of vertices and E = the number of edges], and the number of vertices in the subgraph.

14. A computer-implemented method as claimed in any one of claims 5 to 13 further comprising an operation to visualize data from a vertex weighted graph and/or subgraph.

15. A computer-implemented method as claimed in any preceding claim wherein the data set of objects is scale free.

16. A computer-implemented method for identifying molecular complexes in a data set of biological molecules comprising;

(a) implementing a relational graph comprising the data set of biological molecules represented as vertices and the relationship between the biological molecules represented by edges interconnecting the vertices; (b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices from the relational graph wherein each vertex is assigned a weight based on its local network density; and

(c) performing an operation on a vertex weighted graph comprising choosing a seed vertex with a selected assigned weight and recursively moving outward from the seed vertex to include vertices based on a threshold assigned weight, to produce a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices.

17. A computer-implemented method as claimed in claim 16 wherein in the molecular complex subgraph at least two of the vertices represent different types of biological molecules.

18. A computer-implemented method as claimed in claim 16 wherein in step (c) molecular complexes that do not contain at least two edges are filtered out.

19. A computer-implemented method as claimed in claim 16, further comprising adding biological molecules to the molecular complexes and/or removing biological molecules from the molecular complexes.

20. A computer-implemented method as claimed in claim 16 wherein in step (c) biological molecules are added that have not been encountered when recursively moving outward from the seed vertex.

21. A computer-implemented method as claimed in claim 16, wherein in step (c) biological molecules are removed that are connected to the seed vertex by a single edge.

22. A computer-implemented method as claimed in any one of claims 16 to 21 wherein the molecular complexes in (c) are assigned a score and are ranked.

23. A computer-implemented method as claimed in claim 22 wherein the score is the product of the molecular complex subgraph density [C=(V,E) where V = the number of vertices and E = the number of edges], and the number of vertices in the subgraph.

24. A computer-implemented method as claimed in any of claims 16 to 23 further comprising an operation to visualize data from a vertex weighted graph and/or molecular complex subgraph.

25. A computer-implemented method as claimed in any of claims 16 to 24 wherein the data set of biological molecules is scale free.

26. A computer-implemented method as claimed in any preceding claim wherein the local network density is based on the highest k-core of the neighbourhood of a vertex, wherein the k-core is a network of minimum degree k.

27. A computer-implemented method as claimed in claim 26 wherein the local network density is based on a core clustering coefficient of a vertex v defined as the density of the highest /c-core of the immediate vertices connected to v, including v.

28. A computer-implemented method as claimed in claim 27 wherein k is at least 2.

29. A computer-implemented method as claimed in claim 27 wherein k represents the highest value corresponding to the most densely connected region of the relational graph.

30. A computer-implemented method as claimed in any preceding claim wherein the density operation in step (b) amplifies the weighting of heavily interconnected graph regions while removing noise.

31. A computer- implemented method as claimed in any preceding claim wherein the threshold assigned weight in step (c) is determined as a percentage of the weight of the seed vertex.

32. A computer-implemented method of claim 3 for identifying biological molecules that form a molecular complex comprising:

(a) implementing a relational graph comprising a data set of one or more known biological molecules of the molecular complex and unknown biological molecules that are not known to be part of the molecular complex, wherein the biological molecules are represented as vertices and the relationship between the biological molecules is represented by edges interconnecting the vertices;

(b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices representing the known biological molecules and unknown biological molecules wherein each vertex is assigned a weight based on its local network density; (c) performing an operation on the vertex weighted graph to identify unknown biological molecules in the molecular complex by recursively moving outward from a seed vertex representing a known biological molecule in the molecular complex, based on a threshold assigned weight, to add other vertices representing unknown biological molecules; (d) optionally repeating step (c) for the other vertices until no more vertices can be added based on the threshold assigned weight; and (e) producing a molecular complex subgraph embodying known and unknown biological molecules in the molecular complex that are represented by the vertices identified in (c) and (d).

33. A computer-implemented method as claimed in claim 32 wherein the seed vertex represents a biological molecule associated with a disease.

34. A computer-implemented method of claim 3 for determining the putative function of a selected biological molecule comprising:

(a) implementing a relational graph comprising a data set comprising the selected biological molecule and other biological molecules that potentially may form molecular complexes with the selected biological molecule, wherein the biological molecules are represented as vertices and the relationship between the biological molecules is represented by edges interconnecting the vertices;

(b) performing a density operation on the relational graph to produce a vertex weighted graph embodying vertices representing the selected biological molecule and other biological molecules wherein each vertex is assigned a weight based on its local network density; and

(d) producing a molecular complex subgraph embodying molecular complexes comprising the selected biological molecule and other biological molecules; and

(e) determining the putative function of the selected biological molecule based on the structure or function of the other biological molecules in the molecular complexes.

35. A computer program product for performing a method as claimed in any one of claims 1 to 34.

36. A computer program product for performing a density operation upon one or more relational graphs wherein the relational graphs comprise a data set of objects represented as vertices and the relationship between the objects represented by edges interconnecting the vertices, and wherein the computer program product comprises a computer data medium on which is carried a means for performing a density operation on the relational graphs to produce one or more weighted vertex graphs embodying vertices from the relational graph wherein each vertex is assigned a weight based on its local network density.

37. A computer program product as claimed in claim 36 wherein the computer data medium further comprises a means for performing an operation on the vertex graph to choose a vertex with a selected assigned weight and to recursively move outward from the selected vertex to include vertices based on a threshold assigned weight, thereby producing a subgraph embodying relationships between objects with the objects represented by vertices and the relationship between the objects represented by edges interconnecting the vertices.

38. A computer program product as claimed in claim 36 or 37 wherein the objects are biological molecules.

39. A computer program product as claimed in claim 36 wherein the objects are biological molecules and wherein the computer data medium further comprises a means for performing an operation on the vertex graph to choose a vertex with a selected assigned weight and to recursively move outward from the selected vertex to include vertices based on a threshold assigned weight, thereby producing a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices.

40. A computer program product comprising a computer data medium on which is carried means for identifying a subset of objects in densely connected regions of a relational graph and a means for performing a density operation upon the subset to produce a subgraph embodying relationships between the objects with objects represented by vertices and the relationship between the objects represented by edges interconnecting the vertices.

41. A computer program product as claimed in claim 40 comprising a computer data medium on which is carried means for identifying a subset of biological molecules in densely connected regions of a relational graph and a means for performing a density operation upon the subset to produce a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices.

42. A computer program product as claimed in claim 41 wherein in the molecular complex subgraph at least two of the vertices represent different types of biological molecules and/or at least two edges represent different types of relationships between the biological molecules.

43. A system for electronically identifying and optionally visualizing related objects in a data set comprising: (a) a computer having relational graphs comprising the objects wherein the objects are represented by vertices and the relationship between the objects represented by edges interconnecting the vertices;

(b) a density operation for processing such relational graphs to produce vertex weighted graphs embodying vertices from the relational graphs wherein each vertex is assigned a weight based on its local network density;

(c) an operation for processing vertex weighted graphs to choose a seed vertex with a selected assigned weight and recursively moving outward from the seed vertex to include vertices based on a threshold assigned weight, to produce a subgraph embodying relationships with the objects in the relationships represented by vertices and the relationship between the objects represented by edges interconnecting the vertices; and (d) an imaging operation for producing images of the relationships; the system, in response to data requests, creating and transmitting to a plurality of end-users, the images defining the relationships.

44. A system for electronically identifying and optionally visualizing molecular complexes in a data set of biological molecules comprising:

(a) a computer having relational graphs comprising the biological molecules wherein the biological molecules are represented by vertices and the relationship between the biological molecules in the biological molecules represented by edges interconnecting the vertices;

(b) a density operation for processing such relational graphs to produce vertex weighted graphs embodying vertices from the relational graphs wherein each vertex is assigned a weight based on its local network density; (c) an operation for processing vertex weighted graphs to choose a seed vertex with a selected assigned weight and recursively moving outward from the seed vertex to include vertices based on a threshold assigned weight, to produce a molecular complex subgraph embodying molecular complexes with biological molecules in the molecular complexes represented by vertices and the relationship between the biological molecules in the molecular complexes represented by edges interconnecting the vertices; and

(d) an imaging operation for producing images of the molecular complexes; the system, in response to data requests, creating and transmitting to a plurality of end-users, the images defining the molecular complexes.

45. A method for displaying on a computer screen information concerning molecular complexes comprising retrieving an image defining a molecular complex from a system as claimed in claim 43 or 44.