We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I need to use validated PPI data in my research. Is there any database of such validated (like Mass Spectrometry) database for cancer?
One of the most popular databases of PPI data is STRING. It is not only for cancer though, but since many research is done in the field of cancer, you will find a lot of data derived from cancer research.
This site provides a platform to facilitate discovery of new mechanisms to control tumorigenesis through the integration of genomic, pharmacological, clinical, and structural data with the network of cancer-associated protein-protein interactions experimentally detected in cancer cells.
Li et al. (2017) The OncoPPi network of cancer-focused protein-protein interactions to inform biological insights and therapeutic strategies. Nat Commun 16: 14356.
Yeast Interactome Project
The interactome of an organism is the network formed by the full set of physical interactions that can occur in a physiologically relevant dynamic range between all its macromolecules, including protein-protein, DNA-protein, and RNA-protein interactions. Our goal is to develop a more complete yeast protein-protein interaction (PPI) map, leveraging the knowledge we’ve gained from systematic binary interaction mapping efforts. Yeast affords us the opportunity to interrogate a comprehensive set of protein-coding genes and to integrate our systematic interaction data with high quality literature-curated information and with other systematic datasets that are amenable to orthogonal validation. The generation of interactome network maps with increasing sensitivity and quality is a necessary, albeit not sufficient, aspect in the quest of generating predictive macromolecular models at the scale of the whole cell. We developed a conceptual framework to assess interactome models based on quantitative benchmarking of screening and validation assays against reference sets. With this we have shown that the set of high quality, systematic yeast interactome data sets cover
20% of the predicted yeast binary interactome, albeit interrogating less than 75% of yeast protein-coding genes. Utilizing novel technologies we aim to extend to 50% to enable a profoundly improved understanding of the interactome network of budding yeast. For quality control, binary interactions are validated in standardized orthogonal interaction assays, allowing us to thus generate and analyze a third generation, high-quality binary interactome map of S. cerevisiae.
YI-I: We carried out a comparative quality assessment of current yeast interactome datasets, demonstrating that high-throughput yeast two-hybrid (Y2H) provides high-quality binary interaction information. As a large fraction of the yeast binary interactome remains to be mapped, we developed an empirically-controlled mapping framework to produce a "second-generation" high-quality, high-throughput Y2H dataset covering
20% of all yeast binary interactions. Both Y2H and affinity purification followed by mass spectrometry (AP/MS) data are of equally high-quality but of a fundamentally different and complementary nature resulting in networks with different topological and biological properties. This binary map is enriched for transient signaling interactions and inter-complex connections with a highly significant clustering between essential proteins. Rather than correlating with essentiality, protein connectivity correlates with genetic pleiotropy.
Graphical representation of three different types of yeast interactome datasets. The structure of the binary interactome network is obviously different from the structure of the co-complex interactome network. The network structure of the literature-curated dataset resembles that of the co-complex dataset, even though the literature-curated datasets are reported to contain mostly binary interactions. All datasets can be downloaded here.
Interactome Projects at CCSB
H uman interactome mapping is the flagship project of CCSB. A first map of the human binary interactome (Rual et al Nature 2005) was obtained by yeast two-hybrid (Y2H) screening for direct, binary interactions within a "Space I" matrix of
8,000 x 8,000 ORFs contained in Human ORFeome v1.1 (Rual et al Genome Res 2004). We have developed an empirical framework that quantitatively measures the parameters of screening completeness, assay sensitivity, sampling sensitivity, and precision, and used this framework to estimate the size of the human binary interactome as
130,000 ± 32,000 binary interactions (Venkatesan et al Nat Methods 2009). Using a novel next-generation sequencing strategy to identify interaction pairs (Yu et al Nat Methods 2011), we have carried out a Y2H screen for interactions within a “Space II” matrix of
13,000 x 13,000 ORFs contained in Human ORFeome v5.1. We report
14,000 new direct, binary interactions (Rolland et al Cell 2014), bringing the total number of unique binary interactions to
V iruses intrinsically depend on their host cell during the course of infection and can elicit pathological phenotypes similar to those arising from mutations (Gulbahce et al PLoS Comput Biol 2012). We applied a systematic integrated pipeline to investigate at genome-scale perturbations of host interactome networks induced by individual gene products encoded by members of four functionally related, yet biologically distinct, families of DNA tumor viruses: polymaviruses, papillomaviruses, adenoviruses, and Epstein-Barr virus (Rozenblatt-Rosen et al Nature 2012). By yeast two-hybrid we screened 123 viral ORFs against
13,000 human ORFs, obtaining 454 validated binary interactions between 53 viral proteins and 307 human target proteins. By tandem affinity purification followed by mass spectrometry (TAP-MS), we reproducibly mapped 3,787 viral-host co-complex associations involving 54 viral proteins and the products of 1,079 unambiguously identified host genes. more
P lants have unique features that evolved in response to environmental and ecological challenges. Accounts of the complex cellular networks that underlie plant-specific functions are missing. We reported a proteome-wide binary protein-protein interaction map from a search space size of
8,000 x 8,000 ORFs for the interactome network of the plant Arabidopsis thaliana. This interactome map contains
6,200 highly reliable interactions between
2% of the full Arabidopsis biophysical binary interactome. more
W orm Interactome version 8 contains 3,864 binary protein-protein interactions for C. elegans assembled from the WI-2007 high-throughput yeast two-hybrid screen (1,816 new interactions reported in Simonis et al Nat Methods 2009) the WI-2004 high-throughput yeast two-hybrid screen (1,735 interactions reported in Li et al Science 2004) and a compendium of data from medium-throughput yeast two-hybrid screens (554 interactions). more
Y east Interactome version 1 (CCSB-YI1) contains high-quality yeast two-hybrid protein-protein interactions for S. cerevisiae. It includes 1,809 interactions among 1,278 proteins, comprising
10% of the complete yeast binary interactome estimated at
18,000 ± 4,500 interactions (Yu et al Science 2008). To obtain a more comprehensive binary yeast interactome CCSB-YI1 was combined with Ito-core and Uetz-screen datasets to produce Y2H-union, which contains 2,930 binary interactions among 2,018 proteins,
20% of the whole yeast binary interactome. more
F ragmentome: Many protein-protein interactions are mediated through independently folding modular domains. Proteome-wide efforts to model interactome networks have necessarily neglected the modular organization of proteins. We developed an experimental “fragmentome” strategy to efficiently identify interaction domains (Boxem et al Cell 2008). We used this strategy to generate a domain-based interactome network for proteins involved in C. elegans early embryonic cell divisions.
H uman Interactome Mapping Project is the flagship project of CCSB. A first draft map of "Space-I" of the human interactome, that describes interactions within the matrix of 8000x8000 ORFs contained in Human ORFeome v1.1 was published in 2005 (Rual et al.). We have used these data to develop a conceptual framework that incorporates screen-completeness, assay detection-ability and screen saturation to estimate the size of the human interactome (Venkatesan et al, submitted). Currently, work is in progress to map space II (12kx12k) and III (16k x 16k) of the human interactome at three-fold coverage.more
2,000 metabolically related genes and cloning their open reading frames (ORF) ii) identifying protein-protein interactions among the metabolic gene products and iii) building protein interaction maps and metabolic networks that will ultimately allow the development of testable predictions of C. reinhardtii physiology, including gene lethality and rates of growth under defined environmental conditions. more
Chemical biology approaches to target validation in cancer
Target validation is especially critical in the context of drugging the cancer genome and clinical drug resistance.
We review how chemical biology approaches benefit target validation.
We illustrate how critically assessed small molecule chemical inhibitors complement genetic approaches.
We highlight recent progress, including reagents for less druggable targets.
Target validation is a crucial element of drug discovery. Especially given the wealth of potential targets emerging from cancer genome sequencing and functional genetic screens, and also considering the time and cost of downstream drug discovery efforts, it is essential to build confidence in a proposed target, ideally using different technical approaches. We argue that complementary biological and chemical biology strategies are essential for robust target validation. We discuss recent progress in the discovery and application of high quality chemical tools and other chemical biology approaches to target validation in cancer. Among other topical examples, we highlight the emergence of designed irreversible chemical tools to study potential target proteins and oncogenic pathways that were hitherto regarded as poorly druggable.
PIMKL: Towards interpretable phenotype prediction using pathway information
Predicting disease progression based on molecular data obtained from diseased tissue samples and stratifying, or classifying, patients accordingly is a crucial step to help clinicians better personalize and design effective treatments. Although a number of machine algorithms have been proposed to solve this task, many of them have failed in providing interpretable predictions, a key aspect for adoption of such methodologies in the health care industry.
IBM has created PIMKL (pathway-induced multiple kernel learning), a new machine learning algorithm that can produce highly predictive performance and interpretability in predicting phenotypes based on molecular data. PIMKL enables this by exploiting prior knowledge on molecular interactions. Specifically, we feed PIMKL information on molecular networks and pathways describing well-defined biomolecular process. Using a machine learning technique known as multiple kernel learning, PIMKL identifies molecular pathways that are important for the classification of patient groups. The insights on differences between patient groups provided, thanks to the interpretability of the model, could therefore lead to better understanding of cancer progression.
In our study, we specifically addressed the task of predicting whether a breast cancer patient will suffer a relapse within five years after first treatment. In order to benchmark PIMKL, we compared it to 14 other similar algorithms that have been previously applied to six breast cancer cohorts. PIMKL consistently outperformed its counterparts or ranked among the top-performing algorithms on each single cohort.[vii]
The molecular signatures identified by PIMKL turned out to also be highly relevant when we applied the algorithm on unseen data from a different cohort to that used for training. This is an indication that the knowledge obtained on disease progression can be transferred across datasets and cohorts. A particular strength of PIMKL showed up when we trained it with noisy data from different omic layers, i.e. genetic, epigenetic or protein data. Even then, PIMKL proved able to discard the noise and select the most informative data without reducing its performance.
We deployed PIMKL on the IBM Cloud. Researchers can access the web service or use the open source code and use their own data to run experiments. and obtain stable molecular signatures. Learn more about PIMKL on the project web site.
PIMKL exploits a machine learning technique called multiple kernel learning and prior knowledge on molecular interactions to predict phenotypes from molecular data of diverse origins.
The three presented algorithms demonstrate how machine learning approaches can be exploited to advance biomedical research on complex diseases such as cancer. Our work also shows that it is possible to incorporate explainabilty into the algorithms, thereby reinforcing trust while also guiding the search for the underlying disease mechanisms. We continue working to further refine these solutions. By making them publicly available, we hope to maximize their positive impact in the scientific community.
[iv] Matteo Manica, Ali Oskooei, Jannis Born, Vigneshwari Subramanian, Julio Sáez- Rodríguez, María Rodríguez Martínez, “Towards Explainable Anticancer Compound Sensitivity Prediction via Multimodal Attention-based Convolutional Encoders”, Workshop on Computational Biology ICML, 2019.
[v] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. 26th International Conference on Neural Information Processing Systems, NIPS’13 3111–3119
[vi] Matteo Manica, Roland Mathis, Joris Cadow, María Rodríguez Martínez, “Context-specific interaction networks from vector representation of words“, Nature Machine Intelligence 1(4), 181–190, 2019.
[vii] Matteo Manica, Joris Cadow, Roland Mathis, María Rodríguez Martínez,
“PIMKL: Pathway-Induced Multiple Kernel Learning”,
npj Systems Biology and Applications 5(1), 8, 2019.
Research Staff Member in Cognitive Health Care and Life Sciences, IBM Research-Zurich
III. THE SYSTEMS BIOLOGY APPROACH
The ability to collect data for a large number of molecular species from the same sample, including data measuring gene expression and the levels of many proteins and metabolites, opens the possibility of obtaining network-level data. This type of data also makes possible a more discovery-oriented approach. Since the ultimate goal of a systems biology approach to cancer is to understand a relevant biological network with the help of a mathematical model derived from a data-mining effort on heterogeneous, system-level data, systems biology projects require a specialized experimental design, several steps for data analysis, and an iterative approach alternating between computational prediction and experimental verification. This process is schematically represented in Figure 2 .
The systems biology strategy. The goal of a systems biology approach is to construct predictive mathematical models of biological systems. This figure depicts the general strategy to be followed. The starting point is existing biological knowledge about the system in question. It is used to construct a general mathematical framework for a model. The model might contain several unknown biological parameters, such as kinetic rate constants or diffusion rates. The model is refined through known parameter values and experimental data, to which the model can be fitted through simulation. At this stage, the model is used to generate biological hypotheses that can be tested experimentally. Through an iterative process of hypothesis and experiment, the model is refined until it accurately captures the key biological features of the system to be modeled.
The fundamental premise of systems biology is that a complete understanding of a molecular network requires not just knowledge of its parts but also of their dynamic interactions. In order to construct meaningful system-level models, quantitative whole-cell measurements of the network's components as they change over time in response to one or more perturbations are needed. They form the basis of a computational model of the network. The model incorporates one or more hypotheses about the mechanisms that are responsible for system dynamics and provides a theoretical explanation of the observations. Model validation with further experiments tests these hypotheses. A validated model can then serve as a tool for the generation of further hypotheses. As such, the mathematical model represents a tool for hypothesis generation about a system that is typically too large and complex to understand on the basis of a static wiring diagram. Essential features of molecular networks, such as feedback loops and synergistic effects of different pathways, cannot be discovered and understood without the theoretical basis provided by the model.
Constructing dynamic mathematical models of networks is often not possible, however, due to the lack of appropriate time course data sets. In these cases it is still beneficial and often possible to construct and analyze static networks representing, for example, all possible protein-protein interactions in a cell type. Many data sets from functional genomics are available for this purpose. In Section VI we will describe two such examples.
Another important tool set in the systems biology repertoire is network inference from system-level data, typically DNA microarray data measuring gene transcription levels, 2D-gel, or large-scale mass spectrometry data. Inference methods allow the construction of phenomenological networks solely based on information available in the data, without the bias of prior biological knowledge or preconceived notions of the network structure. These phenomenological models can then be used to focus attention on certain features and provide constraints for building mechanistic models like the one in .
Sometimes network inference methods are referred to as “top-down” modeling and mechanistic models as 𠇋ottom-up.” As these terms have begun to enter into general use, it is worth commenting on their distinction. The availability of high-throughput data sets such as DNA microarrays or large-scale protein and metabolite profiles has opened up the possibility to create system-wide models of networks through inference of relationships between network nodes from data sets. This approach, sometimes referred to as top-down modeling, is in contrast to the more traditional approach of constructing mathematical or statistical models by first identifying a list of molecular species and their interactions and then constructing a model that encodes these relationships. Model parameters are then frequently estimated by fitting the model to available data. This approach might be called bottom-up modeling. (An example is provided in the next case study.) The fundamental difference between the two approaches is that in the bottom-up approach one begins with a determination of which molecules are important and which interactions are to be considered. In the top-down approach the aim is to reverse-engineer this information in an unbiased way from the system-level data. Network inference has emerged as a central problem in systems biology . A variety of algorithms have been developed for this purpose, whose output ranges from statistical models such as ARACNE and Bayesian networks models , providing correlation measures of network nodes, to discrete models, such as Boolean networks  and systems of ordinary differential equations, giving full dynamic network models. In most cases, top-down models are phenomenological rather than mechanistic, like bottom-up models. The ultimate goal is to combine the two approaches, with top-down models providing constraints for the bottom-up approach.
Systems biology approaches to the study of cancer to date have mostly focused on the construction and understanding of the molecular networks being altered by malignant transformations. But several studies have been undertaken that move the application of systems biology techniques closer to the clinic. While still far from the personalized medicine paradigm suggested by the genomics revolution it is feasible to envision some of the ways in which parameters in mathematical models might depend on characteristics of individuals and determine the nature of dynamic processes related to cancer progression or diagnosis and treatment. Relevant questions impacting clinical practice are:
How can we use network-level analysis to improve prognostic forecasting? Case Study 1 in Section V. suggests an answer with respect to the prediction of the metastatic potential of breast tumors. Using a combination of DNA microarray data and protein-protein interaction information, the study proposes subnetworks of the protein-protein interaction network as predictors, rather than collections of individual genes.
How can we use alterations of the molecular network in malignant cells rather than differential gene expression in order to identify oncogenes? Case Study 2 focuses on this question in the context of B cell lymphomas. Using a variety of data sets and statistical techniques, a network is established that integrates a range of molecular interactions. Oncogenes in different lymphoma phenotypes are identified by the changes in their network neighborhood rather than by differential expression.
How can we use multi-scale dynamic models spanning the molecular and the tissue level in order to improve treatment methods? This question is addressed in Case Study 3, focused on improved protocols for radiation therapy. Using a dynamic mathematical model that integrates the molecular, cellular, and tissue level scales, one can study strategies for optimal timing of radiation treatment sessions.
Before presenting the case studies that address specific instances of these questions, we first review briefly the systems biology literature as it pertains to cancer. This review is not intended to be comprehensive, but rather illustrative of the systems biology approach.
Results and discussion
Bait selection and analytical processing
An initial set of 407 human bait proteins was selected based on known or implied disease associations and functional annotation. These proteins are implicated in a diverse set of biological processes and pathways. The most well-represented biological process categories among the set of baits are protein modification, cell cycle, transcription and signal transduction, reflecting the choice of bait proteins that are fundamental to essential cellular processes. Many of the baits also have known disease associations, the most well represented being breast cancer, colon cancer, diabetes and obesity, reflecting our objective to target important human diseases. Approximately 10% of the baits selected were hypothetical or poorly annotated proteins, chosen in some cases for their homology to proteins with disease or functional associations of interest. The data set reported here maps interactions for 338 of the initial set of bait proteins. A complete listing of the bait proteins and a representative biological process from the Gene Ontology (GO) ( Ashburner et al, 2000 ), where available, is provided in Supplementary Table I. (See Supplementary Information for further details on bait selection and disease associations.)
Analytical processing and mass spectrometry were carried out as described in Materials and methods. In total, 1034 individual immunoprecipitation experiments were resolved by SDS–PAGE, and proteins visualized by colloidal Coomassie stain. Processing of the corresponding gel lanes yielded 16 321 gel bands that were processed by mass spectrometry generating over 400 000 MS/MS spectra that matched a peptide sequence in the database. For over half of the baits, replicated immunoprecipitation experiments were performed. Figure 1 shows a breakdown of the total set of experiments by type.
Prey identification, scoring and filtering
As shown in Figure 1, our data set consists of both replicated and single-pass immunoprecipitation experiments. An additional level of redundancy arises from the fact that prey proteins may or may not be restricted to single-gel bands in a given lane. We, therefore, devised a data-processing pipeline that would consolidate, organize and remove redundancy and provide us with an accurate master list of the prey proteins identified for each bait. Figure 2 provides an overview of this process. Having extracted each gel band and acquired the MS/MS data (Figure 2A), each data file was searched using the Mascot (www.matrixscience.com) search engine. Low-quality peptide and protein hits were then removed by applying score threshold rules (see Materials and methods). All protein and peptide hits corresponding to gel bands from the same bait were combined and clustered using the algorithm outlined in Figure 2D. The algorithm collapses proteins into clusters if their respective sets of matching peptides are proper subsets of one another (i.e., if one set of peptides is completely redundant with respect to the other). A representative protein (termed the ‘anchor’ of the cluster) can then be selected from the cluster. The anchor is selected by ranking the proteins within a cluster by score or number of peptides and then choosing the top-ranked protein. Ties may be broken by consideration of other attributes, such as quantity and quality of annotation for each protein. This process removes redundancy at the level of the peptide matches and protein sequence each of the anchor proteins is guaranteed to be non-redundant with respect to its complement of matching peptides. A small proportion (0.5%) of the reported interactions are, however, redundant at the level of the gene locus, that is, multiple reported prey proteins map to the same gene name. These may represent instances of protein isoforms or variants and are accordingly left in the data set.
Filtering out spurious and nonspecific proteins
In order to minimize the number of false-positive interactions, we applied an empirical filtering process to remove spurious/contaminant proteins and nonspecifically interacting proteins (Figure 2E). Three filtering steps were applied to the interaction network. These steps are summarized in Table I. First was the identification of the bait protein itself (bait–bait interactions were removed from the network 97% of baits were identified at least once). Second, 207 interactions corresponding to instances of spill over from one gel lane to the next were removed. Finally, we built a database of spuriously occurring proteins and contaminants based on 202 control (vector only) immunoprecipitation experiments. Those proteins occurring in ⩾2.5% of control experiments were removed from the data set as were proteins interacting with ⩾5% of baits. This combined set of proteins includes many common contaminants of mass-spectrometry experiments (such as human keratins) as well as proteins observed to bind nonspecifically, and includes protein families such as tubulins, ribosomal proteins and heat-shock proteins.
|Filtering step||Baits||Unique proteins||Interactions|
|Unfiltered interaction network||407||2826||24 540|
|Remove bait–bait interactions a a Those instances where the bait protein was identified in the mass spectrometry experiment. ||407||2826||24 211|
|Remove spill-over interactions b b Observations of apparent spill-over from one gel lane to another detected by manual examination of gels and peptide/protein identification data. ||407||2826||24 005|
|Remove frequent binders and control experiment proteins c c Frequent binders defined as prey proteins identified for ⩾5% of baits control proteins are those ‘prey’ proteins identified in ⩾2.5% of control experiments. ||338||2235||6463|
- a Those instances where the bait protein was identified in the mass spectrometry experiment.
- b Observations of apparent spill-over from one gel lane to another detected by manual examination of gels and peptide/protein identification data.
- c Frequent binders defined as prey proteins identified for ⩾5% of baits control proteins are those ‘prey’ proteins identified in ⩾2.5% of control experiments.
Interaction confidence scores
As many peptide and protein identification metrics (scores, expect values, number of peptides, peptide coverage, etc.) can be used to assess the overall confidence of a prey identification, we sought to combine several of these metrics and generate an overall measure of the confidence of each prey observation (Figure 2G). Data corresponding to the set of 18 well-replicated (⩾5 immunoprecipitation experiments) baits (see Figure 1) were used as a training set to build a partial least squares (PLS)-based regression model of prey protein reproducibility, whereby the reproducibility was the dependent variable and six predictor variables were selected to build the model. The six predictor variables were the Mascot score for the prey in the lane, the total number of peptides observed for the prey in the lane, the rank of the prey protein in the lane, a binary value indicating whether the prey is in fact the bait protein itself, the maximum Mascot score for the prey across all of the bands in which it was observed in the lane and the best rank for the prey protein across all bands in the lane. The model was trained on the replicated set of baits and then applied to the remaining data set. Where multiple experiments were performed for a bait, an averaged predicted reproducibility value was calculated across all of the observations of the given prey protein. Finally, this value was normalized to between 0 and 1 and reported as the interaction confidence score. In a small number of cases (approximately 5% of the reported interactions), an interaction confidence score was not calculated, because one of the predictor variables was not available. For example, in some cases, a protein identified as present in a given lane may not be judged as present in any of the individual bands in that lane (peptides corresponding to the protein may be present in different bands, none of which scores highly enough for the protein to be considered as present) in these cases, the predictor variable corresponding to the best rank for the protein across all bands in the lane is not calculated. For 18% of the interactions in the accompanying data set, the interaction confidence score is reported as 0. These prey protein observations should still be interpreted as valid as they meet the required search engine score thresholds.
We validated our interaction scoring metric in several ways, demonstrating its utility as a measure of interaction confidence. First, using our training data set of reproduced baits, we performed a 10-fold cross-validation of the model, and measured the ability of our model to estimate prey reproducibility (see Supplementary Information). We found good correlation (r=0.66) between the observed reproducibility and predicted reproducibility across our training set. Second, by analyzing the subset of known interactions in the data set (see subsequent section), we observed that the interaction confidence scores assigned to the set of known interactions were significantly higher than those scores assigned to previously unknown interactions the set of known interactions has a mean interaction confidence score of 0.43, whereas the mean of the entire set of interaction confidence scores is 0.21, a statistically significant difference (Wilcoxon rank sum test P≪0.0001). Third, we analyzed the set of reciprocal interactions in the data set. In our study, no explicit effort was made to test bait–prey interactions reciprocally (i.e., to use the observed prey proteins as baits and see whether the original bait proteins are identified). A small number of interactions (21) were, however, observed reciprocally in the data set. The interaction confidence scores of these 21 reciprocally observed interactions (mean=0.43) were significantly higher (Wilcoxon rank sum test P≪0.0001) than the set of interactions for which a reciprocal interaction was not observed (mean=0.25) or indeed the whole data set (mean=0.21). These observations show that the interaction confidence score is a useful means of ranking the interactions for subsequent data mining. To facilitate more in-depth analysis of such a large data set, we focused our in-depth interpretation of the interactions primarily on interactions with score ⩾0.3, corresponding to approximately one-third (2251 interactions) of the data set. This threshold was chosen because most interactions between subunits of well-characterized protein complexes represented in the data set (the proteasome and eukaryotic translation initiation factors—see below) have scores ⩾0.3. In addition, for 85% of prey proteins with interaction score ⩾0.3, two or more distinct peptide sequences were identified, consistent with emerging guidelines for mass spectrometry-based protein identification ( Bradshaw et al, 2006 ).
Computational assessment and validation
Other types of genomic information, when combined with protein–protein interactions, can provide stronger evidence of functional relationships between genes. Several methods of utilizing these orthogonal genomic data to computationally assess high-throughput protein–protein interaction data have been proposed, such as comparison with gene expression, analysis of paralogous interactions and utilization of functional and sub cellular localization information ( Deane et al, 2002 von Mering et al, 2002 Rual et al, 2005 ). In this section, we present a computational assessment of the IP-HTMS data set by integrating three classes of genomic information: other human protein–protein interaction data sources, GO annotations and gene expression microarray data.
An important consideration when integrating other data types is how to count the protein–protein interactions ( von Mering et al, 2002 ). Two paradigms for modeling protein–protein interaction data have been proposed: the ‘spoke’ model, whereby each bait is assumed to interact with each of its observed prey proteins, and the ‘matrix’ model, whereby the bait and all of the preys interact with each other ( Bader and Hogue, 2002 ) We adopted the ‘spoke’ model for all of our analyses (unless stated otherwise), as the ‘matrix’ model has been shown to produce higher rates of false positives ( Bader and Hogue, 2002 ). We recognize, however, the limitations of the ‘spoke’ model, in particular that bait–prey interactions identified in immunoprecipitation experiments may not actually represent direct physical interactions between the bait and prey protein.
Comparison to other protein–protein interaction data sources
Previous reports have in general found relatively little overlap between protein–protein interaction data sets ( Bader and Hogue, 2002 ). For example, a recent comparison of a comprehensive literature-curated catalog of yeast interactions to all available high-throughput yeast interactions showed only a 14% overlap ( Reguly et al, 2006 ). As pointed out by the latter authors, however, it is important to distinguish between the absolute intersection of the two data sets (the number of interactions in common between the data sets being compared) and the intersection of ‘interaction space’ covered by each data set. For the IP-HTMS platform, the interaction space is the space covered by the set of bait proteins. For example, in comparing the IP-HTMS data set to a Y2H data set, we identify the IP-HTMS space as those Y2H interactions for which one or more of the interactors correspond to an IP-HTMS bait. Performing the comparisons in this way allows for realistic estimates of how interactions are recapitulated across different studies and technology platforms.
We compared the IP-HTMS data set to three other sources of human protein–protein interactions: a collation of known interactions ( Ramani et al, 2005 ), a set of interactions predicted from lower eukaryotic interactome maps ( Lehner and Fraser, 2004 ) and a high-throughput Y2H study ( Rual et al, 2005 ). The overlap between these data sets and the IP-HTMS data set are summarized in Table II. The overlap between the IP-HTMS data set and these three other sources ranges from 6 to 11%, broadly in line with observations of the overlap between the human Y2H data set and literature-curated interactions (2–8%) ( Rual et al, 2005 ). By randomly permuting the IP-HTMS bait–prey interactions and re-computing the overlaps, we confirmed that the overlaps are significantly greater than would be expected by chance (P<0.0001). Similar comparisons in yeast between IP-HTMS interactions ( Ho et al, 2002 ) and literature-curated and tandem affinity purification ( Gavin et al, 2002 ) and literature-curated interactions show 20 and 30% overlaps, respectively ( Reguly et al, 2006 ), suggesting that a much greater proportion of the yeast interactome has been cataloged than that the human interactome.
|Protein–protein interaction data set|
|Known a a Ramani et al (2005) ||Predicted b b Lehner and Fraser (2004) ||Experimental (Y2H) c c Rual et al (2005) |
|Interactions||31 183||20 469||6727|
|IP-HTMS baits featured in data set d d IP-HTMS baits (from total of 343) featuring in the data set. ||216||123||94|
|Overlap with IP-HTMS space e e Number of interactions in the data set featuring one or more IP-HTMS baits. ||2332||668||366|
|Intersection with IP-HTMS (number of interactions, percentage of total) f f Number of shared interactions between data set and IP-HTMS. ||149, 6.4%||78, 11.4%||29, 7.9%|
|Randomly permuted intersection with IP-HTMS (min, mean, max) g g Number of shared interactions between randomly permuted (1000 iterations) IP-HTMS and data set. ||7, 14.3, 25||3, 8.0, 14||0, 1.8, 7|
|Statistical significance of intersection (fold-enrichment, P-value) h h Fold enrichment of observed intersection over intersection expected by chance. ||∼10-fold, P<0.0001||∼10 fold, P<0.0001||∼15 fold, P<0.0001|
- a Ramani et al (2005)
- b Lehner and Fraser (2004)
- c Rual et al (2005)
- d IP-HTMS baits (from total of 343) featuring in the data set.
- e Number of interactions in the data set featuring one or more IP-HTMS baits.
- f Number of shared interactions between data set and IP-HTMS.
- g Number of shared interactions between randomly permuted (1000 iterations) IP-HTMS and data set.
- h Fold enrichment of observed intersection over intersection expected by chance.
The sets of interactions in common between the human IP-HTMS interactions and each of the other three data sets are themselves overlapping of the total of 256 overlapping interactions between IP-HTMS and the other three data sets, 82 are found in two or more of the overlapping sets. We also note that interactions in common between the IP-HTMS and other sources of human protein–protein interactions have in general significantly higher confidence scores. The mean confidence scores for the interactions in common between IP-HTMS and the known set, IP-HTMS and the predicted set, and IP-HTMS and the Y2H set are 0.43, 0.43 and 0.42, respectively, higher than expected by chance (P≪0.0001 Wilcoxon rank sum test) given the overall distribution of confidence scores.
As already mentioned, it is probable that some of the bait–prey interactions identified in IP-HTMS experiments may not actually represent direct physical interactions between the bait and prey protein, but instead interactions between preys. To explore this further, we first extended our comparisons by considering the matrix of all possible interactions in the IP-HTMS data set (i.e., including all possible prey–prey interactions for each bait). Of the matrix of ∼225K possible IP-HTMS interactions, 1678 are in common with the known set (statistically significantly greater than expected by chance, P<0.0001). Although the accuracy of considering the matrix of all interactions is expected to be lower than when only considering bait–prey interactions ( Bader and Hogue, 2002 ), clearly many valid interactions remain to be discovered from this broader approach.
Second, we compared our IP-HTMS interactions to the literature using the Pathway Studio software (Ariadne Genomics). This software enables rapid annotation of protein–protein interactions with literature mined from various sources. Using this approach, 145 protein–protein interactions in our IP-HTMS data set were annotated as present in the literature. In order to identify those IP-HTMS interactions that represent indirect interactions between bait and prey, we mined the literature in the following way. Bait–prey pairs from our IP-HTMS experiments that have literature validation in the Pathway Studio database were selected. The interaction network was then expanded by extracting all known interactors from the literature that are within two edges of the prey. We then overlapped the experimental interactions with the expanded network such that for each bait we considered all paths of length two where the (bait, prey) and the (bait, interactor of prey) pairs are both in IP-HTMS, and hence, the (prey, interactor of prey) pair can be inferred. We did the same for paths of length three, and we enumerated all the distinct length-one pairs from the literature that were part of the overlapping paths. This allows us to significantly expand the validation of our data set using the literature by including not just bait–prey but also prey–prey interactions. With our additional analysis, the total number of observed interactions that are reinforced by the literature increases to 375. This represents a 2.6-fold increase in validation corresponding to 6% of all of our interactions. This set of interactions is provided in Supplementary Table IV. We have utilized this approach in a detailed way to extend networks for individual bait proteins. An example of this is provided in Supplementary Figure III. Only four direct interactors of VHL from our data set matched with the literature. Using our novel approach, we extended the interaction surrounding VHL within two literature edges. This increased the number of proteins seen in the VHL IP-HTMS experiment that are linked to VHL through the literature to 13 (three-fold increase). The nine new associations are indirect but are linked through known interactors of VHL.
Evolutionary relationships between genes both across and within species have been proposed as sources for discovery and confirmation of protein–protein interactions ( Matthews et al, 2001 ). In yeast, interactions between pairs of proteins have been shown to be of higher confidence if interactions also occur between paralogs of the interactors ( Deane et al, 2002 ). The latter authors developed the paralogous verification method, and showed that in yeast the method was able to predict 40% of true interactions with a 1% false-positive rate ( Deane et al, 2002 ).
We explored the utility of this method for assessment of the IP-HTMS data set by first collating a set of 1999 groups of human paralogs (representing 6023 human genes) from the inparanoid database ( O'Brien et al, 2005 ). Cross-referencing to the IP-HTMS data set identified 834 interactions for which both bait and prey could be assigned one or more paralogs. Overall, 154 of these 834 interactions (18%) had one or more paralogous interactions. The set of 154 paralogous interactions are provided as Supplementary Information (Supplementary Table III).
In many cases, these paralogous interactions are comprised of a single bait interacting with two or more related (paralogous) prey proteins. We also wished to test the rate at which paralogous baits identify the same or related prey proteins. The IP-HTMS data set provides an opportunity to do this, because for 16 of the IP-HTMS baits, one or more paralogs have also been used as baits. These 16 baits correspond to 157 interactions for which paralogs were assigned, and 57 of these interactions are paralogous (36%). One caveat to analyzing the IP-HTMS data in this way is that it is not possible to distinguish between independent interactions of paralogous baits with the same or related prey proteins and the scenario whereby paralogous baits interact with each other (e.g., heterodimers) and that complex then identifies the same set of preys regardless of which bait is used. The set of 16 paralogous baits includes three members of the 14-3-3 protein family, YWHAB, YWHAQ and YWHAZ. These proteins are known to form homo- and heterodimers in vivo ( Jones et al, 1995 ) and together contribute 35 of the 57 interactions from paralogous baits. Nevertheless, this is a useful demonstration of the reproducibility of paralogous baits the three 14-3-3 baits identify 117 prey proteins in total, 33 of which are identified by more than one of the baits. Finally, we note that interactions supported by a paralogous interaction have significantly higher interaction confidence scores the set of 154 paralogous interactions have a mean score of 0.33, as compared to 0.21 across the whole data set (Wilcoxon rank sum test P≪0.0001). As pointed out by Deane et al (2002 ), the paralogous verification method is useful only where paralogs can be identified. This is only possible for a relatively small fraction (834 out of 6463 interactions) of the IP-HTMS data set. Nevertheless, we believe that this first preliminary analysis of paralogous interactions in the human interactome illustrates the potential for further in-depth studies as our ability to assign paralogs improves and our knowledge of the human interactome increases.
Biological process and pathway enrichment
To gain an overview of the classes of proteins identified as preys for each of the baits, we used the GO (slim subsets) to analyze biological process and cellular component category representation. In both cases, the distribution of prey proteins among the categories is similar to the distribution of categories among bait proteins the most well-represented bait biological process protein categories—protein modification, protein biosynthesis, cell cycle, transcription and signal transduction, are also the most well-represented prey protein categories.
We used the GO annotation to analyze the degree to which bait and prey interactors share the same or related GO categories. For high-throughput yeast data, the fractions of interactions for which both interactors have the same high-level biological process or cellular component categories have been estimated at 20 and 27%, respectively ( Reguly et al, 2006 ). For our human IP-HTMS data, these fractions are 12 and 20%, respectively. To illustrate these associations in more detail, we generated bait–prey coincidence maps (Figure 3) in which the association between each combination of bait and prey GO categories is tested using a contingency table and statistical test (Fisher exact test). Each combination of bait GO category, i, and prey GO category j, is represented as a cell in the matrix, and the color of the cell represents the statistical significance of the association between bait category i and prey category j. We also implemented a permutation procedure to characterize the distribution of P-values derived from random associations (see Materials and methods). The permutation-based P-value for each bait–prey category combination was calculated as the fraction of times the Fisher exact test P-value was less than the observed ‘real’ P-value. On this basis, the bait–prey category combinations with P-values less than or equal to 0.0001 are all judged to be highly significant smaller P-values for each of these category combinations were not observed across 1000 independent random permutations of the bait–prey categories.
This analysis revealed a significant tendency of baits to interact with prey proteins implicated in the same or similar biological process (Figure 3A and B). For example, the most significant bait–prey biological process category combinations were protein biosynthesis/protein biosynthesis (P=1.7e−09) and catabolism/catabolism (P=2.3e−08). These correspond to two highly connected clusters of interacting proteins representing known macromolecular complexes—translation initiation and elongation factors and the proteasome (both discussed in more detail below). Similar results were obtained for the cellular component categories (Figure 3C and D), except that significant off-diagonal associations were also seen. Most notably, a significant enrichment is seen between baits assigned to the plasma membrane baits and endoplasmic reticulum/Golgi preys. This enrichment is largely due to two members of the tumor necrosis factor receptor super-family baits (TNFRSF14 and TNFRSF5) interacting with several endoplasmic reticulum prey proteins. These two baits interact with an overlapping set of endoplasmic reticulum-associated proteins including several components of the microsomal signal peptidase complex and endoplasmic reticulum-associated protein disulfide isomerase family members. Although it is not clear what the actual biological explanation might be, we believe that these are not spurious observations as this group of prey proteins is also identified using the TRAF6 bait (TNF receptor-associated factor), a known mediator of signaling from TNFRSF5.
Integrated analysis of the IP-HTMS and GO categories also facilitated discovery of some very specific but potentially biomedically important interactions. Relatively few proteins in the IP-HTMS data set are assigned to the peroxisome (17 interactions involve a peroxosomal bait or prey). Of these interactions, a single interaction was observed between a peroxisomal bait and a peroxisomal prey: PHYH (phytanoyl-CoA 2-hydroxylase) bait identified ABCD3 (ATP-binding cassette, subfamily D) as a prey. Defects in the functioning of both PHYH and ABCD3 are implicated in Zellweger's syndrome and other peroxisomal biogenesis disorders, a set of potentially severe (fatal) inherited diseases ( Moser, 1999 Steinberg et al, 2006 ). In addition, several studies have shown interactions between ABCD proteins and peroxisomal biogenesis factors (PEX proteins) and between PHYH and PEX proteins ( Liu et al, 1999 Gloeckner et al, 2000 ). To our knowledge, our observation is the first indication of a protein–protein interaction between PHYH and ABDC3.
Cross-referencing gene expression information
Increased similarity of gene expression profiles for genes encoding interacting proteins has been demonstrated in yeast ( Ge et al, 2001 ). Preliminary evidence that this may also be the case in higher eukaryotes has been reported for Caenorhabditis elegans ( Li et al, 2004 ) and in humans ( Hahn et al, 2005 Rual et al, 2005 ). In the latter case, enrichment for higher gene expression correlation was seen for both literature-derived interactions and, albeit at a lower level, for the experimentally derived data set ( Rual et al, 2005 ). One of the principal issues in attempting to measure whether a relationship exists between gene expression and protein interaction data sets is the incompleteness and arbitrary nature of selecting appropriate human gene expression data. Rather than select individual data sets over which co expression could be measured, we made use of a compendium of co-expression measurements generated from 3924 microarrays from 60 different human studies ( Lee et al, 2004 ). Co-expression links in this study are defined as positive or negative based upon their position within the extremes of the distributions of correlation for each study ( Lee et al, 2004 ). Figure 4 shows the respective fractions of positive and negative co-expression links for several sets of interaction data. For the complete set of approximately 9 million co-expression measurements, a slight bias towards positive measurements was observed ( Lee et al, 2004 ). We first confirmed that the ratio of positive to negative co-expression counts for measurements within the IP-HTMS space (i.e., where one or more of the pair of coexpressed genes corresponded to an IP-HTMS bait) was broadly similar to the bias observed in the complete data set (respective positive to negative ratios are 1.25 and 1.32). We then observed that elevated positive to negative ratios were observed for both the IP-HTMS data set and the human Y2H data set ( Rual et al, 2005 ) and for the set of known interactions ( Ramani et al, 2005 ), suggesting that human gene pairs encoding interacting proteins are also more likely to be coexpressed. The magnitudes of the positive to negative ratios for the IP-HTMS and Y2H data sets are similar (2.33 and 2.50, respectively), whereas the ratio for the known set is significantly higher (4.75). We also confirmed that the ratio of positive to negative co-expression counts for the IP-HTMS data set is statistically significantly higher (P<1e−6, 1 million iterations) than expected by chance by randomly sampling sets (1028 co-expression pairs—the same size as the observed overlap) of co-expression pairs from the IP-HTMS space (mean ratio=1.32, maximum ratio=1.68).
We have also used the integrated IP-HTMS and gene co-expression data for further in-depth discovery of functional relationships between genes. The LYAR (Ly-1 antibody reactive) protein was originally isolated from a mouse T-cell leukemia cell line and shown to encode a predominantly nucleolar-localized protein ( Su et al, 1993 ). As an IP-HTMS bait, LYAR identified 79 prey proteins, and of these, 32 were also found as coexpressed genes in the co-expression database ( Lee et al, 2004 ). Twelve of these co-expression links are classed as stringent (co-expression observed across three or more gene expression studies) ( Lee et al, 2004 ), and are represented in Figure 5. All of the 12 co-expressors/interactors are positively coexpressed and are nonrandomly distributed within the distribution of all co-expression P-values for LYAR (see Figure 5). Indeed, two LYAR interactors, BRIX and DKC1, are the two most highly coexpressed genes for LYAR across the complete co-expression database. All of the 12 (except one hypothetical protein) coexpressing/interacting proteins have been documented as nucleolar proteins (see Figure 5). Overall, these coherent co-expression/interaction patterns are not uncommon in our data set 32 IP-HTMS baits show stringent co-expression with two or more of their prey proteins.
Biological interpretation of the interaction network
Global visualization of the IP-HTMS data set
To aid interpretation of the IP-HTMS data set, we visualized the interaction network in two ways. First, to globally visualize the data set, we developed the bait–bait connectivity map (Figure 6A and B). This visualization reduces the complexity and highlights salient features of the data set by representing only bait proteins and the degree to which they share prey proteins. Second, we visualized selected fragments of the complete (baits and preys) interaction map (Figure 6C–F). The biological significance of two of these maps (the NIMA family kinase, Nek6 interactions and translation initiation and elongation) is discussed in more detail in subsequent sections.
Several features of the data set are clear from the bait–bait map in Figure 6A. First, as shown in the lower part of the graph, many baits (approximately 30%) are poorly connected that is, the set of prey proteins identified is quite distinct from the set of preys identified for any other bait. This is a consequence of both the empirical filtering that was applied to the data set (whereby frequent prey proteins and proteins observed in the control experiments that would otherwise tend to join all baits to one another were removed) and the fact that the baits selected for the study are proteins implicated in a wide variety of diseases, processes, pathways and complexes. Second, where data from multiple baits from the same complex and process are available, those baits are well connected to one another. Several of these interconnected sets of baits are indicated in Figure 6A and B (cross-referenced by roman numerals). For example, the five baits corresponding to the proteasome included in the study form a largely distinct, well-connected network as shown in Figure 6A and B, panel iv. The complete interaction map for these five baits is shown in Figure 6C. The identified prey proteins include many core and regulatory components of the proteasome. Other well interconnected sets of baits include spliceosome complex components (Figure 6A and B, panel i), chromatin remodeling components (Figure 6A and B, panel ii), the translation and elongation factor baits (Figure 6A and B, panel iii), the 14-3-3 protein baits (Figure 6A and B, panel v) and sumoylation pathway components (Figure 6A and B, panel vi). For several of these well-connected bait clusters, we have also represented the corresponding complete interaction maps (Figure 6A and B, panel iii corresponds to Figure 6F, Figure 6A and B, panel iv corresponds to Figure 6C, and Figure 6A and B, panel vi corresponds to Figure 6D). Each of the 14-3-3 baits (Figure 6A and B, panel v) identified a largely overlapping set of preys, an anticipated result given that these proteins form homo- and heterodimers in vivo ( Jones et al, 1995 ). A subset of the experiments for the four 14-3-3 baits included in our study were previously reported and analyzed in-depth ( Jin et al, 2004 ) (approximately 60% of the 14-3-3 prey proteins reported in the current study were reported by Jin et al (2004) ). In addition, these authors analyzed the domain profiles of the identified prey proteins and validated the interaction with the Rho GTPase activator, AKAP13, an interaction identified in our study with two (YWHAB and YWHAG) of the four 14-3-3 baits.
NIMA family kinases and the mitotic cascade
The NIMA (never in mitosis gene a) was originally described in Aspergillus nidulans as a key regulator of entry into the mitotic cycle. Hence, families of NIMA-related kinases (Nek) have since been found to be widely distributed in eukaryotes with a conserved role in regulation of mitosis ( Lu and Hunter, 1995 O'Connell et al, 2003 ). In humans, 11 members of the Nek family have been described. Nek6 was previously shown to be essential for mitotic progression in human cells, and was suggested to be particularly important for the metaphase–anaphase transition ( Yin et al, 2003 ) and chromatin condensation ( Hashimoto et al, 2002 ). Expression analysis also suggested an association of Nek family members with chromosome instability and cancer ( Bowers and Boylan, 2004 Hayward and Fry, 2005 ). Nek6 bait was used in three IP-HTMS experiments, and 42 prey proteins were identified (see the interaction map in Figure 6E). Of particular interest in this set of Nek6 interacting proteins are those with roles in the mitotic cascade, chromosomal remodeling and regulation of the cell cycle. Both Nek7 and Nek9 were identified as interacting proteins with high confidence (Nek9 is the highest scoring prey for Nek6). In addition, these two prey proteins are quite specific Nek7 was only identified in the Nek6 experiments, whereas Nek9 was identified with one other bait, GABARAPL2. Previous work showed that Nek9 activates Nek6 during mitosis and possibly regulates Nek7 as well ( Belham et al, 2003 ). In addition, previous immunoprecipitation experiments showed that Nek6 binds Nek9 (Nercc1) ( Roig et al, 2002 ). The Nek6 experiments also identified key components of the cohesin and condensin complexes (SMC1L1, SMC2L1, MTB, CNAP1, CSPG6 and MCM7) required for condensation, segregation and structural maintenance of chromosomes during mitosis in addition to several microtubule-associated proteins (EML2, EML3 and EML4). Although not well understood, the latter ‘echinoderm microtubule-associated’ proteins are thought to play a role in regulation of microtubule dynamics during mitosis ( Eichenmuller et al, 2002 ). In our study, these proteins were only identified with the Nek6 bait. Nek6 also identified several known cell-cycle regulators (WEE1, CDC37 and RBL1), although we note that the retinoblastoma-like protein, RBL1, was previously shown to bind MCM7 ( Sterner et al, 1998 ), suggesting that Nek6 may not bind directly to both of these proteins. Besides the strong association of Nek6 with the mitotic cycle, Nek6 interacts with several proteins involved in nuclear import/export (TNPO3, IPO4, XPO5 and NUP93). Interestingly, NIMA kinase appears to be required for conformational changes to the nuclear pore complex during mitosis in Aspergillus ( De Souza et al, 2004 ). Furthermore, a direct interaction between the nucleoporin 93 kDa (NUP93) and NIMA kinase has been shown and suggested to be required for nuclear accumulation of mitotic regulators ( De Souza et al, 2003 ). Our data therefore suggest that the Nek family members are also required for nuclear pore complex regulation in higher eukaryotes.
Translation initiation and elongation factors
The molecular mechanisms underlying protein synthesis in eukaryotic organisms are complex and only partially understood ( Kapp and Lorsch, 2004 ). The eukaryotic translation process can be divided into four steps: initiation—the assembly of the ribosome at the initiation codon, elongation—the positioning of aminoacyl tRNAs into the acceptor site, termination—occurring when a stop codon is encountered, and finally the recycling of the ribosomal machinery. As part of our protein interaction mapping, we selected six eukaryotic translation initiation factor (EIF) proteins as baits (EIF2B1, EIF3S10, EIF4A1, EIF4A2, EIF4EBP1 and GC20). A total of 222 interactions were identified for these six baits, primarily with GC20 (162 interactions) and EIF4A2 (42 interactions). Seventy-five interactions have an interaction confidence score greater than 0.3, and 60% of these interactions are with other eukaryotic initiation factor proteins or components of the translational machinery. We focus our discussion here on this subset of the interactions. Our results recapitulate many of the known complexes and steps involved in translation initiation and demonstrate both the specificity and sensitivity of the IP-HTMS approach. Figure 6F shows a bait–prey interaction map for the six initiation factor baits. All of the interactions are shown except for GC20 and EIF4A2 baits, for which only selected prey proteins are shown.
The first step of translation initiation is the formation of a ternary complex between GTP, Met-tRNA and EIF2 and binding of this complex and other EIFs to the 40S ribosomal complex to form the 43S preinitiation complex ( Pestova et al, 2001 ). We observed several complexes that participate in this process. GC20 is a homolog of the yeast SUI1/EIF1 protein, known to be required for binding of the GTP/Met-tRNA/EIF2 complex to the 40S ribosome ( Majumdar et al, 2003 ). In our experiments, the GC20 bait identified several components of the EIF2 complex.
EIF3 is also required for generation of a stable 40S pre-initiation complex. Our experiments with GC20 and EIF3S10 identified many of the EIF3 components (EIF1 (GC20 homolog) has previously been shown to interact with EIF3 ( Fletcher et al, 1999 )).
The EIF3S10 experiments demonstrate the specificity and sensitivity of the IP-HTMS approach this bait identified eight prey proteins, seven of which are documented EIF3 subunits. Interestingly, the remaining prey protein, GA17, dendritic cell protein, contains a P roteasome/ C OP9/ I nitiation factor (PCI) domain, a domain of unknown function but which is seen in components of multi-subunit complexes, such as the proteasome, COP9 and EIF3. Our results support recent work suggesting that GA17 is an additional subunit of EIF3 ( Unbehaun et al, 2004 ).
The next step in the process is mRNA binding to form the 43S pre-initiation complex. EIF4H is known to interact with EIF4A as part of this process and was observed in our experiments ( Richter et al, 1999 ). Both EIF4A and EIF4H were observed in the raw data for the GC20 immunoprecipitation experiments, although EIF4A was removed based on our filtering criteria and EIF4H assigned a low interaction confidence score.
Eukaryotic messenger RNAs contain a modified guanosine, termed a cap, at their 5′ ends. For translation to proceed, binding of an initiation factor, EIF4E, to the cap structure is required ( Richter and Sonenberg, 2005 ). EIF4B binds near the 5′-terminal cap of mRNA in the presence of EIF4F and ATP. EIF 4G1, 4G2, 4E and 4A are known components of the EIF4F multi-subunit complex, all of which were observed in our experiments with the EIF4 baits. EIF4E and protein translation as a whole are regulated in part by the EIF4E binding protein, EIF4EBP1 ( Haghighat et al, 1995 ). In our experiments, the EIF4EBP1 bait identified a single prey protein, EIF4E. The PDCD4 (programmed cell death 4) protein was identified as a prey in both EIF4A1 and EIF4A2 experiments. The PDCD4 gene product has been reported to be a tumor and transformation suppressor and proposed as a target for cancer therapy ( Lankat-Buttgereit and Goke, 2003 ). PDCD4 has also been shown to inhibit translation through its binding to EIF4A and EIF4G ( Yang et al, 2004 Zakowicz et al, 2005 ). Our results support these reports and suggest that PDCD4 interacts very specifically with the translation machinery PDCD4 was seen only with the EIF4A1 and EIF4A2 baits.
Finally, EIF2B functions to recycle the EIF2–GDP complex and recreate EIF2–GTP, which is then ready for a subsequent round of initiation. Immunoprecipitation using EIF2B1 identified six prey proteins, two of which (EIF2B3 and EIF2B5) are documented EIF2B components.
IP-HTMS has provided us with a snapshot of the interactions occurring during the complex process of eukaryotic translation initiation. With six bait proteins covering the major processes of initiation, we are able to identify many relevant interacting proteins and provide a rich data set for further discovery.
This study presents the first high-throughput analysis of native protein complexes by IP-HTMS in a human cell line. As illustrated in this report, our data set provides for both recapitulation of known complexes and discovery of new interactions and complexes. Although our data set maps interactions for proteins implicated in a broad range of pathways and processes, we anticipate that future, focused applications of the IP-HTMS approach will begin to probe in greater depth the impact of disease states and drug treatments on human protein–protein interactions.
This research was funded by grants from the National Institutes of Health (P50AI150476, U19AI135990, U19AI135972, R01AI143292, U54CA209891, R01AG059751, U54NS100717, P01HL146366 and U01MH115747 to NJK) from the Defense Advanced Research Projects Agency (HR0011-19-2-0020, HR00111-9-S0092-FP-FP-002 and HR00111-92-0021 to NJK) and funding from F. Hoffmann-La Roche and Vir Biotechnology. We are grateful to Qiongyu Li, Monita Muralidharan, Robyn Kaake, and Ruth Hüttenhain for their input on this manuscript.
Protein interaction networks are essential in promoting our understanding of cellular phenotypes, and recording of a reliable, static representation of possible human protein–protein interactions is in progress (Stelzl, 2013 Rolland et al, 2014 ). However, conditional protein interaction rewiring is key to cellular responses (Ideker & Krogan, 2012 Woodsmith & Stelzl, 2014 ). Analysis of interaction rewiring provides mechanistic insight into cellular processes (Hegele et al, 2012 ) and a basis to assess the impact of genetic variation in disease (Zhong et al, 2009 Wei et al, 2014 ). In cellular responses, signals are often propagated through PTMs such as tyrosine phosphorylation. Reader proteins mediate these changing PTMs through their recognition (Seet et al, 2006 ), widely reshaping interaction networks in response to the signaling state of the cell.
This study presents a data set of 292 phospho-tyrosine-dependent interactions generated a by large-scale Y2H approach employing human tyrosine kinases (pY-Y2H) which covers part of an interaction space previously unamenable to direct experimental testing. Our approach assays kinase-dependent interactions of full-length proteins in a crowded cellular environment. To ensure specificity of the assay and avoid fitness defects of the yeast due to tyrosine kinase overexpression, bait, prey and protein kinases are expressed at very low levels (Worseck et al, 2012 ). The interaction patterns obtained show high specificity with respect to human kinases. Kinase-dependent yeast growth as indicator of phosphorylation-dependent interactions requires two recognition events, that is, prey phosphorylation by the kinase and binding of the phosphorylated protein by the phospho-recognition domain-containing pY-reader protein. Two recognition events decrease the chance of spurious or false-positive signals compared to non-conditional binary protein interactions. However, it is important to note that the interaction patterns observed depend on several parameters that can vary greatly between kinases and interacting pairs. This includes: (i) protein expression levels, (ii) kinase activity, (iii) kinase specificity and (iv) interaction specificity. For example, ABL2 promotes the highest number of pY-dependent protein interactions, maybe because its catalytic activity is optimal for the pY-Y2H assay.
Non-receptor tyrosine kinases are subject to tight regulation in human cell as they are typically inhibited in the absence of stimuli (Blume-Jensen & Hunter, 2001 ). It is prohibitively difficult to control the origin of phosphorylation in any human system as tyrosine kinases have localized activity levels, largely overlapping substrate specificity and function in signaling cascades. Saccharomyces cerevisiae does not contain bona fide tyrosine kinases, and any trans-regulatory components are not in place when screening for pY-PPI in yeast. However, in the pY-Y2H system, we unambiguously reveal kinase candidates which can phosphorylate human proteins, thereby promoting pY-dependent interactions. In an assay particularly designed to control for the absence of interactions, we tested a selected set of 37 interactions using kinase-dead versions of the nine tyrosine kinases used in the pY-Y2H screen. No comparable interaction signal was obtained with the kinase-dead versions for any of the tested interactions which led us conclude that the vast majority of the kinase-dependent interactions are indeed phosphorylation dependent in the pY-Y2H assay. Phospho-tyrosine binding proteins are multi-domain proteins, 24 of which contain a tyrosine kinase domain as well. Using them as bait, we found both phospho-dependent interactions and phospho-independent interactions. In the latter cases, we cannot exclude that these interactions are phosphorylation dependent as the phosphorylation of the interaction partner could be due to the kinase activity of the bait itself.
Network analysis of the data set confirms that phospho-tyrosine signaling has evolved as a highly connected modular system governing processes involved in cellular signaling, cell–cell communication and growth response pathways related to cancer (Lim & Pawson, 2010 ). Substrate binder annotation statistics also revealed strong agreement with our current knowledge of pY function validating our pY-PPI network. The PPI search was performed through matrix screening examining
13,900 proteins (Entrez GeneID level, covered by
17,000 ORFs) in a highly parallel fashion searching for pY-dependent binary relationships in a biologically unbiased manner. Nevertheless, we identified pY-dependent interactions revolving around a module of relatively few interacting proteins in the human proteome in agreement with the observation that PTMs, including tyrosine phosphorylation, selectively accumulate on human protein complexes (Woodsmith et al, 2013 ). However, widespread Y-phosphorylation was reported on more than 5,000 human proteins in large-scale phospho-proteomics studies (Tan et al, 2009 Hornbeck et al, 2012 ) and our literature analysis indicates that many more human pY-PPIs are still to be discovered. A fraction of the recorded pY sites may be spurious mass spectrometry identifications or may not function through recognition by pY-reader proteins. However, as with any interaction detection method, false negatives can arise in the pY-Y2H approach. Low expression levels of the hybrid proteins (Venkatesan et al, 2009 Worseck et al, 2012 ), kinases insufficiently active under the conditions used, and the fact that only a subset of the human non-RTKs was tested limits coverage of pY-mediated interactions.
Using known phospho-tyrosine sites in known linear amino acid motifs, we predicted a fraction of about 1/6 of the interactions likely mediated through linear peptide motif recognition. Supported by the high fraction of interactions that can be validated in co-IPs in mammalian cells, the data suggest that unknown linear epitopes surrounding the phospho-tyrosine sites or the full-length protein context are crucial for interaction specificity. Thus, this data set provides a wealth of pY-dependent links to investigate potentially novel modes of pY recognition case by case. For example, the SH2 domain of GRB2 is known to bind [pYxNx] peptides that form a β-turn conformation with the specificity-determining N(+2) residue next to W121 in the GRB2 EF-loop closing the binding side (Rahuel et al, 1996 ). However, a series of peptides with amino acids other than N at the (+2) positions (including A, D, E, G, H, I, M, P, Q and R) have been found to bind the GRB2(SH2) domain in systematic array studies with peptides resembling known pY sites (Miller et al, 2008 Tinti et al, 2013 ). In a series of pull-down experiments (Fig 4A–D) and including a standard peptide binding array with purified proteins (Fig 4E and Supplementary Fig S9), we identified Y124 of TSPAN2 as a critical site for GRB2 and PIK3R3 binding. Also in this case, the amino acid sequence surrounding TSPAN2-Y124 may not support binding of the pY peptide in the canonical binding mode (Higo et al, 2013 ) suggesting further investigation.
The role of tetraspanins in cancer is connected to their interactions with integrins, matrix metalloproteases and receptor kinases (Sadej et al, 2014 ). For example, CD9 and CDC151, like other TSPANs, are directly associated with RTKs such as DDR1 (Castro-Sanchez et al, 2010 ), MET (Klosek et al, 2005 Franco et al, 2010 ) or EGFR (Murayama et al, 2004 , 2008 ). On the basis of those reports, it can be speculated that TSPANs are targets of RTKs or of other yet unidentified TK activity (Bordoli et al, 2014 ). Information about tyrosine phosphorylation has been collected in large-scale phospho-proteomics data acquisition projects for 17 TSPAN family members (Hornbeck et al, 2012 ). In particular, TSPAN8 is reported to be phosphorylated at Y122 (Moritz et al, 2010 ), which aligns in the corresponding region with Y124 of TSPAN2. In A549 lung adenocarcinoma cells, it has been shown that silencing of CD151 has major effects on downstream tyrosine phosphorylation signaling events involving p130Cas, FAK, paxillin and Src (Yamada et al, 2008 ). CD151 or CD9 influence cytoskeletal dynamics inducing cellular spreading and long protrusions similar to the phenotype observed when transfecting TSPAN2 into HEK293 cells (Wang et al, 2011 Blumenthal et al, 2012 ). Otsubo et al demonstrated that RNAi-mediated TSPAN2 knockdown decreases cell motility and invasive activity in small airway epithelial cells, the putative origin of lung adenocarcinomas, and promotes survival in metastatic mouse models (Otsubo et al, 2014 ). Our data open up a possible function of phospho-tyrosine 124-mediated interactions related to TSPAN2's involvement in adenocarcinoma invasion and migration. However, as we do not directly demonstrate TSPAN2-Y124 phosphorylation in vivo, we cannot rule out indirect effects of the Y124F mutation, for example, on TSPAN2 glycosylation, related to the observed cellular phenotype. Thus, we provide a mechanistic entry point to unravel the effects of elevated TSPAN2 levels in cancer such as lung adenocarcinomas (Otsubo et al, 2014 ). The hypothesis is that conditional pY-dependent interactions with different adaptor proteins, GRB2 or PIK3R3, respectively, may be altered in cancer, suggesting further investigation of TSPAN2 and its interactions in the transition of cells to abnormal growth (Lazo, 2007 Hemler, 2014 Otsubo et al, 2014 ).
The more detailed characterization of the TSPAN2–GRB2 and TSPAN2–PIK3R3 interactions exemplarily demonstrates how PPIs are spatially constrained. The signaling status of the cell dictates pY-dependent interactions and can give rise to a multitude of differential PPI networks involving an overlapping set of proteins each specifying the pY-mediated signal flow under certain conditions. Signaling hubs, like GRB2 or PIK3R3, can mediate several distinct responses, triggered, for example, by elevated kinase activity or alterations in local protein concentration, simultaneously and possibly independently. Therefore, assessing the consequences of conditional interactions, that is, edge phenotypes, rather than alterations of proteins alone may prove important for a better understanding of the molecular changes that occur, for example, during cancer development. To this end, our pY-PPI data set may serve as a useful resource.
Protein interaction network of IAV with Homo sapiens
After integrating all the interaction data from literature and databases, the total number of protein interactions between IAV and Homo sapiens is 1477. After removal of the duplicate interactions, this number becomes 1027. Figure 1 shows the protein-protein interaction network constructed in Cytoscape. The number of nodes in the network is 829 among which 14 are viral proteins while 815 belong to human proteins. The viral protein NS1 has the highest number of interactions (397) with human proteins, followed by NP and PB2 with 184 and 117 associations respectively. Based on the network, we utilized CytoHubba to rank individual nodes by different features as described in the section of Materials and Methods. Figure 2 represents the rankings of viral proteins based on their respective numbers of associations with host proteins (see Table 2 for detailed data). Furtherly, CytoCluster  was used to cluster the nodes, and from the result we found out the highly connected hubs and potential protein complexes in the biological networks. Figure 3 represents a single highly connected cluster produced from ClusterOne algorithm, one of the six algorithms in CytoCluster. Through network analyzer tools, we also obtained important statistical features of the network. Table 3 displays some important statistical parameters of the protein-protein interaction network between IAV and human.
Protein-protein interaction network of Influenza A virus with host Homo sapiens constructed in Cytoscape. The network contains 829 nodes (proteins) among which 14 are viral proteins while 815 are human proteins. Total number of edges (interactions) are 1051 and the highly connected nodes tend to make clusters and hubs in the network. Viral proteins NS1, NP and PB2 are shown in a bigger size due to their higher numbers of associations with human proteins. The average number of interactions for a node is 2.4, which means the network is not very dense
A pie chart of IAV proteins based on their numbers of interactions with human proteins
A highly connected subgraph formed between viral protein NS1 with human proteins
KEGG and GO enrichment analyses of top-notch IAV-associated host proteins
Among the host proteins that have interactions with IAV proteins, we found that certain human proteins are associated with more viral proteins compared with the others. Human proteins LNX2 and MEOX2 interact with 7 and 6 viral proteins respectively. Host proteins TFCP2, PRKRA and DVL2 were found to be interacting with 5 viral proteins respectively while eight proteins are associated with 4 viral proteins each. Figure 4 shows the resulting graph of KEGG pathway analysis for the top 13 IAV-interacting host proteins. From Fig. 4, it can be seen that the set of genes are highly enriched in the pathway of basal cell carcinoma which is a type of skin cancer. The other pathways in which the gene set is enriched are the pathways of synaptic vesicle cycle, collecting duct acid secretion, cocaine addiction, ferroptosis, and cortisol synthesis and secretion. Gene Ontology analysis for these highly IAV-interacting host proteins shows that the gene products are involved in the important biological processes including snRNA transcription from RNA polymerase III promoter, beta-catenin destruction complex disassembly, response to increased oxygen levels, production of siRNA involved in RNA interference, cellular response to oxygen levels, and certain other processes. The complete depiction of the Gene Ontology information i.e. biological processes, molecular functions and cellular components is shown in Figures S1, S2, S3 in supplementary materials.
KEGG pathways in which the 13 highly IAV-interacting host factors are enriched. The color of the bar shows the intensity of the gene set enriched in a specific pathway. The lighter the color, the more enriched the genes are in that pathway
Node clustering analysis
In a protein interaction network, node clusters, highly interconnected node groups, generally consist of two classes of modules i.e. protein complexes and functional modules. With node clustering analysis, researchers have identified some significant functional modules in the protein-protein interaction studies on Type 2 Diabetes  and Burkitt’s lymphoma  respectively. In order to detect the potential protein complexes and functionally important modules in IAV-human protein interaction network, node clustering was performed with different methods including the hierarchical, k-means, k-medoid, density-based and spectral-based clustering algorithms. The hierarchical clustering algorithm builds a tree that connects nodes in the network hierarchically while the density-based algorithm detects densely connected subgraphs in the network. Figure 5 shows multiple clusters made for the IAV-human protein interaction network by MCODE, CytoCluster, ClusterViz and ClusterOne respectively. The cluster detected by MCODE has 10 nodes in which HA, NA, PA, PB1 and M2 belong to viral proteins while LNX2, TFCP2, DVL2, DVL3 and MEOX2 belong to human. In the cluster made by CytoCluster, there are 9 nodes among which PB1-F2 belongs to viral proteins while MIF, UGP2, PLAUR, PASK, ARHGEF1, GNB2, PLSCR1 and SNRK are human proteins. There are 9 nodes in the cluster detected by ClusterViz where HA is a viral protein while SLC25A1, ARF1, HLA-DRB1, COL413BP, SLC25A6, ARF4, HLA-B and EEF1G belong to human. The cluster made by ClusterOne has 6 nodes with NP as a viral protein and KPNA6, ACTN4, MOV10, NCBP1 and LARP1 belonging to human. These nodes are densely interconnected to each other but sparsely associated to the other nodes, and the clusters formed by them contain potential protein complexes and functionally important modules in IAV-human protein interaction network.
Node clusters made for IAV-human protein interaction network in Cytoscape with different clustering algorithms including MCODE a), CytoCluster b), ClusterViz c) and ClusterOne d)
Common host factors involved in IAV and HCV infection pathways
In our previous study on the protein interaction network of Hepatitis C Virus with Homo sapiens, we found out some potential human proteins that interact with HCV viral proteins . Here, by comparing the results from the current and previous studies, we detected common host factors involved in both viral infection pathways. Out of 815 host factors from the current study and 940 identified in the previous study, 138 host proteins were found to overlap in both studies, which means that these gene products are involved in both HCV and IAV infectious pathways. Table S1 gives the list of the 138 common host factors in supplementary materials. Figure 6 shows the results of KEGG pathway analysis for these proteins. It can be seen that these proteins are not only involved in HCV and IAV infectious pathways but also actively enriched in the infectious pathways of Hepatitis B Virus, viral carcinogens, measles, human T-cell leukemia virus 1 and human cytomegalovirus.
KEGG pathways in which the common host genes are enriched involved in both infectious pathways of IAV and HCV viruses
These authors contributed equally: Elham Amjad and Solmaz Asnaashari.
Biotechnology Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
Elham Amjad, Solmaz Asnaashari, Babak Sokouti & Siavoush Dastmalchi
School of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
B.S. and S.D. contributed to the design and implementation of the research. E.A. and S.A. worked out the numerical calculations and outcomes for the experiment. All authors (E.A., S.A., B.S. and S.D.) discussed and aided in interpreting the results and contributed to the final manuscript.