PrismEXP is published at PeerJ and can be accessed here: https://peerj.com/articles/14927/

When using PrismEXP please cite:
Lachmann A, Rizzo KA, Bartal A, Jeon M, Clarke DJB, Ma’ayan A. 2023. PrismEXP: gene annotation prediction from stratified gene-gene co-expression matrices. PeerJ 11:e14927 https://doi.org/10.7717/peerj.14927

Tutorial Video

This video is 18 minutes long and discusses the basic principle of PrismEXP as well as ways you can use and interact with the software.

The PrismEXP algorithm

The PrismEXP algorithm is based on the idea behind the Simpson's Paradox. The Simpson's Paradox states that two variables can exhibit relationships (e.g., correlation) that differ between subsets of the data compared to their relationship across the entire data. An example of this behavior is depicted in the figure below. The figure is showing five clusters in which when observed individually, we see a positive correlation between variable X and Y, but the correlation is negative when all samples are considered.

(Source: Wikimedia, Pace~svwiki, distributed under CC-BY-SA 4.0 licence)

PrismEXP tries to maximize the information from large collection of gene expression studies, namely, ARCHS4, to make gene annotation predictions. To achieve this, gene expression samples are clustered into groups of samples with a similar expression profile. This clustering is unsupervised. It automatically groups samples into clusters likely related to their natural division into tissues, cell types, and cell lines. A correlation matrix is computed for each cluster (e.g., 100 clusters).

So how does PrismEXP make annotation predictions for a specific gene?

\(g_1\) Assuming we have computed \(N\) correlation matrices and a gene \(g_1\), we now need a database with known gene annotations in the form of a GMT file. A GMT file contains, in each row, a gene annotation term, for example a biological pathway, followed by a set of gene symbols. All genes following the biological term are annotated with it. The genes in a row of a GMT file are also called a gene set. In this example, all genes in the gene set are annotated as members of the pathway. PrismEXP tries to extend a given gene set with genes that have similarities to the genes of that set. In the simple example of a single correlation matrix, a logical solution would be to rank all genes based on their average correlation to the genes in the gene set. This method already works quite well in recovering known gene annotations and was used previously in the software tools and web-based resources ARCHS4 and Geneshot.

Example:

Gene set \(\mathbb{S}\), with \(|\mathbb{S}| = M\) from GO_Biological_Processes_2021:
mismatch repair (GO:0006298) = {MLH1, LIG1, RPA2, RPA3, RPA1, MSH2, ...}
average correlation for: \[cor(\mathbb{S}, g_1) = \frac{cor(\textrm{MLH1}, g_1) + cor(\textrm{LIG1}, g_1) + cor(\textrm{RPA2}, g_1) + \ldots }{M}\]


But in PrismEXP, instead of just having one correlation matrix, we have many. We require a strategy of ranking genes while taking the average correlation score from each correlation matrix into account. Here is where the machine learning step comes into play. PrismEXP takes in multiple average correlation scores (derived from correlation matrices \(C_1, \ldots, C_N\)) and maps them to a single association score \(a_{\textrm{score}}\): \[a_{\textrm{score}}(\mathbb{S}, g_1, \{C_1, \ldots, C_N\}) = f(cor(\mathbb{S}, g_1 | C_1), \ldots, cor(\mathbb{S}, g_1 | C_N))\]
The function \(f\) represents a machine learning regression algorithm. The choice of the regression algorithm is not not critical for PrismEXP. Any suitable algorithm would work for this. For example, Random Forest performs well. PrismEXP uses the lightGBM algorithm that is a gradient-boosting framework that uses tree-based learning algorithms. lightGBM is related to xgboost with improved performance.

Training the model

To train \(f\) we select a random pair of gene \(g_x\) and gene set \(\mathbb{S}\). There are now two scenarios: \(g_x \in \mathbb{S}\) or \(g_x \notin \mathbb{S}\). If \(g_x \in \mathbb{S}\) remove the gene from \(\mathbb{S}\) (\(\mathbb{S} - \{g_x\}\)) and caclulate all average correlation scores \(cor(\mathbb{S} - \{g_x\}, g_x, C_{i})\), for \(1 \leq i \leq N\). \(f\) should produce a 1 if \(g_x \in \mathbb{S}\), otherwise 0. We sample many such pairs to build our training set. The idea is that PrismEXP can learn dependencies between different correlation matrices helping the algorithm to differentiate between true and false samples.

As a result of training, PrismEXP will take \(N\) correlation matrices, a gene set \(\mathbb{S}\), and a gene \(g_x\) and will output an association score. The higher the association score, the more likely \(g_x\) belongs to the set \(\mathbb{S}\).

The PrismEXP website

Below is a guide that covers the various features of the PrismEXP website. This website provides access to precomputed PrismEXP predictions for several commonly used gene set libraries.
This is the top of the page of the PrismEXP site. It contains an input field for gene symbols, and a submit button. Enter any gene of interest to receive predictions about its likely annotations. When you start typing, valid gene symbol suggestions will be presented.

After submitting the gene symbol, the PrismEXP website will display precomputed predicted and known gene annotations for several gene set libraries. For each gene set library, there will be a separate table. The Gene AUC value represents how well known gene annotations are predicted by the PrismEXP algorithm from the given library. In this example, the AUC of SOX2 ChEA annotations is 0.706.

In the result table, one of the columns is the set reliability. It is the AUC of the gene set with respect to all genes. In this example, the target genes of the transcription factor RING1B were predicted with an AUC of 0.856. This score is available because all gene predictions were precomputed.

If an annotation was previously known, the row in the prediction table is highlighted in cyan. In the example, two of the top three predictions are already known annotations for the gene.
Once a gene symbol is searched on the PrismEXP website, there will be a second option to calculate gene annotations using the PrismEXP Appyter. Since the Appyter is performing real-time predictions on a pre-trained model, all gene set libraries in Enrichr are possible to be selected. Select a desired library and select the Appyters button. This will redirect the user to the Appyter page. Calculating the predictions takes around 2 minutes since the features have to be computed from 100 correlation matrices.

The PrismEXP Appyter

The PrismEXP Appyter can be accessed from the PrismEXP website, or by navigating directly to https://appyters.maayanlab.cloud/#/PrismEXP. Once a gene symbol is submitted for analysis from the PrismEXP website, there will be an option to submit the gene for analysis with the PrismEXP Appyter. Since the Appyter is performing real-time predictions with a pre-trained model, all gene set libraries in Enrichr can be selected. You can select a desired library, and then click on the Appyter button. This will redirect you to the Appyter page. Calculating the predictions takes about 2 minutes since the features have to be computed from 100 correlation matrices.

Once the Appyter started executing, it will take approximately 2 minutes to finish the calculations. The result will include a validation plot displaying the ROC curve for the known gene annotations in the provided gene set library. This file can also be downloaded as a PDF or as a PNG image files. Generally, a high AUC can give you confidence that other highly ranked gene annotations are relevant for the gene of interest.

The prediction table is the final output of the PrismEXP Appyter. It shows a ranking of terms from the given gene set library. The top term is the most associated with your queried gene by the PrismEXP algorithm. The table contains four columns (predictions, z-score, p-value, and Bonferroni). The predictions column contains the direct output from PrismEXP. In this example, HEMOPOIESIS is the top prediction. The z-score is calculated by computing a z-score normalization over the whole prediction column. p-values are calculated from the z-score, assuming a normal distribution. The Bonferroni column contains p-values corrected for multiple hypothesis testing. It should be noted that due to the nature of the regression model, the ranking is more important than the p-values. As a result of the model, p-values are usually not very significant. Since p-values are calculated from the population of the association scores, very small p-values indicate that there are few associations for the given gene in the gene set library.