The LINCS Data Portal 3.0 (LDP3) enables users to query the datasets and signatures generated by the LINCS consortium. LDP3 is built with the Signature Commons framework (Figure 1). LDP3 has RESTful APIs, and a dedicated web interface for exploring the LINCS data. In this documentation, we provide details on how the LINCS data was processed, ingested, and benchmarked. We also provide tutorials on how to use the website and the APIs.
Level 3 L1000 gene expression profile datasets were downloaded from CLUE.io (https://clue.io/) on June 2, 2021, and divided into batches based on the batch ID of each profile. For example, profiles corresponding to the IDs LJP008_NEU_24H_X1_B21:N15 and LJP008_NEU_24H_X2_B21:O06 would both belong to the batch LJP008_NEU_24H.
Using the Level 5 signature metadata file, also downloaded from CLUE.io on June 2, 2021, the replicate profile IDs corresponding to each signature were identified. For each signature, the signature was computed using the Characteristic Direction method [1], which compares the corresponding replicate profiles (the “perturbations”) to all other profiles in the batch to which the signature belongs (the “controls”). The resulting signature is a vector of coefficients representing the differential expression of each gene in the perturbation profiles when compared to the control profiles.
All signatures may be accessed individually via the persistent_id field in the metadata for each signature. Compiled signatures by perturbation type can also be accessed from the Download page.
The characteristic direction signatures were benchmarked against three other differential gene-expression analysis methods: fold change, limma [2], and MODZ [3]. Signatures corresponding to the perturbation of cells by the drug dexamethasone were chosen for benchmarking. Specifically, the chosen perturbations are 10 uM treatments of dexamethasone, followed by expression profiling at 24 hours of exposure in the A549 cell line. Each signature was computed by each of the four differential gene expression analysis methods, including the characteristic direction. The top up- and down-regulated genes were identified for each method applied to each signature, and then compared to the NR3C1/Glucocorticoid Receptor target gene set from the ENCODE ChIP-seq library available from ChEA3 [4], as well as to manually computed up and down gene sets from all dexamethasone-related GEO studies from CREEDS [5]. Overlap between the signature-ranked genes and the gene sets was visualized with bridge plots.
The bridge plots below show the results after the signatures computed using each of the four methods from all dexamethasone signatures were averaged to generate a single line corresponding to a single differential gene expression characterization method. In general, the characteristic direction method performs better than the MODZ method and is it comparable to limma. It is important to note that the L1000 assay only profiles the expression of 978 genes while the expression of ~12,000 genes is inferred. By comparing these data to independent data that is expected to have some concordance we expect that the methods that increase such concordance are better. However, other benchmarks should be further applied.
LDP3's signature search enables users to perform signature similarity searches across over 1.5 million L1000 and other signatures. Users can input up and down genes which can be validated against all the genes that are present in the LDP3 database. This automated validation also provides synonym suggestions for non-standard gene names. Enrichment analysis is performed against seven LINCS datasets, with each dataset corresponding to a specific perturbation type. The top mimicker and reverser signatures are then displayed as bar charts.
Figure 4 Gene terms entered as queries to the up and down input boxes are validated automatically against the genes within the LDP3 database. Genes are labeled based on whether they are found in the database, while suggestions are provided for non-standard gene names and synonyms.
Figure 5 Users can query LDP3 with a single gene with three options. The first option, (1) converts the gene into two sets of the mostly positively and negative correlated co-expressed genes based on data complied from the ARCHS4 resource [5]. These top co-expressed genes are used as the up and down input gene sets. The second option, (2) finds signatures with the gene name make part of the perturbagen, for example, if the gene was knocked down or over expressed. The search with this option directs the user to the metadata search panel. The third option, (3) returns signatures that maximally up or down regulate the input gene (more information is provided below under the metadata search subheading).
Figure 6 The top mimicker and reverser signatures for each query input signature are displayed as bar charts. Hovering over the bars in the chart displays a tooltip with more information about the signature, for example, the signature full name and the enrichment analysis z-scores.
Figure 7 Expanded view. Each panel of the results from a query is expandable. Upon expansion, users can view the top mimickers and reversers in one bar chart and two tables. A dedicated search bar is made available for filtering the search results by term. Download buttons are also provided for downloading the tables.
Figure 8 Scatter plots visualization of the enrichment results. Users can view the signature search results on a scatter plot with the z-up and z-down scores as the x and y axis respectively. On these plots each point represents a matching signature. Signatures located on the top right quadrant are the mimickers while those in the bottom left quadrant are the reversers. Hovering over a point on the plot shows the metadata of the signature as well as the z-scores. Users can change how the points are colored by clicking on the radio buttons at the right. Furthermore, users can perform signature search against signatures with a specific perturbagen either by (1) clicking on the top perturbagen, (2) searching for a perturbagen using the text field, or (3) clicking on a node with the desired perturbagen. This search provides a view about how consistent this perturbagen is in inducing the same changes across many conditions.
LDP3's metadata search enables users to search the content of the LINCS dataset to identify available datasets to download and analyze. Users can input any search term and then they are presented with matching entries from the LDP3 dataset. Once results are returned, further filtering is provided to narrow down the search.
Figure 9 Metadata search. LDP3 provides a metadata search engine that enables users to search for datasets, signatures, or genes by submitting any search term. Such search terms can be for example, an assay, a cell line, a drug, a disease, or a gene name. The search results can be further refined and filtered by sub setting the results DSGCs, datasets, cell lines, and perturbation.
Figure 10
For L1000 metadata signature search results, users have the option to download the the returned signatures as (1) full rank signature file, or (2) a GMT file containing the top up- and down-regulated genes. Furthermore, users have the option to apply these top up- and down-regulated genes as input for signature search, or send them to Enrichr for enrichment analysis [6-8].
Figure 11 For L1000 metadata signature search results, users have the option to download the the returned signatures as (1) full rank signature file, or (2) a GMT file containing the top up- and down-regulated genes. Furthermore, users have the option to apply these top up- and down-regulated genes as input for signature search, or send them to Enrichr for enrichment analysis [6-8].
When a user clicks on the results from the metadata search, the entries are brought to the respective metadata page of that entry. Depending on the dataset type, this dedicated page will display the metadata of the entry, as well as the entry's related signatures, genes, and datasets.
Figure 12 Gene landing pages that contain the signatures that maximally up- or down-regulate the gene are provided by LDP3. Results from these pages can be filtered by DSGC, dataset, cell line, or perturbagen.
Figure 13
Signature dedicated landing pages in LDP3 displays the available metadata for each signature as well as the top up- and down-regulated genes for each signature.
Figure 14 Dataset pages shows the metadata of the dataset along with the relevant download links, as well as the signatures (if any) that is under that dataset.
Figure 15 DSGC dedicated landing pages can be accessed from the DSGCs tab. These pages list the available datasets produced by each LINCS DSGC.
The DCIC processed LINCS L1000 data and the other LINCS dataset packages can be downloaded from the LDP3 download page. Users can download the coefficient tables, GMT files of the up and down gene sets, as well as the predicted RNA-Seq profiles created with applying a deep learning algorithm (CycleGAN) [11]. Figure 16 Screenshot from the LDP3 Download Page
LDP3 can be accessed via well-documented RESTful APIs. These API are documented with OpenAPI and smartAPI [12]. This documentation can be viewed by going to the API page.Figure 17 Screenshot from the LDP3 API page Figure 17 Screenshot from the LDP3 API page
The next section provides examples on how to use the LDP3 RESTful APIs.
In this section, we describe the two types of RESTful APIs provided by LDP3 for users who wish to access the LINCS data programmatically. The metadata API (https://ldp3.cloud/metadata-api) provides fast full-text searches as well as querying of data aggregates. Metadata searches are structured using LoopBack 4 queries. The data API (https://ldp3.cloud/data-api), on the other hand, provides enrichment analysis against LINCS signatures.
Here, we explore several use cases and provide Python code that users can use as a template for more complex queries. To start, make sure you have the request library installed via pip. More technical information about the LDP3 API is available from here: https://ldp3.cloud/#/API.
Users can utilize the full-text search capabilities of LDP 3.0 to search for datasets, signatures, or genes using the following URL https://ldp3.cloud/metadata-api/
Suppose we want to search for datasets (libraries) that contains the word proteomics. We structure our query as follows:
import requests
import json
API_URL = "https://ldp3.cloud/metadata-api/libraries/find"
payload = {
"filter": {
"where": {
"meta": {
"fullTextSearch": "proteomics"
}
},
"limit": 2
}
}
res = requests.post(API_URL, json=payload)
results = res.json()
print(json.dumps(results, indent=2))
LDP 3.0 stores its metadata as semi-structured JSON serialized entries. Because of this, we can also filter results via metadata fields. Here we show how to find signatures that are perturbed with a CRISPR Knockdown:
import requests
import json
API_URL = "https://ldp3.cloud/metadata-api/signatures/find"
payload = {
"filter": {
"where": {
"meta.pert_type": "CRISPR Knockdown"
},
"limit": 2
}
}
res = requests.post(API_URL, json=payload)
results = res.json()
print(json.dumps(results, indent=2))
Since LDP 3.0 stores metadata as semi-structured JSON serialized entries, this means that although JSON serialization provides opportunities for diverse metadata fields, we still follow certain structures for LDP 3.0 which are defined here. These structures enable performing queries such as the previous example. In that example, we are sure that the field meta.pert_type exists, and is in fact required. An easy way to get the available fields for querying without going through the validators is to fetch them using https://ldp3.cloud/metadata-api/<model>/key_count
. This returns the available keys, and how many entries have that field. For example, we want to view the available search keys for the genes (entities):
import requests
import json
API_URL = "https://ldp3.cloud/metadata-api/entities/key_count"
payload = {
"limit": 15
}
res = requests.get(API_URL, params={"filter": json.dumps(payload)})
results = res.json()
print(json.dumps(list(results.keys()), indent=2))
The count endpoint https://ldp3.cloud/metadata-api/<model>/count
accepts GET requests to count the number of entries in a model. Users can also pass a where
filter to get a filtered count. Here we show how to get the number of datasets (libraries) that use the L1000 mRNA profiling assay.
import requests
import json
API_URL = "https://ldp3.cloud/metadata-api/libraries/count"
payload = {
"meta.assay": "L1000 mRNA profiling assay"
}
res = requests.get(API_URL, params={"where": json.dumps(payload)})
results = res.json()
print(json.dumps(results, indent=2))
The value count endpoint https://ldp3.cloud/metadata-api/<model>/value_count
is used to count the values of a specific field. This is particularly useful for obtaining the top assays, cell line, perturbations of a model. Below we show how to get the top 25 cell lines of the signatures perturbed with dexamethasone.
import requests
import json
API_URL = "https://ldp3.cloud/metadata-api/signatures/value_count"
payload = {
"where": {
"meta.pert_name": "dexamethasone"
},
"fields": ["meta.cell_line"],
"limit": 25
}
res = requests.get(API_URL, params={"filter": json.dumps(payload)})
results = res.json()
print(json.dumps(results, indent=2))
import requests
import json
METADATA_API = "https://ldp3.cloud/metadata-api/"
DATA_API = "https://ldp3.cloud/data-api/api/v1/"
input_gene_set = {
"up_genes": ["TARBP1", "APP", "RAP1GAP", "UFM1", "DNAJA3", "PCBD1", "CSRP1"],
"down_genes": ["CEBPA", "STAT5B", "DSE", "EIF4EBP1", "CARD8", "HLA-DMA", "SERPINE1"]
}
all_genes = input_gene_set["up_genes"] + input_gene_set["down_genes"]
payload = {
"filter": {
"where": {
"meta.symbol": {
"inq": all_genes
}
},
"fields": ["id", "meta.symbol"]
}
}
res = requests.post(METADATA_API + "entities/find", json=payload)
entities = res.json()
for_enrichment = {
"up_entities": [],
"down_entities": []
}
for e in entities:
symbol = e["meta"]["symbol"]
if symbol in input_gene_set["up_genes"]:
for_enrichment["up_entities"].append(e["id"])
elif symbol in input_gene_set["down_genes"]:
for_enrichment["down_entities"].append(e["id"])
print(json.dumps(for_enrichment, indent=2))
For general use, the ranktwosided endpoint takes as input the up and down gene sets, a limit, and the database you want to use for enrichment. Obtaining the list of available databases for enrichment is discussed in the next section. The data API returns the top matching signatures along with the enrichment scores, ranked by the absolute product of the z-up and z-down scores. A positive z-score means that the genes in the gene-set are positioned on the top of the ranking, meanwhile, a negative z-score means that the genes in the gene-set are positioned in the bottom of the ranking. You can optionally multiply z-down and direction down with -1 to be consistent with the scatter plots available from the LDP3 user interface. Using this convention, we define reversers as having negative z-up and z-down while mimickers have positive z-up and z-down.
query = {
**for_enrichment,
"limit": 10,
"database": "l1000_xpr"
}
res = requests.post(DATA_API + "enrich/ranktwosided", json=query)
results = res.json()
# Optional, multiply z-down and direction-down with -1
for i in results["results"]:
i["z-down"] = -i["z-down"]
i["direction-down"] = -i["direction-down"]
print(json.dumps(results, indent=2))
We can now resolve the UUIDs of the returned signatures using the metadata API
sigids = {i["uuid"]: i for i in results["results"]}
payload = {
"filter": {
"where": {
"id": {
"inq": list(sigids.keys())
}
}
}
}
res = requests.post(METADATA_API + "signatures/find", json=payload)
signatures = res.json()
## Merge the scores and the metadata
for sig in signatures:
uid = sig["id"]
scores = sigids[uid]
scores.pop("uuid")
sig["scores"] = scores
print(json.dumps(signatures, indent=2))
To view the available databases for signature search, users can use the /listdata
endpoint as follows:
import requests
import json
API_URL = "https://ldp3.cloud/data-api/api/v1/listdata"
res = requests.post(API_URL)
databases = res.json()
for_enrichment = {
"up_entities": [],
"down_entities": []
}
print(json.dumps(databases, indent=2))
[1] Clark, N. R., Hu, K. S., Feldmann, A. S., Kou, Y., Chen, E. Y., Duan, Q., & Ma’ayan, A. (2014). The characteristic direction: a geometrical approach to identify differentially expressed genes. BMC bioinformatics, 15(1), 1-16. doi: 10.1186/1471-2105-15-79. PMID: 24650281. PMCID: PMC4000056.
[2] Ritchie, M. E., Phipson, B., Wu, D. I., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research, 43(7), e47-e47. doi: 10.1093/nar/gkv007. PMID: 25605792. PMCID: PMC4402510
[3] Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, Gould J, Davis JF, Tubelli AA, Asiedu JK, Lahr DL, Hirschman JE, Liu Z, Donahue M, Julian B, Khan M, Wadden D, Smith IC, Lam D, Liberzon A, Toder C, Bagul M, Orzechowski M, Enache OM, Piccioni F, Johnson SA, Lyons NJ, Berger AH, Shamji AF, Brooks AN, Vrcic A, Flynn C, Rosains J, Takeda DY, Hu R, Davison D, Lamb J, Ardlie K, Hogstrom L, Greenside P, Gray NS, Clemons PA, Silver S, Wu X, Zhao WN, Read-Button W, Wu X, Haggarty SJ, Ronco LV, Boehm JS, Schreiber SL, Doench JG, Bittker JA, Root DE, Wong B, Golub TR. (2017). A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171(6), 1437-1452. doi: 10.1016/j.cell.2017.10.049. PMID: 29195078. PMCID: PMC5990023.
[4] Keenan AB, Torre D, Lachmann A, Leong AK, Wojciechowicz ML, Utti V, Jagodnik KM, Kropiwnicki E, Wang Z, Ma'ayan A. (2019). ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic acids research, 47(W1), W212-W224. doi: 10.1093/nar/gkz446. PMID: 31114921; PMCID: PMC6602523.
[5] Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. (2018). Massive mining of publicly available RNA-seq data from human and mouse. Nature communications, 9(1), 1-10. doi: 10.1038/s41467-018-03751-6. PMID: 29636450 PMCID: PMC5893633
[6] Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, Clark NR, Ma'ayan A. (2013). Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC bioinformatics, 14(1), 1-14. doi: 10.1186/1471-2105-14-128. PMID: 23586463 PMCID: PMC3637064
[7] Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, McDermott MG, Monteiro CD, Gundersen GW, Ma'ayan A. (2016). Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic acids research, 44(W1), W90-W97. doi: 10.1093/nar/gkw377. PMID: 27141961 PMCID: PMC4987924
[8] Xie Z, Bailey A, Kuleshov MV, Clarke DJB., Evangelista JE, Jenkins SL, Lachmann A, Wojciechowicz ML, Kropiwnicki E, Jagodnik KM, Jeon M, & Ma’ayan A. (2021). Gene set knowledge discovery with Enrichr. Current protocols, 1(3), e90. doi: 10.1002/cpz1.90. PMID: 33780170 PMCID: PMC8152575
[9] Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, 't Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), 1-9. doi: 10.1038/sdata.2016.18. PMID: 26978244 PMCID: PMC4792175
[10] Clarke DJB, Wang L, Jones A, Wojciechowicz ML, Torre D, Jagodnik KM, Jenkins SL, McQuilton P, Flamholz Z, Silverstein MC, Schilder BM, Robasky K, Castillo C, Idaszak R, Ahalt SC, Williams J, Schurer S, Cooper DJ, de Miranda Azevedo R, Klenk JA, Haendel MA, Nedzel J, Avillach P, Shimoyama ME, Harris RM, Gamble M, Poten R, Charbonneau AL, Larkin J, Brown CT, Bonazzi VR, Dumontier MJ, Sansone SA, Ma'ayan A. FAIRshake: toolkit to evaluate the FAIRness of research digital resources. Cell systems, 9(5), 417-421. doi: 10.1016/j.cels.2019.09.011. PMID: 31677972 PMCID: PMC7316196
[11] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223-2232). Paper: arXiv:1703.10593
[12] Zaveri, A., Dastgheib, S., Wu, C., Whetzel, T., Verborgh, R., Avillach, P., Korodi, G., Terryn, R., Jagodnik, K.M., Assis, P., & Dumontier, M. (2017). smartAPI: towards a more intelligent network of web APIs. In European Semantic Web Conference (pp. 154-169). Springer, Cham. doi: 10.1007/978-3-319-58451-5_11