ARCHS4 provides access to gene counts from HiSeq 2000 and HiSeq 2500 platforms for human and mouse experiments. The website allows download of the data in H5 format for programmatic access as well as a 3-dimensional view of the sample space. Search features allow browsing of the data by meta data annotation, signature similarity and functional enrichment. Selected sample sets can be downloaded into a tab separated through auto-generated R scripts for further analysis. Reads are aligned with Kallisto using a custom cloud computing platform. Human samples are aligned against GRCh38 cdna reference and mouse samples against GRCm38 cdna. All processable files from the GEO/SRA database since June 2018 are processed and available for download.
Raw read counts can be downloaded by browsing to the top of the ARCHS4 website and selecting the Download option. The raw counts are separated by human and mouse samples and are provided in H5 format. H5 provides efficient compression of the gene counts data. Tab separated files can be extracted with the provided code and auto-generated scripts provided by the web interface.
Selected sample and gene sets are displayed in the Search Results section. Under downloads scripts can be downloaded for the given gene sets. In case of gene sets a text file containing the gene symbols is provided.
Gene sets can be exported to Enrichr for further analysis.
The ARCHS4 chrome extension is a browser extension that adds content to the landing pages of RNA-seq datasets available on the Gene Expression Omnibus (GEO) when samples have been processed by ARCHS4. The extension adds links to download files that contain the aligned reads mapped to genes with counts, as well as a heatmap visualization summary of the expression data from the processed samples using Clustergrammer. The ARCHS4 Chrome extension installed from the Chrome web store. An example of an enhanced GEO landing page for series GSE77243 is shown.
Gene landing pages are accessible through the search fields on the top right of the ARCHS4 interface. Genes can be searched by Entrez gene symbol. The gene search returns the expression distribution for major tissues and cell lines and predictions of biological function and regulatory properties of the target gene. If a gene is previously known to be part of a predicted gene set the terms are marked in green. If a gene has sufficient number of prior knowledge annotations a ROC curve shows how well the prior knowledge about the gene can be recovered.
The data view port displays samples or genes relative to the other samples and genes in the dataset. Samples/Genes with similar expression are clustered. The layout is computed with t-SNE.
The visualization is dynamic and allows rotation and zooming. The two icons on the top right are the manual selection tool and a toggle button that moves the view port into a smaller window to the left of the webpage.
The colors of samples and gene sets can be modified in the result section.
The data view can display 4 precalculated data visualizations: human samples, human genes, mouse samples and mouse genes. To toggle between species, select either the human or mouse button on the left. To change from sample view to gene view select one of the yellow buttons. A new view will automatically be loaded on selection. Prior selections are saved and will be loaded again if the former mode is selected.
The colors of sample and gene sets can be modified in the result section.
The visualization is dynamic an allows rotation and zoom. The two icons on the top right are the manual selection tool and a toggle button that moves the view port into a smaller window to the left of the webpage.
Hierarchical Data Format (HDF) is an open source file format for large data storage. It allows programmatic accessibility of matrix entries based on column and row indices while allowing for efficient data compression. The H5 files provided by ARCHS4 contain raw read counts as well as detailed meta data information extracted from GEO.
group | name | otype | dclass | dim | |
---|---|---|---|---|---|
0 | / | data | H5I_GROUP | ||
1 | /data | expression | H5I_DATASET | INTEGER | 307268 x 35238 |
2 | / | meta | H5I_GROUP | ||
3 | /meta | genes | H5I_GROUP | ||
4 | /meta/genes | chromosome | H5I_DATASET | STRING | 35238 |
5 | /meta/genes | ensembl_gene_id | H5I_DATASET | STRING | 35238 |
6 | /meta/genes | gene_biotype | H5I_DATASET | STRING | 35238 |
7 | /meta/genes | gene_symbol | H5I_DATASET | STRING | 35238 |
8 | /meta/genes | genes | H5I_DATASET | STRING | 35238 |
9 | /meta/genes | start_position | H5I_DATASET | STRING | 35238 |
10 | /meta | info | H5I_GROUP | ||
11 | /meta/info | author | H5I_DATASET | STRING | ( 0 ) |
12 | /meta/info | contact | H5I_DATASET | STRING | ( 0 ) |
13 | /meta/info | creation-date | H5I_DATASET | STRING | ( 0 ) |
14 | /meta/info | laboratory | H5I_DATASET | STRING | ( 0 ) |
15 | /meta/info | version | H5I_DATASET | INTEGER | ( 0 ) |
16 | /meta | samples | H5I_GROUP | ||
17 | /meta/samples | channel_count | H5I_DATASET | STRING | 307268 |
18 | /meta/samples | characteristics_ch1 | H5I_DATASET | STRING | 307268 |
19 | /meta/samples | contact_address | H5I_DATASET | STRING | 307268 |
20 | /meta/samples | contact_city | H5I_DATASET | STRING | 307268 |
21 | /meta/samples | contact_country | H5I_DATASET | STRING | 307268 |
22 | /meta/samples | contact_institute | H5I_DATASET | STRING | 307268 |
23 | /meta/samples | contact_name | H5I_DATASET | STRING | 307268 |
24 | /meta/samples | contact_zip | H5I_DATASET | STRING | 307268 |
25 | /meta/samples | data_processing | H5I_DATASET | STRING | 307268 |
26 | /meta/samples | data_row_count | H5I_DATASET | STRING | 307268 |
27 | /meta/samples | extract_protocol_ch1 | H5I_DATASET | STRING | 307268 |
28 | /meta/samples | geo_accession | H5I_DATASET | STRING | 307268 |
29 | /meta/samples | instrument_model | H5I_DATASET | STRING | 307268 |
30 | /meta/samples | last_update_date | H5I_DATASET | STRING | 307268 |
31 | /meta/samples | library_selection | H5I_DATASET | STRING | 307268 |
32 | /meta/samples | library_source | H5I_DATASET | STRING | 307268 |
33 | /meta/samples | library_strategy | H5I_DATASET | STRING | 307268 |
34 | /meta/samples | molecule_ch1 | H5I_DATASET | STRING | 307268 |
35 | /meta/samples | organism_ch1 | H5I_DATASET | STRING | 307268 |
36 | /meta/samples | platform_id | H5I_DATASET | STRING | 307268 |
37 | /meta/samples | readsaligned | H5I_DATASET | FLOAT | 307268 |
38 | /meta/samples | readstotal | H5I_DATASET | FLOAT | 307268 |
39 | /meta/samples | relation | H5I_DATASET | STRING | 307268 |
40 | /meta/samples | series_id | H5I_DATASET | STRING | 307268 |
41 | /meta/samples | singlecellprobability | H5I_DATASET | FLOAT | 307268 |
42 | /meta/samples | source_name_ch1 | H5I_DATASET | STRING | 307268 |
43 | /meta/samples | status | H5I_DATASET | STRING | 307268 |
44 | /meta/samples | submission_date | H5I_DATASET | STRING | 307268 |
45 | /meta/samples | taxid_ch1 | H5I_DATASET | STRING | 307268 |
46 | /meta/samples | title | H5I_DATASET | STRING | 307268 |
47 | /meta/samples | type | H5I_DATASET | STRING | 307268 |
44 | /meta | genes | H5I_DATASET | STRING | 35238 |
Scripts to extract tab separated gene expression files can be created through the graphical user interface of ARCHS4. The script has to be executed as an R-script. A free version of R can be downloaded from: www.rstudio.com. Upon execution the script should install all required dependencies, and then download the full gene expression file before extracting the selected samples.
# R script to download selected samples
# Copy code and run on a local machine to initiate download
# Check for dependencies and install if missing
library("rhdf5") # can be installed using Bioconductor
destination_file = "human_matrix_v9.h5"
extracted_expression_file = "GSM2679484_expression_matrix.tsv"
url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix_v9.h5"
# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
print("Downloading compressed gene expression matrix.")
download.file(url, destination_file, quiet = FALSE, mode = 'wb')
}
# Selected samples to be extracted
samp = c("GSM2679452","GSM2679453","GSM2679454","GSM2679455","GSM2679456","GSM2679457","GSM2679458","GSM2679459","GSM2679460","GSM2679461","GSM2679462","GSM2679463","GSM2679464","GSM2679465","GSM2679466","GSM2679467","GSM2679468","GSM2679469","GSM2679470","GSM2679471","GSM2679472","GSM2679473","GSM2679474","GSM2679475","GSM2679476","GSM2679477","GSM2679478","GSM2679479","GSM2679480","GSM2679481","GSM2679482",
"GSM2679483","GSM2679484","GSM2679485","GSM2679486","GSM2679487","GSM2679488","GSM2679489","GSM2679490","GSM2679491","GSM2679492","GSM2679493","GSM2679494","GSM2679495","GSM2679496","GSM2679497","GSM2679498","GSM2679499","GSM2679500","GSM2679501","GSM2679502","GSM2679503","GSM2679504","GSM2679505","GSM2679506","GSM2679507","GSM2679508","GSM2679509","GSM2679510","GSM2679511","")
# Retrieve information from compressed data
samples = h5read(destination_file, "meta/samples/geo_accession")
genes = h5read(destination_file, "meta/genes/genes")
# Identify columns to be extracted
sample_locations = which(samples %in% samp)
# extract gene expression from compressed data
expression = t(h5read(destination_file, "data/expression", index=list(sample_locations, 1:length(genes))))
H5close()
rownames(expression) = genes
colnames(expression) = samples[sample_locations]
# Print file
write.table(expression, file=extracted_expression_file, sep="\t", quote=FALSE, col.names=NA)
print(paste0("Expression file was created at ", getwd(), "/", extracted_expression_file))
Extracted samples from a specified tissue can originate from multiple series with slightly different experimental conditions. If desired batch effects from gene expression can be removed with the Combat library.
# R script to download selected samples
# Copy code and run on a local machine to initiate download
# Check for dependencies and install if missing
library("rhdf5") # can be installed using Bioconductor
library("preprocessCore")
library("sva")
destination_file = "human_matrix_v9.h5"
extracted_expression_file = "GSM2679484_expression_matrix.tsv"
url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix_v9.h5"
# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
print("Downloading compressed gene expression matrix.")
download.file(url, destination_file, quiet = FALSE, mode = 'wb')
}
# Selected samples to be extracted
samp = c("GSM2679452","GSM2679453","GSM2679454","GSM2679455","GSM2679456","GSM2679457","GSM2679458","GSM2679459","GSM2679460","GSM2679461","GSM2679462","GSM2679463","GSM2679464","GSM2679465","GSM2679466","GSM2679467","GSM2679468","GSM2679469","GSM2679470","GSM2679471","GSM2679472","GSM2679473","GSM2679474","GSM2679475","GSM2679476","GSM2679477","GSM2679478","GSM2679479","GSM2679480","GSM2679481","GSM2679482",
"GSM2679483","GSM2679484","GSM2679485","GSM2679486","GSM2679487","GSM2679488","GSM2679489","GSM2679490","GSM2679491","GSM2679492","GSM2679493","GSM2679494","GSM2679495","GSM2679496","GSM2679497","GSM2679498","GSM2679499","GSM2679500","GSM2679501","GSM2679502","GSM2679503","GSM2679504","GSM2679505","GSM2679506","GSM2679507","GSM2679508","GSM2679509","GSM2679510","GSM2679511","")
# Retrieve information from compressed data
samples = h5read(destination_file, "meta/samples/geo_accession")
genes = h5read(destination_file, "meta/genes/genes")
# Identify columns to be extracted
sample_locations = which(samples %in% samp)
# extract gene expression from compressed data
expression = t(h5read(destination_file, "data/expression", index=list(sample_locations, 1:length(genes))))
H5close()
rownames(expression) = genes
colnames(expression) = samples[sample_locations]
# normalize samples and correct for differences in gene count distribution
expression = log2(expression+1)
expression = normalize.quantiles(expression)
rownames(expression) = genes
colnames(expression) = samples[sample_locations]
# correct batch effects in gene expression
batchid = match(series, unique(series))
correctedExpression <- ComBat(dat=expression, batch=batchid, par.prior=TRUE, prior.plots=FALSE)
Metadata search parses the tissue description field from GEO to find matches with the entered search term. The search ignores spaces and is case insensitive. Results are highliged in the data viewer and a result is added to the result list.
We preselected a series of cellular tissues based by cellular system. This allows simple browsing of the data for tissues of interest. Some tissue selections can return empty for either mouse or human samples.
Signature search uses a list of high expressed genes and low expressed genes and identifies samples that match the given input. The gene expression is z-score normalized across samples to identify the relative gene expression.
Enrichment search highlights samples that are enriched in gene sets from 8 gene set libraries (CHEA 2016, KEA 2016, Encode TF ChIP-seq 2016, KEGG 2016, MGI mammalian phenotype, Human phenotype, GO Biological Process, GO cellular component, GO molecular function)
Open gene landing page when searching by gene symbol. A set of genes can be highlighted by selecting a gene set library and a corresponding gene set.
Method | POST | ||||||||||
URL | https://maayanlab.cloud/custom/rooky | ||||||||||
Returns |
|
||||||||||
Parameters |
|
||||||||||
Example code |
|
||||||||||
Example results |
|
The top 100 predictions of gene set membership accross multiple domains is shown in the tables on the gene page. Gene set membership is predicted by membership by association. If a gene shares high correlation with known members of a gene set it will get a high z-score during the membership perediction. If a gene already has known functions/gene set memberships they are highlighted in green. If a gene is extensively annotated a ROC curve shows how well known annotations could be recovered by the algorithm. If there is no image the gene has not enough prior gene memberships to build a reliable ROC curve.
The gene correlation table contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues. The gene list can be uploaded to Enrichr for further investigation.
The tissue and cell line expression atlas are calculated from samples in ARCHS4. The tissues are grouped in multiple levels and cover a wide range of different cellular contexts. Since samples of any given tissue can come from many distinct laboratoeries condition upon sample creation are not identical and various subtypes of tissues can be mixed. This in comparison to GTEx can report the observed variability in non homogenious sample groups.
Please acknowledge ARCHS4 in your publications by citing the following reference:
Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6