ARCHS4 Help


Getting Started with ARCHS4

Data visualization

Gene pages

Data handling

Search tools

Terms of use


If you would like to receive updates on the ARCHS4 data and stay informed about new data releases consider signing up for the newsletter.


First Steps

About ARCHS4

ARCHS4 provides access to gene counts from HiSeq 2000 and HiSeq 2500 platforms for human and mouse experiments. The website allows download of the data in H5 format for programmatic access as well as a 3-dimensional view of the sample space. Search features allow browsing of the data by meta data annotation, signature similarity and functional enrichment. Selected sample sets can be downloaded into a tab separated through auto-generated R scripts for further analysis. Reads are aligned with Kallisto using a custom cloud computing platform. Human samples are aligned against GRCh38 cdna reference and mouse samples against GRCm38 cdna. All processable files from the GEO/SRA database since June 2018 are processed and available for download.


Download files

Raw read counts can be downloaded by browsing to the top of the ARCHS4 website and selecting the Download option. The raw counts are separated by human and mouse samples and are provided in H5 format. H5 provides efficient compression of the gene counts data. Tab separated files can be extracted with the provided code and auto-generated scripts provided by the web interface.

Selected sample and gene sets are displayed in the Search Results section. Under downloads scripts can be downloaded for the given gene sets. In case of gene sets a text file containing the gene symbols is provided.

Gene sets can be exported to Enrichr for further analysis.

Chrome Extension

The ARCHS4 chrome extension is a browser extension that adds content to the landing pages of RNA-seq datasets available on the Gene Expression Omnibus (GEO) when samples have been processed by ARCHS4. The extension adds links to download files that contain the aligned reads mapped to genes with counts, as well as a heatmap visualization summary of the expression data from the processed samples using Clustergrammer. The ARCHS4 Chrome extension installed from the Chrome web store. An example of an enhanced GEO landing page for series GSE77243 is shown.


Gene landing pages

Gene landing pages are accessible through the search fields on the top right of the ARCHS4 interface. Genes can be searched by Entrez gene symbol. The gene search returns the expression distribution for major tissues and cell lines and predictions of biological function and regulatory properties of the target gene. If a gene is previously known to be part of a predicted gene set the terms are marked in green. If a gene has sufficient number of prior knowledge annotations a ROC curve shows how well the prior knowledge about the gene can be recovered.


Data visualization

WebGL data viewer

The data view port displays samples or genes relative to the other samples and genes in the dataset. Samples/Genes with similar expression are clustered. The layout is computed with t-SNE.

The visualization is dynamic and allows rotation and zooming. The two icons on the top right are the manual selection tool and a toggle button that moves the view port into a smaller window to the left of the webpage.

The colors of samples and gene sets can be modified in the result section.


Viewer options

The data view can display 4 precalculated data visualizations: human samples, human genes, mouse samples and mouse genes. To toggle between species, select either the human or mouse button on the left. To change from sample view to gene view select one of the yellow buttons. A new view will automatically be loaded on selection. Prior selections are saved and will be loaded again if the former mode is selected.

The colors of sample and gene sets can be modified in the result section.

Viewer user interaction

The visualization is dynamic an allows rotation and zoom. The two icons on the top right are the manual selection tool and a toggle button that moves the view port into a smaller window to the left of the webpage.


Data handling

About the H5 file format

Hierarchical Data Format (HDF) is an open source file format for large data storage. It allows programmatic accessibility of matrix entries based on column and row indices while allowing for efficient data compression. The H5 files provided by ARCHS4 contain raw read counts as well as detailed meta data information extracted from GEO.



group name otype dclass dim
0 / data H5I_GROUP
1 /data expression H5I_DATASET INTEGER 307268 x 35238
2 / meta H5I_GROUP
3 /meta genes H5I_GROUP
4 /meta/genes chromosome H5I_DATASET STRING 35238
5 /meta/genes ensembl_gene_id H5I_DATASET STRING 35238
6 /meta/genes gene_biotype H5I_DATASET STRING 35238
7 /meta/genes gene_symbol H5I_DATASET STRING 35238
8 /meta/genes genes H5I_DATASET STRING 35238
9 /meta/genes start_position H5I_DATASET STRING 35238
10 /meta info H5I_GROUP
11 /meta/info author H5I_DATASET STRING ( 0 )
12 /meta/info contact H5I_DATASET STRING ( 0 )
13 /meta/info creation-date H5I_DATASET STRING ( 0 )
14 /meta/info laboratory H5I_DATASET STRING ( 0 )
15 /meta/info version H5I_DATASET INTEGER ( 0 )
16 /meta samples H5I_GROUP
17 /meta/samples channel_count H5I_DATASET STRING 307268
18 /meta/samples characteristics_ch1 H5I_DATASET STRING 307268
19 /meta/samples contact_address H5I_DATASET STRING 307268
20 /meta/samples contact_city H5I_DATASET STRING 307268
21 /meta/samples contact_country H5I_DATASET STRING 307268
22 /meta/samples contact_institute H5I_DATASET STRING 307268
23 /meta/samples contact_name H5I_DATASET STRING 307268
24 /meta/samples contact_zip H5I_DATASET STRING 307268
25 /meta/samples data_processing H5I_DATASET STRING 307268
26 /meta/samples data_row_count H5I_DATASET STRING 307268
27 /meta/samples extract_protocol_ch1 H5I_DATASET STRING 307268
28 /meta/samples geo_accession H5I_DATASET STRING 307268
29 /meta/samples instrument_model H5I_DATASET STRING 307268
30 /meta/samples last_update_date H5I_DATASET STRING 307268
31 /meta/samples library_selection H5I_DATASET STRING 307268
32 /meta/samples library_source H5I_DATASET STRING 307268
33 /meta/samples library_strategy H5I_DATASET STRING 307268
34 /meta/samples molecule_ch1 H5I_DATASET STRING 307268
35 /meta/samples organism_ch1 H5I_DATASET STRING 307268
36 /meta/samples platform_id H5I_DATASET STRING 307268
37 /meta/samples readsaligned H5I_DATASET FLOAT 307268
38 /meta/samples readstotal H5I_DATASET FLOAT 307268
39 /meta/samples relation H5I_DATASET STRING 307268
40 /meta/samples series_id H5I_DATASET STRING 307268
41 /meta/samples singlecellprobability H5I_DATASET FLOAT 307268
42 /meta/samples source_name_ch1 H5I_DATASET STRING 307268
43 /meta/samples status H5I_DATASET STRING 307268
44 /meta/samples submission_date H5I_DATASET STRING 307268
45 /meta/samples taxid_ch1 H5I_DATASET STRING 307268
46 /meta/samples title H5I_DATASET STRING 307268
47 /meta/samples type H5I_DATASET STRING 307268
44 /meta genes H5I_DATASET STRING 35238


Parsing H5 file

Scripts to extract tab separated gene expression files can be created through the graphical user interface of ARCHS4. The script has to be executed as an R-script. A free version of R can be downloaded from: www.rstudio.com. Upon execution the script should install all required dependencies, and then download the full gene expression file before extracting the selected samples.


# R script to download selected samples
# Copy code and run on a local machine to initiate download
# Check for dependencies and install if missing
library("rhdf5")    # can be installed using Bioconductor

destination_file = "human_matrix_v9.h5"
extracted_expression_file = "GSM2679484_expression_matrix.tsv"
url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix_v9.h5"

# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
    print("Downloading compressed gene expression matrix.")
    download.file(url, destination_file, quiet = FALSE, mode = 'wb')
}

# Selected samples to be extracted
samp = c("GSM2679452","GSM2679453","GSM2679454","GSM2679455","GSM2679456","GSM2679457","GSM2679458","GSM2679459","GSM2679460","GSM2679461","GSM2679462","GSM2679463","GSM2679464","GSM2679465","GSM2679466","GSM2679467","GSM2679468","GSM2679469","GSM2679470","GSM2679471","GSM2679472","GSM2679473","GSM2679474","GSM2679475","GSM2679476","GSM2679477","GSM2679478","GSM2679479","GSM2679480","GSM2679481","GSM2679482",
"GSM2679483","GSM2679484","GSM2679485","GSM2679486","GSM2679487","GSM2679488","GSM2679489","GSM2679490","GSM2679491","GSM2679492","GSM2679493","GSM2679494","GSM2679495","GSM2679496","GSM2679497","GSM2679498","GSM2679499","GSM2679500","GSM2679501","GSM2679502","GSM2679503","GSM2679504","GSM2679505","GSM2679506","GSM2679507","GSM2679508","GSM2679509","GSM2679510","GSM2679511","")

# Retrieve information from compressed data
samples = h5read(destination_file, "meta/samples/geo_accession")
genes = h5read(destination_file, "meta/genes/genes")

# Identify columns to be extracted
sample_locations = which(samples %in% samp)

# extract gene expression from compressed data
expression = t(h5read(destination_file, "data/expression", index=list(sample_locations, 1:length(genes))))
H5close()
rownames(expression) = genes
colnames(expression) = samples[sample_locations]

# Print file
write.table(expression, file=extracted_expression_file, sep="\t", quote=FALSE, col.names=NA)
print(paste0("Expression file was created at ", getwd(), "/", extracted_expression_file))



Batch effect correction

Extracted samples from a specified tissue can originate from multiple series with slightly different experimental conditions. If desired batch effects from gene expression can be removed with the Combat library.



# R script to download selected samples
# Copy code and run on a local machine to initiate download
# Check for dependencies and install if missing

library("rhdf5")    # can be installed using Bioconductor
library("preprocessCore")
library("sva")

destination_file = "human_matrix_v9.h5"
extracted_expression_file = "GSM2679484_expression_matrix.tsv"
url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix_v9.h5"

# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
    print("Downloading compressed gene expression matrix.")
    download.file(url, destination_file, quiet = FALSE, mode = 'wb')
}

# Selected samples to be extracted
samp = c("GSM2679452","GSM2679453","GSM2679454","GSM2679455","GSM2679456","GSM2679457","GSM2679458","GSM2679459","GSM2679460","GSM2679461","GSM2679462","GSM2679463","GSM2679464","GSM2679465","GSM2679466","GSM2679467","GSM2679468","GSM2679469","GSM2679470","GSM2679471","GSM2679472","GSM2679473","GSM2679474","GSM2679475","GSM2679476","GSM2679477","GSM2679478","GSM2679479","GSM2679480","GSM2679481","GSM2679482",
"GSM2679483","GSM2679484","GSM2679485","GSM2679486","GSM2679487","GSM2679488","GSM2679489","GSM2679490","GSM2679491","GSM2679492","GSM2679493","GSM2679494","GSM2679495","GSM2679496","GSM2679497","GSM2679498","GSM2679499","GSM2679500","GSM2679501","GSM2679502","GSM2679503","GSM2679504","GSM2679505","GSM2679506","GSM2679507","GSM2679508","GSM2679509","GSM2679510","GSM2679511","")

# Retrieve information from compressed data
samples = h5read(destination_file, "meta/samples/geo_accession")
genes = h5read(destination_file, "meta/genes/genes")

# Identify columns to be extracted
sample_locations = which(samples %in% samp)

# extract gene expression from compressed data
expression = t(h5read(destination_file, "data/expression", index=list(sample_locations, 1:length(genes))))
H5close()
rownames(expression) = genes
colnames(expression) = samples[sample_locations]

# normalize samples and correct for differences in gene count distribution
expression = log2(expression+1)
expression = normalize.quantiles(expression)

rownames(expression) = genes
colnames(expression) = samples[sample_locations]

# correct batch effects in gene expression
batchid = match(series, unique(series))
correctedExpression <- ComBat(dat=expression, batch=batchid, par.prior=TRUE, prior.plots=FALSE)


Search tools

Meta data search

Metadata search parses the tissue description field from GEO to find matches with the entered search term. The search ignores spaces and is case insensitive. Results are highliged in the data viewer and a result is added to the result list.

We preselected a series of cellular tissues based by cellular system. This allows simple browsing of the data for tissues of interest. Some tissue selections can return empty for either mouse or human samples.


Signature search

Signature search uses a list of high expressed genes and low expressed genes and identifies samples that match the given input. The gene expression is z-score normalized across samples to identify the relative gene expression.



Enrichment search

Enrichment search highlights samples that are enriched in gene sets from 8 gene set libraries (CHEA 2016, KEA 2016, Encode TF ChIP-seq 2016, KEGG 2016, MGI mammalian phenotype, Human phenotype, GO Biological Process, GO cellular component, GO molecular function)

Gene search

Open gene landing page when searching by gene symbol. A set of genes can be highlighted by selecting a gene set library and a corresponding gene set.

API

Get similarity score of samples for a gene expression profile

Method POST
URL https://maayanlab.cloud/custom/rooky
Returns
name query name
samples Integer portion of the GSM id. Complete the GSM by appending GSM to the integer portion. (1462973, 2270363, 1897293) -> (GSM1462973, GSM2270363, GSM1897293)
similarity Similarity z-scores. Large values represent similar signatures, negative values represent reverse signatures.
Parameters
type 'full_signature' or 'geneset'. When geneset add fileds upgenes and downgenes with arrays of gene symbols.
signatureName name the query
species "human" or "mouse"
siggenes array of gene symbols, must match the expression values in signature
signature unnormalized gene counts, must match siggenes
Example code
import json
import requests

URL = 'https://maayanlab.cloud/custom/rooky'

data = {
    "type": "full_signature",
    "signatureName": "example_query",
    "species": "human",
    "siggenes": ["A1BG", "AADACL3", "AASS", "ABCD4", "ABTB2", "TNS1", "TPTEP1"], 
    "signature": [1000, 21300, 1231, 0, 19923, 4000, 20000]
}

r = requests.post(URL, json=data)

r.status_code
r.json()

Example results

{
    'name': ['example_query'],
    'samples': [1462973, 2270363, 1897293, ...],
    'similarity': [[-0.9423], [1.1858], [-0.6544], ...]
}


Gene pages

Functional prediction

The top 100 predictions of gene set membership accross multiple domains is shown in the tables on the gene page. Gene set membership is predicted by membership by association. If a gene shares high correlation with known members of a gene set it will get a high z-score during the membership perediction. If a gene already has known functions/gene set memberships they are highlighted in green. If a gene is extensively annotated a ROC curve shows how well known annotations could be recovered by the algorithm. If there is no image the gene has not enough prior gene memberships to build a reliable ROC curve.

Gene correlation

The gene correlation table contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues. The gene list can be uploaded to Enrichr for further investigation.

Tissue expression atlas

The tissue and cell line expression atlas are calculated from samples in ARCHS4. The tissues are grouped in multiple levels and cover a wide range of different cellular contexts. Since samples of any given tissue can come from many distinct laboratoeries condition upon sample creation are not identical and various subtypes of tissues can be mixed. This in comparison to GTEx can report the observed variability in non homogenious sample groups.


Terms of use

Use

Source code is available under the Apache Licence 2.0. Provided gene expression files available under the Creative Commons Attribution 4.0 International LicenseCreative Commons License.
All data is free to use for non-commercial purposes. For commercial use please contact MSIP.

Citation

Please acknowledge ARCHS4 in your publications by citing the following reference:
Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6

Disclaimer

ARCHS4 is not to be used for treating or diagnosing human subjects. ARCHS4 or any documents available from this server are provided as is without any warranty of any kind, either express, implied, or statutory, including, but not limited to, any implied warranties of merchantability, fitness for particular purpose and freedom from infringement, or that ARCHS4 or any documents available from this server will be error free. The Ma'ayan lab makes no representations that the use of ARCHS4 or any documents available from this server will not infringe any patent or proprietary rights of third parties. In no event will the Ma'ayan lab or any of its members be liable for any damages, including but not limited to direct, indirect, special or consequential damages, arising out of, resulting from, or in any way connected with the use of ARCHS4 or documents available from this server.




© Ma'ayan Lab.