Welcome to the Dimensionality Reduction Tool by the Molecular and Genomics Informatics Core (MaGIC).
Dimensionality reduction can be used to reduce datasets with high numbers of features into smaller summarized dimensions. This can be used to visualize similarities and dissimilarities between samples in your dataset. Ideally, samples that are prescribed as similar should group with like samples- for example your control group and treatment group should cluster respectively.This type of dimensionality reduction can usually be performed using PCA, tSNE, or UMAP.
Each dimensionality reduction method has its own use. In many datasets they will tell a similar story, but it is best to decide which optimally fits your experimental design. In general, PCA will be used for smaller datasets (such as a few samples RNA-seq) and will begin to scale into tSNE and UMAP as the sample N increases, such as for scRNA-seq.
Principal component analysis (PCA) is a common linear method of dimensionality reduction. In essence, the variant features in your dataset are distilled into eigenvector components that capture the maximum amount of variance.
t-Stochastic Neighbor Embedding (tSNE) is a graph based and non-linear dimensionality reduction method. Essentially, it is calculating the distance for the embedding on the distance to neighbor cells in PCA space.
Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction method. UMAP is similar to tSNE, but scales effectively and preserves local/global distances for delineating groups.
The data for dimensionality reduction can come from any type of count data. A few examples are RNA-seq normalized hit counts, Luminex assay counts, plant petal size/color, etc. If your data sources contain smaller N's, start with the PCA plots. Both tSNE and UMAP are designed for larger datasets, and will require custom hyperparameter tweaks to run, even with the demo data.
To use this tool, at minimum you must have a tsv/csv table containing with the first row containing your row identifiers (for example Gene IDs), followed by a column with count data per sample (for example VST normalized hit counts). Additionally you need a metadata table. The first column should contain the sample names matching the columns in the count data. Each subsequent column can include any non-measured data, such as grouping variables.
Caution. These are some of the tSNE hyperparameters. We strongly recommend you do not tweak these unless you understand the changes being made. Some of these will drastically increase runtime and memory consumption or trigger crashes.
Caution. These are some of the UMAP hyperparameters. We strongly recommend you do not tweak these unless you understand the changes being made. Some of these will drastically increase runtime and memory consumption or trigger crashes.