Note
This repository is in active development
The DicePlot package allows you to create visualizations (dice plots) for datasets with more than two categorical variables and additional continuous variables. This tool is particularly useful for exploring complex categorical data and their relationships with continuous variables.
To install the DicePlot package, follow these steps:
Ensure that you have R installed on your system. You can download it from The Comprehensive R Archive Network (CRAN). Or use conda:
conda create -n diceplot -c conda-forge r-base -y
conda activate diceplot
The DicePlot
package depends on several other R packages. Install them by running:
install.packages(c(
"devtools",
"dplyr",
"ggplot2",
"tidyr",
"data.table",
"ggdendro"
))
You have three options for installing the DicePlot package:
install.packages("diceplot")
# Install devtools if you haven't already
install.packages("devtools")
# Install DicePlot from GitHub
devtools::install_github("maflot/DicePlot/diceplot")
Download the repository and run the following code to install the package:
install.packages("$path on your local machine$/DicePlot/diceplot", repos = NULL, type="source")
After installation, load the DicePlot
package into your R session:
library(diceplot)
Here is a simple example of how to use the DicePlot v0.1.2
package.
For more examples, check the tests/ folder.
# Load necessary libraries
library(diceplot)
library(tidyr)
library(data.table)
library(ggplot2)
library(dplyr)
library(tibble)
library(grid)
library(cowplot)
library(RColorBrewer)
First, we define the cell types, pathways, pathway groups, pathology variables, and assign colors to pathology variables:
# Define common variables
cell_types <- c("Neuron", "Astrocyte", "Microglia", "Oligodendrocyte", "Endothelial")
pathways <- c(
"Apoptosis", "Inflammation", "Metabolism", "Signal Transduction", "Synaptic Transmission",
"Cell Cycle", "DNA Repair", "Protein Synthesis", "Lipid Metabolism", "Neurotransmitter Release",
"Oxidative Stress", "Energy Production", "Calcium Signaling", "Synaptic Plasticity", "Immune Response"
)
# Assign groups to pathways
pathway_groups <- data.frame(
Pathway = pathways,
Group = c(
"Linked", "UnLinked", "Other", "Linked", "UnLinked",
"UnLinked", "Other", "Other", "Other", "Linked",
"Other", "Other", "Linked", "UnLinked", "Other"
),
stringsAsFactors = FALSE
)
pathology_variables <- c("AD", "Cancer", "Flu", "ADHD", "Age", "Weight")
# Assign colors to pathology variables
n_colors <- length(pathology_variables)
colors <- brewer.pal(n = n_colors, name = "Set1")
z_colors <- setNames(colors, pathology_variables)
Explanation:
- Cell Types: A list of different cell types involved in the study.
- Pathways: Biological pathways relevant to the cell types.
- Pathway Groups: Categorization of pathways into 'Linked', 'UnLinked', or 'Other'.
- Pathology Variables: Medical conditions or variables of interest.
- Colors Assignment: Assigning a unique color to each pathology variable for visualization.
Now we finalize the data and plot the dice plot:
# Create dummy data
set.seed(123)
data <- expand.grid(CellType = cell_types, Pathway = pathways, stringsAsFactors = FALSE)
data <- data %>%
rowwise() %>%
mutate(
PathologyVariable = list(sample(pathology_variables, size = sample(1:length(pathology_variables), 1)))
) %>%
unnest(cols = c(PathologyVariable))
# Merge the group assignments into the data
data <- data %>%
left_join(pathway_groups, by = "Pathway")
# Use the dice_plot function with new parameter names
p = dice_plot(
data = data,
x = "CellType",
y = "Pathway",
z = "PathologyVariable",
group = "Group",
group_alpha = 0.6,
title = "Dice Plot with 6 Pathology Variables",
z_colors = z_colors,
custom_theme = theme_minimal(),
min_dot_size = 2,
max_dot_size = 4
)
print(p)
# Simply save the plot using the ggplot functions
# ggsave("./diceplot_example.png", p, width = 8, height = 9)
Explanation:
- Data Creation: We create a data frame that contains all combinations of cell types and pathways.
- Assign Pathology Variables: For each combination, we randomly assign one or more pathology variables.
- Merge Groups: We add the group information to each pathway.
- Plotting: We directly call dice_plot to generate and display the dice plot with the specified parameters.
A Domino Plot is a specialized visualization from the DicePlot package that allows you to display differential expression data across multiple categorical variables. It's particularly useful for visualizing how gene expression changes across different cell types, conditions, and contrasts.
The plot uses colors to represent up/down-regulation and size to represent statistical significance. This example uses data from the ZEBRA database, a hierarchically integrated gene expression atlas of the murine and human brain at single-cell resolution.
Before starting, ensure you have the following packages installed:
install.packages(c("dplyr", "tidyr", "ggplot2", "diceplot"))
For this tutorial, we'll use a dataset derived from human cortex samples that contains differential expression analysis results comparing gene expression between sexes across various neurological conditions. The dataset includes:
- gene: Gene symbols
- cell_type: Different cell types in the brain
- contrast: Different disease conditions compared to control (e.g., "MS-CT" compares Multiple Sclerosis to Control)
- sex: The contrast variable (male vs female)
- logFC: Log fold change values
- PValue and FDR: Statistical significance measures
library(dplyr)
library(tidyr)
library(ggplot2)
library(diceplot)
# Load dataset
zebra.df = read.csv(file = "data/ZEBRA_sex_degs_set.csv")
genes = c("SPP1","APOE","SERPINA1","PINK1","ANGPT1","ANGPT2","APP","CLU","ABCA7")
zebra.df <- zebra.df %>% filter(gene %in% genes) %>%
filter(contrast %in% c("MS-CT","AD-CT","ASD-CT","FTD-CT","HD-CT")) %>%
mutate(cell_type = factor(cell_type, levels = sort(unique(cell_type)))) %>%
filter(PValue < 0.05)
Let's start with a basic domino plot using the default parameters:
p_basic <- domino_plot(
data = zebra.df, # Input data
gene_list = genes, # List of genes to include
var_id = "contrast", # Variable that identifies different conditions
x = "gene", # Variable for x-axis
y = "cell_type", # Variable for y-axis
contrast = "sex", # Contrast variable (e.g., male vs female)
log_fc = "logFC", # Column name for log fold change
p_val = "FDR" # Column name for p-values
)
# Display the plot
print(p_basic)
Now, let's create a more customized version with specific dot sizes and logFC limits:
p_advanced <- domino_plot(
data = zebra.df,
gene_list = genes,
var_id = "contrast",
x = "gene",
y = "cell_type",
contrast = "sex",
log_fc = "logFC",
p_val = "FDR",
min_dot_size = 1, # Minimum dot size for least significant results
max_dot_size = 3, # Maximum dot size for most significant results
logfc_limits = c(min(zebra.df$logFC)-1, max(zebra.df$logFC)-1) # Custom logFC color scale limits
)
# Display the plot
print(p_advanced$domino_plot)
The domino_plot()
function returns a list with several components:
domino_plot
: The main plot object- Other components that vary based on the version you're using
You can access the main plot using p_advanced$domino_plot
.
Since the domino plot returns a ggplot2 object, you can further customize it using standard ggplot2 functions:
p_custom <- p_advanced$domino_plot +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, size = 14),
legend.position = "bottom"
) +
labs(title = "Differential Expression Across Cell Types and Conditions")
# Display the customized plot
print(p_custom)
# Save the plot
ggsave("domino_plot_example.png", p_custom, width = 10, height = 8, dpi = 300)
You can create a faceted domino plot to separate results by a particular variable:
p_faceted <- domino_plot(
data = zebra.df,
gene_list = genes,
var_id = "contrast",
x = "gene",
y = "cell_type",
contrast = "sex",
log_fc = "logFC",
p_val = "FDR",
min_dot_size = 1,
max_dot_size = 3
)$domino_plot +
facet_wrap(~contrast, scales = "free_y") +
theme(
strip.background = element_rect(fill = "lightgray"),
strip.text = element_text(face = "bold")
)
# Display the faceted plot
print(p_faceted)
# Save the faceted plot
ggsave("domino_plot_faceted.png", p_faceted, width = 14, height = 10, dpi = 300)
The domino_plot()
function has several important parameters:
Parameter | Description |
---|---|
data |
Input data frame containing all necessary variables |
gene_list |
List of genes to include in the plot |
var_id |
Variable that identifies different conditions |
x |
Variable for x-axis (typically genes) |
y |
Variable for y-axis (typically cell types) |
contrast |
Contrast variable (e.g., sex, treatment) |
log_fc |
Column name for log fold change values |
p_val |
Column name for p-values |
min_dot_size |
Minimum dot size for least significant results |
max_dot_size |
Maximum dot size for most significant results |
logfc_limits |
Custom limits for logFC color scale |
In a domino plot:
-
Color: Represents the direction and magnitude of change
- Red typically indicates upregulation (positive logFC)
- Blue typically indicates downregulation (negative logFC)
- The intensity of color represents the magnitude of change
-
Size: Represents statistical significance
- Larger dots indicate more statistically significant results (smaller p-values)
- Smaller dots indicate less statistically significant results (larger p-values)
-
Position: Shows the combination of categorical variables
- x-axis: Typically genes
- y-axis: Typically cell types
- Facets (if used): Can represent different conditions or contrasts
This tutorial has prerquisties which are not defaults in the diceplot package itself. Before proceeding, install the required R packages:
install.packages(c("sf", "ggplot2", "diceplot", "dplyr", "cowplot", "rnaturalearth"))
We use a dataset containing city locations in Saarland, along with their log-transformed distances to France, Switzerland, Luxembourg, and Rheinland-Pfalz.
- name: City name
- lon/lat: Geographical coordinates
- dice: Number of dice dots (fixed at 4)
- log_France, log_Swiss, log_Luxembourg, log_Rheinlandpfalz: Log-transformed distances to respective regions
library(sf)
library(ggplot2)
library(diceplot)
library(dplyr)
library(cowplot)
library(rnaturalearth)
# Define custom dice face positions
var_positions <- data.frame(
x_offset = c(-0.3, 0.3, -0.3, 0.3),
y_offset = c(0.3, 0.3, -0.3, -0.3),
var = c("log_France", "log_Swiss", "log_Luxembourg", "log_Rheinlandpfalz")
)
# Load Germany state boundaries
germany_states <- ne_states(country = "Germany", returnclass = "sf")
saarland <- germany_states[germany_states$name == "Saarland", ]
# Define city locations and distances
cities <- data.frame(
name = c("Saarbrücken", "Saarlouis", "Homburg", "Britten", "Merzig", "Lebach", "Ottweiler"),
dice = 4,
lon = c(6.996, 6.751, 7.339, 6.784, 6.639, 6.913, 7.167),
lat = c(49.234, 49.315, 49.320, 49.481, 49.442, 49.407, 49.400),
France = c(14, 12, 38, 27, 18, 27, 36),
Swiss = c(190, 204, 195, 221, 220, 210, 206),
Luxembourg = c(51, 31, 67, 23, 17, 35, 52),
Rheinlandpfalz = c(29, 27, 6, 16, 20, 12, 16)
)
# Convert to spatial object
cities_sf <- st_as_sf(cities, coords = c("lon", "lat"), crs = 4326)
cities_sf$log_France <- log(cities_sf$France)
cities_sf$log_Swiss <- log(cities_sf$Swiss)
cities_sf$log_Luxembourg <- log(cities_sf$Luxembourg)
cities_sf$log_Rheinlandpfalz <- log(cities_sf$Rheinlandpfalz)
create_custom_legends_for_map <- function(var_positions, dot_size, legend_text_size = 9) {
legend_data <- var_positions %>% mutate(x = x_offset + 1, y = y_offset + 1)
ggplot() +
geom_point(data = legend_data, aes(x = x, y = y), size = dot_size, color = "black") +
geom_point(data = legend_data, aes(x = x, y = y), size = dot_size + 0.5, shape = 1, color = "black") +
coord_fixed(ratio = 1, xlim = c(0.5, 2.5), ylim = c(0.5, 1.5), expand = FALSE) +
geom_text(
data = legend_data,
aes(
x = ifelse(x > 0, x + 0.15, x - 0.15),
y = ifelse(y > 0, y + 0.15, y - 0.15),
label = var,
hjust = ifelse(x < 0, 1, 0),
vjust = ifelse(y > 0, 0, 1)
),
size = legend_text_size / 3, color = "black"
) +
ggtitle("Dice arrangement") +
theme_void()
}
geom_dice_sf is fully integratable with ggplot.
# Generate legend plot
legend_plot <- create_custom_legends_for_map(var_positions, dot_size = 3)
# Generate main dice plot
main_plot <- ggplot() +
geom_sf(data = saarland, fill = "lightblue", color = "black") +
geom_dice_sf(
sf_data = cities_sf,
dice_value_col = "dice",
face_color = c("log_France", "log_Swiss", "log_Luxembourg", "log_Rheinlandpfalz"),
dice_size = 0.5,
dot_size = 3
) +
geom_text(
data = cities_sf,
mapping = aes(x = st_coordinates(cities_sf)[,1],
y = st_coordinates(cities_sf)[,2],
label = name),
nudge_y = 0.03, size = 3
) +
ggtitle("Saarland with Dice Markers Showing Log-Scaled Distances to Borders") +
theme_minimal()
# Combine main plot and legend
final_plot <- plot_grid(main_plot, legend_plot, ncol = 2, rel_widths = c(4, 1))
# Display the final plot
final_plot
For using dice plots in Python, please refer to pyDicePlot
For full documentation and additional examples, please refer to the documentation
- Visualize Complex Data: Easily create plots for datasets with multiple categorical variables.
- Customization: Customize plots with titles, labels, and themes.
- Integration with ggplot2: Leverages the power of
ggplot2
for advanced plotting capabilities.
We welcome contributions from the community! If you'd like to contribute:
- Fork the repository on GitHub.
- Create a new branch for your feature or bug fix.
- Submit a pull request with a detailed description of your changes.
If you have any questions, suggestions, or issues, please open an issue on GitHub.
- Update the examples to real world data
- move example files out of test to example
- add prototype for geom_dice_sf function
- see examples/geom_dice_sf_test2.R
- Add proper legend to the plot, remove intermediate plot
- Default logfc crop to NULL
If you use this code or the R and Python packages for your own work, please cite diceplot as:
M. Flotho, P. Flotho, A. Keller, "Diceplot: A package for high dimensional categorical data visualization," arxiv, 2024. doi:10.48550/arXiv.2410.23897
BibTeX entry:
@article{flotea2024,
author = {Flotho, M. and Flotho, P. and Keller, A.},
title = {Diceplot: A package for high dimensional categorical data visualization},
year = {2024},
journal = {arXiv preprint},
doi = {https://doi.org/10.48550/arXiv.2410.23897}
}
[1] Flotho, M., Flotho, P., Keller, A. (2024). Diceplot: A package for high dimensional categorical data visualization. arXiv preprint. https://doi.org/10.48550/arXiv.2410.23897
[2] Flotho, M., Amand, J., Hirsch, P., Grandke, F., Wyss-Coray, T., Keller, A., Kern, F. (2023). ZEBRA: a hierarchically integrated gene expression atlas of the murine and human brain at single-cell resolution. Nucleic Acids Research, 52(D1), D1089-D1096. https://doi.org/10.1093/nar/gkad990