-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Co-occurrence matrix #11
Comments
Something like this? But only one side? Sources as columns/rows and the number of shared records across source pairs.library(tidyverse) Dummy dataSourceX <- paste0("Source", seq(1,20)) text for tooltip:data <- data %>% ggplotp <- ggplot(data, aes(SourceX, SourceY, fill= shared_records, text=text)) + plotlyggplotly(p, tooltip="text") |
We could design a bespoke plot to include these elements (gg_miss will fail where there are no missing data... which might happen I guess?) |
For sure - I meant to suggest showing the number of articles per source, rather than missing values in this plot, just based on this visualisation. Present/absent in a source is very similar to missing/not-missing in variable, so we might be able to derive some of the code from naniar - licence permitting .... |
In fact, the if (!require(pacman)) install.packages("pacman")
#> Loading required package: pacman
pacman::p_load(tidyverse, UpSetR)
data <- data.frame(
article = 1:100,
source1 = rbinom(100, 1, .5),
source2 = rbinom(100, 1, .2),
source3 = rbinom(100, 1, .1),
source4 = rbinom(100, 1, .6),
source5 = rbinom(100, 1, .7),
source6 = rbinom(100, 1, .3)
)
data[-1] %>% upset(order.by = "freq") Created on 2022-02-18 by the reprex package (v2.0.1) |
That's helpful - though I probably wouldn't show the information twice (i.e. above and below the diagonal). Something that might be interesting (though probably not essential enough for a standard visualisation) is the overlap, which would not be symmetrical but might help with prioritizing databases (i.e. if 90% of results of DB1 are in DB2, but only 10% of results of DB2 are in DB1, then it might be worth focusing on DB2 over DB1 in future work / updates?) |
I like that idea to show assymmetry!
…________________________________
From: Lukas Wallrich ***@***.***>
Sent: 21 February 2022 12:38
To: ESHackathon/CiteSource ***@***.***>
Cc: nealhaddaway ***@***.***>; Author ***@***.***>
Subject: Re: [ESHackathon/CiteSource] Co-occurrence matrix (Issue #11)
Here is an example with the databases being listed both on the x and y axis, which can show overlapping citations between databases. (this is a snapshot of only part of a sheet so the numbers do not add up)
That's helpful - though I probably wouldn't show the information twice (i.e. above and below the diagonal). Something that might be interesting (though probably not essential enough for a standard visualisation) is the overlap, which would not be symmetrical but might help with prioritizing databases (i.e. if 90% of results of DB1 are in DB2, but only 10% of results of DB2 are in DB1, then it might be worth focusing on DB2 over DB1 in future work / updates?)
—
Reply to this email directly, view it on GitHub<#11 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKOBNXFK3JZH6VIF6MGTK4DU4IW3HANCNFSM5OT7WTRQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Yes, I think it would be cleaner to only include one side of the diagonal - would an option for the for % overlap vs. # of records (shown in the image I posted) solve this - also add a total row to show total % of overlap |
I now gave the two visualisations a go (# overlapping and % overlapping) as I started wondering how the data wrangling would actually work ... I hope this didn't just duplicate someone else's work. Does this look helpful? Then I can turn it into a proper function on a branch. While I still like the message of the % overlap, I'm not sure if this is the most effective visualisation for that - maybe it would be worth trying to space out the rows to make clear that they are the primary dimension in that case? Any other suggestions? if (!require(pacman)) install.packages("pacman")
#> Loading required package: pacman
pacman::p_load(tidyverse)
data <- data.frame(
article = 1:500,
source1 = rbinom(500, 1, .5),
source2 = rbinom(500, 1, .2),
source3 = rbinom(500, 1, .1),
source4 = rbinom(500, 1, .6),
source5 = rbinom(500, 1, .7),
source6 = rbinom(500, 1, .1),
source7 = rbinom(500, 1, .25),
source8 = rbinom(500, 1, .4)
)
sources <- data %>% select(matches("source"))
cooc_mat <- purrr::map_dfr(names(sources), function (source) {
sources[sources[source] == 1,] %>% colSums()
}) %>% select(all_of(names(sources))) %>% as.matrix()
#To plot cooccurences
cooc_mat_plot <- cooc_mat
cooc_mat_plot[upper.tri(cooc_mat_plot)] <- NA
cooc_mat_plot %>% as_tibble() %>% mutate(DB1 = names(sources)) %>%
pivot_longer(-DB1, names_to = "DB2", values_to = "records") %>%
ggplot(aes(DB1, DB2, fill = records)) +
geom_tile() +
scale_fill_gradient(low="white") +
geom_text( aes(label=records))
#> Warning: Removed 28 rows containing missing values (geom_text). #To plot overlap
overlap_matrix <- cooc_mat/diag(cooc_mat)
labels_matrix <- overlap_matrix
labels_matrix[TRUE] <- paste(round(overlap_matrix, 2) * 100, "%")
diag(labels_matrix) <- diag(cooc_mat)
labels_df <- labels_matrix %>% as_tibble() %>% mutate(DB1 = names(sources)) %>%
pivot_longer(-DB1, names_to = "DB2", values_to = "label")
overlap_matrix %>% as_tibble() %>% mutate(DB1 = names(sources)) %>%
pivot_longer(-DB1, names_to = "DB2", values_to = "records") %>%
ggplot(aes(DB1, DB2, fill = records)) +
geom_tile() +
scale_fill_gradient(low="white", labels = scales::percent, limits = c(0, .999)) +
geom_text(data = labels_df, aes(label=label, fill = NULL)) +
labs(fill = "Overlap", caption = "Note: Percentages are share of records in row also found in column,
number of results in each database is shown on the diagonal") Created on 2022-02-22 by the reprex package (v2.0.1) |
These look really fantastic. Can we add a collumn/row that had the % unique? I didn't understand the idea of spacing out the rows for the % visual. I'm open to other visuals for sure. |
I really like the way that this viual looks and how it might be able to show the data. I just want to make sure I understand it 100% The set size is the numbner of records that were part of the .ris, so databases OR unique strings In this case if they were databases... the 3rd database had 2 unique records, the 4th database had 7 unique records and they shared between them 4 records (that we're just between them) and 9 that were also in database 1 and database 2. Just to make sure I'm thinking about its use the same way |
Yes, that's all exactly right. We should definitely rename "Set size" for clarity. Maybe "intersection size" as well, though I am not 100% sure what would be clearer? Codeif (!require(pacman)) install.packages("pacman")
#> Loading required package: pacman
pacman::p_load(tidyverse, UpSetR)
data <- data.frame(
article = 1:100,
source1 = rbinom(100, 1, .5),
source2 = rbinom(100, 1, .2),
source3 = rbinom(100, 1, .1),
source4 = rbinom(100, 1, .6),
source5 = rbinom(100, 1, .7),
source6 = rbinom(100, 1, .3)
)
data[-1] %>% upset(order.by = "freq", sets.x.label = "Number of records") |
I also love the way this visualisation looks! Maybe "overlapping citation count" or something for the intersection size? |
Thank you for explaining this Trevor! I was having a hard time interpreting this at first, but now that I get it, I really like it. "Number of records" is a good label for the small graph, and I agree with Kaitlyn's suggestion of "Overlapping citation count" for the large graph. Could also be "Overlapping record count". |
I have now created the functions to produce the heatmaps and upset plots in the plots branch. Two notes - mostly regarding defaults.
|
Love how these are looking. The first one now makes it really easy to see how many unique records each database is contributing. Would there be a way to make these interactive at all, in the sense of being able to click on a bar and get a list of the records. Or maybe that output would be provided in another format...I'm just thinking it would be nice to see what those unique records are. Also like how the bottom graph is shaping up. I think the Set Size is a useful addition! |
We should be able to add interactivity with plotly() or by appending javascript links, yeah :) |
I'm closing this as most of it is done. Still open: we have another issue for interactivity of plots #22. I will open a new issue to add the set size to heat maps. |
We need to produce a co-occurrence matrix with sources as columns/rows and the number of shared records across source pairs.
Not sure if co-occurrence or frequency matrix is the right word - can't find anything that shows frequencies rather than correlations or p-values, but maybe because it's very basic...
But something that looks like this:
The text was updated successfully, but these errors were encountered: