Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Co-occurrence matrix #11

Closed
nealhaddaway opened this issue Feb 17, 2022 · 19 comments
Closed

Co-occurrence matrix #11

nealhaddaway opened this issue Feb 17, 2022 · 19 comments
Assignees
Labels
ESHackathon Help wanted Extra attention is needed Output Data output visualisation

Comments

@nealhaddaway
Copy link
Contributor

We need to produce a co-occurrence matrix with sources as columns/rows and the number of shared records across source pairs.

Not sure if co-occurrence or frequency matrix is the right word - can't find anything that shows frequencies rather than correlations or p-values, but maybe because it's very basic...

But something that looks like this:

@LukasWallrich
Copy link
Collaborator

I wonder if the naniar::gg_miss_upset visualisation would be a better / alternative style for this? That would allow one to see which combinations of databases sources were found in - and I'd find it helpful to see the size of each set included in the chart?

image

@DrMattG
Copy link
Collaborator

DrMattG commented Feb 18, 2022

Something like this? But only one side?
`

Sources as columns/rows and the number of shared records across source pairs.

library(tidyverse)
library(hrbrthemes)
library(plotly)

Dummy data

SourceX <- paste0("Source", seq(1,20))
SourceY <- paste0("Source", seq(1,20))
data <- expand.grid(SourceX=SourceX, SourceY=SourceY)
data$shared_records <- round(runif(400, 0, 25))
data=data %>%
mutate(shared_records=ifelse(data$SourceX==data$SourceY, NA, shared_records))

text for tooltip:

data <- data %>%
mutate(text = paste0("Shared_records: ",round(shared_records,2), "\n"))

ggplot

p <- ggplot(data, aes(SourceX, SourceY, fill= shared_records, text=text)) +
geom_tile() +
labs(x="", y="")+
theme_ipsum()+
theme(axis.text.x = element_text(angle=45))

plotly

ggplotly(p, tooltip="text")
`

@DrMattG
Copy link
Collaborator

DrMattG commented Feb 18, 2022

I wonder if the naniar::gg_miss_upset visualisation would be a better / alternative style for this? That would allow one to see which combinations of databases sources were found in - and I'd find it helpful to see the size of each set included in the chart?

image

We could design a bespoke plot to include these elements (gg_miss will fail where there are no missing data... which might happen I guess?)

@LukasWallrich
Copy link
Collaborator

We could design a bespoke plot to include these elements (gg_miss will fail where there are no missing data... which might happen I guess?)

For sure - I meant to suggest showing the number of articles per source, rather than missing values in this plot, just based on this visualisation. Present/absent in a source is very similar to missing/not-missing in variable, so we might be able to derive some of the code from naniar - licence permitting ....

@LukasWallrich
Copy link
Collaborator

In fact, the UpSetR::upset function that the naniar vis is based on directly gives the desired result, without worrying about missing data. This visualisation is obviously only helpful when the number of sources is quite small - but that seems to be a realistic situation, maybe with a long tail grouped as "other"?

if (!require(pacman)) install.packages("pacman")
#> Loading required package: pacman
pacman::p_load(tidyverse, UpSetR)

data <- data.frame(
  article = 1:100,
  source1 = rbinom(100, 1, .5),
  source2 = rbinom(100, 1, .2),
  source3 = rbinom(100, 1, .1),
  source4 = rbinom(100, 1, .6),
  source5 = rbinom(100, 1, .7),
  source6 = rbinom(100, 1, .3)
)

data[-1] %>% upset(order.by = "freq")

Created on 2022-02-18 by the reprex package (v2.0.1)

@TNRiley
Copy link
Collaborator

TNRiley commented Feb 21, 2022

Here is an example with the databases being listed both on the x and y axis, which can show overlaping citations between databases. (this is a snapshot of only part of a sheet so the numbers do not add up)

chrome-capture

@LukasWallrich
Copy link
Collaborator

Here is an example with the databases being listed both on the x and y axis, which can show overlapping citations between databases. (this is a snapshot of only part of a sheet so the numbers do not add up)

That's helpful - though I probably wouldn't show the information twice (i.e. above and below the diagonal). Something that might be interesting (though probably not essential enough for a standard visualisation) is the overlap, which would not be symmetrical but might help with prioritizing databases (i.e. if 90% of results of DB1 are in DB2, but only 10% of results of DB2 are in DB1, then it might be worth focusing on DB2 over DB1 in future work / updates?)

@nealhaddaway
Copy link
Contributor Author

nealhaddaway commented Feb 21, 2022 via email

@TNRiley
Copy link
Collaborator

TNRiley commented Feb 22, 2022

Yes, I think it would be cleaner to only include one side of the diagonal - would an option for the for % overlap vs. # of records (shown in the image I posted) solve this - also add a total row to show total % of overlap

@LukasWallrich
Copy link
Collaborator

I now gave the two visualisations a go (# overlapping and % overlapping) as I started wondering how the data wrangling would actually work ... I hope this didn't just duplicate someone else's work. Does this look helpful? Then I can turn it into a proper function on a branch.

While I still like the message of the % overlap, I'm not sure if this is the most effective visualisation for that - maybe it would be worth trying to space out the rows to make clear that they are the primary dimension in that case? Any other suggestions?

if (!require(pacman)) install.packages("pacman")
#> Loading required package: pacman
pacman::p_load(tidyverse)

data <- data.frame(
  article = 1:500,
  source1 = rbinom(500, 1, .5),
  source2 = rbinom(500, 1, .2),
  source3 = rbinom(500, 1, .1),
  source4 = rbinom(500, 1, .6),
  source5 = rbinom(500, 1, .7),
  source6 = rbinom(500, 1, .1),
  source7 = rbinom(500, 1, .25),
  source8 = rbinom(500, 1, .4)
)

sources <- data %>% select(matches("source")) 

cooc_mat <- purrr::map_dfr(names(sources), function (source) {
  sources[sources[source] == 1,] %>% colSums()
}) %>% select(all_of(names(sources))) %>% as.matrix()

#To plot cooccurences

cooc_mat_plot <- cooc_mat

cooc_mat_plot[upper.tri(cooc_mat_plot)] <- NA

cooc_mat_plot %>% as_tibble() %>% mutate(DB1 = names(sources)) %>%
  pivot_longer(-DB1, names_to = "DB2", values_to = "records") %>%
  ggplot(aes(DB1, DB2, fill = records)) +
  geom_tile() +
  scale_fill_gradient(low="white") +
  geom_text( aes(label=records))
#> Warning: Removed 28 rows containing missing values (geom_text).

#To plot overlap

overlap_matrix <- cooc_mat/diag(cooc_mat)

labels_matrix <- overlap_matrix
labels_matrix[TRUE] <- paste(round(overlap_matrix, 2) * 100, "%")

diag(labels_matrix) <- diag(cooc_mat)

labels_df <- labels_matrix %>% as_tibble() %>% mutate(DB1 = names(sources)) %>%
  pivot_longer(-DB1, names_to = "DB2", values_to = "label")

overlap_matrix %>% as_tibble() %>% mutate(DB1 = names(sources)) %>%
  pivot_longer(-DB1, names_to = "DB2", values_to = "records") %>%
  ggplot(aes(DB1, DB2, fill = records)) +
  geom_tile() +
  scale_fill_gradient(low="white", labels = scales::percent, limits = c(0, .999)) +
  geom_text(data = labels_df, aes(label=label, fill = NULL)) +
  labs(fill = "Overlap", caption = "Note: Percentages are share of records in row also found in column, 
       number of results in each database is shown on the diagonal")

Created on 2022-02-22 by the reprex package (v2.0.1)

@TNRiley
Copy link
Collaborator

TNRiley commented Feb 23, 2022

These look really fantastic. Can we add a collumn/row that had the % unique?

I didn't understand the idea of spacing out the rows for the % visual. I'm open to other visuals for sure.

@TNRiley
Copy link
Collaborator

TNRiley commented Feb 23, 2022

image

I really like the way that this viual looks and how it might be able to show the data. I just want to make sure I understand it 100%

The set size is the numbner of records that were part of the .ris, so databases OR unique strings
The filled dots are where each record was found
The intersection size number is how many records were found in that unique overlap combination

In this case if they were databases... the 3rd database had 2 unique records, the 4th database had 7 unique records and they shared between them 4 records (that we're just between them) and 9 that were also in database 1 and database 2.

Just to make sure I'm thinking about its use the same way

@LukasWallrich
Copy link
Collaborator

I really like the way that this viual looks and how it might be able to show the data. I just want to make sure I understand it 100%

Yes, that's all exactly right. We should definitely rename "Set size" for clarity. Maybe "intersection size" as well, though I am not 100% sure what would be clearer?

Code
if (!require(pacman)) install.packages("pacman")
#> Loading required package: pacman
pacman::p_load(tidyverse, UpSetR)

data <- data.frame(
  article = 1:100,
  source1 = rbinom(100, 1, .5),
  source2 = rbinom(100, 1, .2),
  source3 = rbinom(100, 1, .1),
  source4 = rbinom(100, 1, .6),
  source5 = rbinom(100, 1, .7),
  source6 = rbinom(100, 1, .3)
)

data[-1] %>% upset(order.by = "freq", sets.x.label = "Number of records")

@kaitlynhair
Copy link
Collaborator

I also love the way this visualisation looks!

Maybe "overlapping citation count" or something for the intersection size?

@rootsandberries
Copy link
Collaborator

Thank you for explaining this Trevor! I was having a hard time interpreting this at first, but now that I get it, I really like it. "Number of records" is a good label for the small graph, and I agree with Kaitlyn's suggestion of "Overlapping citation count" for the large graph. Could also be "Overlapping record count".

@LukasWallrich
Copy link
Collaborator

I have now created the functions to produce the heatmaps and upset plots in the plots branch.

Two notes - mostly regarding defaults.

  1. I changed the default order of the upset plots to focus on the number of sources that records are found in - that seems to be of the greatest practical interest (i.e. how many unique per source, per pair of sources - or conversely, shared by all/most sources). The user can still switch it to sort by frequency across any combination of sources.

image

  1. By default, the co-occurence matrices are now sorted based on the number of records, with the largest source on top (but the user can turn that off so that the ordering in the data is preserved). In addition, it might be nice to show the size of each source as in the example below? However, that is not trivial, so not something I can implement at the moment. If we want it, the best way might be to use the superheat package.

image

@rootsandberries
Copy link
Collaborator

Love how these are looking. The first one now makes it really easy to see how many unique records each database is contributing. Would there be a way to make these interactive at all, in the sense of being able to click on a bar and get a list of the records. Or maybe that output would be provided in another format...I'm just thinking it would be nice to see what those unique records are.

Also like how the bottom graph is shaping up. I think the Set Size is a useful addition!

@nealhaddaway
Copy link
Contributor Author

We should be able to add interactivity with plotly() or by appending javascript links, yeah :)

@LukasWallrich
Copy link
Collaborator

I'm closing this as most of it is done.

Still open: we have another issue for interactivity of plots #22.

I will open a new issue to add the set size to heat maps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ESHackathon Help wanted Extra attention is needed Output Data output visualisation
Projects
None yet
Development

No branches or pull requests

7 participants