-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathnevada.qmd
303 lines (221 loc) · 15.4 KB
/
nevada.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
---
title: "nevada"
vignette: >
%\VignetteIndexEntry{nevada}
%\VignetteEngine{quarto::html}
%\VignetteEncoding{UTF-8}
knitr:
opts_chunk:
collapse: true
comment: '#>'
bibliography: references.bib
---
```{r}
#| label: setup
library(nevada)
```
*Network-valued data* are data in which the statistical unit is a network itself. This is the data with which we can make inference on *populations of networks* from *samples of networks*. The [**nevada**](https://astamm.github.io/nevada/) package proposes a specific `nvd` class to handle network-valued data. Inference from such samples is made possible though a 4-step procedure:
- Choose a suitable representation of your samples of networks.
- Choose a suitable distance to embed your representation into a nice metric space.
- Choose one or more test statistics to define your alternative hypothesis.
- Compute an empirical permutation-based approximation of the null distribution.
The package focuses for now on the two-sample testing problem and assumes that
all networks from both samples share the same node structure.
There are two types of questions that one can ask:
1. Is there a difference between the distributions that generated the two
observed samples?
2. Can we localize the differences between the distributions on the node structure?
The [**nevada**](https://astamm.github.io/nevada/) package offers a dedicated function for answering each of these two questions:
- [`test2_global()`](https://astamm.github.io/nevada/reference/test2_global.html); for more details, please see @lovato2020,
- [`test2_local()`](https://astamm.github.io/nevada/reference/test2_local.html); for more details, please see @lovato2021.
## The `nvd` class for network-valued data
In [**nevada**](https://astamm.github.io/nevada/), network-valued data are
stored in an object of class `nvd`, which is basically a list of
[**igraph**](https://igraph.org/r/) objects. We provide:
- a constructor [`nvd()`](https://astamm.github.io/nevada/reference/nvd.html) which allows the user to simulate samples of networks using some of the popular models from [**igraph**](https://igraph.org/r/). Currently, one can use:
- the stochastic block model,
- the $k$-regular model,
- the GNP model,
- the small-world model,
- the PA model,
- the Poisson model,
- the binomial model.
The constructor simulates networks with 25 nodes.
- a function [`as_nvd()`](https://astamm.github.io/nevada/reference/as_nvd.html) to coerce lists of [**igraph**](https://igraph.org/r/) objects into an object of class `nvd`.
## Network representation
There are currently 3 possible matrix representations for a network. Let $G$
be a network with $N$ nodes.
### Adjacency matrix
A $N$ x $N$ matrix $W$ is an *adjacency matrix* for
$G$ if element $W_{ij}$ indicates if there is an edge between vertex $i$
and vertex $j$: $$
W_{ij}=
\begin{cases}
w_{i,j}, & \mbox{if } (i,j) \in E \mbox{ with weight } w_{i,j}\\
0, & \mbox{otherwise.}
\end{cases}
$$
In [**nevada**](https://astamm.github.io/nevada/), this representation can be
achieved with
[`repr_adjacency()`](https://astamm.github.io/nevada/reference/representations.html).
### Laplacian matrix
The *Laplacian matrix* $L$ of the network $G$ is defined in the following way:
$$ L = D(W) - W, $$
where $D(W)$ is the diagonal matrix whose $i$-*th* diagonal element is the
degree of vertex $i$.
In [**nevada**](https://astamm.github.io/nevada/), this representation can be
achieved with
[`repr_laplacian()`](https://astamm.github.io/nevada/reference/representations.html).
### Modularity matrix
The elements of the *modularity matrix* $B$ are given by
$$ B_{ij} = W_{ij} - \frac{d_i d_j}{2m}, $$
where $d_i$ and $d_j$ are the degrees of vertices $i$ and $j$ respectively, and
$m$ is the total number of edges in the network.
In [**nevada**](https://astamm.github.io/nevada/), this representation can be
achieved with
[`repr_modularity()`](https://astamm.github.io/nevada/reference/representations.html).
### Choosing a representation for an object of class `nvd`
Instead of going through every single network in a sample to make its
representation, [**nevada**](https://astamm.github.io/nevada/) provides the
[`repr_nvd()`](https://astamm.github.io/nevada/reference/repr_nvd.html) function
which does exactly that for an object of class `nvd`.
```{r}
x <- nvd(sample_size = 3L, model = "gnp", n = 24L, p = 1/3)
repr_nvd(x, representation = "laplacian")
```
## Distances between networks
It is possible to choose which distance consider in the analysis. Let $G$ and
$H$ be two networks with $N$ nodes each and suppose that $X$ and $Y$ are the
matrix representations of $G$ and $H$, respectively. The user can currently
choose among 4 distances: Hamming, Frobenius, spectral and root-Euclidean.
### Hamming distance
$$ \rho_H(G,H)=\frac{1}{N(N-1)}\sum_{i \neq j}^N \bigl\arrowvert X_{i,j}-Y_{i,j} \bigr\arrowvert. $$
In [**nevada**](https://astamm.github.io/nevada/), this distance can be computed with [`dist_hamming()`](https://astamm.github.io/nevada/reference/distances.html).
### Frobenius distance
$$ \rho_F(G,H) = \left\| X - Y \right\|_F^2 = \sum_{i \neq j}^N \bigl ( X_{i,j}-Y_{i,j} \bigr )^2. $$
In [**nevada**](https://astamm.github.io/nevada/), this distance can be computed
with
[`dist_frobenius()`](https://astamm.github.io/nevada/reference/distances.html).
### Spectral distance
$$ \rho_S(G,H)=\sum_{i \neq j}^N \bigl ( \Lambda^X_{i,j}-\Lambda^Y_{i,j} \bigr )^2, $$
where $\Lambda^X$ and $\Lambda^Y$ are the diagonal matrices with eigenvalues on
the diagonal given by the spectral decomposition of the matrix representations
of $G$ and $H$.
In [**nevada**](https://astamm.github.io/nevada/), this distance can be computed
with
[`dist_spectral()`](https://astamm.github.io/nevada/reference/distances.html).
### Root Euclidean distance
$$ \rho_{RE}(G,H) = \left\| X^{1/2} - Y^{1/2} \right\|_F^2. $$
Note that this distance is not compatible with all matrix representations as it requires that the representation be semi-positive definite.
In [**nevada**](https://astamm.github.io/nevada/), this distance can be computed
with
[`dist_root_euclidean()`](https://astamm.github.io/nevada/reference/distances.html).
### Computing a matrix of pairwise distances for an object of class `nvd`
Pre-computation of the matrix of pairwise distances for samples of networks
alleviates the computational burden of permutation testing. This is why
[**nevada**](https://astamm.github.io/nevada/) provides the convenient
[`dist_nvd()`](https://astamm.github.io/nevada/reference/dist_nvd.html) function
which does exactly that for an object of class `nvd`.
```{r}
x <- nvd(sample_size = 3L, model = "gnp", n = 24L, p = 1/3)
dist_nvd(x, representation = "laplacian", distance = "hamming")
```
## Test statistics
The [**nevada**](https://astamm.github.io/nevada/) package has been designed to work well with the [**flipr**](https://lmjl-alea.github.io/flipr/) package, which handles the permutation scheme once suitable representation, distance and test statistics have been chosen. The most efficient way to two-sample testing with network-valued data pertains to use statistics based on inter-point distances, that is pairwise distances between observations.
### Available statistics
#### From [**flipr**](https://lmjl-alea.github.io/flipr/)
A number of test statistics along this line have been proposed in the literature, including ours [@lovato2020]. As these test statistics rely on inter-point distances, they are not specific to network-valued data. As such, they can be found in [**flipr**](https://lmjl-alea.github.io/flipr/). We adopt the naming convention that a test statistic function shall start with the prefix `stat_`. All statistics based on inter-point distances are named with the suffix `_ip`. Here is the list of test statistics based on inter-point distances that are currently available in [**flipr**](https://lmjl-alea.github.io/flipr/):
- [`stat_student_ip()`](https://lmjl-alea.github.io/flipr/reference/two-sample-stats.html) and its alias `stat_t_ip()` implement a Student-like test statistic based on inter-point distances proposed by @lovato2020;
- [`stat_fisher_ip()`](https://lmjl-alea.github.io/flipr/reference/two-sample-stats.html) and its alias `stat_f_ip()` implement a Fisher-like test statistic based on inter-point distances proposed by @lovato2020;
- [`stat_bg_ip()`](https://lmjl-alea.github.io/flipr/reference/two-sample-stats.html) implements the statistic proposed by @biswas2014nonparametric;
- [`stat_energy_ip()`](https://lmjl-alea.github.io/flipr/reference/two-sample-stats.html) implements the class of energy-based statistics as proposed by @szekely2013energy;
- [`stat_cq_ip()`](https://lmjl-alea.github.io/flipr/reference/two-sample-stats.html) implements the statistic proposed by @chen2010two;
- [`stat_mod_ip()`](https://lmjl-alea.github.io/flipr/reference/two-sample-stats.html) implements a statistic that computes the mean of inter-point distances;
- [`stat_dom_ip()`](https://lmjl-alea.github.io/flipr/reference/two-sample-stats.html) implements a statistic that computes the distance between the medoids of the two samples, possibly standardized by the pooled corresponding variances.
#### From [**nevada**](https://astamm.github.io/nevada/)
There are also 3 statistics proposed in @chen2018weighted that are based on a similarity graph built on top of the distance matrix:
- [`stat_original_edge_count()`](https://astamm.github.io/nevada/reference/statistics.html),
- [`stat_generalized_edge_count()`](https://astamm.github.io/nevada/reference/statistics.html),
- [`stat_weighted_edge_count()`](https://astamm.github.io/nevada/reference/statistics.html).
There are also Student-like statistics available only for Frobenius distance for which we can easily compute the Fréchet mean. These are:
- [`stat_student_euclidean()`](https://astamm.github.io/nevada/reference/statistics.html),
- [`stat_welch_euclidean()`](https://astamm.github.io/nevada/reference/statistics.html).
#### Write your own test statistic function
In addition to the test statistic functions already implemented in [**flipr**](https://lmjl-alea.github.io/flipr/) and [**nevada**](https://astamm.github.io/nevada/), you can also implement your own function. Test statistic functions compatible with [**flipr**](https://lmjl-alea.github.io/flipr/) should have at least two mandatory input arguments:
- `data` which is either a concatenated list of size $n_x + n_y$ regrouping the data points of both samples or a distance matrix of size $(n_x + n_y) \times (n_x + n_y)$ stored as an object of class `dist`.
- `indices1` which is an integer vector of size $n_x$ storing the indices of the data points belonging to the first sample in the current permuted version of the data.
The [**flipr**](https://lmjl-alea.github.io/flipr/) package provides a helper function `use_stat(nsamples = 2, stat_name = )` which makes it easy for users to create their own test statistic ready to be used by [**nevada**](https://astamm.github.io/nevada/). This function creates and saves a `.R` file in the `R/` folder of the current working directory and populates it with the following template:
```{r, eval=FALSE}
#' Test Statistic for the Two-Sample Problem
#'
#' This function computes the test statistic...
#'
#' @param data A list storing the concatenation of the two samples from which
#' the user wants to make inference. Alternatively, a distance matrix stored
#' in an object of class \code{\link[stats]{dist}} of pairwise distances
#' between data points.
#' @param indices1 An integer vector that contains the indices of the data
#' points belong to the first sample in the current permuted version of the
#' data.
#'
#' @return A numeric value evaluating the desired test statistic.
#' @export
#'
#' @examples
#' # TO BE DONE BY THE DEVELOPER OF THE PACKAGE
stat_{{{name}}} <- function(data, indices1) {
n <- if (inherits(data, "dist"))
attr(data, "Size")
else if (inherits(data, "list"))
length(data)
else
stop("The `data` input should be of class either list or dist.")
indices2 <- seq_len(n)[-indices1]
x <- data[indices1]
y <- data[indices2]
# Here comes the code that computes the desired test
# statistic from input samples stored in lists x and y
}
```
For instance, a [**flipr**](https://lmjl-alea.github.io/flipr/)-compatible version of the $t$-statistic with pooled variance will look like:
```{r, eval=FALSE}
stat_student <- function(data, indices1) {
n <- if (inherits(data, "dist"))
attr(data, "Size")
else if (inherits(data, "list"))
length(data)
else
stop("The `data` input should be of class either list or dist.")
indices2 <- seq_len(n)[-indices1]
x <- data[indices1]
y <- data[indices2]
# Here comes the code that computes the desired test
# statistic from input samples stored in lists x and y
x <- unlist(x)
y <- unlist(y)
stats::t.test(x, y, var.equal = TRUE)$statistic
}
```
### Usage
#### Naming conventions
Test statistics are passed to the functions `test2_global()` and `test2_local()` via the argument `stats` which accepts a character vector in which:
- statistics from [**nevada**](https://astamm.github.io/nevada/) expected to be named without the `stat_` prefix (e.g. `"original_edge_count"` or `"student_euclidean"`).
- statistics from [**flipr**](https://lmjl-alea.github.io/flipr/) are expected to be named without the `stat_` prefix but adding the `flipr:` prefix (e.g., `"flipr:student_ip"`).
- statistics from any other package **pkg** are expected to be named without the `stat_` prefix but adding the `pkg:` prefix.
```{r}
x <- nvd(sample_size = 10L, model = "gnp", n = 24L, p = 1/3)
y <- nvd(sample_size = 10L, model = "degree" , out_degree = rep(2, 24L), method = "configuration")
test2_global(
x = x,
y = y,
representation = "laplacian",
distance = "frobenius",
stats = c("flipr:student_ip", "flipr:fisher_ip"),
seed = 1234
)$pvalue
```
Note that you can also refer to test statistic function from [**nevada**](https://astamm.github.io/nevada/) using the naming `"nevada:original_edge_count"` as you would do for test statistics from [**flipr**](https://lmjl-alea.github.io/flipr/). This is mandatory for instance if you have not yet loaded [**nevada**](https://astamm.github.io/nevada/) in your environment via `library(nevada)`.
#### Using multiple test statistics
In permutation testing, the choice of a test statistic determines the alternative hypothesis, while the null hypothesis is always that the distributions that generated the observed samples are the same. This means that if you were to use the Student statistic `stat_student_ip()` for instance, then what you would be actually testing is whether the means of the distributions are different. If you'd rather be sensitive to differences in variances of the distributions, then you should go with the Fisher statistic `stat_fisher_ip()`.
You can also be sensitive to multiple aspects of a distribution when testing via the permutation framework. This is achieved under the hood by the [**flipr**](https://lmjl-alea.github.io/flipr/) package which implements the so-called *non-parametric combination* (NPC) approach proposed by @pesarin2010permutation when you provide more than one test statistics in the `stats` argument. You can read [this article](https://lmjl-alea.github.io/flipr/articles/alternative.html) to know more about its implementation in [**flipr**](https://lmjl-alea.github.io/flipr/). The bottom line is that, for example, you can choose both the Student and Fisher statistics to test simultaneously for differences in mean and in variance.
## References