-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path22-dataset-structures.Rmd
304 lines (192 loc) · 13.3 KB
/
22-dataset-structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
# (PART\*) Software details {-}
# Excel file and dataset details {#dataset-object-details}
---
output:
rmarkdown::pdf_document:
fig_caption: yes
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval.after = "fig.cap"
)
library(RJafroc)
```
## Introduction
This chapter is included to document recent Excel file format changes and the new dataset structure.
## ROC dataset {#dataset-object-details-roc-dataset}
```{r echo=TRUE}
x <- DfReadDataFile("R/quick-start/rocCr.xlsx", newExcelFileFormat = TRUE)
```
### The structure of a factorial ROC dataset object {#dataset-object-details-structure-roc-dataset}
`x` is a `list` with `r length(x)` members: `ratings`, `lesions` and `descriptions`.
```{r}
str(x, max.level = 1)
```
The `x$ratings` member contains 3 sub-lists.
```{r}
str(x$ratings)
```
* `x$ratings$NL`, with dimension [2, 5, 8, 1], contains the ratings of normal cases. The first dimension (2) is the number of treatments, the second (5) is the number of readers and the third (8) is the total number of cases. For ROC datasets the fourth dimension is always unity. The five extra values ^[With only 3 non-diseased cases why does one need 8 values?] in the third dimension, of `x$ratings$NL` which are filled with `NAs`, are needed for compatibility with FROC datasets.
* `x$ratings$LL`, with dimension [2, 5, 5, 1], contains the ratings of abnormal cases. The third dimension (5) corresponds to the 5 diseased cases.
* `x$ratings$LL_IL`, equal to NA', is there for compatibility with LROC data, `IL` denotes incorrect-localizations.
The `x$lesions` member contains 3 sub-lists.
```{r}
str(x$lesions)
```
* The `x$lesions$perCase` member is a vector with 5 ones representing the 5 diseased cases in the dataset.
* The `x$lesions$IDs` member is an array with 5 ones.
```{r}
x$lesions$weights
```
`x$lesions$weights` member is an array with 5 ones. These are irrelevant for ROC datasets. They are there for compatibility with FROC datasets.
`x$descriptions` contains 7 sub-lists.
```{r}
str(x$descriptions)
```
* `x$descriptions$fileName` is intended for internal use.
* `x$descriptions$type` indicates that this is an `ROC` dataset.
* `x$descriptions$name` is intended for internal use.
* `x$descriptions$truthTableStr` is intended for internal use, see Section \@ref(dataset-object-truth-table-str).
* `x$descriptions$design` specifies the dataset design, which is "FCTRL" in the present example ("FCTRL" = a factorial dataset).
* `x$descriptions$modalityID` is a vector with two elements `"0"` and `"1"`, the names of the two modalities.
* `x$readerID` is a vector with five elements `"0"`, `"1"`, `"2"`, `"3"` and `"4"`, the names of the five readers.
### The `FP` worksheet {#dataset-object-details-read-datafile-correspondence-nl-fp}
![](images/quick-start/rocCrFp.png){width=100%}
* The list member `x$ratings$NL` is an array with `dim = c(2,5,8,1)`.
+ The first dimension (2) comes from the number of modalities.
+ The second dimension (5) comes from the number of readers.
+ The third dimension (8) comes from the **total** number of cases.
+ The fourth dimension is always 1 for an ROC dataset.
* The value of `x$ratings$NL[1,5,2,1]`, i.e., `r x$ratings$NL[1,5,2,1]`, corresponds to row 15 of the FP table, i.e., to `ModalityID` = 0, `ReaderID` = 4 and `CaseID` = 2.
* The value of `x$ratings$NL[2,3,2,1]`, i.e., `r x$ratings$NL[2,3,2,1]`, corresponds to row 24 of the FP table, i.e., to `ModalityID` 1, `ReaderID` 2 and `CaseID` 2.
* All values for case index > 3 and case index <= 8 are `-Inf`. For example the value of `x$ratings$NL[2,3,4,1]` is `-Inf`. This is because there are only 3 non-diseased cases. The extra length is needed for compatibility with FROC datasets.
### The `TP` worksheet {#dataset-object-details-read-datafile-correspondence-ll-tp}
![](images/quick-start/rocCrTp.png){width=100%}
* The list member `x$ratings$LL` is an array with `dim = c(2,5,5,1)`.
+ The first dimension (2) comes from the number of modalities.
+ The second dimension (5) comes from the number of readers.
+ The third dimension (5) comes from the number of diseased cases.
+ The fourth dimension is always 1 for an ROC dataset.
* The value of `x$ratings$LL[1,1,5,1]`, i.e., `r x$ratings$LL[1,1,5,1]`, corresponds to row 6 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 0 and `CaseID` = 74.
* The value of `x$ratings$LL[1,2,2,1]`, i.e., `r x$ratings$LL[1,2,2,1]`, corresponds to row 8 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 1 and `CaseID` = 71.
* The value of `x$ratings$LL[1,4,4,1]`, i.e., `r x$ratings$LL[1,4,4,1]`, corresponds to row 21 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 3 and `CaseID` = 74.
* The value of `x$ratings$LL[1,5,2,1]`, i.e., `r x$ratings$LL[1,5,2,1]`, corresponds to row 23 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 4 and `CaseID` = 71.
* There are no `-Inf` values in `x$ratings$LL`: `any(x$ratings$LL == -Inf)` = `r any(x$ratings$LL == -Inf)`. This is true for any ROC dataset.
### caseIndex vs. caseID {#dataset-object-details-read-datafile-correspondence-case-index-vs-case-id}
* The `caseIndex` is the array index used to access elements in the NL and LL arrays. The case-index is always an integer in the range 1, 2, ..., up to the array length. Remember that unlike C++, R indexing starts from 1.
* The `caseID` is any integer value, including zero, used to uniquely label the cases.
* Regardless of what order they occur in the worksheet, the non-diseased cases are always ordered first. In the current example the case indices are 1, 2 and 3, corresponding to the three non-diseased cases with `caseIDs` equal to 1, 2 and 3.
* Regardless of what order they occur in the worksheet, in the NL array the diseased cases are always ordered *after* the last non-diseased case. In the current example the case indices in the `NL` array are 4, 5, 6, 7 and 8, corresponding to the five diseased cases with `caseIDs` equal to 70, 71, 72, 73, and 74. In the `LL` array they are indexed 1, 2, 3, 4 and 5. Some examples follow:
* `x$ratings$NL[1,3,2,1]`, a FP rating, refers to `ModalityID` 0, `ReaderID` 2 and `CaseID` 2 (since the modality and reader IDs start with 0).
* `x$ratings$NL[2,5,4,1]`, a FP rating, refers to `ModalityID` 1, `ReaderID` 4 and `CaseID` 70, the first diseased case; this is `-Inf`.
* `x$ratings$NL[1,4,8,1]`, a FP rating, refers to `ModalityID` 0, `ReaderID` 3 and `CaseID` 74, the last diseased case; this is `-Inf`.
* `x$ratings$NL[1,3,9,1]`, a FP rating, is an illegal value, as the third index cannot exceed 8.
* `x$ratings$NL[1,3,8,2]`, a FP rating, is an illegal value, as the fourth index cannot exceed 1 for an ROC dataset.
* `x$ratings$LL[1,3,1,1]`, a TP rating, refers to `ModalityID` 0, `ReaderID` 2 and `CaseID` 70, the first diseased case.
* `x$ratings$LL[2,5,4,1]`, a TP rating, refers to `ModalityID` 1, `ReaderID` 4 and `CaseID` 73, the fourth diseased case.
## FROC dataset {#dataset-object-details-froc-dataset}
![](images/software-details/frocCrTruth.png){width=100%}
### The structure of a factorial FROC dataset {#dataset-object-details-structure-froc-dataset}
```{r}
x <- DfReadDataFile("images/software-details/frocCr.xlsx", newExcelFileFormat = TRUE)
```
The dataset `x` is a `list` variable with 3 members: `x$ratings`, `x$lesions` and `x$descriptions`.
```{r}
str(x, max.level = 1)
```
The `x$ratings` member contains 3 sub-lists.
```{r}
str(x$ratings)
```
* There are `K2 = 5` diseased cases (the length of the third dimension of `x$ratings$LL`) and `K1 = 3` non-diseased cases (the length of the third dimension of `x$ratings$NL` minus `K2`).
* `x$ratings$NL`, a [2, 3, 8, 2] array, contains the NL ratings on non-diseased and diseased cases.
* `x$ratings$LL`, a [2, 3, 5, 3] array, contains the ratings of LLs on diseased cases.
* `x$ratings$LL_IL` is `NA`, this field applies to an LROC dataset (contains incorrect localizations on diseased cases).
The `x$lesions` member contains 3 sub-lists.
```{r}
str(x$lesions)
```
* `x$lesions$perCase` is the number of lesions per diseased case vector, i.e., `r x$lesions$perCase`.
* `max(x$lesions$perCase)` is the maximum number of lesions per case, i.e., `r `max(x$lesions$perCase)`.
* `x$lesions$weights` is the weights of lesions.
```{r}
x$lesions$weights
```
The weights for the first diseased case are 0.3 and 0.7. The weight for the second diseased case is 1. For the third diseased case the three weights are 1/3 each, etc. For each diseased case the finite weights sum to unity.
`x$descriptions` contains 7 sub-lists.
```{r}
str(x$descriptions)
```
* `x$descriptions$filename` is for internal use.
* `x$descriptions$type` is `r x$descriptions$type`, which specifies the data collection method.
* `x$descriptions$name` is for internal use.
* `x$descriptions$truthTableStr` is for internal use; it quantifies the structure of the dataset; it is explained in the next section.
* `x$descriptions$design` is `r x$descriptions$design`; it specifies the study design.
* `x$descriptions$modalityID` is a vector with two elements `r x$descriptions$modalityID` naming the two modalities.
* `x$readerID` is a vector with three elements `r x$descriptions$readerID` naming the three readers.
### `truthTableStr` {#dataset-object-truth-table-str}
* For this dataset `I` = 2, `J` = 3 and `K` = 8.
* `truthTableStr` is a `2 x 3 x 8 x 4` array, i.e., `I` x `J` x `K` x (maximum number of lesions per case plus 1 - the `plus 1 ` is needed to accommodate non-diseased cases).
* Each entry in this array is either `1`, meaning the corresponding interpretation happened, or `NA`, meaning the corresponding interpretation did not happen.
#### Explanation for non-diseased cases
Since the fourth index is set to 1, in the following code only non-diseased cases yield ones and all diseased cases yield `NA`.
```{r}
all(x$descriptions$truthTableStr[,,1:3,1] ==1)
all(is.na(x$descriptions$truthTableStr[,,4:8,1]))
```
#### Explanation for diseased cases with one lesion
Since the fourth index is set to 2, in the following code all non-diseased cases yield `NA` and all diseased cases yield 1 as all diseased cases have at least one lesion.
```{r}
all(is.na(x$descriptions$truthTableStr[,,1:3,2]))
all(x$descriptions$truthTableStr[,,4:8,2] == 1)
```
#### Explanation for diseased cases with two lesions
Since the fourth index is set to 3, in the following code all non-diseased cases yield `NA`; the first diseased case `70` yields 1 (this case contains two lesions); the second disease case `71` yields `NA` (this case contains only one lesion); the third disease case `72` yields `NA` (this case contains only two lesions); the fourth disease case `73` yields 1 (this case contains two lesions); the fifth disease case `74` yields `NA` (this case contains one lesion).
```{r}
# all non diseased cases
all(is.na(x$descriptions$truthTableStr[,,1:3,3]))
# first diseased case
all(x$descriptions$truthTableStr[,,4,3] == 1)
# second diseased case
all(is.na(x$descriptions$truthTableStr[,,5,3]))
# third diseased case
all(x$descriptions$truthTableStr[,,6,3] == 1)
# fourth diseased case
all(x$descriptions$truthTableStr[,,7,3] == 1)
# fifth diseased case
all(is.na(x$descriptions$truthTableStr[,,8,3]))
```
#### Explanation for diseased cases with three lesions
Since the fourth index is set to 4, in the following code all non-diseased cases yield `NA`; the first diseased case `70` yields `NA` (this case contains two lesions); the second disease case `71` yields `NA` (this case contains one lesion); the third disease case `72` yields `NA` (this case contains two lesions); the fourth disease case `73` yields 1 (this case contains three lesions); the fifth disease case `74` yields `NA` (this case contains one lesion).
```{r}
# all non diseased cases
all(is.na(x$descriptions$truthTableStr[,,1:3,4]))
# first diseased case
all(is.na(x$descriptions$truthTableStr[,,4,4]))
# second diseased case
all(is.na(x$descriptions$truthTableStr[,,5,4]))
# third diseased case
all(x$descriptions$truthTableStr[,,6,4] == 1)
# fourth diseased case
all(is.na(x$descriptions$truthTableStr[,,7,4]))
# fifth diseased case
all(is.na(x$descriptions$truthTableStr[,,8,4]))
```
### The FP worksheet
These are found in the `FP` or `NL` worksheet:
![](images/software-details/frocCrFp.png){width=100%}
* The common vertical length is 22 in this example.
* `ReaderID`: the reader labels: `0`, 1`, `2`, as declared in the `Truth` worksheet.
* `ModalityID`: the modality labels: `0` or `1`, as declared in the `Truth` worksheet.
* `CaseID`: `1`, `2`, `3`, `71`, `72`, `73`, `74`, as declared in the `Truth` worksheet; note that not all cases have NL marks on them.
* `NL_Rating`: the ratings of non-diseased cases.
### The TP worksheet
These are found in the `TP` or `LL` worksheet, see below.
![](images/software-details/frocCrTp.png){width=100%}
* This worksheet has the ratings of diseased cases.
* `ReaderID`: the reader labels: these must be from `0`, `1`, `2`, as declared in the `Truth` worksheet.
* `ModalityID`: `0` or `1`, as declared in the `Truth` worksheet.
* `CaseID`: these must be from `70`, `71`, `72`, `73`, `74`, as declared in the `Truth` worksheet; not all diseased cases have LL marks.
* `LL_Rating`: the ratings of diseased cases.