22-dataset-structures.Rmd

# (PART\*) Software details {-}

# Excel file and dataset details {#dataset-object-details}

---
output:
  rmarkdown::pdf_document:
    fig_caption: yes        
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval.after = "fig.cap"
)

library(RJafroc)
```


## Introduction

This chapter is included to document recent Excel file format changes and the new dataset structure.


## ROC dataset {#dataset-object-details-roc-dataset}

```{r echo=TRUE}
x <- DfReadDataFile("R/quick-start/rocCr.xlsx", newExcelFileFormat = TRUE)
```             

### The structure of a factorial ROC dataset object  {#dataset-object-details-structure-roc-dataset}


`x` is a `list` with `r length(x)` members: `ratings`, `lesions` and `descriptions`.


```{r}
str(x, max.level = 1)
```             


The `x$ratings` member contains 3 sub-lists.

```{r}
str(x$ratings)
```             

* `x$ratings$NL`, with dimension [2, 5, 8, 1], contains the ratings of normal cases. The first dimension (2) is the number of treatments, the second (5) is the number of readers and the third (8) is the total number of cases. For ROC datasets the fourth dimension is always unity. The five extra values ^[With only 3 non-diseased cases why does one need 8 values?] in the third dimension, of `x$ratings$NL` which are filled with `NAs`, are needed for compatibility with FROC datasets.

* `x$ratings$LL`, with dimension [2, 5, 5, 1], contains the ratings of abnormal cases. The third dimension (5) corresponds to the 5 diseased cases.

* `x$ratings$LL_IL`, equal to NA', is there for compatibility with LROC data, `IL`  denotes incorrect-localizations.


The `x$lesions` member contains 3 sub-lists.


```{r}
str(x$lesions)
```             


* The `x$lesions$perCase` member is a vector with 5 ones representing the 5 diseased cases in the dataset. 

* The `x$lesions$IDs` member is an array with 5 ones.


```{r}
x$lesions$weights
```             

`x$lesions$weights` member is an array with 5 ones. These are irrelevant for ROC datasets. They are there for compatibility with FROC datasets.


`x$descriptions` contains 7 sub-lists.

```{r}
str(x$descriptions)
```             


* `x$descriptions$fileName` is intended for internal use. 
* `x$descriptions$type` indicates that this is an `ROC` dataset. 
* `x$descriptions$name` is intended for internal use. 
* `x$descriptions$truthTableStr` is intended for internal use, see Section \@ref(dataset-object-truth-table-str).
* `x$descriptions$design` specifies the dataset design, which is "FCTRL" in the present example ("FCTRL" = a factorial dataset).
* `x$descriptions$modalityID` is a vector with two elements `"0"` and `"1"`, the names of the two modalities. 
* `x$readerID` is a vector with five elements  `"0"`, `"1"`, `"2"`, `"3"` and `"4"`, the names of the five readers. 


### The `FP` worksheet {#dataset-object-details-read-datafile-correspondence-nl-fp}

![](images/quick-start/rocCrFp.png){width=100%}

* The list member `x$ratings$NL` is an array with `dim = c(2,5,8,1)`. 
    + The first dimension (2) comes from the number of modalities. 
    + The second dimension (5) comes from the number of readers. 
    + The third dimension (8) comes from the **total** number of cases. 
    + The fourth dimension is always 1 for an ROC dataset. 
* The value of `x$ratings$NL[1,5,2,1]`, i.e., `r x$ratings$NL[1,5,2,1]`, corresponds to row 15 of the FP table, i.e., to `ModalityID` = 0, `ReaderID` = 4 and `CaseID` = 2.
* The value of `x$ratings$NL[2,3,2,1]`, i.e., `r x$ratings$NL[2,3,2,1]`, corresponds to row 24 of the FP table, i.e., to `ModalityID` 1, `ReaderID` 2 and `CaseID` 2.
* All values for case index > 3 and case index <= 8 are `-Inf`. For example the value of `x$ratings$NL[2,3,4,1]` is `-Inf`. This is because there are only 3 non-diseased cases. The extra length is needed for compatibility with FROC datasets.


### The `TP` worksheet {#dataset-object-details-read-datafile-correspondence-ll-tp}

![](images/quick-start/rocCrTp.png){width=100%}

* The list member `x$ratings$LL` is an array with `dim = c(2,5,5,1)`. 
    + The first dimension (2) comes from the number of modalities. 
    + The second dimension (5) comes from the number of readers. 
    + The third dimension (5) comes from the number of diseased cases. 
    + The fourth dimension is always 1 for an ROC dataset. 
* The value of `x$ratings$LL[1,1,5,1]`, i.e., `r x$ratings$LL[1,1,5,1]`, corresponds to row 6 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 0 and `CaseID` = 74.
* The value of `x$ratings$LL[1,2,2,1]`, i.e., `r x$ratings$LL[1,2,2,1]`, corresponds to row 8 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 1 and `CaseID` = 71.
* The value of `x$ratings$LL[1,4,4,1]`, i.e., `r x$ratings$LL[1,4,4,1]`, corresponds to row 21 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 3 and `CaseID` = 74.
* The value of `x$ratings$LL[1,5,2,1]`, i.e., `r x$ratings$LL[1,5,2,1]`, corresponds to row 23 of the TP table, i.e., to `ModalityID` = 0, `ReaderID` = 4 and `CaseID` = 71.
* There are no `-Inf` values in `x$ratings$LL`: `any(x$ratings$LL == -Inf)` = `r any(x$ratings$LL == -Inf)`. This is true for any ROC dataset.


### caseIndex vs. caseID {#dataset-object-details-read-datafile-correspondence-case-index-vs-case-id}

* The `caseIndex` is the array index used to access elements in the NL and LL arrays. The case-index is always an integer in the range 1, 2, ..., up to the array length. Remember that unlike C++, R indexing starts from 1.
* The `caseID` is any integer value, including zero, used to uniquely label the cases.
* Regardless of what order they occur in the worksheet, the non-diseased cases are always ordered first. In the current example the case indices are 1, 2 and 3, corresponding to the three non-diseased cases with `caseIDs` equal to 1, 2 and 3.
* Regardless of what order they occur in the worksheet, in the NL array the diseased cases are always ordered *after* the last non-diseased case. In the current example the case indices in the `NL` array are 4, 5, 6, 7 and 8, corresponding to the five diseased cases with `caseIDs` equal to 70, 71, 72, 73, and 74. In the `LL` array they are indexed 1, 2, 3, 4 and 5. Some examples follow:
* `x$ratings$NL[1,3,2,1]`, a FP rating, refers to `ModalityID` 0, `ReaderID` 2 and `CaseID` 2 (since the modality and reader IDs start with 0).
* `x$ratings$NL[2,5,4,1]`, a FP rating, refers to `ModalityID` 1, `ReaderID` 4 and `CaseID` 70, the first diseased case; this is `-Inf`.
* `x$ratings$NL[1,4,8,1]`, a FP rating, refers to `ModalityID` 0, `ReaderID` 3 and `CaseID` 74, the last diseased case; this is `-Inf`.
* `x$ratings$NL[1,3,9,1]`, a FP rating, is an illegal value, as the third index cannot exceed 8.
* `x$ratings$NL[1,3,8,2]`, a FP rating, is an illegal value, as the fourth index cannot exceed 1 for an ROC dataset.
* `x$ratings$LL[1,3,1,1]`, a TP rating, refers to `ModalityID` 0, `ReaderID` 2 and `CaseID` 70, the first diseased case.
* `x$ratings$LL[2,5,4,1]`, a TP rating, refers to `ModalityID` 1, `ReaderID` 4 and `CaseID` 73, the fourth diseased case.

## FROC dataset {#dataset-object-details-froc-dataset}

![](images/software-details/frocCrTruth.png){width=100%}

### The structure of a factorial FROC dataset {#dataset-object-details-structure-froc-dataset}


```{r}
x <- DfReadDataFile("images/software-details/frocCr.xlsx", newExcelFileFormat = TRUE)
```             


The dataset `x` is a `list` variable with 3 members: `x$ratings`, `x$lesions` and `x$descriptions`.


```{r}
str(x, max.level = 1)
```             


The `x$ratings` member contains 3 sub-lists.


```{r}
str(x$ratings)
```             


* There are `K2 = 5` diseased cases (the length of the third dimension of `x$ratings$LL`) and `K1 = 3` non-diseased cases (the length of the third dimension of `x$ratings$NL` minus `K2`). 
* `x$ratings$NL`, a [2, 3, 8, 2] array, contains the NL ratings on non-diseased and diseased cases. 
* `x$ratings$LL`, a [2, 3, 5, 3] array, contains the ratings of LLs on diseased cases.
* `x$ratings$LL_IL` is `NA`, this field applies to an LROC dataset (contains incorrect localizations on diseased cases).


The `x$lesions` member contains 3 sub-lists.


```{r}
str(x$lesions)
```             


* `x$lesions$perCase` is the number of lesions per diseased case vector, i.e., `r x$lesions$perCase`.
* `max(x$lesions$perCase)` is the maximum number of lesions per case, i.e., `r `max(x$lesions$perCase)`.
* `x$lesions$weights` is the weights of lesions.


```{r}
x$lesions$weights
```             


The weights for the first diseased case are 0.3 and 0.7. The weight for the second diseased case is 1. For the third diseased case the three weights are 1/3 each, etc. For each diseased case the finite weights sum to unity. 


`x$descriptions` contains 7 sub-lists.


```{r}
str(x$descriptions)
```             


* `x$descriptions$filename` is for internal use. 
* `x$descriptions$type` is `r x$descriptions$type`, which specifies the data collection method. 
* `x$descriptions$name` is for internal use. 
* `x$descriptions$truthTableStr` is for internal use; it quantifies the structure of the dataset; it is explained in the next section.
* `x$descriptions$design` is `r x$descriptions$design`; it specifies the study design. 
* `x$descriptions$modalityID` is a vector with two elements `r x$descriptions$modalityID` naming the two modalities. 
* `x$readerID` is a vector with three elements `r x$descriptions$readerID` naming the three readers. 


### `truthTableStr` {#dataset-object-truth-table-str}

* For this dataset `I` = 2, `J` = 3 and `K` = 8.
* `truthTableStr` is a `2 x 3 x 8 x 4` array, i.e., `I` x `J` x `K` x (maximum number of lesions per case plus 1 - the `plus 1 ` is needed to accommodate non-diseased cases). 
* Each entry in this array is either `1`, meaning the corresponding interpretation happened, or `NA`, meaning the corresponding interpretation did not happen. 

#### Explanation for non-diseased cases

Since the fourth index is set to 1, in the following code only non-diseased cases yield ones and all diseased cases yield `NA`. 

```{r}
all(x$descriptions$truthTableStr[,,1:3,1] ==1)
all(is.na(x$descriptions$truthTableStr[,,4:8,1]))
```             


#### Explanation for diseased cases with one lesion

Since the fourth index is set to 2, in the following code all non-diseased cases yield `NA` and all diseased cases yield 1 as all diseased cases have at least one lesion. 

```{r}
all(is.na(x$descriptions$truthTableStr[,,1:3,2]))
all(x$descriptions$truthTableStr[,,4:8,2] == 1)
```             


#### Explanation for diseased cases with two lesions

Since the fourth index is set to 3, in the following code all non-diseased cases yield `NA`; the first diseased case `70` yields 1 (this case contains two lesions); the second disease case `71` yields `NA` (this case contains only one lesion); the third disease case `72` yields `NA` (this case contains only two lesions); the fourth disease case `73` yields 1 (this case contains two lesions); the fifth disease case `74` yields `NA` (this case contains one lesion). 

```{r}
# all non diseased cases
all(is.na(x$descriptions$truthTableStr[,,1:3,3]))
# first diseased case
all(x$descriptions$truthTableStr[,,4,3] == 1)
# second diseased case
all(is.na(x$descriptions$truthTableStr[,,5,3]))
# third diseased case
all(x$descriptions$truthTableStr[,,6,3] == 1)
# fourth diseased case
all(x$descriptions$truthTableStr[,,7,3] == 1)
# fifth diseased case
all(is.na(x$descriptions$truthTableStr[,,8,3]))
```             


#### Explanation for diseased cases with three lesions

Since the fourth index is set to 4, in the following code all non-diseased cases yield `NA`; the first diseased case `70` yields `NA` (this case contains two lesions); the second disease case `71` yields `NA` (this case contains one lesion); the third disease case `72` yields `NA` (this case contains two lesions); the fourth disease case `73` yields 1 (this case contains three lesions); the fifth disease case `74` yields `NA` (this case contains one lesion). 

```{r}
# all non diseased cases
all(is.na(x$descriptions$truthTableStr[,,1:3,4]))
# first diseased case
all(is.na(x$descriptions$truthTableStr[,,4,4]))
# second diseased case
all(is.na(x$descriptions$truthTableStr[,,5,4]))
# third diseased case
all(x$descriptions$truthTableStr[,,6,4] == 1)
# fourth diseased case
all(is.na(x$descriptions$truthTableStr[,,7,4]))
# fifth diseased case
all(is.na(x$descriptions$truthTableStr[,,8,4]))
```             


### The FP worksheet

These are found in the `FP` or `NL` worksheet:

![](images/software-details/frocCrFp.png){width=100%}

* The common vertical length is 22 in this example. 
* `ReaderID`: the reader labels: `0`, 1`, `2`, as declared in the `Truth` worksheet. 
* `ModalityID`: the modality labels: `0` or `1`, as declared in the `Truth` worksheet. 
* `CaseID`: `1`, `2`, `3`, `71`, `72`, `73`, `74`, as declared in the `Truth` worksheet; note that not all cases have NL marks on them.  
* `NL_Rating`: the ratings of non-diseased cases.


### The TP worksheet

These are found in the `TP` or `LL` worksheet, see below.

![](images/software-details/frocCrTp.png){width=100%}

* This worksheet has the ratings of diseased cases. 
* `ReaderID`: the reader labels: these must be from `0`, `1`, `2`, as declared in the `Truth` worksheet. 
* `ModalityID`: `0` or `1`, as declared in the `Truth` worksheet. 
* `CaseID`: these must be from `70`, `71`, `72`, `73`, `74`, as declared in the `Truth` worksheet; not all diseased cases have LL marks.   
* `LL_Rating`: the ratings of diseased cases.