-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path11-case-study-education.Rmd
227 lines (171 loc) · 11.2 KB
/
11-case-study-education.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
knit: bookdown::preview_chapter
---
# PISA: Programme for International Student Assessment
<!--
description of data and why is this interesting
processing the data
examining selected features, including gender, tvs, books by country using dotplots and maps
within country
how you can tell the data is synthetic
-->
Every three years an international survey is conducted to assess the educational systems across the globe by testing 15 year olds on math, science and reading, called the Programme for International Student Assessment (PISA). It measures their readiness to meet life challenges and workforce preparedness. As of 2020, according to the PISA web site more than 90 countries and 3 million students have been involved.
Each student, school, parent are also asked to complete extensive questionnaires. Schools provide information about their resources staff qualifications and staffing levels. Parents are asked about the household environment and student support. Students are asked about their interest in different subjects and their friend networks.
The data is used by individual countries to inform education policy and improve learning. The news media take this as an opportunity to measure and comment on their country against other countries, and raising the concern about girls and math. Here's a sample of headlines after the 2018 data release:
*Vital Signs: Australia's slipping student scores will lead to greater income inequality* [Richard Holden, The Conversation](https://theconversation.com/vital-signs-australias-slipping-student-scores-will-lead-to-greater-income-inequality-128301)
*In China, Nicholas studied maths 20 hours a week. In Australia, it's three* [Michael Fowler and Adam Carey, Sydney Morning Herald](https://www.smh.com.au/education/in-china-nicholas-studied-maths-20-hours-a-week-in-australia-it-s-three-20191203-p53ggv.html)
*New Zealand top-end in OECD's latest PISA report but drop in achievements 'worrying'* [Jessica Long and Mandy Te, Stuff](https://www.stuff.co.nz/national/education/117890945/new-zealand-topend-in-oecds-latest-pisa-report-but-drop-in-achievements-worrying)
*Not even mediocre? Indonesian students score low in math, reading, science: PISA report* [Karina M. Tehusijarana, The Jakarta Post](https://www.thejakartapost.com/news/2019/12/04/not-even-mediocre-indonesian-students-score-low-in-math-reading-science-pisa-report.html)
*A significant gender gap in maths performance in favour of male students has returned, despite closing in 2015* [Natassia Chrysanthos, Sydney Morning Herald](https://www.smh.com.au/national/nsw/urgent-need-to-address-maths-performance-as-nsw-slumps-in-international-test-20191203-p53ge2.html)
## Data access
The data, from 2000 through to the most recent survey, can be accessed at [http://www.oecd.org/pisa/data/](http://www.oecd.org/pisa/data/). The format of the data by year is different, but in recent years it is provided in proprietary binary formats. The variables collected differ slightly from year to year, although there is a core set of variables that are always included and coded identically.
There are multiple files associated with each survey, containing the student questionnaire responses and test scores, school and parent questionnaire responses, two cognitive item responses, and assoociated data dictionaries. In the following analysis we focus on the 2015 student data. It is approximately 580Mb, contains 615 attributes on more than 270,000 students.
## Data pre-processing
Read the data directly from the web site.
```{r createdb, eval=FALSE}
# From http://www.oecd.org/pisa/data/2015database/ download
# the SPSS format zip file `Student questionnaire data file (419MB)'
library(haven)
pisa_2015 <- read_sav("https://webfs.oecd.org/pisa/PUF_SAS_COMBINED_CMB_STU_QQQ.zip")
```
Create a local SQLite database to make it faster to make various summaries and subsets.
```{r make_sqlite, eval=FALSE}
library(tidyverse)
library(dbplyr)
library(sqldf)
library(DBI)
db <- dbConnect(SQLite(), dbname="data/PISA.sqlite")
dbWriteTable(conn=db, name="student", value=pisa_2015)
dbListFields(db, "student")
```
Use the database to extract a subset of variables.
```{r make_counts, eval=FALSE}
db <- dbConnect(SQLite(), dbname="data/PISA.sqlite")
tb <- tbl(db, "student")
scores <- tb %>%
select(CNT, ST004D01T, PV1MATH, PV1READ, PV1SCIE, SENWT) %>% collect()
scores <- scores %>%
rename(gender=ST004D01T,
math=PV1MATH, reading=PV1READ, science=PV1SCIE,
w=SENWT)
save(scores, file="data/pisa_scores.rda")
```
## Examining the gender gap across countries
The gender gap in math is a common discussion, with the concern being that girls tend to score lower than boys *on average* in standardized math tests. The PISA data provides an opportunity to explore the gender gap across numerous countries.
### Clean up country codes
Each country is coded using three letters, which mirror international standard ISO codes, except that a few are unique to this data. In order to join the data, with country names, or map data, these need to be recoded or the records dropped.
```{r ISO_recode}
library(tidyverse)
library(ISOcodes)
data("ISO_3166_1")
# Load data
load("data/pisa_scores.rda")
# The country information will be used to jooin the data with map data
# and the ISOcodes package provides information about codes and country
scores <- scores %>%
mutate(CNT=recode(CNT, "QES"="ESP", "QCH"="CHN", "QAR"="ARG", "TAP"="TWN")) %>%
filter(CNT != "QUC") %>%
filter(CNT != "QUD") %>%
filter(CNT != "QUE") %>%
mutate(gender=factor(gender, levels=c(1,2), labels=c("female","male")))
```
### Compute weighted means by country and gender
Each observation in the student records has an associated survey weight. This reflects the representation of the demographic of the student in comparison to the population demographics. Using the weights to compute a weighted average produces an estimate that better reflects the population mean. The math gap is measured by differencing the boys and girls averages.
```{r gender_means}
score_gap <- scores %>%
group_by(CNT, gender) %>%
summarise(math=weighted.mean(math, w=w, na.rm=T),
reading=weighted.mean(reading, w=w, na.rm=T),
science=weighted.mean(science, w=w, na.rm=T)) %>%
pivot_longer(cols=math:science, names_to="test", values_to="score") %>%
pivot_wider(names_from=gender, values_from=score) %>%
mutate(gap = male - female) %>%
pivot_wider(id_cols=CNT, names_from=test, values_from=gap)
```
Confidence intervals for the population mean difference can be constructed usng bootstrap. The 90% confidence intervals are computed below, and these are joined with the mean difference for each country.
```{r}
library(boot)
# Compute confidence intervals
cifn <- function(d, i) {
x <- d[i,]
ci <- weighted.mean(x$math[x$gender=="male"],
w=x$w[x$gender=="male"], na.rm=T)-
weighted.mean(x$math[x$gender=="female"],
w=x$w[x$gender=="female"], na.rm=T)
ci
}
bootfn <- function(d) {
r <- boot(d, statistic=cifn, R=100)
l <- sort(r$t)[5]
u <- sort(r$t)[95]
ci <- c(l, u)
return(ci)
}
score_gap_boot <- scores %>%
split(.$CNT) %>% purrr::map(bootfn) %>% as_tibble() %>%
pivot_longer(cols=ALB:VNM, names_to="CNT", values_to="value") %>%
arrange(CNT) %>%
mutate(bound=rep(c("ml","mu"), length(unique(scores$CNT)))) %>%
pivot_wider(names_from = bound, values_from = value)
score_gap <- score_gap %>%
left_join(score_gap_boot, by="CNT")
```
### Examine the math gap by country
To get the country name added to the plot, we join the data with a standard ISO code database. Some country names need improving. The mean differences are plotted against countries which are sorted from largest difference to smallest.
```{r}
score_gap <- score_gap %>%
left_join(ISO_3166_1[,c("Alpha_3", "Name")], by=c("CNT"="Alpha_3")) %>%
rename(name = Name)
score_gap$name[score_gap$CNT == "KSV"] <- "Kosovo"
score_gap <- score_gap %>%
mutate(name = recode(name, "Czechia"="Czech Republic",
"Korea, Republic of"="South Korea",
"Macedonia, Republic of"="Macedonia",
"Moldova, Republic of"="Moldova",
"Russian Federation"="Russia",
"Taiwan, Province of China"="Taiwan",
"Trinidad and Tobago"="Trinidad",
"United States"="USA",
"United Kingdom"="UK",
"Viet Nam"="Vietnam"))
```
```{r gap_dots, fig.height=8, fig.width=6, out.width="100%"}
library(forcats)
ggplot(data=score_gap, aes(x=fct_reorder(name, math), y=math)) +
geom_hline(yintercept=0, colour="red") +
geom_point() +
geom_errorbar(aes(ymin=ml, ymax=mu), width=0) +
coord_flip() +
xlab("") + ylab("Gender gap") + ylim(c(-35, 35))
```
### What we learn
The math gap is not universal, and in some countries girls score higher than boys on average. There are also many countries where there is no significant difference between the gender averages.
## Your turn
1. A problem with the dot plot is that the mean difference axis uses negative numbers, whenever the girls average is higher than the boys. A more accurate reflection on the gender difference would be to use only positive numbers, and label left of 0 to be "girls higher" and right of 0 to be "boys higher". It could also be useful to have a finer resolution of grid lines on the mean difference. Improve the mean difference axis, so that it has positive numbers only and enhanced labels, and ticks at increments of 5 points.
2. Re-do the analysis using reading scores by gender. Discuss the reading gap across the globe.
## Mapping the scores
To generate a map using colour to represent the mean difference, requires joining the data to a map data set. There's a world map as polygons available in the `ggplot2` package, which enables a quick and easy map tp be constructed. The map data needs a few corrections to match the names with the scores data. Colour can be important when representing a numeric variable, so the colour scale is manually constructed.
```{r eval=FALSE}
library(ggthemes)
world_map <- map_data("world")
world_map$region[world_map$subregion == "Hong Kong"] <- "Hong Kong"
world_map$region[world_map$subregion == "Macao"] <- "Macao"
to_map <- left_join(world_map, score_gap, by=c("region"="name"))
ggplot(to_map, aes(map_id = region)) +
geom_map(aes(fill=math), map = world_map,
color="grey70", size=0.1) +
scale_fill_gradient2("Math gap", limits=c(-35, 35), na.value="grey90",
low="#1B9E77", high="#D95F02", mid="white") +
expand_limits(x = world_map$long, y = world_map$lat) +
theme_few() +
theme(legend.position = "bottom",
legend.key.width=unit(1.5, "cm"),
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text = element_blank())
```
We learn that there are a few regions where the gender gap is reversed or non-existent. These tend to be Scandinavian countries, south-east Asia and the middle east.
## Your turn
Make the world map using reading gap, and summarise your findings.
## Change in scores over time
FIXME: Add some `learningtower` examples.