workflow.qmd

---
title: "Assessing the distributional impacts of rail schemes"
author: "Rahulan Chandrasekaran"
date: "now"
format:
  html:
    toc: true
    anchor-sections: true
    link-external-newwindow: true
    code-fold: true
    code-overflow: wrap
execute: 
  cache: false
  warning: false
  error: true
  echo: true
  messsage: false
---

## Introduction

This document explains how to perform distributional impacts analysis (DIA) on the modelled user (or direct transport) benefits of a rail intervention, in accordance with [TAG Unit A4.2](https://www.gov.uk/government/publications/tag-unit-a4-2-distributional-impact-appraisal). 

The R code below has been developed specifically for rail schemes modelled with MOIRA or MOIRA2.

This analysis requires the user to provide two CSV files as inputs, each containing modelled estimates of changes in generalised journey time (in minutes) for pairs of origin and destination stations - described by their NLC identifiers - for trips of a specific journey purpose. One file should have modelled user benefits for commuting trips, and the other file should have the equivalent for leisure trips.^[This analysis does not use any data based on business trips, as business trip user benefits are not suitable for distributional impacts analysis.]

The code first converts the two sets of modelled GJT changes into monetised user benefits by station. For this step, the user benefits associated with a given pair of origin and destination stations is split equally between the origin and destination.^[This is in contrast to alternative assumptions typically used in the context of multi-modal transport models, such as assigning all of a link's benefits to the origin station (especially when the modelled time period is restricted to the morning peak).]

This yields estimates of user benefits by modelled station, which is then mapped to LSOAs using an assignment method where LSOAs with population-weighted centroids within a certain distance of a station are then paired with that station. Mapping user benefits to LSOAs allows for comparison against income deprivation data, which is also available at LSOA level. 

Our final table summarises the distribution of scheme user benefits by LSOA across all five income deprivation quintiles (1 = most deprived 20%). Within the table, each quintile is assigned a grade based on its constituent impact area LSOAs' relative shares of modelled user benefits and population. 

That is, for all impact area LSOAs within an income deprivation quintile:

| Quintile benefits  | Comparison vs quintile population share | Grade |
|--------|--------|--------|
| Positive (benefits) | 5 perc. points or more **greater** than quintile pop. share   | ✔✔✔ |
| Positive (benefits) | Within ±5 perc. points of quintile pop. share   | ✔✔ |
| Positive (benefits) | 5 perc. points or more **smaller** than quintile pop. share   | ✔ |
| Zero | -  | Neutral |
| Negative (disbenefits) | 5 perc. points or more **smaller** than quintile pop. share   | ❌ |
| Negative (disbenefits) | Within ±5 perc. points of quintile pop. share   | ❌❌|
| Negative (disbenefits) | 5 perc. points or more **greater** than quintile pop. share   | ❌❌❌ |

: System for grading of transport user benefits DIs for each of the social groups. Source: TAG Unit A4.2, Table 8. {tbl-colwidths="[40,40,20]"}

## Data sources

- User-supplied data on scheme benefits: 
  - Format: 3 column CSV: origin station NLC; destination staion NLC; modelled change in generalised journey time (minutes) for journeys of a specific purpose (either commuting or leisure)
  - Worked example: based on TRU 2019 analysis
- Reference data on station NLCs and coordinates
  - For the worked example, we will use the station coordinates and NLCs from the TRU scheme modelling.
  - However, this will be replaced with a larger dataset constructed initially from ORR stations data.
- Population by LSOA: ONS, 2020 mid-year population estimates (all ages), published in September 2021. Link to data: [.xlsx](https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/lowersuperoutputareamidyearpopulationestimates/mid2020sape23dt2/sape23dt2mid2020lsoasyoaestimatesunformatted.xlsx)
- Income deprivation quintiles by LSOA, from the 2019 English indices of deprivation, published by MHCLG. [Link to IMD homepage](https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019).
- Population-weighted centroids by LSOA: available on the ONS Open Geography portal. [Link](https://geoportal.statistics.gov.uk/datasets/54a76a36e3b4420a9ea83fcf7994525d_0/).
- Appraisal parameters: 
  - Value of time for commuting journeys: £9.95, from the TAG Data Book.
  - Value of time for leisure journeys: £4.54, from the TAG Data Book.

## Code / outputs

First, we load the necessary R packages and define the distance to use for pairing LSOAs with stations.  

```{r packages, include = FALSE}
# Load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr, readr, data.table, glue, here, janitor, sf, cli, 
               gt, tmap, geojsonio, stringr, assertthat, tictoc)

tictoc::tic("Runtime") # start timer

```

```{r user_options}

# Are scheme commuting/leisure benefits unmonetised?
commuting_benefits_unmonetised <- TRUE
leisure_benefits_unmonetised <- TRUE

# Scheme benefits data filenames 
commuting_benefits_filename <- "user_benefits_commuting.csv"
leisure_benefits_filename <- "user_benefits_leisure.csv"

# Buffer distance to use when assigning LSOAs to stations by distance to 
# LSOA population-weighted centroids (units: metres)
station_buffer <- 1500

# Values of time for commuting and leisure purposes
# Source: TAG Data Book, 2010 prices
vot_commuting_tag <- 9.95
vot_leisure_tag <- 4.54

```


We read in the user-supplied scheme benefits data, assumed to be located in the relative paths `data/user_benefits_commuting.csv` and `data/user_benefits_leisure.csv`, before converting the estimated travel time savings to monetary values, summing across the two journey purposes, and aggregating the results by station. The first few rows of the resulting data is displayed at the end. 

```{r, eval = TRUE}
# Read data on modelled scheme GJT changes by purpose (commuting or leisure) 
# for all modelled pairs of origin and destination stations, before monetising
# GJT changes if necessary

ReadStationGJTCByPurpose <- function(data_path, filename, vot, to_monetise){
  
  bens <- readr::read_csv(glue("{data_path}{filename}"), 
                  col_names = c("origin_tlc", "dest_tlc", "gjtc")) %>% 
    mutate(gjtc =  gjtc * (vot/60) * to_monetise) %>% 
    filter(!if_any(everything(), is.na))
  
  bens_orig <- bens %>% 
    group_by(origin_tlc) %>%
    summarise(gjtc = sum(gjtc, na.rm = T) / 2) %>% 
    rename("tlc" = origin_tlc)
  
  bens_dest <- bens %>% 
    group_by(dest_tlc) %>%
    summarise(gjtc = sum(gjtc, na.rm = T) / 2) %>% 
    rename("tlc" = dest_tlc)
  
  assertthat::are_equal(sum(bens$gjtc), 
                        sum(bens_orig$gjtc) + sum(bens_dest$gjtc)) 
  
  bens_agg <- rbind(bens_orig, bens_dest)
  return(bens_agg)
}
  
commuting_bens <- ReadStationGJTCByPurpose(
  data_path = glue("{here()}/data/"),
  filename = commuting_benefits_filename,
  vot = vot_commuting_tag,
  to_monetise = commuting_benefits_unmonetised
)

leisure_bens <- ReadStationGJTCByPurpose(
  data_path = glue("{here()}/data/"),
  filename = leisure_benefits_filename,
  vot = vot_leisure_tag,
  to_monetise = leisure_benefits_unmonetised
)

total_bens <- rbind(commuting_bens, leisure_bens) %>% 
  group_by(tlc) %>% 
  summarise(gjtc = sum(gjtc, na.rm = T)) %>% 
  arrange(tlc)

head(total_bens)
```

Next, we read in our reference data on station NLCs and coordinates. For the worked example, we will use a dataset constructed from the modelling of the scheme that provides the user benefits for our worked example. This is a subset of all non-closed rail stations in England, but this does not affect the analysis. However, in future we will replace this with a dataset built on ORR stations data.

```{r}
stations_filename <- "tru_stations.csv"
stations <- readr::read_csv(glue("{here()}/data/{stations_filename}"), 
                  col_names = c("name", "tlc", "easting", "northing")) %>% 
  filter(!if_any(everything(), is.na)) %>% 
  distinct(.keep_all = T) %>% 
  dplyr::inner_join(total_bens, by = c("tlc")) %>% # add benefits by station
  mutate(gjtc = gjtc * -1) %>% 
  filter(!(str_detect(name, "Zone "))) # exclude non-rail London zones (eg "Zone Blackfriars")
```


Next, we need LSOA population data.

```{r}
# Population (all ages) in mid-2020 by 2011 LSOA
# Source: ONS (https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/lowersuperoutputareamidyearpopulationestimates/mid2020sape23dt2/sape23dt2mid2020lsoasyoaestimatesunformatted.xlsx, sheet: "Mid-2020 Persons", cols A-G)
lsoa_pop_filename <- "lsoa11_pop_2020.csv"
lsoa_pop <- readr::read_csv(glue("{here()}/data/{lsoa_pop_filename}")) %>% 
  janitor::clean_names() %>% 
  select(lsoa_code, "pop" = all_ages) %>% 
  rename("lsoa11cd" = lsoa_code) %>% 
  as.data.table()

head(lsoa_pop)
```


Followed by LSOA income deprivation quintile data, which requires converting deciles to quintiles. 


```{r}
# LSOA income deprivation quintiles, 2019
# Source: MHCLG, English indices of deprivation 2019
imd_filename <- "File_7_-_All_IoD2019_Scores__Ranks__Deciles_and_Population_Denominators_3.csv"
imd <- readr::read_csv(glue("{here()}/data/{imd_filename}")) %>% 
  janitor::clean_names() %>% 
  select(lsoa_code_2011, 
         "income_decile" = income_decile_where_1_is_most_deprived_10_percent_of_lso_as) %>% 
  mutate(income_quintile = case_when(
    income_decile >= 9 ~ 5,
    income_decile >= 7 ~ 4,
    income_decile >= 5 ~ 3,
    income_decile >= 3 ~ 2,
    income_decile >= 1 ~ 1,
    TRUE ~ NA_integer_
  )) %>% 
  rename("lsoa11cd" = lsoa_code_2011) %>% 
  select(-c(income_decile)) %>% 
  as.data.table()

head(imd)
```


Finally, we import a dataset containing the coordinates of all (2011 definition) LSOA population-weighted centroids in England and Wales in 2022. 

```{r, eval = TRUE}
# Read data on 2021 LSOA population-weighted centroids
# Source: https://geoportal.statistics.gov.uk/datasets/ons::lsoa-dec-2021-pwc-for-england-and-wales/explore
lsoa_pwc_filename <- "LSOA_Dec_2011_PWC_in_England_and_Wales_2022_1923591000694358693.csv"
lsoa_pwc <- readr::read_csv(glue("{here()}/data/{lsoa_pwc_filename}")) %>% 
  janitor::clean_names() %>% 
  filter(!(stringr::str_sub(lsoa11cd, 1, 1) == "W")) %>%  # exclude Welsh LSOAs
  select(lsoa11cd, x, y)

head(lsoa_pwc)
```


At this stage, we can plot our scheme benefits data (by station) on a map. 


```{r map_options, include = FALSE}
tmap_options(check.and.fix = T)
tmap_mode("view")
```


```{r, eval = TRUE, include = TRUE}
# Make bubble map of user benefits by modelled station
stations %>% 
  sf::st_as_sf(coords = c("easting", "northing"), crs = 27700) %>%
  tm_shape(.) +
  tm_bubbles(size = "gjtc", col = "blue", alpha = 0.2)
```

We continue with our analysis by assigning user benefits to LSOAs, which requires identifying all pairs of modelled stations and population-weighted LSOA centroids within `r paste0(station_buffer)`m of each other. Each LSOA that is paired with a modelled station in this way is then assigned a portion of the station's user benefits based on its share of the total population across all station-paired LSOAs.

::: {.callout-note}
This is undoubtedly the most computationally taxing step in the process, but we can exploit the speed of the [**data.table**](https://rdatatable.gitlab.io/data.table/) package to avoid memory-related issues. This works successfully for our worked example - which has 716 modelled stations - where this step is implemented within 5-10 seconds. However, further testing will be required to determine whether the step can be completed successfully for schemes with many more modelled stations.  
:::


```{r}
stations <- as.data.table(stations)
lsoa_pwc <- as.data.table(lsoa_pwc)

station_pairs <- vector("list", nrow(stations))

for(s in seq(1:nrow(stations))){
  station_pairs[[s]] <- as.data.table(stations[s,])[, as.list(lsoa_pwc), 
                                                    by = stations[s,]]
  station_pairs[[s]] <- station_pairs[[s]][, 
                                           dist := (((easting - x)^2) + ((northing - y)^2)) ^ 0.5
                                           ][dist <= station_buffer]
  if(s %% 100 == 0){cli::cli_inform("Processed {s}/{nrow(stations)} stations")}
}

final <- rbindlist(station_pairs)
rm(station_pairs)

```


Now that we've obtained a dataset showing scheme user benefits by LSOA, for all LSOAs with population-weighted centroids within `r paste0(station_buffer)`m of a modelled station, we can add our other LSOA variables (population and income deprivation quintile) before calculating each LSOA's share of its paired station's population and associated user benefits.

Then we aggregate the data by income deprivation quintile and produce our final user benefits DI table.

```{r, eval = TRUE}

# Add LSOA income deprivation quintile and 2020 population data 
final_2 <- final[imd, 
               on = "lsoa11cd", 
               income_quintile := i.income_quintile][
                 lsoa_pop,
                 on = "lsoa11cd",
                 pop := i.pop
               ] %>% 
  as_tibble() %>%
  
  # Calculate total station population (across paired LSOAs)
  group_by(tlc) %>% 
  mutate(sta_pop = sum(pop)) %>%  
  ungroup() %>% 
  
  # Calculate LSOA population shares
  # And then LSOA benefits (based on pop. shares)
  mutate(pop_share = pop / sta_pop,
         lsoa_bens = gjtc * pop_share) %>% 
  filter(!if_any(everything(), is.na))

# Finish compiling benefits by LSOA
bens_by_lsoa <- final_2 |>
  select(lsoa11cd, lsoa_bens)

head(bens_by_lsoa)
```

```{r}
# Produce TAG A4.2 user benefits DI table
# Group benefits/disbenefits data by income quintile, assign grade based on 
# comparing each quintile's share of benefits (disbenefits) and its population
# share. Uses {gt} package to produce a table in HTML format. 


bens_by_quintile <- final_2 %>% 
  group_by(income_quintile) %>% 
  summarise(lsoas = n(),
            bens = sum(lsoa_bens),
            pop = sum(pop)) |> 
  mutate(disbens = if_else(bens < 0, bens * -1, 0)) |> 
  mutate(bens = if_else(bens > 0, bens, 0)) |> 
  select(income_quintile, lsoas, bens, disbens, pop)

bens_by_quintile <- bens_by_quintile |> 
  mutate(share_bens = bens / sum(bens),
         share_disbens = disbens / sum(disbens),
         share_pop = pop / sum(pop)) %>% 
  mutate(grade = case_when(
    share_bens > 0 & share_bens - share_pop >= 0.05 ~ "✔✔✔",
    share_bens > 0 & share_bens - share_pop >= -0.05 ~ "✔✔",
    share_bens > 0 & share_bens - share_pop < -0.05 ~ "✔",
    share_bens == 0 ~ "Neutral",
    share_disbens > 0 & share_disbens - share_pop >= 0.05 ~ "❌❌❌",
    share_disbens > 0 & share_disbens - share_pop >= -0.05 ~ "❌❌",
    share_disbens > 0 & share_disbens - share_pop <= -0.05 ~ "❌",
    TRUE ~ NA_character_
  )) %>% 
  mutate(bens = bens / 1000000,
         disbens = disbens / 1000000,
         pop = pop / 1000000)

# Make final HTML table
bens_by_quintile |> 
  gt() |> 
  sub_missing(missing_text = "-") |> 
  
  # Amend column formats
  fmt_number(columns = 2, decimals = 0) %>% # lsoas
  fmt_number(columns = c(3:5), decimals = 1) %>% # total bens/disbens/pop
  fmt_percent(columns = c(6:8), decimals = 0) %>% # shares of bens/disbens/pop
  
  # Amend column labels
  cols_label(
    income_quintile = "Income Deprivation Quintile",
    lsoas = "LSOAs",
    bens = "Total benefits",
    disbens = "Total disbenefits",
    pop = "Population",
    share_bens  = "Share of user benefits",
    share_disbens = "Share of user disbenefits",
    share_pop = "Share of population in impact area",
    grade = "Assessment"
    ) %>%
  
  # Table header
  tab_header(title = "User Benefits Distributional Analysis") %>% 
  
  # Add table footnotes
  tab_footnote(
    footnote = "1 = most deprived quintile; 5 = least deprived quintile.",
    locations = cells_column_labels(columns = income_quintile)
  ) %>% 
  tab_footnote(
    footnote = "£m, 2010 prices, for a specific modelled year and scenario.",
    locations = cells_column_labels(columns = bens)
  ) %>% 
  tab_footnote(
    footnote = glue("Summed across all English LSOAs in each quintile with a population-weighted centroid within {station_buffer}m of a scheme-modelled station."),
    locations = cells_column_labels(columns = pop)
  ) %>% 
  
  # Add/format summary totals
  grand_summary_rows(
    columns = c(lsoas),
    fns = list(Total = ~sum(., na.rm = T)),
    fmt = list(~fmt_integer(.))
  ) %>% 
  grand_summary_rows(
    columns = c(bens, disbens, pop),
    fns = list(Total = ~sum(., na.rm = T)),
    fmt = list(
        ~fmt_number(., decimals = 1),
        ~fmt_number(., decimals = 1),
        ~fmt_number(., decimals = 1)
      )
  ) %>% 
  grand_summary_rows(
    columns = c(share_bens, share_disbens, share_pop),
    fns = list(Total = ~sum(., na.rm = T)),
    fmt = list(
        ~fmt_percent(., decimals = 0),
        ~fmt_percent(., decimals = 0),
        ~fmt_percent(., decimals = 0)
      )
  ) 
```


```{r}
tictoc::toc(log = TRUE)

sessionInfo()
```