01_z01_groundwater_pumping.rmd


### Project Overview

---

**Objective:**

* Understand the impact of groundwater pumping on groundwater level. 

---

**Datasets**

* Groundwater pumping by irrigation wells in Chase, Dundy, and Perkins Counties in the southwest corner of Nebraska 
* Groundwater levels observed at USGS monitoring wells located in the three counties and retrieved from the National Water Information System (NWIS) maintained by USGS using the `dataRetrieval` package.

---

**Econometric Model**

In order to achieve the project objective, we will estimate the following model:

$$
 y_{i,t} - y_{i,t-1} = \alpha + \beta gw_{i,t-1} + v
$$

where $y_{i,t}$ is the depth to groundwater table^[the distance from the surface to the top of the aquifer] in March^[For our geographic focus of southwest Nebraska, corn is the dominant crop type. Irrigation for corn happens typically between April through September. For example, this means that changes in groundwater level ($y_{i,2012} - y_{i,2011}$) captures the impact of groundwater pumping that occurred April through September in 2011.] in year $t$ at USGS monitoring well $i$, and $gw_{i,t-1}$ is the total amount of groundwater pumping that happened within the 2-mile radius of the monitoring well $i$. 

---

**GIS tasks**

* read an ESRI shape file as an `sf` (spatial) object 
  - use `sf::st_read()`
* download depth to water table data using the `dataRetrieval` package developed by USGS 
  - use `dataRetrieval::readNWISdata()` and `dataRetrieval::readNWISsite()`
* create a buffer around USGS monitoring wells
  - use `sf::st_buffer()`
* convert a regular `data.frame` (non-spatial) with geographic coordinates into an `sf` (spatial) objects
  - use `sf::st_as_sf()`  and `sf::st_set_crs()`
* reproject an `sf` object to another CRS
  - use `sf::st_transform()`
* identify irrigation wells located inside the buffers and calculate total pumping
  - use `sf::st_join()`
+ create maps 
  * use the `tmap` package 

---

**Preparation for replication**

Run the following code to install or load (if already installed) the `pacman` package, and then install or load (if already installed) the listed package inside the `pacman::p_load()` function.

```{r demo1_packages}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  sf, # vector data operations
  dplyr, # data wrangling
  dataRetrieval, # download USGS NWIS data
  lubridate, # Date object handling
  modelsummary, # regression table generation
  lfe # fast regression with many fixed effects
)
```

### Project Demonstration

The geographic focus of the project is the southwest corner of Nebraska consisting of Chase, Dundy, and Perkins County (see Figure \@ref(fig:NE-county) for their locations within Nebraska). Let's read a shape file of the three counties represented as polygons. We will use it later to spatially filter groundwater level data downloaded from NWIS.

```{r NE_county_data, echo = FALSE, results = "hide"}
#--- Nebraska counties ---#
NE_county <-
  st_read(
    dsn = "Data",
    layer = "cb_2018_us_county_20m"
  ) %>%
  filter(STATEFP == "31") %>%
  mutate(NAME = as.character(NAME)) %>%
  st_transform(32614)

three_counties <-
  st_read(dsn = "Data", layer = "urnrd") %>%
  st_transform(32614)
```

```{r Demo1_read_urnrd_borders}
three_counties <-
  st_read(dsn = "Data", layer = "urnrd") %>%
  #--- project to WGS84/UTM 14N ---#
  st_transform(32614)
```

```{r NE-county, fig.cap = "The location of Chase, Dundy, and Perkins County in Nebraska", echo = F}
#--- map the three counties ---#
tm_shape(NE_county) +
  tm_polygons() +
  tm_shape(three_counties) +
  tm_polygons(col = "blue", alpha = 0.3) +
  tm_layout(frame = FALSE)
```
---

We have already collected groundwater pumping data, so let's import it. 

```{r Demo1_urnrd_gw_read}
#--- groundwater pumping data ---#
(
  urnrd_gw <- readRDS("Data/urnrd_gw_pumping.rds")
)
```

`well_id` is the unique irrigation well identifier, and `vol_af` is the amount of groundwater pumped in acre-feet. This dataset is just a regular `data.frame` with coordinates. We need to convert this dataset into a object of class `sf` so that we can later identify irrigation wells located within a 2-mile radius of USGS monitoring wells (see Figure \@ref(fig:sp-dist-wells) for the spatial distribution of the irrigation wells).

```{r convert_to_sf}
urnrd_gw_sf <-
  urnrd_gw %>%
  #--- convert to sf ---#
  st_as_sf(coords = c("lon", "lat")) %>%
  #--- set CRS WGS UTM 14 (you need to know the CRS of the coordinates to do this) ---#
  st_set_crs(32614)

#--- now sf ---#
urnrd_gw_sf
```

```{r sp-dist-wells, fig.cap = "Spatial distribution of irrigation wells", echo = F}
tm_shape(three_counties) +
  tm_polygons() +
  tm_shape(unique(urnrd_gw_sf, by = "well_id")) +
  tm_symbols(size = 0.1, col = "blue") +
  tm_layout(frame = FALSE)
```
---

Here are the rest of the steps we will take to create a regression-ready dataset for our analysis.

1. download groundwater level data observed at USGS monitoring wells from National Water Information System (NWIS) using the `dataRetrieval` package 
2. identify the irrigation wells located within the 2-mile radius of the USGS wells and calculate the total groundwater pumping that occurred around each of the USGS wells by year 
3. merge the groundwater pumping data to the groundwater level data

---

Let's download groundwater level data from NWIS first. The following code downloads groundwater level data for Nebraska from Jan 1, 1990, through Jan 1, 2016.

```{r gwl_data_download, eval = F}
#--- download groundwater level data ---#
NE_gwl <-
  readNWISdata(
    stateCd = "Nebraska",
    startDate = "1990-01-01",
    endDate = "2016-01-01",
    service = "gwlevels"
  ) %>%
  dplyr::select(site_no, lev_dt, lev_va) %>%
  rename(date = lev_dt, dwt = lev_va)

#--- take a look ---#
head(NE_gwl, 10)
```

```{r read_NW_gwl, echo = F}
NE_gwl <- readRDS("Data/NE_gwl.rds")

#--- take a look ---#
head(NE_gwl, 10)
```

`site_no` is the unique monitoring well identifier, `date` is the date of groundwater level monitoring, and `dwt` is depth to water table. 

We calculate the average groundwater level in March by USGS monitoring well (right before the irrigation season starts):^[`month()` and `year()` are from the `lubridate` package. They extract month and year from a `Date` object.]

```{r avg_march_deptn}
#--- Average depth to water table in March ---#
NE_gwl_march <-
  NE_gwl %>%
  mutate(
    date = as.Date(date),
    month = month(date),
    year = year(date),
  ) %>%
  #--- select observation in March ---#
  filter(year >= 2007, month == 3) %>%
  #--- gwl average in March ---#
  group_by(site_no, year) %>%
  summarize(dwt = mean(dwt))

#--- take a look ---#
head(NE_gwl_march, 10)
```

Since `NE_gwl` is missing geographic coordinates for the monitoring wells, we will download them using the `readNWISsite()` function and select only the monitoring wells that are inside the three counties.  

```{r NE_sites}
#--- get the list of site ids ---#
NE_site_ls <- NE_gwl$site_no %>% unique()

#--- get the locations of the site ids ---#
sites_info <-
  readNWISsite(siteNumbers = NE_site_ls) %>%
  dplyr::select(site_no, dec_lat_va, dec_long_va) %>%
  #--- turn the data into an sf object ---#
  st_as_sf(coords = c("dec_long_va", "dec_lat_va")) %>%
  #--- NAD 83 ---#
  st_set_crs(4269) %>%
  #--- project to WGS UTM 14 ---#
  st_transform(32614) %>%
  #--- keep only those located inside the three counties ---#
  .[three_counties, ]
```

---

We now identify irrigation wells that are located within the 2-mile radius of the monitoring wells^[This can alternatively be done using the `st_is_within_distance()` function.]. We first create polygons of 2-mile radius circles around the monitoring wells (see Figure \@ref(fig:buffer-map)).

```{r create_buffer}
buffers <- st_buffer(sites_info, dist = 2 * 1609.34) # in meter
```

```{r buffer-map, fig.cap = "2-mile buffers around USGS monitoring wells", echo = F}
tm_shape(three_counties) +
  tm_polygons() +
  tm_shape(buffers) +
  tm_polygons(col = "red", alpha = 0, 2) +
  tm_shape(sites_info) +
  tm_symbols(size = 0.1) +
  tm_layout(frame = FALSE)
```

We now identify which irrigation wells are inside each of the buffers and get the associated groundwater pumping values. The `st_join()` function from the `sf` package will do the trick.

```{r Demo_join_buffer_gw, cache = FALSE}
#--- find irrigation wells inside the buffer and calculate total pumping  ---#
pumping_nearby <- st_join(buffers, urnrd_gw_sf)
```

Let's take a look at a USGS monitoring well (`site_no` = $400012101323401$).

```{r take_a_look}
filter(pumping_nearby, site_no == 400012101323401, year == 2010)
```

As you can see, this well has seven irrigation wells within its 2-mile radius in 2010.   

Now, we will get total nearby pumping by monitoring well and year. 

```{r Demo1_summary_by_buffer, cache = TRUE}
(
  total_pumping_nearby <-
    pumping_nearby %>%
    #--- calculate total pumping by monitoring well ---#
    group_by(site_no, year) %>%
    summarize(nearby_pumping = sum(vol_af, na.rm = TRUE)) %>%
    #--- NA means 0 pumping ---#
    mutate(
      nearby_pumping = ifelse(is.na(nearby_pumping), 0, nearby_pumping)
    )
)
```

---

We now merge nearby pumping data to the groundwater level data, and transform the data to obtain the dataset ready for regression analysis.

```{r Demo_nearby_merge}
#--- regression-ready data ---#
reg_data <-
  NE_gwl_march %>%
  #--- pick monitoring wells that are inside the three counties ---#
  filter(site_no %in% unique(sites_info$site_no)) %>%
  #--- merge with the nearby pumping data ---#
  left_join(., total_pumping_nearby, by = c("site_no", "year")) %>%
  #--- lead depth to water table ---#
  arrange(site_no, year) %>%
  group_by(site_no) %>%
  mutate(
    #--- lead depth ---#
    dwt_lead1 = dplyr::lead(dwt, n = 1, default = NA, order_by = year),
    #--- first order difference in dwt  ---#
    dwt_dif = dwt_lead1 - dwt
  )

#--- take a look ---#
dplyr::select(reg_data, site_no, year, dwt_dif, nearby_pumping)
```

---

Finally, we estimate the model using `feols()` from the `fixest` package (see [here](https://cran.r-project.org/web/packages/fixest/vignettes/fixest_walkthrough.html) for an introduction).

```{r Demo_reg_dwt}
#--- OLS with site_no and year FEs (error clustered by site_no) ---#
reg_dwt <-
  feols(
    dwt_dif ~ nearby_pumping | site_no + year,
    cluster = "site_no",
    data = reg_data
  )
```

Here is the regression result.

```{r reg_dwt_disp}
modelsummary(
  reg_dwt,
  stars = TRUE,
  gof_omit = "IC|Log|Adj|Within|Pseudo"
)
```

---