Summary

This document contains exploratory plots and tables summarizing the CGA reservoir dataset, as well as a fit of a survival model to explore various factors that may influence when reservoirs were constructed.

Results of the survival analysis show that prior land use, reservoir size, and percent saturation at time of construction may have influenced when reservoirs were built, and that some of those influences may have varied in strength between the two CGAs Grand Prairie and Cache River. In particular, we see there is a three-way interaction between CGA, prior land use, and reservoir size. It is that reservoir size influences construction time only in previously non-cropland areas and only in Grand Prairie. On all previously cropland areas, and all areas in Cache River, we see very little difference in construction time by reservoir size.

Version history

Setup

Load the data. Convert to a spatial object so that we can make maps if needed. There is a mistake in one County value in the data. Correct that. Create an ID column for each reservoir and also a numeric column for the last year in the timespan we know it was built (so the latest possible year each reservoir could have been built, for plotting purposes). Combine empty and <Null> entries to "Unknown" for prior land use. Also, load the county boundaries for the state of Arkansas to use as background for the maps. Project everything into a map projection appropriate for Arkansas.

library(data.table)
library(ggplot2)
library(sf)
library(USAboundaries)
library(purrr)
library(gt)
library(stringr)
library(survival)
library(survminer)
library(icenReg)
library(ggdist)
library(glue)

reservoirs <- fread('project/Yaeger_ReservoirDataset_Oct2022.csv')
reservoirs[County == '<=1975', County := 'Lonoke']
reservoirs[, ID := 1:.N]
reservoirs[, OID_ := NULL]
reservoirs[, LOCATIONID := NULL]
MSLcols <- grep('MSLft|MRVA', names(reservoirs), value = TRUE)
reservoirs[, c(MSLcols) := NULL]
reservoirs[, last_year := map_int(strsplit(TimeSpan, split = '[[:punct:]]', fixed = FALSE), ~ as.integer(.[length(.)]))]
reservoirs[Prior_LULC %in% c('', '<Null>'), Prior_LULC := 'Unknown']

reservoirs_sp <- st_as_sf(reservoirs, coords = c('X', 'Y'), crs = 4326) |>
  st_transform(crs = 26952)
AR_counties <- us_counties(resolution = 'high', states = 'AR') |>
  st_transform(crs = 26952)

Make a long-form time series of groundwater depth by year and identify for each year whether the reservoir at that location was built previously, was built this year (or during the 5-year time span immediately preceding the date when groundwater depth was measured), or had yet to be built at that time. edit: Add percent saturation to this time series.

static_attrs <- c('ID', 'CGA', 'Type', 'County', 'Area_ha', 'Area_ac', 'TimeSpan', 'last_year', 'Prior_LULC', 'X', 'Y')
reservoir_depth_timeseries <- melt(reservoirs, id.vars = static_attrs, variable.name = 'year', value.name = 'depth')
reservoir_depth_timeseries[, ts_var := ifelse(grepl('^PctSat', year), 'pct_sat', 'depth')]
reservoir_depth_timeseries[, year := as.numeric(gsub('DTW_Y|PctSat', '', year))]
reservoir_depth_timeseries[, built_this_year := fcase(year > last_year, 'built previously', year == last_year, 'built this year', year < last_year, 'will be built later')]

reservoir_depth_timeseries <- dcast(reservoir_depth_timeseries, paste(paste(static_attrs, collapse = '+'), '+ built_this_year + year ~ ts_var'), value.var = 'depth')

Exploratory plots and tables

Note: Code is not shown for this portion.

Map of reservoirs with some of their attributes

Plot the reservoirs on a map with size of point corresponding to its area, shape corresponding to whether it’s a pond or a reservoir, and color for when it was built. Dark colors indicate older and light colors newer. Note the size of the point is proportional to the area of the reservoir on a log scale. It looks like the majority of the older reservoirs are in the southern region and there is a much higher density of reservoirs there, and a greater number of large reservoirs in the south (basically all of the ones with >100 hectares size are down in the southern part). You probably already know all this but it is useful for me to visualize.

Make some tables or plots of the different attributes of the reservoirs, both individually and crossed with one another.

Here is a histogram of reservoir area by CGA. Notice log scale x-axis. Grand Prairie has more overall as well as many more in the large end of the distribution.

Here are the same histograms as above, with separate panels for ponds and reservoirs.

Here are tables of reservoir counts by some attribute crossed with CGA, with some colors to help visualize.

timespan counts by CGA
timespan when built Cache River Grand Prairie total
<=1975 21 178 199
1976-1980 8 36 44
1981-1985 10 51 61
1986-1990 17 40 57
1991-1995 10 10 20
1996-2000 18 76 94
2001-2005 21 135 156
2006-2010 14 53 67
2011-2015 24 53 77
prior land use counts by CGA
prior land use Cache River Grand Prairie total
Cropland 75 296 371
Mixed: Cropland and Pond 4 14 18
Mixed: Cropland and Wooded area 2 6 8
Mixed: Cropland, Natural Feature 17 22 39
Mixed: Edge of Field, Natural Feature 11 68 79
Natural low-lying feature 8 35 43
Wooded area 5 13 18
Unknown 21 178 199
waterbody type counts by CGA
type of waterbody Cache River Grand Prairie total
Pond 25 74 99
Reservoir 118 558 676

Time series plot of groundwater depth and new reservoir construction

Here is a time series plot showing, for each reservoir individually, a time trend of depth to groundwater (partially transparent because many trends are stacked on top of each other). Superimposed on that are boxplots of the depth of groundwater separated by when the reservoirs at each time point were built (using the last year of the timespan that we know it was built in).

The plot obviously shows the water table getting deeper over time in both CGAs. To me, just eyeballing it right now, it does seem that the new reservoirs being built in each year, in both CGAs, basically track the average depth to groundwater prevailing at that time. (The distributions of the data in the blue boxplots are not any different than the distributions in the red and green boxplots.) So that would imply, in a way, that reservoir builders aren’t basing their decision of whether to build a reservoir on how deep the groundwater is at that location. They are building them based on other reasons, so the depth to groundwater at that location is basically random with respect to their decision. As a result, it just tracks the average for that region. This especially appears to be true in the Grand Prairie CGA. We could make a statistical model to estimate actual parameters for that trend but for now this is just an exploratory idea.

One other thing to note, which I mentioned in the Teams conversation the other day, is that it would also be enlightening to have groundwater depth data for locations in which no reservoirs were ever built. That would really help nail down if this is a significant trend. We might find that locations without reservoirs (where no one has ever built a reservoir to date) have much shallower depths to groundwater than the places where reservoirs have been built at some point in history. That would mean that yes, reservoirs are being built in places where it is harder to access groundwater than places where they aren’t being built (which would be a different conclusion than what I am drawing from just eyeballing the graph here). If the groundwater depth data is being derived from some continuous data layer, we could actually use that to make that comparison.

Same time series plot as above, with ponds removed

Here is the same plot as above, plotting only the reservoirs and not the ponds. To me the broad trends look very similar.

Time series plot of reservoir size and new reservoir construction

This plot only shows the new reservoirs built each year in each CGA, and a distribution of their areas in hectares. As above when we were dealing with area, notice logarithmic scale on the y-axis.

Again I am basing this on an eyeball test only, but I do not see much of a trend in the size of new reservoir construction over time. A statistical model could be used to look more rigorously but this figure does not seem to show much of one. (But note especially in Cache River that new reservoir construction after 1995 may be bigger than old ones, though it is not a strong trend.)

Here is the same plot as above but with an arithmetic y-axis.

Here is the arithmetic y-axis plot excluding all reservoirs greater than 50 ha in area (only 26 of the 775 are removed from this plot).

Same size by time plot as above, separated by ponds and reservoirs

I decided to plot both ponds and reservoirs as separate bars here. It does not change the overall story that there does not seem to be any obvious trend in size of new construction of reservoirs over time.

Time series plot of percent saturation

These plots were added on October 19. They are the same format as the groundwater depth plots, with year on the x-axis, but using the percent saturation variable on the y-axis instead.

The percent saturation trend is a little different than the depth trend, especially now that the corrections were made (Oct 21 version). We see a peak in 1995 for many of the locations in Grand Prairie but not Cache River. But we do also see the general downward trend resulting from aquifer depletion in both regions. We still do not see a big difference between the percent saturation and the status of reservoir construction in any given year, just as we didn’t for depth.

Statistical models

I decided that the best way to analyze this dataset is survival analysis, otherwise known as time-to-event analysis. As I said before, it would be an even better model if we had the history of some locations where reservoirs were not built. However, we can do a decent job with this dataset. We have left-censoring in the dataset because some of the reservoirs were built before 1975 and we do not know exactly when. Also, we have 5-year intervals for timespan so we have to account for the fact that we can only say the reservoir was built sometime in that 5-year interval. The R package survival is able to account for both of those things.

Let’s use a proportional-hazards method to estimate survival curves as a function of the different properties: location (CGA), area (log-transformed), percent aquifer saturation at time of construction, and prior land cover. Create a column for the lower and upper bound of when the reservoir was built from the timespan column. If it was built before 1975, use NA for the lower bound. Also adjust so that 1975 is year 0.

Another adjustment is that there are a large number of prior land use categories. To simplify the model, let’s categorize them into two categories: one is cropland and all the mixed categories that contain cropland, the other is everything else that is not cropland, including forest, natural features, field edge, and unknown.

Finally, standardize area and percent saturation (after log transforming area) so that we can compare the coefficients.

edits: On October 19, I edited the model in a few ways:

year_of_const_data <- reservoir_depth_timeseries[year == last_year & Type == 'Reservoir']
year_of_const_data[, lower_bound := as.numeric(str_extract(TimeSpan, '^[0-9]+')) - 1975]
year_of_const_data[, upper_bound := as.numeric(str_extract(TimeSpan, '[0-9]+$')) - 1975]

year_of_const_data[, Prior_LULC_binary := ifelse(grepl('Cropland', Prior_LULC), 'cropland', 'other')]

scaled_data <- year_of_const_data[, .(Area_ha, pct_sat)]
scaled_data[, log_area := log(Area_ha)]
scaled_data <- scale(scaled_data)

year_of_const_data[, scaled_area := scaled_data[, 'log_area']]
year_of_const_data[, scaled_pctsat := scaled_data[, 'pct_sat']]

# Manually create CGA by land use interaction term because it seems like there is a bug creating 3-way interactions in ic_bayes
year_of_const_data[, CGA_by_LULC := factor(CGA == 'Grand Prairie' & Prior_LULC_binary == 'other')]

# icenReg fit
# Note: samples = number of samples across all chains.
set.seed(1021)
fit <- ic_bayes(Surv(time = lower_bound, time2 = upper_bound, type = 'interval2') ~ CGA + Prior_LULC_binary + scaled_area + scaled_pctsat + CGA_by_LULC + CGA:scaled_area + CGA:scaled_pctsat + CGA_by_LULC:scaled_area + CGA_by_LULC:scaled_pctsat, 
                data = year_of_const_data,
                model = 'ph', dist = 'loglogistic',
                controls = bayesControls(samples = 4000, chains = 4, burnIn = 9000, thin = 10))

Parameter estimates

First let’s look at a table and graph of parameter estimates. It is somewhat counter-intuitive to interpret these results, and it confused me at first. A negative coefficient on a variable means that as that variable increases, the “relative hazard” of having a reservoir built increases. So a positive coefficient means that higher values of the variable are associated with reservoirs being built earlier, and a negative coefficient means that higher values of the variable are associated with reservoirs being built later.

samples_long <- as.data.table(fit$samples)
samples_long[, id := 1:nrow(.SD)]
samples_long <- melt(samples_long, id.vars = 'id', variable.name = 'parameter')
samples_long[, hazard_ratio := exp(value)]

param_qs <- samples_long[, median_qi(value, .width = c(0.66, 0.90, 0.95)), by = parameter]

Because this figure shows standardized coefficients, we can compare them. The point estimates of the coefficients are black dots, and the 66%, 90%, and 95% quantile credible intervals are in progressively lighter shades of blue. A dotted line is plotted at zero (no effect). The coefficients are counter-intuitive, so here is my interpretation of the results edited after updating model:

  • The overall effect of CGA is positive indicating that Grand Prairie reservoirs tended to be built earlier (makes sense looking at the map).
  • The effect of “other,” or non-cropland, prior land use is positive meaning that reservoirs built on non-cropland areas were more likely to be built earlier, and reservoirs built on cropland later. This also makes sense to me.
  • The effect of area is weakly positive indicating larger reservoirs tend to be built earlier but this is not a strong main effect because of the significant interactions.
  • The effect of percent saturation at time of construction is very positive, meaning that reservoirs built at locations with more saturated aquifers were more likely to be built earlier. This still may not be a valid causal relationship but is potentially more causal than water table depth, because the percent saturation variable is more localized. However it should still be interpreted with caution.
  • The interaction between CGA (Grand Prairie) and prior land use (non-cropland) is weakly positive. This means that the effect where non-cropland was built earlier is stronger in Grand Prairie than Cache River. But we probably do not have sufficient evidence to really conclude that.
  • The interaction between CGA and area is weakly negative.
  • The interaction between CGA (Grand Prairie) and % saturation is negative. This means that the trend where reservoirs over more saturated aquifers were built earlier is weaker in Grand Prairie and stronger in Cache River. Again it is important to note this may not represent a valid causal relationship.
  • The three-way interaction between CGA, prior land use, and reservoir area is weakly positive. Three-way interactions are hard to interpret but the basic interpretation, which you can see better by looking at the area figure below, is that the size effect differs both by CGA and prior land use: the trend of large reservoirs being built earlier is strongest on non-cropland areas in Grand Prairie.

Here is a table with the same median and quantile credible interval values as presented in the figure above.

parameter median 66% QI 90% QI 95% QI
CGA (Grand Prairie effect) 0.442 (0.279, 0.6) (0.166, 0.722) (0.115, 0.783)
prior land use (non-cropland effect) 0.681 (0.465, 0.908) (0.271, 1.06) (0.209, 1.127)
reservoir area 0.212 (0.096, 0.324) (0.017, 0.405) (-0.023, 0.443)
% saturation at construction 0.620 (0.494, 0.75) (0.407, 0.831) (0.364, 0.867)
CGA × prior land use (Grand Prairie × non-cropland) 0.318 (0.083, 0.562) (-0.088, 0.741) (-0.164, 0.824)
CGA (Grand Prairie) × area −0.186 (-0.322, -0.055) (-0.417, 0.045) (-0.462, 0.089)
CGA (Grand Prairie) × % saturation −0.507 (-0.649, -0.366) (-0.744, -0.258) (-0.8, -0.211)
CGA × prior land use × area (Grand Prairie × non-cropland × area) 0.168 (0.078, 0.263) (0.013, 0.323) (-0.024, 0.353)
CGA × prior land use × % saturation (Grand Prairie × non-cropland × % saturation) 0.137 (0.05, 0.218) (-0.007, 0.281) (-0.033, 0.312)

Parameter estimates as hazard ratios

If we exponentiate the parameter estimates we get them expressed as hazard ratios associated with a change of 1 unit in the predictor. Thus a value >1 indicates the reservoirs are more likely to be built earlier as the predictor increases, and <1 indicates they are more likely to be built later.

hazard_ratio_qs <- samples_long[, median_qi(hazard_ratio, .width = c(0.66, 0.90, 0.95)), by = parameter]

Below is a similar plot to the one above but with all parameter median estimates and credible intervals transformed with exp() to express them as hazard ratios, as well as a table with the same format as the one above.

parameter median 66% QI 90% QI 95% QI
CGA (Grand Prairie effect) 1.556 (1.321, 1.822) (1.181, 2.059) (1.122, 2.189)
prior land use (non-cropland effect) 1.976 (1.592, 2.479) (1.312, 2.886) (1.233, 3.087)
reservoir area 1.237 (1.101, 1.382) (1.017, 1.5) (0.977, 1.557)
% saturation at construction 1.859 (1.639, 2.116) (1.502, 2.295) (1.439, 2.379)
CGA × prior land use (Grand Prairie × non-cropland) 1.375 (1.086, 1.753) (0.916, 2.098) (0.849, 2.281)
CGA (Grand Prairie) × area 0.830 (0.724, 0.946) (0.659, 1.046) (0.63, 1.093)
CGA (Grand Prairie) × % saturation 0.602 (0.523, 0.694) (0.475, 0.772) (0.449, 0.81)
CGA × prior land use × area (Grand Prairie × non-cropland × area) 1.183 (1.081, 1.3) (1.013, 1.381) (0.977, 1.423)
CGA × prior land use × % saturation (Grand Prairie × non-cropland × % saturation) 1.146 (1.051, 1.243) (0.993, 1.324) (0.967, 1.366)

Plots of reservoir-building trend by variable

In the following plots, the “stairstep” line shows the trend of time of building reservoirs observed in the data. It starts at 0% (no reservoirs have been built yet and all sites where reservoirs will eventually be built are empty) and grows to 100% when all reservoirs have been built. So a line that rises more steeply means that reservoirs were built more quickly. Superimposed on the observed data stairstep line is a trendline with the posterior median estimate from the model, which is a smooth trend of modeled building probability instead of the 5-year interval steps of the observed data. Surrounding each trend are shaded areas with the 66%, 90%, and 95% quantile credible intervals around the trend in progressively lighter shading.

The plots will (hopefully) show two things:

  • that the model does a reasonably good job of fitting the data because the modeled probabilities follow the observed trends fairly closely
  • where the trends differ by CGA and/or reservoir attribute, as shown by separation between the posterior estimates of the trends

I am showing three different comparisons of trends, in each case with a different panel for the Cache River and Grand Prairie areas:

  • Trend for each prior land use type within each CGA
  • Trend for large and small within each CGA
  • Trend for more saturated and less saturated at time of construction within each CGA

The trends are shown holding the other variables constant at median value, and only showing the trend for reservoirs built on cropland for the latter two plots. This does not affect the interpretation because no interaction between cropland and reservoir size or groundwater depth was modeled.

# Predictions for all the data combinations
# Calculate the area values for 25%, 50%, and 75% quantiles from the data
# Do separately for each CGA
qprobs <- c(.25, .5, .75)
q_area <- year_of_const_data[, list(area_bin = c('low', 'mid', 'high'), scaled_area = quantile(scaled_area, probs = qprobs)), by = .(CGA, Prior_LULC_binary, CGA_by_LULC)]
q_pctsat <- year_of_const_data[, list(pctsat_bin = c('low', 'mid', 'high'), scaled_pctsat = quantile(scaled_pctsat, probs = qprobs)), by = .(CGA, Prior_LULC_binary, CGA_by_LULC)]
pred_dat <- merge(q_area, q_pctsat, all = TRUE, allow.cartesian = TRUE)

# median cutoffs for each CGA x land use combination
med_area <- q_area[area_bin == 'mid']
med_area[, area_bin := NULL]
setnames(med_area, 'scaled_area', 'area_cutoff')
med_pctsat <- q_pctsat[pctsat_bin == 'mid']
med_pctsat[, pctsat_bin := NULL]
setnames(med_pctsat, 'scaled_pctsat', 'pctsat_cutoff')

ci_list <- map(c(.66, .9, .95), function(w) {
  ci <- survCIs(fit, newdata = pred_dat, q = 0:40, ci_level = w)
  ci_dt <- as.data.table(cbind(pred_dat, width = w))
  ci_dt[, data := ci$cis]
  ci_dt[, as.data.frame(data), by = c(names(pred_dat), 'width')]
})

ci_dt <- rbindlist(ci_list)
setnames(ci_dt, old = c('estimate..mean.', 'estimate..median.'), new = c('mean', 'median'))
ci_wide <- dcast(ci_dt, CGA + Prior_LULC_binary + CGA_by_LULC + area_bin + pctsat_bin + Time + mean + median ~ width, value.var = c('lower', 'upper'))
ci_wide[, year := Time + 1975]

# Get raw data summaries for step plot
years <- seq(1975, 2015, 5)

pct_built <- function(x) data.frame(year = years, built = cumsum(table(factor(x, levels=years)))/length(x))

nbuilt_cga_x_lulc <- year_of_const_data[, pct_built(last_year), by = .(CGA, Prior_LULC_binary, CGA_by_LULC)]

year_of_const_data[med_pctsat, on = .NATURAL, pctsat_cutoff := i.pctsat_cutoff]
year_of_const_data[med_area, on = .NATURAL, area_cutoff := i.area_cutoff]

year_of_const_data[, area_bin := ifelse(scaled_area <= area_cutoff, 'low', 'high')]
year_of_const_data[, pctsat_bin := ifelse(scaled_pctsat <= pctsat_cutoff, 'low', 'high')]
nbuilt_cga_x_area <- year_of_const_data[, pct_built(last_year), by = .(CGA, area_bin, Prior_LULC_binary, CGA_by_LULC)]
nbuilt_cga_x_pctsat <- year_of_const_data[, pct_built(last_year), by = .(CGA, pctsat_bin, Prior_LULC_binary, CGA_by_LULC)]

Trend by prior land use type for each CGA

In this figure, we can see that both in the observed and modeled data, reservoirs on non-cropland areas were built earlier than reservoirs on cropland (holding both size and % saturation constant at an average value). This effect is more pronounced in Grand Prairie. The uncertainty intervals are also narrower in Grand Prairie because there are many more data points there to fit the trend on.

Trend by reservoir size for each CGA & land use

In this figure, now that we have correctly incorporated the three-way interaction, we can see that larger reservoirs were built somewhat earlier than smaller ones (holding % saturation constant at an average value) only on non-cropland areas in Grand Prairie. Everywhere else (on cropland in Grand Prairie, and everywhere in Cache River), there is little to no effect of reservoir area on construction time. Again, the uncertainty intervals are also narrower in Grand Prairie because of larger sample size.

Trend by percent saturation for each CGA & land use

In this figure, we can see that both in the observed and modeled data, reservoirs over more saturated aquifers were built earlier than reservoirs over less saturated aquifers (holding reservoir size constant at an average value). This effect is somewhat weaker than the land use type effect, and the effect size is roughly similar in each CGA. As I have noted before, this should be interpreted with caution.