This document contains exploratory plots and tables summarizing the CGA reservoir dataset, as well as a fit of a survival model to explore various factors that may influence when reservoirs were constructed.
Results of the survival analysis show that prior land use, reservoir size, and percent saturation at time of construction may have influenced when reservoirs were built, and that some of those influences may have varied in strength between the two CGAs Grand Prairie and Cache River. In particular, we see there is a three-way interaction between CGA, prior land use, and reservoir size. It is that reservoir size influences construction time only in previously non-cropland areas and only in Grand Prairie. On all previously cropland areas, and all areas in Cache River, we see very little difference in construction time by reservoir size.
Load the data. Convert to a spatial object so that we can make maps
if needed. There is a mistake in one County value in the data. Correct
that. Create an ID column for each reservoir and also a numeric column
for the last year in the timespan we know it was built (so the latest
possible year each reservoir could have been built, for plotting
purposes). Combine empty and <Null>
entries to
"Unknown"
for prior land use. Also, load the county
boundaries for the state of Arkansas to use as background for the maps.
Project everything into a map projection appropriate for Arkansas.
library(data.table)
library(ggplot2)
library(sf)
library(USAboundaries)
library(purrr)
library(gt)
library(stringr)
library(survival)
library(survminer)
library(icenReg)
library(ggdist)
library(glue)
reservoirs <- fread('project/Yaeger_ReservoirDataset_Oct2022.csv')
reservoirs[County == '<=1975', County := 'Lonoke']
reservoirs[, ID := 1:.N]
reservoirs[, OID_ := NULL]
reservoirs[, LOCATIONID := NULL]
MSLcols <- grep('MSLft|MRVA', names(reservoirs), value = TRUE)
reservoirs[, c(MSLcols) := NULL]
reservoirs[, last_year := map_int(strsplit(TimeSpan, split = '[[:punct:]]', fixed = FALSE), ~ as.integer(.[length(.)]))]
reservoirs[Prior_LULC %in% c('', '<Null>'), Prior_LULC := 'Unknown']
reservoirs_sp <- st_as_sf(reservoirs, coords = c('X', 'Y'), crs = 4326) |>
st_transform(crs = 26952)
AR_counties <- us_counties(resolution = 'high', states = 'AR') |>
st_transform(crs = 26952)
Make a long-form time series of groundwater depth by year and identify for each year whether the reservoir at that location was built previously, was built this year (or during the 5-year time span immediately preceding the date when groundwater depth was measured), or had yet to be built at that time. edit: Add percent saturation to this time series.
static_attrs <- c('ID', 'CGA', 'Type', 'County', 'Area_ha', 'Area_ac', 'TimeSpan', 'last_year', 'Prior_LULC', 'X', 'Y')
reservoir_depth_timeseries <- melt(reservoirs, id.vars = static_attrs, variable.name = 'year', value.name = 'depth')
reservoir_depth_timeseries[, ts_var := ifelse(grepl('^PctSat', year), 'pct_sat', 'depth')]
reservoir_depth_timeseries[, year := as.numeric(gsub('DTW_Y|PctSat', '', year))]
reservoir_depth_timeseries[, built_this_year := fcase(year > last_year, 'built previously', year == last_year, 'built this year', year < last_year, 'will be built later')]
reservoir_depth_timeseries <- dcast(reservoir_depth_timeseries, paste(paste(static_attrs, collapse = '+'), '+ built_this_year + year ~ ts_var'), value.var = 'depth')
Note: Code is not shown for this portion.
Plot the reservoirs on a map with size of point corresponding to its area, shape corresponding to whether it’s a pond or a reservoir, and color for when it was built. Dark colors indicate older and light colors newer. Note the size of the point is proportional to the area of the reservoir on a log scale. It looks like the majority of the older reservoirs are in the southern region and there is a much higher density of reservoirs there, and a greater number of large reservoirs in the south (basically all of the ones with >100 hectares size are down in the southern part). You probably already know all this but it is useful for me to visualize.
Make some tables or plots of the different attributes of the reservoirs, both individually and crossed with one another.
Here is a histogram of reservoir area by CGA. Notice log scale x-axis. Grand Prairie has more overall as well as many more in the large end of the distribution.
Here are the same histograms as above, with separate panels for ponds and reservoirs.
Here are tables of reservoir counts by some attribute crossed with CGA, with some colors to help visualize.
timespan counts by CGA | |||
timespan when built | Cache River | Grand Prairie | total |
---|---|---|---|
<=1975 | 21 | 178 | 199 |
1976-1980 | 8 | 36 | 44 |
1981-1985 | 10 | 51 | 61 |
1986-1990 | 17 | 40 | 57 |
1991-1995 | 10 | 10 | 20 |
1996-2000 | 18 | 76 | 94 |
2001-2005 | 21 | 135 | 156 |
2006-2010 | 14 | 53 | 67 |
2011-2015 | 24 | 53 | 77 |
prior land use counts by CGA | |||
prior land use | Cache River | Grand Prairie | total |
---|---|---|---|
Cropland | 75 | 296 | 371 |
Mixed: Cropland and Pond | 4 | 14 | 18 |
Mixed: Cropland and Wooded area | 2 | 6 | 8 |
Mixed: Cropland, Natural Feature | 17 | 22 | 39 |
Mixed: Edge of Field, Natural Feature | 11 | 68 | 79 |
Natural low-lying feature | 8 | 35 | 43 |
Wooded area | 5 | 13 | 18 |
Unknown | 21 | 178 | 199 |
waterbody type counts by CGA | |||
type of waterbody | Cache River | Grand Prairie | total |
---|---|---|---|
Pond | 25 | 74 | 99 |
Reservoir | 118 | 558 | 676 |
Here is a time series plot showing, for each reservoir individually, a time trend of depth to groundwater (partially transparent because many trends are stacked on top of each other). Superimposed on that are boxplots of the depth of groundwater separated by when the reservoirs at each time point were built (using the last year of the timespan that we know it was built in).
The plot obviously shows the water table getting deeper over time in both CGAs. To me, just eyeballing it right now, it does seem that the new reservoirs being built in each year, in both CGAs, basically track the average depth to groundwater prevailing at that time. (The distributions of the data in the blue boxplots are not any different than the distributions in the red and green boxplots.) So that would imply, in a way, that reservoir builders aren’t basing their decision of whether to build a reservoir on how deep the groundwater is at that location. They are building them based on other reasons, so the depth to groundwater at that location is basically random with respect to their decision. As a result, it just tracks the average for that region. This especially appears to be true in the Grand Prairie CGA. We could make a statistical model to estimate actual parameters for that trend but for now this is just an exploratory idea.
One other thing to note, which I mentioned in the Teams conversation the other day, is that it would also be enlightening to have groundwater depth data for locations in which no reservoirs were ever built. That would really help nail down if this is a significant trend. We might find that locations without reservoirs (where no one has ever built a reservoir to date) have much shallower depths to groundwater than the places where reservoirs have been built at some point in history. That would mean that yes, reservoirs are being built in places where it is harder to access groundwater than places where they aren’t being built (which would be a different conclusion than what I am drawing from just eyeballing the graph here). If the groundwater depth data is being derived from some continuous data layer, we could actually use that to make that comparison.
Here is the same plot as above, plotting only the reservoirs and not the ponds. To me the broad trends look very similar.
This plot only shows the new reservoirs built each year in each CGA, and a distribution of their areas in hectares. As above when we were dealing with area, notice logarithmic scale on the y-axis.
Again I am basing this on an eyeball test only, but I do not see much of a trend in the size of new reservoir construction over time. A statistical model could be used to look more rigorously but this figure does not seem to show much of one. (But note especially in Cache River that new reservoir construction after 1995 may be bigger than old ones, though it is not a strong trend.)
Here is the same plot as above but with an arithmetic y-axis.
Here is the arithmetic y-axis plot excluding all reservoirs greater than 50 ha in area (only 26 of the 775 are removed from this plot).
I decided to plot both ponds and reservoirs as separate bars here. It does not change the overall story that there does not seem to be any obvious trend in size of new construction of reservoirs over time.
These plots were added on October 19. They are the same format as the groundwater depth plots, with year on the x-axis, but using the percent saturation variable on the y-axis instead.
The percent saturation trend is a little different than the depth trend, especially now that the corrections were made (Oct 21 version). We see a peak in 1995 for many of the locations in Grand Prairie but not Cache River. But we do also see the general downward trend resulting from aquifer depletion in both regions. We still do not see a big difference between the percent saturation and the status of reservoir construction in any given year, just as we didn’t for depth.
I decided that the best way to analyze this dataset is survival
analysis, otherwise known as time-to-event analysis. As I said before,
it would be an even better model if we had the history of some locations
where reservoirs were not built. However, we can do a decent job with
this dataset. We have left-censoring in the dataset because some of the
reservoirs were built before 1975 and we do not know exactly when. Also,
we have 5-year intervals for timespan so we have to account for the fact
that we can only say the reservoir was built sometime in that 5-year
interval. The R package survival
is able to account for
both of those things.
Let’s use a proportional-hazards method to estimate survival curves
as a function of the different properties: location (CGA), area
(log-transformed), percent aquifer saturation at time of construction,
and prior land cover. Create a column for the lower and upper bound of
when the reservoir was built from the timespan column. If it was built
before 1975, use NA
for the lower bound. Also adjust so
that 1975 is year 0.
Another adjustment is that there are a large number of prior land use categories. To simplify the model, let’s categorize them into two categories: one is cropland and all the mixed categories that contain cropland, the other is everything else that is not cropland, including forest, natural features, field edge, and unknown.
Finally, standardize area and percent saturation (after log transforming area) so that we can compare the coefficients.
edits: On October 19, I edited the model in a few ways:
year_of_const_data <- reservoir_depth_timeseries[year == last_year & Type == 'Reservoir']
year_of_const_data[, lower_bound := as.numeric(str_extract(TimeSpan, '^[0-9]+')) - 1975]
year_of_const_data[, upper_bound := as.numeric(str_extract(TimeSpan, '[0-9]+$')) - 1975]
year_of_const_data[, Prior_LULC_binary := ifelse(grepl('Cropland', Prior_LULC), 'cropland', 'other')]
scaled_data <- year_of_const_data[, .(Area_ha, pct_sat)]
scaled_data[, log_area := log(Area_ha)]
scaled_data <- scale(scaled_data)
year_of_const_data[, scaled_area := scaled_data[, 'log_area']]
year_of_const_data[, scaled_pctsat := scaled_data[, 'pct_sat']]
# Manually create CGA by land use interaction term because it seems like there is a bug creating 3-way interactions in ic_bayes
year_of_const_data[, CGA_by_LULC := factor(CGA == 'Grand Prairie' & Prior_LULC_binary == 'other')]
# icenReg fit
# Note: samples = number of samples across all chains.
set.seed(1021)
fit <- ic_bayes(Surv(time = lower_bound, time2 = upper_bound, type = 'interval2') ~ CGA + Prior_LULC_binary + scaled_area + scaled_pctsat + CGA_by_LULC + CGA:scaled_area + CGA:scaled_pctsat + CGA_by_LULC:scaled_area + CGA_by_LULC:scaled_pctsat,
data = year_of_const_data,
model = 'ph', dist = 'loglogistic',
controls = bayesControls(samples = 4000, chains = 4, burnIn = 9000, thin = 10))
First let’s look at a table and graph of parameter estimates. It is somewhat counter-intuitive to interpret these results, and it confused me at first. A negative coefficient on a variable means that as that variable increases, the “relative hazard” of having a reservoir built increases. So a positive coefficient means that higher values of the variable are associated with reservoirs being built earlier, and a negative coefficient means that higher values of the variable are associated with reservoirs being built later.
samples_long <- as.data.table(fit$samples)
samples_long[, id := 1:nrow(.SD)]
samples_long <- melt(samples_long, id.vars = 'id', variable.name = 'parameter')
samples_long[, hazard_ratio := exp(value)]
param_qs <- samples_long[, median_qi(value, .width = c(0.66, 0.90, 0.95)), by = parameter]
Because this figure shows standardized coefficients, we can compare them. The point estimates of the coefficients are black dots, and the 66%, 90%, and 95% quantile credible intervals are in progressively lighter shades of blue. A dotted line is plotted at zero (no effect). The coefficients are counter-intuitive, so here is my interpretation of the results edited after updating model:
Here is a table with the same median and quantile credible interval values as presented in the figure above.
parameter | median | 66% QI | 90% QI | 95% QI |
---|---|---|---|---|
CGA (Grand Prairie effect) | 0.442 | (0.279, 0.6) | (0.166, 0.722) | (0.115, 0.783) |
prior land use (non-cropland effect) | 0.681 | (0.465, 0.908) | (0.271, 1.06) | (0.209, 1.127) |
reservoir area | 0.212 | (0.096, 0.324) | (0.017, 0.405) | (-0.023, 0.443) |
% saturation at construction | 0.620 | (0.494, 0.75) | (0.407, 0.831) | (0.364, 0.867) |
CGA × prior land use (Grand Prairie × non-cropland) | 0.318 | (0.083, 0.562) | (-0.088, 0.741) | (-0.164, 0.824) |
CGA (Grand Prairie) × area | −0.186 | (-0.322, -0.055) | (-0.417, 0.045) | (-0.462, 0.089) |
CGA (Grand Prairie) × % saturation | −0.507 | (-0.649, -0.366) | (-0.744, -0.258) | (-0.8, -0.211) |
CGA × prior land use × area (Grand Prairie × non-cropland × area) | 0.168 | (0.078, 0.263) | (0.013, 0.323) | (-0.024, 0.353) |
CGA × prior land use × % saturation (Grand Prairie × non-cropland × % saturation) | 0.137 | (0.05, 0.218) | (-0.007, 0.281) | (-0.033, 0.312) |
If we exponentiate the parameter estimates we get them expressed as hazard ratios associated with a change of 1 unit in the predictor. Thus a value >1 indicates the reservoirs are more likely to be built earlier as the predictor increases, and <1 indicates they are more likely to be built later.
hazard_ratio_qs <- samples_long[, median_qi(hazard_ratio, .width = c(0.66, 0.90, 0.95)), by = parameter]
Below is a similar plot to the one above but with all parameter
median estimates and credible intervals transformed with
exp()
to express them as hazard ratios, as well as a table
with the same format as the one above.
parameter | median | 66% QI | 90% QI | 95% QI |
---|---|---|---|---|
CGA (Grand Prairie effect) | 1.556 | (1.321, 1.822) | (1.181, 2.059) | (1.122, 2.189) |
prior land use (non-cropland effect) | 1.976 | (1.592, 2.479) | (1.312, 2.886) | (1.233, 3.087) |
reservoir area | 1.237 | (1.101, 1.382) | (1.017, 1.5) | (0.977, 1.557) |
% saturation at construction | 1.859 | (1.639, 2.116) | (1.502, 2.295) | (1.439, 2.379) |
CGA × prior land use (Grand Prairie × non-cropland) | 1.375 | (1.086, 1.753) | (0.916, 2.098) | (0.849, 2.281) |
CGA (Grand Prairie) × area | 0.830 | (0.724, 0.946) | (0.659, 1.046) | (0.63, 1.093) |
CGA (Grand Prairie) × % saturation | 0.602 | (0.523, 0.694) | (0.475, 0.772) | (0.449, 0.81) |
CGA × prior land use × area (Grand Prairie × non-cropland × area) | 1.183 | (1.081, 1.3) | (1.013, 1.381) | (0.977, 1.423) |
CGA × prior land use × % saturation (Grand Prairie × non-cropland × % saturation) | 1.146 | (1.051, 1.243) | (0.993, 1.324) | (0.967, 1.366) |
In the following plots, the “stairstep” line shows the trend of time of building reservoirs observed in the data. It starts at 0% (no reservoirs have been built yet and all sites where reservoirs will eventually be built are empty) and grows to 100% when all reservoirs have been built. So a line that rises more steeply means that reservoirs were built more quickly. Superimposed on the observed data stairstep line is a trendline with the posterior median estimate from the model, which is a smooth trend of modeled building probability instead of the 5-year interval steps of the observed data. Surrounding each trend are shaded areas with the 66%, 90%, and 95% quantile credible intervals around the trend in progressively lighter shading.
The plots will (hopefully) show two things:
I am showing three different comparisons of trends, in each case with a different panel for the Cache River and Grand Prairie areas:
The trends are shown holding the other variables constant at median value, and only showing the trend for reservoirs built on cropland for the latter two plots. This does not affect the interpretation because no interaction between cropland and reservoir size or groundwater depth was modeled.
# Predictions for all the data combinations
# Calculate the area values for 25%, 50%, and 75% quantiles from the data
# Do separately for each CGA
qprobs <- c(.25, .5, .75)
q_area <- year_of_const_data[, list(area_bin = c('low', 'mid', 'high'), scaled_area = quantile(scaled_area, probs = qprobs)), by = .(CGA, Prior_LULC_binary, CGA_by_LULC)]
q_pctsat <- year_of_const_data[, list(pctsat_bin = c('low', 'mid', 'high'), scaled_pctsat = quantile(scaled_pctsat, probs = qprobs)), by = .(CGA, Prior_LULC_binary, CGA_by_LULC)]
pred_dat <- merge(q_area, q_pctsat, all = TRUE, allow.cartesian = TRUE)
# median cutoffs for each CGA x land use combination
med_area <- q_area[area_bin == 'mid']
med_area[, area_bin := NULL]
setnames(med_area, 'scaled_area', 'area_cutoff')
med_pctsat <- q_pctsat[pctsat_bin == 'mid']
med_pctsat[, pctsat_bin := NULL]
setnames(med_pctsat, 'scaled_pctsat', 'pctsat_cutoff')
ci_list <- map(c(.66, .9, .95), function(w) {
ci <- survCIs(fit, newdata = pred_dat, q = 0:40, ci_level = w)
ci_dt <- as.data.table(cbind(pred_dat, width = w))
ci_dt[, data := ci$cis]
ci_dt[, as.data.frame(data), by = c(names(pred_dat), 'width')]
})
ci_dt <- rbindlist(ci_list)
setnames(ci_dt, old = c('estimate..mean.', 'estimate..median.'), new = c('mean', 'median'))
ci_wide <- dcast(ci_dt, CGA + Prior_LULC_binary + CGA_by_LULC + area_bin + pctsat_bin + Time + mean + median ~ width, value.var = c('lower', 'upper'))
ci_wide[, year := Time + 1975]
# Get raw data summaries for step plot
years <- seq(1975, 2015, 5)
pct_built <- function(x) data.frame(year = years, built = cumsum(table(factor(x, levels=years)))/length(x))
nbuilt_cga_x_lulc <- year_of_const_data[, pct_built(last_year), by = .(CGA, Prior_LULC_binary, CGA_by_LULC)]
year_of_const_data[med_pctsat, on = .NATURAL, pctsat_cutoff := i.pctsat_cutoff]
year_of_const_data[med_area, on = .NATURAL, area_cutoff := i.area_cutoff]
year_of_const_data[, area_bin := ifelse(scaled_area <= area_cutoff, 'low', 'high')]
year_of_const_data[, pctsat_bin := ifelse(scaled_pctsat <= pctsat_cutoff, 'low', 'high')]
nbuilt_cga_x_area <- year_of_const_data[, pct_built(last_year), by = .(CGA, area_bin, Prior_LULC_binary, CGA_by_LULC)]
nbuilt_cga_x_pctsat <- year_of_const_data[, pct_built(last_year), by = .(CGA, pctsat_bin, Prior_LULC_binary, CGA_by_LULC)]
In this figure, we can see that both in the observed and modeled data, reservoirs on non-cropland areas were built earlier than reservoirs on cropland (holding both size and % saturation constant at an average value). This effect is more pronounced in Grand Prairie. The uncertainty intervals are also narrower in Grand Prairie because there are many more data points there to fit the trend on.
In this figure, now that we have correctly incorporated the three-way interaction, we can see that larger reservoirs were built somewhat earlier than smaller ones (holding % saturation constant at an average value) only on non-cropland areas in Grand Prairie. Everywhere else (on cropland in Grand Prairie, and everywhere in Cache River), there is little to no effect of reservoir area on construction time. Again, the uncertainty intervals are also narrower in Grand Prairie because of larger sample size.
In this figure, we can see that both in the observed and modeled data, reservoirs over more saturated aquifers were built earlier than reservoirs over less saturated aquifers (holding reservoir size constant at an average value). This effect is somewhat weaker than the land use type effect, and the effect size is roughly similar in each CGA. As I have noted before, this should be interpreted with caution.