Summary

This document is supplementary information for the manuscript:

Taliercio, E., D. Eickholt, Q. D. Read, T. Carter, N. Waldeck, and B. Fallen. 2023. Parental Choice and Intrapopulation Selection for Seed Size Impacts Uprightness of Progeny Derived from Interspecific Hybridization between Glycine max and Glycine soja.

This includes all code required to reproduce the analysis described as “Experiment 3” in the manuscript.

Change log

13 Feb 2023: combine all analysis into single notebook
09 Feb 2023: update Bayes Factor code for new package versions
03 Feb 2023: additional minor edits to labels on figures
13 Jan 2023: minor edits to figure formatting
29 Mar 2022: clean version to be included with repo (for initial submission)
08 Mar 2022: add post-hoc contrasts and Bayes factors to specifically test some of the differences between groups; improve figures and tables
13 Jan 2022: first version

Setup

Load the necessary packages.

library(data.table)

library(ggplot2)

library(tidyr)

library(brms)

library(emmeans)

library(tidybayes)

library(ggtext)

library(bayestestR)

library(effectsize)

library(Rutilitybelt) # This package may be installed by calling remotes::install_github('qdread/Rutilitybelt')



options(mc.cores = 4, brms.backend = 'cmdstanr', brms.file_refit = 'never')

Load data and set factor levels.

dat <- fread('data/counts_seedwt.csv')



dat[, size_ordered := ordered(size, levels = c('< 8 3/4', '> 8 3/4', '> 10', '> 10 3/4', '> 11 1/2', '> 12'))]

dat[, size_combined := factor(size_combined, levels = c('< 8 3/4', '> 8 3/4', '> 10 3/4', '> 11 1/2', '> 12'))]

dat[, soja := factor(soja)]

Exploratory data viz

Make plots showing the values of number selected per number planted for the different combinations of predictors. Logit transform on y axis.

pct_scale <- scale_y_continuous(trans = 'logit', name = 'percent selected', breaks = c(0.001, 0.003, 0.01, 0.03, 0.10), labels = scales::percent_format(accuracy = 0.1))

soja_facet <- facet_wrap(~ soja, labeller = as_labeller(function(x) paste0('*G. soja* parent: PI', x)))



ggplot(dat |> transform(F3_location = ifelse(dat$F3_location == 'CLA', 'CCRS', 'UCPRS'),

                        F2_location = ifelse(dat$F2_location == 'Caswell', 'CRS', 'LCPRS'),

                        max = gsub('NC ', 'NC-', max)), 

       aes(x = max, y = n_selected/n_planted, fill = interaction(F2_location, F3_location, sep = ' &times; '))) +

  geom_boxplot(position = 'dodge') +

  soja_facet +

  labs(x = '*G. max* parent', fill = 'F<sub>2</sub> location &times; F<sub>3</sub> location') +

  pct_scale +

  theme(legend.position = 'bottom',

        axis.title.x = element_markdown(),

        legend.title = element_markdown(),

        legend.text = element_markdown(),

        strip.text = element_markdown())

#ggsave('C:/Users/qdread/onedrive_usda/ars_projects/taliercio/figs/FIG09_selectabilitybynursery.png', dpi = 400, height = 4, width = 7.5)

Also look at seed size.

ggplot(dat |> transform(max = gsub('NC ', 'NC-', max)), aes(x = max, y = n_selected/n_planted, fill = size_ordered)) +

  geom_boxplot(position = 'dodge') +

  soja_facet + pct_scale +

  scale_fill_viridis_d(name = 'seed size class', 

                       labels = c('&#60; 8 &frac34;', '&#62; 8 &frac34;', '&#62; 10', '&#62; 10 &frac34;', '&#62; 11 &frac12;', '&#62; 12')) +

  labs(x = '*G. max* parent') +

  theme(legend.position = 'bottom',

        axis.title.x = element_markdown(),

        strip.text = element_markdown(),

        legend.text = element_markdown())

#ggsave('C:/Users/qdread/onedrive_usda/ars_projects/taliercio/figs/FIG10_selectabilitybysizeclass.png', dpi = 400, height = 4, width = 7.5)

Fitting model

The statstical model should answer the following research questions:

Does the G. max parent have an effect on number of harvestable plants?
Does the wild parent have an effect on number of harvestable plants? (Specifically, does 407191 produce more harvestable plants?)
Is there an effect of seed size on number of harvestable plants, or is there an interaction between seed size and G. max and/or G. soja parent?
Is there an effect of F2 or F3 location that needs to be considered?

Based on this we will have a 3-way interaction between G. max parent, G. soja parent, and seed size as the fixed effects, and F2 nursery location and F3 nursery location as crossed random effects (random intercepts).

The other consideration is that the response variable is number selected per number planted. This is a binomial trials situation, where we are modeling how the probability of being selected varies with our different fixed and random effects. Each seed planted is a binary trial that has a certain probability of being selected (outcome = 1) or not selected (outcome = 0).

All of this can be combined into a single Bayesian binomial mixed-effects regression model that will enable us to answer all the questions that the design is set up to answer, while properly controlling for any random effects that might arise from the different nurseries.

In the following model, I am using the continuous variable for 100-seed weight to represent seed size. That makes the model a lot easier to deal with than if we used the categories. The formula includes the full three-way interaction and the random intercept terms. It is a binomial model with a logit link function, meaning we are comparing how the log odds changes among the different groups. Some priors are added to limit the coefficients to a reasonable range centered around zero which makes the model converge better.

fit_binom_seedwt_full <- brm(

  n_selected | trials(n_planted_trials) ~ max * soja * ave_100sw + (1|F2_location) + (1|F3_location),

  data = dat, family = binomial,

  prior = c(

    prior('student_t(3, 0, 3)', class = 'b')

  ),

  iter = 6000, warmup = 4000, chains = 4, seed = 910, 

  file = 'data/fit_binomial_seedwt_full'

)

First, let’s look at the model diagnostics to make sure it converged. All Rhat statistics (potential scale reduction factor which should be less than 1.05 for all parameters) are close to 1 which means it converged.

summary(fit_binom_seedwt_full)

##  Family: binomial 

##   Links: mu = logit 

## Formula: n_selected | trials(n_planted_trials) ~ max * soja * ave_100sw + (1 | F2_location) + (1 | F3_location) 

##    Data: dat (Number of observations: 160) 

##   Draws: 4 chains, each with iter = 6000; warmup = 4000; thin = 1;

##          total post-warmup draws = 8000

## 

## Group-Level Effects: 

## ~F2_location (Number of levels: 2) 

##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## sd(Intercept)     0.49      0.63     0.02     2.25 1.01     1205     1351

## 

## ~F3_location (Number of levels: 2) 

##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## sd(Intercept)     0.45      0.55     0.02     2.03 1.01     1152     2002

## 

## Population-Level Effects: 

##                                    Estimate Est.Error l-95% CI u-95% CI Rhat

## Intercept                             -3.54      0.51    -4.61    -2.40 1.01

## maxNCDunphy                            0.81      0.19     0.42     1.19 1.00

## maxNCRaleigh                          -0.30      0.22    -0.73     0.12 1.00

## maxTCHM06M240                         -1.11      0.22    -1.55    -0.66 1.00

## soja425045                            -2.98      0.25    -3.46    -2.50 1.00

## ave_100sw                             -0.04      0.03    -0.10     0.02 1.00

## maxNCDunphy:soja425045                -1.72      0.30    -2.31    -1.13 1.00

## maxNCRaleigh:soja425045                0.99      0.34     0.32     1.65 1.00

## maxTCHM06M240:soja425045              -0.47      0.33    -1.11     0.17 1.00

## maxNCDunphy:ave_100sw                 -0.08      0.03    -0.15    -0.01 1.00

## maxNCRaleigh:ave_100sw                 0.03      0.04    -0.04     0.11 1.00

## maxTCHM06M240:ave_100sw                0.18      0.04     0.11     0.26 1.00

## soja425045:ave_100sw                   0.19      0.05     0.10     0.28 1.00

## maxNCDunphy:soja425045:ave_100sw       0.25      0.05     0.14     0.35 1.00

## maxNCRaleigh:soja425045:ave_100sw     -0.28      0.06    -0.39    -0.16 1.00

## maxTCHM06M240:soja425045:ave_100sw     0.07      0.06    -0.04     0.18 1.00

##                                    Bulk_ESS Tail_ESS

## Intercept                              1759     1565

## maxNCDunphy                            1764     2740

## maxNCRaleigh                           1922     3004

## maxTCHM06M240                          1841     2869

## soja425045                             1403     2591

## ave_100sw                              1525     2716

## maxNCDunphy:soja425045                 1456     2671

## maxNCRaleigh:soja425045                1718     3017

## maxTCHM06M240:soja425045               1741     3114

## maxNCDunphy:ave_100sw                  1629     2668

## maxNCRaleigh:ave_100sw                 1769     2819

## maxTCHM06M240:ave_100sw                1693     2798

## soja425045:ave_100sw                   1338     2638

## maxNCDunphy:soja425045:ave_100sw       1296     2477

## maxNCRaleigh:soja425045:ave_100sw      1616     2847

## maxTCHM06M240:soja425045:ave_100sw     1586     2706

## 

## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS

## and Tail_ESS are effective sample size measures, and Rhat is the potential

## scale reduction factor on split chains (at convergence, Rhat = 1).

Graphs of model results

Now let’s look at the random effect estimates. On this plot and the following, the point is the median, the thick bar is the 66% quantile credible interval (about 1 SD), and the thin bar is the 95% quantile credible interval (about 2 SD).

f2_effect <- fit_binom_seedwt_full %>%

  spread_draws(r_F2_location[F2_location,term])

f3_effect <- fit_binom_seedwt_full %>%

  spread_draws(r_F3_location[F3_location,term])



ranef_both <- rbindlist(list(f2_effect,f3_effect), use.names = FALSE)

ranef_both[, location := factor(F2_location, levels = rev(c('Caswell','Hugo','FF','CLA')), labels = rev(c('CRS (F<sub>2</sub>)','LCPRS (F<sub>2</sub>)','UCPRS (F<sub>3</sub>)','CCRS (F<sub>3</sub>)')))]



ggplot(ranef_both,

       aes(y = location, x = r_F2_location)) +

  stat_sample_slabinterval(geom = 'pointinterval', point_interval = median_qi) +

  geom_vline(xintercept = 0, color = "red", lty = 'dashed') +

  scale_x_continuous(limits = c(-0.5, 0.5)) +

  labs(y = 'nursery', x = 'effect') +

  theme(axis.text.y = element_markdown())

#ggsave('C:/Users/qdread/onedrive_usda/ars_projects/taliercio/figs/randomeffects.png', dpi = 400, height = 4, width = 5)

Next, plot the estimated marginal means for the different crosses. Here we see that the probability of selecting the progeny is uniformly higher for soja parent PI407191. The NC-Raleigh max parent may be a little worse and the NC-Dunphy max parent a little better, but this is a relatively weak effect and does not depend on the soja parent.

emm_max_soja <- emmeans(fit_binom_seedwt_full, ~ max + soja, type = "response")



max_soja_toplot <- gather_emmeans_draws(emm_max_soja) |> setDT()

max_soja_toplot[, p := plogis(.value)]



ggplot(max_soja_toplot, aes(y = max, x = p)) +

  stat_sample_slabinterval(geom = 'pointinterval', point_interval = median_qi) +

  facet_wrap(~ soja, nrow = 2, labeller = labeller(soja = function(x) paste('*G. soja* parent:', x))) +

  scale_x_continuous(name = 'probability of selection', trans = 'logit', breaks = c(.001, .002, .005, .01, .02, .05, .10), expand = expansion(mult = c(0, 0)), limits = c(0.0005, 0.11), labels = scales::percent_format(accuracy = 0.1)) +

  labs(y = '*G. max* parent') +

  theme(axis.title.y = element_markdown(), strip.text = element_markdown())

#ggsave('marginalmeans_crosses.png', dpi = 400, height = 7, width = 5)

These are the estimated marginal trends (slopes of the selectability vs. seed weight line) for the different crosses, as well as for the individual max and soja parents averaged across the other parent. Most of them are positive as expected (increasing seed weight leads to higher percent progeny selected) but not universally. Some are neutral or even negative. The trend noticed in the exploratory plots, that there is a positive trend for selectability with seed weight in PI425045 but not PI407191, is supported by the model.

swt_trends_max_soja <- emtrends(fit_binom_seedwt_full, ~ max + soja, var = 'ave_100sw', type = 'response')



swt_trends_toplot <- gather_emmeans_draws(swt_trends_max_soja) |> setDT()



ggplot(swt_trends_toplot, aes(y = max, x = .value)) +

  stat_sample_slabinterval(geom = 'pointinterval', point_interval = median_qi) +

  geom_vline(xintercept = 0, color = 'red', linetype = 'dashed') +

  facet_wrap(~ soja, nrow = 2, labeller = labeller(soja = function(x) paste('*G. soja* parent:', x))) +

  scale_x_continuous(name = 'slope of selectability vs. seed weight trend') +

  labs(y = '*G. max* parent') +

  theme(axis.title.y = element_markdown(), strip.text = element_markdown())

#ggsave('seedweight_slopes_crosses.png', dpi = 400, height = 7, width = 5)



swt_trends_max <- emtrends(fit_binom_seedwt_full, ~ max, var = 'ave_100sw', type = 'response')



swt_trends_max_toplot <- gather_emmeans_draws(swt_trends_max) |> setDT()



ggplot(swt_trends_max_toplot, aes(y = max, x = .value)) +

  stat_sample_slabinterval(geom = 'pointinterval', point_interval = median_qi) +

  geom_vline(xintercept = 0, color = 'red', linetype = 'dashed') +

  scale_x_continuous(name = 'slope of selectability vs. seed weight trend') +

  labs(y = '*G. max* parent') +

  theme(axis.title.y = element_markdown(), strip.text = element_markdown())

#ggsave('seedweight_slopes_max.png', dpi = 400, height = 4, width = 5)



swt_trends_soja <- emtrends(fit_binom_seedwt_full, ~ soja, var = 'ave_100sw', type = 'response')



swt_trends_soja_toplot <- gather_emmeans_draws(swt_trends_soja) |> setDT()



ggplot(swt_trends_soja_toplot, aes(y = soja, x = .value)) +

  stat_sample_slabinterval(geom = 'pointinterval', point_interval = median_qi) +

  geom_vline(xintercept = 0, color = 'red', linetype = 'dashed') +

  scale_x_continuous(name = 'slope of selectability vs. seed weight trend') +

  labs(y = '*G. soja* parent') +

  theme(axis.title.y = element_markdown(), strip.text = element_markdown())

#ggsave('seedweight_slopes_soja.png', dpi = 400, height = 4, width = 5)

Post hoc contrasts

Contrasts between pairs of marginal means allow us to make quantifiable statements about differences between the means in the manuscript. These contrasts are odds ratios (ratio of the selection odds between the two groups being compared). The Bayesian model does not require any correction for multiple comparisons. An odds ratio of 1, or log odds ratio of 0, means no difference between the two groups, i.e. an identical selection probability in both.

emm_max_soja_raw <- emmeans(fit_binom_seedwt_full, ~ max | soja)

contr_max_soja <- contrast(emm_max_soja_raw, 'pairwise')



emm_max_raw <- emmeans(fit_binom_seedwt_full, ~ max)

contr_max <- contrast(emm_max_raw, 'pairwise')



emm_soja_raw <- emmeans(fit_binom_seedwt_full, ~ soja)

contr_soja <- contrast(emm_soja_raw, 'pairwise')

Post hoc contrasts for the slope trends.

swt_trends_max_bysoja <- emtrends(fit_binom_seedwt_full, ~ max | soja, var = 'ave_100sw', type = 'response')



swt_contr_max_soja <- contrast(swt_trends_max_bysoja, 'pairwise')

swt_contr_max <- contrast(swt_trends_max, 'pairwise')

swt_contr_soja <- contrast(swt_trends_soja, 'pairwise')

Bayes factors

To assessing the strength of evidence for the different effects and comparisons in our model we calculate Bayes factors (BF). Our prior estimate for the difference between the treatments was centered around zero, so the prior essentially represents a null hypothesis. If BF = 1, we have not changed our belief about the parameter at all. The prior estimate (of zero) is just as likely to be true as before. If BF > 1, we have evidence for our new estimate being true. The higher the BF, the more evidence. But if BF < 1, we have more evidence of the prior (zero) estimate being true than we did before we fit the model.

Bayes factors for the pairwise contrasts between the selectability estimates for the different crosses. This requires finding versions of the estimated marginal means and contrasts using the prior model.

fit_prior <- unupdate(fit_binom_seedwt_full)



contr_max_prior <- emmeans(fit_prior, ~ max) |> contrast('pairwise')

contr_soja_prior <- emmeans(fit_prior, ~ soja) |> contrast('pairwise')

contr_max_soja_prior <- emmeans(fit_prior, ~ max | soja) |> contrast('pairwise')



bf_contr_max <- bayesfactor(contr_max, prior = contr_max_prior)

bf_contr_soja <- bayesfactor(contr_soja, prior = contr_soja_prior)

bf_contr_max_soja <- bayesfactor(contr_max_soja, prior = contr_max_soja_prior)



interpbf_contr_max <- interpret_bf(bf_contr_max$log_BF, log = TRUE)

interpbf_contr_soja <- interpret_bf(bf_contr_soja$log_BF, log = TRUE)

interpbf_contr_max_soja <- interpret_bf(bf_contr_max_soja$log_BF, log = TRUE)

Bayes factors for the seed weight slope trends.

swt_trends_max_soja_prior <- emtrends(fit_prior, ~ max | soja, var = 'ave_100sw', type = 'response')

swt_trends_max_prior <- emtrends(fit_prior, ~ max, var = 'ave_100sw', type = 'response')

swt_trends_soja_prior <- emtrends(fit_prior, ~ soja, var = 'ave_100sw', type = 'response')



bf_swt_max_soja <- bayesfactor(swt_trends_max_soja, prior = swt_trends_max_soja_prior)

bf_swt_max <- bayesfactor(swt_trends_max, prior = swt_trends_max_prior)

bf_swt_soja <- bayesfactor(swt_trends_soja, prior = swt_trends_soja_prior)



interpbf_swt_max_soja <- interpret_bf(bf_swt_max_soja$log_BF, log = TRUE)

interpbf_swt_max <- interpret_bf(bf_swt_max$log_BF, log = TRUE)

interpbf_swt_soja <- interpret_bf(bf_swt_soja$log_BF, log = TRUE)

Bayes factors for the pairwise contrasts between the seed weight slopes.

swt_contr_max_soja_prior <- contrast(swt_trends_max_soja_prior, 'pairwise')

swt_contr_max_prior <- contrast(swt_trends_max_prior, 'pairwise')

swt_contr_soja_prior <- contrast(swt_trends_soja_prior, 'pairwise')



bf_swt_contr_max_soja <- bayesfactor(swt_contr_max_soja, prior = swt_contr_max_soja_prior)

bf_swt_contr_max <- bayesfactor(swt_contr_max, prior = swt_contr_max_prior)

bf_swt_contr_soja <- bayesfactor(swt_contr_soja, prior = swt_contr_soja_prior)



interpbf_swt_contr_max_soja <- interpret_bf(bf_swt_contr_max_soja$log_BF, log = TRUE)

interpbf_swt_contr_max <- interpret_bf(bf_swt_contr_max$log_BF, log = TRUE)

interpbf_swt_contr_soja <- interpret_bf(bf_swt_contr_soja$log_BF, log = TRUE)

The BFs are displayed along with the contrasts and their quantiles in tables below.

Results in tabular form

Here are some tables that reproduce some of the information from the graphs above. In all cases, I give parameter estimate (median) and lower and upper bounds of the 66% and 95% credible intervals. Everything is given to 3 significant figures.

Random effects

Random effects
location	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper
CRS (F₂)	0.01860	-0.837	0.712	-0.224	0.169
LCPRS (F₂)	-0.03850	-0.909	0.636	-0.290	0.103
CCRS (F₃)	0.00943	-0.855	0.676	-0.210	0.155
UCPRS (F₃)	-0.03290	-0.909	0.624	-0.265	0.100

Marginal means of crosses

Marginal means of selection probability for each cross
max parent	soja parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper
N7103	407191	0.02140	0.008230	0.06780	0.01600	0.03330
NC Dunphy	407191	0.02980	0.011500	0.09550	0.02230	0.04660
NC Raleigh	407191	0.01930	0.007480	0.06200	0.01440	0.03020
TCHM06-240	407191	0.02140	0.008290	0.06830	0.01600	0.03350
N7103	425045	0.00351	0.001350	0.01130	0.00262	0.00550
NC Dunphy	425045	0.00390	0.001510	0.01240	0.00292	0.00610
NC Raleigh	425045	0.00160	0.000619	0.00506	0.00119	0.00249
TCHM06-240	425045	0.00328	0.001270	0.01040	0.00245	0.00514

Marginal means averaged by parent type

Marginal means of selection probability for each *G. max* parent
max parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper
N7103	0.00865	0.00335	0.0279	0.00648	0.01350
NC Dunphy	0.01080	0.00417	0.0344	0.00808	0.01690
NC Raleigh	0.00555	0.00216	0.0177	0.00415	0.00869
TCHM06-240	0.00837	0.00325	0.0267	0.00628	0.01310

Marginal means of selection probability for each *G. soja* parent
soja parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper
407191	0.0226	0.00878	0.07250	0.01700	0.03540
425045	0.0029	0.00112	0.00925	0.00218	0.00455

Marginal trends of selectability versus seed weight for each cross

Quantiles and Bayes factors are provided. Bayes factors provide a measure of the strength of evidence that the slope is different than our prior estimate centered around zero. Evidence “in favor of” means we think it is not zero, and evidence “against” means it is consistent with an estimate centered around zero.

Selectability vs seed weight trends for crosses
max parent	soja parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
N7103	407191	-0.043	-0.098	0.015	-0.069	-0.015	0.0273	very strong evidence against
NC Dunphy	407191	-0.121	-0.156	-0.086	-0.138	-0.104	5760	extreme evidence in favour of
NC Raleigh	407191	-0.010	-0.054	0.037	-0.033	0.013	0.00502	extreme evidence against
TCHM06-240	407191	0.140	0.093	0.188	0.117	0.164	1220	extreme evidence in favour of
N7103	425045	0.150	0.076	0.223	0.114	0.187	24.5	strong evidence in favour of
NC Dunphy	425045	0.317	0.285	0.348	0.302	0.332	9.84 × 10¹⁸	extreme evidence in favour of
NC Raleigh	425045	-0.093	-0.155	-0.031	-0.122	-0.062	0.394	anecdotal evidence against
TCHM06-240	425045	0.399	0.358	0.441	0.379	0.420	1.61 × 10¹⁹	extreme evidence in favour of

Marginal trends of selectability versus seed weight averaged by parent type

Selectability vs seed weight trends for *G. max* parents
max parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
N7103	0.054	0.007	0.101	0.031	0.076	0.0794	strong evidence against
NC Dunphy	0.098	0.075	0.122	0.087	0.110	7.29 × 10⁶	extreme evidence in favour of
NC Raleigh	-0.051	-0.091	-0.012	-0.070	-0.032	0.0897	strong evidence against
TCHM06-240	0.270	0.238	0.302	0.254	0.286	4.12 × 10²⁸	extreme evidence in favour of

Selectability vs seed weight trends for *G. soja* parents
soja parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
407191	-0.008	-0.032	0.015	-0.02	0.003	0.00377	extreme evidence against
425045	0.194	0.166	0.220	0.18	0.207	8.2 × 10¹¹	extreme evidence in favour of

Post-hoc contrasts: odds ratios of selection probability between crosses

First the contrasts of pairs of max parents within soja parent are given. Then the contrasts of the averaged max and soja parents are given.

Quantiles and Bayes factors are included. Verbal interpretation of Bayes factor is provided according to the cutoffs in Jeffreys 1961. Evidence “in favor of” means we think the probability of selection is different for the pair being compared, and evidence “against” means we think it’s probably the same.

Odds ratios between crosses
max parents contrasted	soja parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
N7103 - NC Dunphy	407191	0.716	0.647	0.790	0.683	0.751	3080	extreme evidence in favour of
N7103 - NC Raleigh	407191	1.109	0.998	1.235	1.053	1.167	0.0154	very strong evidence against
N7103 - (TCHM06-240)	407191	0.995	0.894	1.104	0.945	1.046	0.00288	extreme evidence against
NC Dunphy - NC Raleigh	407191	1.549	1.410	1.697	1.480	1.620	1.68 × 10⁶	extreme evidence in favour of
NC Dunphy - (TCHM06-240)	407191	1.388	1.271	1.517	1.328	1.451	6.26 × 10⁴	extreme evidence in favour of
NC Raleigh - (TCHM06-240)	407191	0.896	0.814	0.989	0.854	0.941	0.0171	very strong evidence against
N7103 - NC Dunphy	425045	0.899	0.796	1.015	0.848	0.953	0.00862	extreme evidence against
N7103 - NC Raleigh	425045	2.192	1.904	2.527	2.046	2.352	5.75 × 10⁸	extreme evidence in favour of
N7103 - (TCHM06-240)	425045	1.070	0.946	1.207	1.007	1.136	0.00356	extreme evidence against
NC Dunphy - NC Raleigh	425045	2.442	2.173	2.739	2.304	2.578	6.08 × 10¹³	extreme evidence in favour of
NC Dunphy - (TCHM06-240)	425045	1.190	1.088	1.306	1.138	1.244	0.85	anecdotal evidence against
NC Raleigh - (TCHM06-240)	425045	0.487	0.435	0.548	0.461	0.516	1.15 × 10⁸	extreme evidence in favour of

Odds ratios between pairs of *G. max* parents
max parent 1	max parent 2	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
N7103	NC Dunphy	0.802	0.744	0.868	0.773	0.833	86.1	very strong evidence in favour of
N7103	NC Raleigh	1.560	1.429	1.704	1.493	1.629	3.08 × 10⁷	extreme evidence in favour of
N7103	TCHM06-240	1.032	0.952	1.117	0.992	1.072	0.00228	extreme evidence against
NC Dunphy	NC Raleigh	1.943	1.805	2.100	1.874	2.014	1.06 × 10²⁰	extreme evidence in favour of
NC Dunphy	TCHM06-240	1.285	1.206	1.372	1.245	1.325	9.83 × 10⁴	extreme evidence in favour of
NC Raleigh	TCHM06-240	0.661	0.613	0.714	0.637	0.687	2.77 × 10⁸	extreme evidence in favour of

Odds ratio between pair of *G. soja* parents
soja parent 1	soja parent 2	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
407191	425045	7.793	7.382	8.238	7.589	8.009	2.77 × 10⁷³	extreme evidence in favour of

Post-hoc contrasts: difference in slopes of seed weight trends between crosses

First the contrasts of pairs of max parents within soja parent are given. Then the contrasts of the averaged max and soja parents are given.

Quantiles and Bayes factors are included. Evidence “in favor of” means we think the difference in slopes is nonzero for the pair being compared, and evidence “against” means it is consistent with a difference of zero between the slopes of the pair being compared.

Comparison of seed weight trend slopes between crosses
max parents contrasted	soja parent	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
N7103 - NC Dunphy	407191	0.079	0.012	0.146	0.047	0.110	0.174	moderate evidence against
N7103 - NC Raleigh	407191	-0.032	-0.105	0.039	-0.068	0.002	0.0165	very strong evidence against
N7103 - (TCHM06-240)	407191	-0.183	-0.257	-0.110	-0.218	-0.148	128	extreme evidence in favour of
NC Dunphy - NC Raleigh	407191	-0.111	-0.169	-0.053	-0.139	-0.083	3.52	moderate evidence in favour of
NC Dunphy - (TCHM06-240)	407191	-0.261	-0.321	-0.201	-0.291	-0.233	1.41 × 10⁷	extreme evidence in favour of
NC Raleigh - (TCHM06-240)	407191	-0.151	-0.216	-0.085	-0.183	-0.118	15.8	strong evidence in favour of
N7103 - NC Dunphy	425045	-0.167	-0.250	-0.087	-0.206	-0.127	13.4	strong evidence in favour of
N7103 - NC Raleigh	425045	0.243	0.146	0.341	0.196	0.289	48.1	very strong evidence in favour of
N7103 - (TCHM06-240)	425045	-0.250	-0.333	-0.165	-0.290	-0.207	921	extreme evidence in favour of
NC Dunphy - NC Raleigh	425045	0.410	0.340	0.480	0.376	0.443	1.49 × 10⁹	extreme evidence in favour of
NC Dunphy - (TCHM06-240)	425045	-0.082	-0.135	-0.030	-0.108	-0.056	0.381	anecdotal evidence against
NC Raleigh - (TCHM06-240)	425045	-0.492	-0.568	-0.416	-0.528	-0.455	5.19 × 10¹⁰	extreme evidence in favour of

Comparison of seed weight trend slopes between pairs of *G. max* parents
max parent 1	max parent 2	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
N7103	NC Dunphy	-0.221	-0.296	-0.142	-0.257	-0.183	0.027	very strong evidence against
N7103	NC Raleigh	0.444	0.357	0.533	0.401	0.488	7.02	moderate evidence in favour of
N7103	TCHM06-240	0.031	-0.049	0.110	-0.008	0.070	8780	extreme evidence in favour of
NC Dunphy	NC Raleigh	0.664	0.590	0.742	0.628	0.700	507	extreme evidence in favour of
NC Dunphy	TCHM06-240	0.251	0.187	0.316	0.220	0.282	4.51 × 10⁶	extreme evidence in favour of
NC Raleigh	TCHM06-240	-0.414	-0.489	-0.337	-0.452	-0.375	9.98 × 10⁹	extreme evidence in favour of

Comparison of seed weight trend slopes between pair of *G. soja* parents
soja parent 1	soja parent 2	estimate	95% CI lower	95% CI upper	66% CI lower	66% CI upper	Bayes factor	interpretation
407191	425045	2.053	1.999	2.109	2.027	2.081	3.18 × 10⁷	extreme evidence in favour of

Additional figure: size class trends with model and data overlaid

Here is a plot that has both the fitted values and data values (average 100-seed weight) for each cross showing the selectability vs. seed weight trends. This is an alternative to Fig 5 in the MS version from early March.

pred_grid <- CJ(max = unique(dat$max), soja = unique(dat$soja), ave_100sw = seq(3.2, 9.2, 0.1), n_planted_trials = 1000)



fitted_vals <- fitted(fit_binom_seedwt_full, newdata = pred_grid, re_formula = ~ 0, summary = FALSE)



fitted_quant <- cbind(pred_grid, t(fitted_vals))

fitted_quant <- melt(fitted_quant, id.vars = 1:4)

fitted_quant[, value := value/n_planted_trials]

fitted_quant <- fitted_quant[, as.data.frame(t(quantile(value, probs = q_probs))), by = .(max, soja, ave_100sw)]

setnames(fitted_quant, c('max','soja','ave_100sw', q_cols))



ggplot(dat, aes(x = ave_100sw)) +

  geom_point(aes(y = n_selected/n_planted)) +

  geom_ribbon(data = fitted_quant, aes(ymin = q0.17, ymax = q0.83), alpha = 0.3) +

  geom_line(data = fitted_quant, aes(y = q0.5)) +

  geom_line(data = fitted_quant, aes(y = q0.025), linetype = 'dotted') +

  geom_line(data = fitted_quant, aes(y = q0.975), linetype = 'dotted') +

  facet_grid(max ~ soja) +

  pct_scale +

  labs(x = '100-seed weight') +

  theme(legend.position = 'bottom',

        axis.title.x = element_markdown(),

        strip.text = element_markdown(),

        legend.text = element_markdown())

#ggsave('selectabilitybysize_modelsanddata.png', dpi = 400, height = 8, width = 5)

Seed color analysis

Setup

Load additional packages and data.

library(tidyverse)

library(gt)



dat <- read_csv('data/seedcolor.csv')

Exploratory plot

Plot the proportions of each seed color for the F2 derived and selected populations of each cross.

The differences are not dramatic but there may be some. The selected populations have somewhat more light colored seeds than the F2 derived populations, but the NC-Raleigh x PI424045 selected population also has more black colored seeds than its F2 derived population.

Statistical model

This is a multinomial regression with main effect of max parent in the cross, population (F2 derived vs. selected), and their interaction. There are no random effects because all nursery locations are summed together. The response variables are the color counts.

fit_seedcolor_f2vsselected <- brm(

  cbind(light, mid, brown, dark) | trials(n_total) ~ max * population,

  data = dat, family = multinomial(link = 'logit', refcat = 'light'),

  prior = c(

    prior('normal(0, 2)', class = 'b')

  ),

  iter = 3000, warmup = 2000, chains = 4, seed = 2024, 

  file = 'data/fit_seedcolor_f2vsselected'

)

Generate predictions and contrasts from fitted model

Get the posterior fitted value of the proportion (population-level prediction) for each of the populations for each cross from the model, and their uncertainty intervals. Take the contrasts as the ratio of posterior probabilities within each color between the selected and the F2-derived populations.

pred_data <- dat %>%

  select(population, max) %>%

  mutate(n_total = 1) %>%

  add_epred_draws(fit_seedcolor_f2vsselected) %>%

  mutate(.category = fct_recode(.category, `mid-green` = 'mid', black = 'dark'))



pred_data_wide <- pred_data %>%

  ungroup %>% select(-.row) %>%

  pivot_wider(names_from = population, values_from = .epred) %>%

  mutate(ratio = selected/`F2 derived`)

Get quantiles of the predicted values and contrasts.

pred_quants <- pred_data_wide %>% 

  group_by(max, .category) %>%

  median_qi(`F2 derived`, selected, ratio, .width = c(.66, .95))

Results

Graph of the estimated mean probabilities of each color

Here are the median posterior probabilities of each color, with thick error bars for the 66% quantile credible interval, and thin error bars for the 95% interval. These are plotted to compare the selected population with the F2 derived within each cross.

Judging from the plot, we can see there are some differences. The selected population has a greater proportion of light-colored seed (posterior proportion is greater than 30%) than the F2 derived population (around 20% to 25%). This is compensated for by the selected population having less mid-colored seed. As we saw in the bar plot above, the NC-Raleigh selected population also has more black-colored seed. Note: The F2 derived population estimates have much higher uncertainty because they are taken from a much smaller sample size. The selected population counts were obtained by simply adding up the counts from all the replicates and ignoring the F2 and F3 nursery location variation, so there is a much larger sample size.

Graph of the ratio of probabilities of each color: F2 derived vs. selected

To better quantify the evidence for difference in color between F2 derived and selected, these are the posterior values for the ratio between F2 derived and selected, and their 66% and 95% quantile credible intervals. A value > 1 indicates higher proportion of a color in the selected population.

We have solid evidence of differences in the relative probabilities. In both crosses, the F2 derived has a lower probability of light-colored seed and a higher probability of mid-colored seed. The F2 derived population has a lower probability of black-colored seed only in the NC-Raleigh x PI424045 cross.

Table of all seed color results

This table contains all the data presented in the above two figures (posterior medians of fitted probabilities, their ratios F2 derived/selected, and 66% and 95% quantile credible intervals for every value).

color	population	estimate	66% QI	95% QI
NC-Raleigh x PI424045
light	F2 derived	0.251	(0.214, 0.29)	(0.178, 0.337)
light	selected	0.340	(0.319, 0.362)	(0.298, 0.384)
light	ratio	1.350	(1.16, 1.61)	(0.985, 1.96)
mid-green	F2 derived	0.301	(0.26, 0.345)	(0.223, 0.395)
mid-green	selected	0.165	(0.148, 0.181)	(0.133, 0.2)
mid-green	ratio	0.547	(0.46, 0.649)	(0.386, 0.787)
brown	F2 derived	0.182	(0.15, 0.219)	(0.118, 0.263)
brown	selected	0.132	(0.118, 0.148)	(0.104, 0.165)
brown	ratio	0.729	(0.585, 0.911)	(0.464, 1.17)
black	F2 derived	0.259	(0.219, 0.301)	(0.183, 0.342)
black	selected	0.361	(0.34, 0.383)	(0.32, 0.407)
black	ratio	1.390	(1.19, 1.65)	(1.03, 2.02)
TCHM06-240 x PI424045
light	F2 derived	0.212	(0.169, 0.259)	(0.128, 0.316)
light	selected	0.347	(0.334, 0.36)	(0.321, 0.374)
light	ratio	1.640	(1.33, 2.06)	(1.1, 2.71)
mid-green	F2 derived	0.331	(0.28, 0.387)	(0.227, 0.451)
mid-green	selected	0.226	(0.215, 0.238)	(0.204, 0.25)
mid-green	ratio	0.683	(0.58, 0.814)	(0.497, 1)
brown	F2 derived	0.223	(0.178, 0.274)	(0.137, 0.33)
brown	selected	0.219	(0.208, 0.231)	(0.195, 0.243)
brown	ratio	0.978	(0.795, 1.24)	(0.649, 1.61)
black	F2 derived	0.224	(0.177, 0.275)	(0.136, 0.329)
black	selected	0.207	(0.196, 0.219)	(0.184, 0.231)
black	ratio	0.922	(0.748, 1.18)	(0.615, 1.54)

Selectable progeny from soybean crosses

Quentin D. Read

February 13, 2023

Summary

Change log

Setup

Exploratory data viz

Fitting model

Graphs of model results

Post hoc contrasts

Bayes factors

Results in tabular form

Random effects

Marginal means of crosses

Marginal means averaged by parent type

Marginal trends of selectability versus seed weight for each cross

Marginal trends of selectability versus seed weight averaged by parent type

Post-hoc contrasts: odds ratios of selection probability between crosses

Post-hoc contrasts: difference in slopes of seed weight trends between crosses

Additional figure: size class trends with model and data overlaid

Seed color analysis

Setup

Exploratory plot

Statistical model

Generate predictions and contrasts from fitted model

Results

Graph of the estimated mean probabilities of each color

Graph of the ratio of probabilities of each color: F2 derived vs. selected

Table of all seed color results