Ordinal Outcomes, Part 2: Break Things Yourself

general

rstats

ordinal

shiny

An interactive tool for generating your own ordinal two-condition comparison and watching the linear and cumulative probit models duel.

Authors

Jon Brauer

Jake Day

Published

May 13, 2026

A scientist watching her linear model fail on ordinal data (with help from Gemini)

Preamble

In Part 1, we argued that running linear regression on an ordinal outcome quietly commits you to an estimand — the conditional mean — that often fails to summarize what you actually care about. We walked through a pre–post comparison of two AI coding assistants (Model A and Model B), where Model B polarized users in opposite directions along an observable prompt-length covariate. Ignoring prompt length, Model A and Model B had identical average satisfaction. Conditioning on prompt length, three models produced different stories about the direction and magnitude of the change: OLS compressed the decision-relevant distributional shifts by roughly half and even got the sign wrong in the off-diagonal cells; the proportional-odds cumulative probit (PO probit or PO CP) recovered the distributional shape better but still came in 8 to 10 percentage points shy at the tails; the category-specific cumulative probit (CS probit or CS CP) matched the observed cell distributions exactly, producing accurate probability contrasts with uncertainty-communicating 95% CIs, p-values, and plots ready to hand to a non-technical collaborator. Part 1 also added formal model-fit comparisons that ranked the three models in the same order as the contrast table (OLS < PO probit < CS probit).

In this post, we give you control of the data and modeling features. Below is an interactive tool¹ for generating your own ordinal two-condition (aka two treatment arms) comparisons. You can dial the sample size, pick a shape for the baseline response distribution, set the effect size, turn a proportional-odds violation on and off, and optionally fit a category-specific cumulative probit alongside OLS and the PO probit. The app simulates the data, fits the models, and overlays their predicted category probabilities on the observed distribution, with 95% CIs on the per-category contrasts and AIC / BIC on a common categorical scale.

Our goal is not to walk you through the math again — Part 1 did that. Rather, it is to let you develop a feel for how the choice of model matters under various conditions. The five guided tours below point to canned settings that illustrate various failure modes mentioned in Part 1, plus a bimodal-baseline counterfactual tour that goes beyond Part 1 by showing how OLS in-sample failures are exacerbated when predicting out-of-sample (here, under counterfactual extrapolation). We render each tour’s expected output as a static figure so you can read the post end-to-end without waiting on the live app to load. But you are not limited to the canned settings; the dashboard lets you take control and examine your own modeling choices and distributional assumptions.

The app’s covariate support is limited to a single binary subgroup toggle (the Include subgroup covariate checkbox, used in Tour 2 below to recreate Part 1’s polarization-by-prompt-length pattern in microcosm). It does not let you simulate continuous, multiple, or differently-named covariates of your own design — adding those dimensions would explode the UI surface, and for those cases the code patterns in the Bring your own data section below should help you translate directly to your own data frame.

A brief methodological note before we start. The app uses frequentist models (lm for linear, MASS::polr for the PO probit, ordinal::clm for the category-specific probit, and ordinal::clm with a scale = ~ treatment term for the heteroscedastic PO probit) rather than Bayesian ones, and pulls 95% CIs for the category-level contrasts from marginaleffects. I (Jon) wanted the app to be fast and to use a minimal set of R packages so it could be hosted on a free public-Shiny tier. For production applied work, we would still reach for a Bayesian multilevel cumulative probit (see our JQC paper and online supplement), but for building intuition polr, clm, and marginaleffects are plenty.²

The app

If the embed above is slow to wake up, open the app in its own tab instead.

Five guided tours

The app gives you six baseline shapes, a treatment-effect slider, a three-way effect-type selector (constant shift, asymmetric PO violation, mean-preserving polarization), a latent-SD-ratio slider for variance shifts, an optional subgroup covariate for heterogeneous treatments, a category-specific-probit toggle, a heteroscedastic-PO-probit toggle (visible only when neither the covariate nor the CS toggle is on), a counterfactual-multiplier slider for out-of-sample extrapolation, and two sample-size / category dials. Briefly:

Baseline shape sets the response distribution in the control condition (Model A). Pick Symmetric / Normal-like for the easy case, Bimodal (polarized) for the Yoda-satisfaction case from Part 1, Skewed toward Satisfied / Dissatisfied for one-sided distributions, Uniform for a flat reference, or Zero-inflated / J-shape for rare-event ordinal outcomes.
Effect type + β (latent-SD shift) specify how Model B differs from Model A. Constant shift is the standard PO setup; asymmetric is a PO violation concentrated at one end of the scale; polarization preserves the mean while widening both tails.
Latent SD ratio controls heteroscedasticity — set above 1 to make the treated condition’s latent distribution wider than the control’s; this drives the Tour 4 failure mode that neither OLS nor the standard PO probit can recover.
Subgroup covariate + interaction magnitude turn on a binary subgroup (think prompt length in Part 1) whose treatment effect differs between levels; with β = 0 and a positive interaction magnitude, the marginal effect is null while the subgroup effects are large and opposing.
Category-specific probit toggle adds the saturated clm(..., nominal = ~ treatment) fit alongside OLS and the PO probit.
Heteroscedastic PO probit toggle adds the clm(..., scale = ~ treatment) fit, which fits a per-condition latent variance — the right tool when the treated condition is more (or less) dispersed than the control condition. Available only when the subgroup covariate is off and the CS-probit toggle is off (its UI checkbox hides otherwise). Demonstrated in Tour 4.
Counterfactual multiplier scales the fitted treatment effect for out-of-sample prediction in the Counterfactual tab.

That is a lot of combinations. Some of them are uninteresting. A few represent the whole point of this series. Below are five canned settings — one that demonstrates where OLS will generally “work” and four that demonstrate distinct modes by which OLS goes wrong (and how cumulative probit, with its category-specific extension, does or does not fix it). The presets in the app’s sidebar load these scenarios in one click.

Tour 1: When linear is roughly fine

Tip

The easy case. Under Tour 1’s settings, OLS and the PO probit produce nearly identical contrasts and AIC values within a few points of each other on the Coefficients & Fit tab. This is what “OLS basically works” looks like — and worth keeping in mind as a reference point for the four tours that follow.

Settings. Symmetric / Normal-like baseline. β = 0.4. Effect type = constant. SD ratio = 1.0. No covariate. CS probit off. N = 500. Seed = 1138.

This is about as close as you get to a scenario where a linear model is roughly defensible on ordinal data. The baseline distribution is tent-shaped, the latent shift is uniform across thresholds (proportional odds holds), the variance is the same in both conditions, and there is no unobserved heterogeneity. Nothing about the data-generating process badly violates OLS’s assumptions. (Caveat: Remember that we have to “hack” the OLS model to generate predicted category probabilities because it is designed to model the conditional mean; the PO probit model generates them naturally.)

Figure 1: Tour 1 — Symmetric baseline, moderate constant shift. OLS-implied triangles and PO-probit dots both track the observed bars.

What to notice. Red OLS-implied triangles and purple PO-probit dots both sit close to the observed bars. In the app’s Coefficients & CIs tab, the OLS and probit coefficients agree directionally; they disagree on magnitude only because they are on different units (OLS in scale points, probit in latent SDs, related by roughly the response-scale standard deviation). The per-category contrasts from the two models are similar. If all your real ordinal data looked like this, you could plausibly get away with OLS for most decisions. The remaining tours show what “plausibly” is doing in that sentence.

Tour 2: Subgroup polarization (the Part 1 analog)

Settings. Symmetric / Normal-like baseline. β = 0. Effect type = constant shift. Latent SD ratio = 1.6. Subgroup covariate on, interaction magnitude = 1.6. CS probit on. N = 1000. Seed = 1138.

This is the two-condition, two-subgroup analog of Part 1’s worked example, dialed up for visual punch. Both subgroups start with the same symmetric baseline distribution. The treatment shifts the low subgroup’s latent distribution upward by +0.8 SDs and the high subgroup’s downward by −0.8 SDs, and on top of that the treated condition has 1.6x the latent variance of the control condition — a real-applied-data combination where one observable factor (think prompt length, ideology, baseline severity) drives opposing treatment effects across subgroups and the treatment increases overall variability. Marginalized over subgroups, the average treatment effect is approximately zero. Conditional on subgroup, the treatment’s effects are large, opposing, and would be obvious to any analyst who bothered to look within subgroup.

Figure 2: Tour 2 — Symmetric baseline, opposing latent treatment effects across subgroups, plus a treated-condition variance shift. Within each subgroup the treatment moves the distribution dramatically toward one tail; marginalized across subgroups, the mean is preserved while the distribution becomes broader and more polarized.

Table 1: Tour 2 — Subgroup-conditional per-category treatment contrasts. The treatment moves the low subgroup up and the high subgroup down; the marginal mean reported by OLS without an interaction term is essentially zero. ⚠ flags categories where OLS-implied Δ has the opposite sign from the probit point estimate.

Subgroup	Category	Observed Δ	OLS-implied Δ	Probit Δ [95% CI]
high	1	31%	16%	31% [26%, 36%]
high	2	-1%	8% ⚠	-1% [-6%, 4%]
high	3	-17%	-5%	-17% [-22%, -12%]
high	4	-15%	-11%	-15% [-20%, -10%]
high	5	1%	-8% ⚠	1% [-2%, 5%]
low	1	-1%	-9%	-1% [-5%, 3%]
low	2	-16%	-12%	-16% [-21%, -11%]
low	3	-13%	-4%	-13% [-19%, -8%]
low	4	-3%	9% ⚠	-3% [-8%, 2%]
low	5	33%	17%	33% [28%, 38%]

What to notice. The four panels show that the treatment has a real, meaningful, and opposing effect in each subgroup: the low subgroup is pushed strongly toward Very Satisfied, the high subgroup strongly toward Very Dissatisfied. An analyst who fits lm(Y ~ treatment) and reads the marginal coefficient will conclude the treatment does nothing, because the subgroups cancel out on the response-scale mean. An analyst who fits lm(Y ~ treatment * subgroup) and reads the Interaction tests box in the app’s Coefficients & CIs tab will see both OLS and probit reporting a meaningful interaction — directionally agreeing, but on different scales. Either way, the per-subgroup, per-category contrasts (visible in the contrast table) tell the most actionable story: who is helped, who is hurt, and by how much in each cell of the response.

This is the failure mode that motivates Part 1’s prompt-length covariate, and it generalizes to most applied settings where heterogeneous treatment effects are real (e.g., fear of crime by political ideology, satisfaction by baseline severity, intervention response by demographic). Whenever an unmodeled covariate drives heterogeneity in opposite directions across subpopulations, the marginal mean misleads. The cumulative-probit framework does not save you from this if you do not model the interaction — but it gives you, via per-cell category contrasts with CIs, a much more direct way to communicate the heterogeneity to a non-technical audience than a scale-point interaction coefficient does. The Coefficients & Fit tab’s AIC numbers reinforce the story: with the interaction in the model, the PO probit and CS probit beat OLS substantially on within-sample fit.

Note: The Counterfactual tab also activates a little differently in this mode than others. With a covariate in the model, it renders a 2×2 bar grid (one row per subgroup, one column per model), with the counterfactual multiplier scaling each subgroup’s treatment effect separately. Sliding the multiplier above 1× shows how the heterogeneous treatment effect would extrapolate if amplified — typically toward saturated category masses under PO probit and toward impossible out-of-bounds predicted mass under OLS, with the sign flip from the off-diagonal cells preserved and amplified.

Tour 3: Skewed-tail compression

Settings. Skewed-toward-Satisfied baseline. β = 0.6. Effect type = constant. CS probit off. N = 1000. Seed = 1138.

A fairly realistic satisfaction-survey shape: most users already like the product, a long tail does not. The treatment is genuinely effective and uniform on the latent scale, and because the baseline is skewed, most of the action happens at the Very Satisfied end.

Figure 3: Tour 3 — Skewed-toward-Satisfied baseline, constant latent shift. The OLS triangle at category 5 (*Very Satisfied*) sits noticeably below the bar; the probit dot tracks it.

What to notice. The OLS coefficient will tell you the mean rises by something like 0.3 to 0.4 scale points; the probit coefficient will say the latent distribution shifts by about 0.6 SDs. Both models agree the treatment works. Yet look at the Very Satisfied column: the observed jump is on the order of 20 percentage points, the probit-predicted jump tracks it closely, and the OLS-implied jump is noticeably smaller. OLS is forced to spread its predicted mass symmetrically around the predicted mean, but the actual distribution piles up against the ceiling at category 5. If you care about the Very Satisfied category specifically — perhaps because those users are the most likely to recommend the product — the linear model is understating the size of the effect by a meaningful fraction. The probit model is not. The Coefficients & Fit tab’s AIC reflects this directly: the PO probit comes in well below OLS on a common categorical scale, which is the within-sample analog of the visual tail-compression failure (aka, ceiling effects).

Tour 4: Variance-only shift

Settings. Symmetric / Normal-like baseline. β = 0.0. Effect type = constant. Latent SD ratio = 1.6. No covariate. Heteroscedastic PO probit on (CS probit off). N = 1000. Seed = 1138.

This is the failure mode that few people discuss. The treatment does not shift the latent mean — it shifts the latent variance. The treated condition has a wider latent distribution than the control condition: the same average opinion, but more variability around it. This produces more mass at categories 1 and 5, less in the middle. It is a real and perhaps even common pattern (treatments that increase polarization, surveys with novel items that produce wider response distributions, interventions with heterogeneous effects of unknown source) and neither OLS nor the standard PO probit will recover it.

Figure 4: Tour 4 — Symmetric baseline, latent SD ratio of 1.6. Treated condition is more spread out; mean is unchanged. OLS and PO probit miss it. The heteroscedastic PO probit (orange squares; `clm(... scale = ~ treatment)`) recovers the marginal shape correctly by explicitly fitting a per-condition latent variance.

What to notice. OLS pools the residual SD across conditions and so its bin probabilities for both conditions use the same predictive width, which understates the treated condition’s spread and overstates the control condition’s. The standard PO probit does not have a scale parameter at all (its assumption is that the latent SD is the same in both groups), so its predicted probabilities are also wrong on both conditions. The correct model here is the heteroscedastic PO probit — clm(... scale = ~ treatment), available via the Also fit heteroscedastic PO probit toggle in the app sidebar — which explicitly fits a per-condition latent variance and recovers the observed proportions correctly. The category-specific probit (toggled off here, but available as an alternative) also recovers the marginal cell probabilities exactly, but by spending parameters on the per-threshold differences rather than by recognizing this as a scale shift; in the app’s Coefficients & Fit tab the heteroscedastic probit’s AIC sits well below the PO probit’s (and slightly lower BIC than CS probit), reflecting that the variance-shift specification is the right structural choice when this pattern is what you suspect.³

Tour 5: Counterfactual extrapolation under a bimodal baseline

Settings. Bimodal (polarized) baseline. β = 0.5. Effect type = constant. Latent SD ratio = 1. No covariate. CS probit off. N = 1000. Seed = 1138. Counterfactual multiplier = 3×.

This is the bimodal “Yoda-satisfaction” pattern from Part 1’s panel C, fit with a moderate latent shift. The data-generating mechanism is otherwise innocent: proportional odds holds, no variance shift, no covariate, no hidden tricks. The challenge is purely that the observed response distribution piles up at the two ends of the scale, with very little mass in the middle, and OLS’s normal assumption cannot accommodate that shape.

Figure 5: Tour 5 — Counterfactual extrapolation under a bimodal baseline. Top-left: OLS-implied predictive PDF at 3× the observed treatment effect; the orange-filled bell curve is shifted to its predicted-mean location (vertical purple line) and the grey wings on either side are OLS’s predicted mass past the response-scale boundaries at 0.5 and K+0.5. Top-right: predicted cumulative distributions at 3× on the response scale — OLS’s continuous CDF (red) drifts past the upper boundary into impossible response values; PO probit’s step CDF (purple) reaches 1 exactly at the top category. Bottom-left and bottom-right: per-category predicted proportions for OLS and PO CP at 1× (faint), at 3× (bold), and observed at 1× (orange) for reference; out-of-bounds ‘<1’ and ‘>5’ columns are flagged with grey backgrounds and bolded axis labels.

What to notice at the observed effect (1×). Before any counterfactual extrapolation, OLS is already in trouble. Because the observed distribution is bimodal — heavy mass at Very Dissatisfied and Very Satisfied, very little in the middle — the pooled OLS residual SD comes in around 1.8, large enough that the fitted normal leaks mass off both ends of the scale. At the Model A baseline (Model B’s pre-treatment counterpart), OLS predicts roughly 8% of responses below 1 and 8.5% above 5: about one-sixth of its predicted mass sits on response values the scale cannot produce, with the treatment off and the multiplier at 1.

The treated cell (Model B at 1×) sharpens the picture at the top of the scale. The observed share of users in the Very Satisfied category is 66%, and PO probit recovers this essentially exactly. OLS, by contrast, predicts only 18% strictly inside the Very Satisfied bin, plus another 18% leaking past the scale ceiling. Even when an analyst applies the natural workaround — collapse the impossible-value mass back into the top category, on the reasoning that “the model meant to predict more Very Satisfied users” — the collapsed estimate is 36%, still 30 percentage points below truth. The workaround does not save OLS on the data the analyst actually has.

What to notice at the counterfactual (3×). Crank the multiplier and the visual failure becomes vivid. OLS’s predicted mean for the treated cell at 3× is 5.54 — past the upper boundary of the response scale itself. Half of OLS’s predicted mass (about 51%) now sits in the “above 5” out-of-bounds column. The model is, in its own self-report, predicting that the average user response will be above the highest available category. PO probit cannot do this; its predictions live in the latent space and project back through the fitted thresholds into a distribution over the K valid response categories at every multiplier. At 3×, it predicts P(Very Satisfied) ≈ 94% — a saturated, in-scale extrapolation. The CDF (cumulative distribution function) panel in the top-right makes this contrast directly visible: PO probit’s step CDF reaches 1 exactly at category K, while OLS’s continuous CDF is still climbing well past the response-scale ceiling.

A counterintuitive but important wrinkle. Compare the OLS-versus-PO-probit gap at 1× and at 3×. At 1×, collapsed-OLS (i.e., capping implied probabilities >5 at =5) predicts 36% Very Satisfied against PO probit’s 66% — a 30-point gap. At 3×, collapsed-OLS predicts 72% against PO probit’s 94% — only a 22-point gap. The gap is smaller at 3× than at 1×. The mechanism is reversal-by-pile-up: as the multiplier grows, OLS leaks more mass past the ceiling, and the collapse-to-top-category workaround scoops that impossible mass back into the visible scale, partially absorbing the failure. Extrapolation does not make the analyst’s interpretation uniformly worse — it makes the failure mode more visible (the predicted mean past the scale ceiling is hard to miss) while, in this case, the substantive interpretive error is actually largest on the data the analyst actually has.

This is the same lesson Part 1’s fit-and-out-of-sample section makes in the bias-variance / measurement-theory framing. Predictive adequacy at one layer of the model does not protect against interpretive failure at another. OLS can predict the categorical distribution adequately in some scenarios (Tour 1) and badly in others (this tour), but the conditional-mean estimand OLS natively reports remains a number without a defensible referent for ordinal outcomes regardless of how the predictive distribution looks. Tour 5 is the version of that argument you can hold in your hand — and, if you have an applied audience, the version they will remember.

Bonus: break things further

Once you have run the five tours, try these:

Unobservable mixture polarization. Load Tour 1 preset, then set effect type to Mean-preserving polarization with no covariate, β around 1.0–1.5, on a symmetric baseline. The marginal post-treatment distribution becomes broader and bimodal-shaped, exactly the way a Tour 2 marginal would look if you collapsed it over the (unobserved) subgroup. This is the harder case for an applied analyst; heterogeneity exists, no covariate is available to break it apart, and the mixture mode lets you simulate “what your data would look like if there were a hidden subgroup driving opposite effects you cannot measure.” OLS and PO probit both report null effect; the CS probit recovers the marginal shape but cannot tell you why it is bimodal.
Asymmetric (PO violation) mode. Set effect type to Asymmetric (PO violation) with a non-zero β. The latent shift is concentrated at the bottom of the scale and fades out at the top — users get pushed up out of the dissatisfied range but barely move toward Very Satisfied. The PO probit’s single treatment coefficient cannot summarize this; the CS probit’s per-threshold shifts can. This is the floor-effect variant of a PO violation, and the same pattern with reverse-orientation thresholds produces the ceiling-effect variant; both are pervasive in applied work (interventions with diminishing returns, fear measures with most respondents already near the floor, satisfaction measures with most respondents already near the ceiling). A reviewer of our JQC paper requested that we be more explicit about these patterns, and this tour is the version of the demonstration we wish we had been able to point them at.
Zero-inflated baseline with a small β. Rare-event ordinal outcomes (self-reported serious crime, high-severity service complaints) follow this shape. OLS is almost always misleading here because most observations sit at category 1 and the residual SD is nearly meaningless. This scenario is closely related to Supplementary Simulation 1b from our JQC supplement.
Very small sample sizes (N = 100 per condition) with any shape. Probit coefficients get noisy quickly in small samples, where OLS is somewhat more robust. This is one of the few practical advantages of OLS on ordinal outcomes, and it tends to disappear at moderate N.
Compound failures. Stack modes — for example, an asymmetric PO violation on a skewed baseline, or a subgroup interaction at small N. The app does not stop you, and the patterns you see can be quite striking. Real applied data routinely has more than one of these going on at once.

Bring your own data

If you have an ordinal outcome and a treatment indicator in a data frame called dat, the minimum-viable version of what the app is doing — extended to include the category-specific probit and cell-level contrasts with CIs, as in Part 1’s worked example — is:

Show code

library(MASS)
library(ordinal)
library(marginaleffects)

dat$y_ord <- factor(dat$y, ordered = TRUE)

# OLS — the default workflow
m_ols <- lm(y ~ treatment, data = dat)
summary(m_ols)

# Proportional-odds cumulative probit
m_ord <- polr(y_ord ~ treatment, data = dat, method = "probit", Hess = TRUE)
summary(m_ord)

# Category-specific cumulative probit (drops PO)
m_ord_cs <- clm(y_ord ~ 1, nominal = ~ treatment, data = dat, link = "probit")
summary(m_ord_cs)

# Predicted category probabilities by treatment (from PO probit)
predict(m_ord,
        newdata = data.frame(treatment = levels(dat$treatment)),
        type    = "probs")

# Per-category treatment contrasts with 95% CIs (from CS probit).
# This is the "applied deliverable" table from Part 1.
avg_comparisons(
  m_ord_cs,
  variables = "treatment",
  type      = "prob"
)

If you also have a covariate that plausibly drives heterogeneity — prompt length, ideology, job role, device type, whatever it is in your data — include it as an interaction, and use by = to break the contrasts out by covariate level:

Show code

m_cs_cov <- clm(y_ord ~ 1,
                nominal = ~ treatment * covariate,
                data    = dat, link = "probit")

avg_comparisons(
  m_cs_cov,
  variables = "treatment",
  by        = "covariate",
  type      = "prob"
)

Two additional things we would do before trusting the output on real data:

Plot the raw response distribution by treatment condition (and, if available, by covariate level) before fitting anything. If it looks tent-shaped and roughly symmetric, carry on with a PO cumulative probit and expect it to track OLS closely on the contrasts. If it looks bimodal, heavily skewed, or zero-inflated, the cumulative probit family is almost certainly giving you better answers than OLS — and the category-specific extension (clm with nominal = ~ treatment) will track the cell distributions much more faithfully than the PO version.
Check whether proportional odds looks defensible. The app’s PO-violation toggle should give you some intuition for what a violation looks like in the predicted-probability output. For formal tests on real data, brant::brant() on a polr fit is a reasonable starting point; for model comparison with proper uncertainty, fit both the PO and category-specific versions in brms (or compare clm fits via AIC / BIC) and pick the one the data prefer.

Series wrap-up

Over these two posts, we have tried to make three arguments, in roughly the order a skeptical reader’s objections come up:

There is a real problem. Linear regression on ordinal outcomes commits you to an estimand (the conditional mean) that frequently does not summarize what you care about. This is not a small-sample issue or a rounding issue; it is a choice about what question you are answering.
There is a workable alternative. The cumulative probit family has a clean generative story, fits in a single line of R via MASS::polr (the proportional-odds version, suitable as a default) or ordinal::clm with nominal = ~ treatment (the category-specific extension that drops the PO assumption when it doesn’t hold), and produces predicted category probabilities — and, with marginaleffects, per-cell tail-probability contrasts with uncertainty-communicating CIs — that you can hand to a non-technical collaborator without a translation step.
The difference matters in cases you actually encounter. Bimodal distributions, skewed distributions, tail-concentrated effects, and PO violations are common in applied research across both academic and industry settings. In those cases, the linear model does not merely give you a slightly noisier answer; it gives you a systematically wrong one, sometimes including a sign flip on specific tail probabilities, and it gives you the wrong answer most confidently about the part of the distribution that matters most.

A note worth surfacing for anyone who has been clicking through the app’s tours. In well-behaved cases — symmetric baseline, moderate latent shift, no PO violation, no scale or boundary pathology, no unmodeled heterogeneity — OLS and the bare PO cumulative probit produce numerically similar per-category contrasts and similar AIC values on the Coefficients & Fit tab. That is not a flaw in either model; it is OLS happening to be approximately right because the data were not pathological enough to break it, while the PO probit collapses to nearly the same answer because both are single-shift models. The argument for defaulting to the cumulative probit family is therefore not that it always wins on contrast magnitudes or fit metrics — much of the time it ties. The argument is that probit fits the per-condition category proportions correctly even when contrasts agree, gives you category-level estimands and predicted probabilities with CIs as the native output rather than as a postprocessing step, extrapolates safely outside the observed range (Tour 5 above), and when the OLS-friendly assumptions break — Tour 2’s polarization, Tour 3’s tail compression, Tour 4’s variance shift, Tour 5’s bimodal baseline — the AIC table reflects the gap, the category-specific extension is one toggle away, and Part 1’s formal AIC + held-out RPS pipeline ports directly to your data. The choice is between defaulting to a model that fits the data structurally and taking on the risk that OLS happens to be approximately right; we think the bargain favors the former.

If any of this felt obvious, we are glad. If any of it was new, then try running the Tour 2 settings in the app, look at the per-cell predicted probabilities and 95% CIs in the contrast table, and ask whether the OLS coefficients in your field’s published papers have ever delivered that level of detail or accuracy. For instance, do analysts tend to answer what fraction of users/participants/units land in each category under each condition, and with what uncertainty? The cumulative probit, paired with marginaleffects::avg_comparisons, hands you that estimand in three lines of R. A slight modification — switching from MASS::polr to ordinal::clm with nominal = ~ treatment — drops the proportional-odds assumption and gives you category-specific contrasts that recover the observed cell distributions exactly, which is what you want precisely when PO is violated and the standard probit’s single coefficient is no longer enough. OLS does not, and cannot, give you either.

Finally, it is worth being precise about what the upgrade is. The familiar OLS p-value is not what is missing — summary(lm(...)) will print one happily, and the app’s coefficients tab includes one too. Yet, it qualifies a coefficient (a scale-point shift in the mean of an ordinal variable) that is commonly what applied decisions ride on but, as we have argued throughout the series, rarely what they should ride on. The cumulative probit’s category-specific predicted probabilities, paired with marginaleffects CIs, replace one number-with-a-p with K paired estimates and K confidence intervals, all on a scale (the proportion of users in each response category) that decision-makers can reason about directly. Significance was never the problem here; the coefficient it qualified was. Consider how many applied decisions, and how many published findings, get adjudicated on a single mean-difference coefficient and its p-value when the more interesting and meaningful underlying questions were about specific category probabilities and how confident we should be about them.

Thanks for reading. If you build something interesting on top of this, or catch an error in the app, or find a real dataset where the probit story flips the conclusion from a published linear analysis, we would love to hear about it!

These were the estimands you were looking for.

Much to learn, you still have. But off to a strong start, you are.

Footnotes

The interactive tool is a Shiny app hosted on Posit Connect Cloud and embedded in the post via <iframe>. See the hosting-note footnote on the next paragraph for cold-start behavior and how to clone the source.↩︎
The app is a standard server-side Shiny app, hosted free on Posit Connect Cloud. If the iframe is slow to come up, that is a cold start — the host puts the app to sleep when idle and spins it back up on demand. Refreshing usually fixes it. The full source is in shiny/app.R alongside the post; you can also clone and run it locally with shiny::runApp("shiny/app.R").↩︎
clm from the ordinal package supports scale = ~ treatment to fit a latent-SD shift directly. This is uncommon in applied work but pedagogically important: ordinal data can carry signal in the variance as well as in the mean, and a model that only fits a mean shift will miss it. The toggle in the app sidebar is gated to the no-covariate, CS-probit-off case; the call is clm(y_ord ~ treatment, scale = ~ treatment, data = dat, link = "probit"). The Bayesian analog (a brms::brm cumulative probit with group-varying discrimination) is the version we would actually use in research; see Bürkner and Vuorre’s (2019) ordinal regression primer for how to specify an unequal variance (disc) Bayesian version in R. For the full frequentist specification see the ordinal vignette (Christensen, 2019).↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{brauer2026,
  author = {Brauer, Jon and Day, Jake},
  title = {Ordinal {Outcomes,} {Part} 2: {Break} {Things} {Yourself}},
  date = {2026-05-13},
  url = {https://reluctantcriminologists.com/blog-posts/017/part2-break-things-yourself.html},
  langid = {en}
}

For attribution, please cite this work as:

Brauer, Jon, and Jake Day. 2026. “Ordinal Outcomes, Part 2: Break Things Yourself.” May 13. https://reluctantcriminologists.com/blog-posts/017/part2-break-things-yourself.html.