Mar

19

Simpson’s paradox

March 19, 2024 | 1 Comment

Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science statistics, and is particularly problematic when frequency data are unduly given causal interpretations. The paradox can be resolved when confounding variables and causal relations are appropriately addressed in the statistical modeling (e.g., through cluster analysis). Simpson's paradox has been used to illustrate the kind of misleading results that the misuse of statistics can generate.

Lance Bialas comments:

Exactly the challenge we face at the macro level currently. Averages are giving us only the opaque as of now: real incomes, mortgage coupons, debt servicing.

William Huggins writes:

i teach people to use paired t-tests as a way of sidestepping the "inappropriate grouping" that causes Simpson's paradox. if you blend up groups rather than comparing like to like, its easy to make a mistake.

Zubin Al Genubi asks:

Would a larger sample size help avoid the paradox? Or higher confidence levels?

William Huggins answers:

t-tests (difference of means) aren't going to solve the problem because they compare average of group 1 to average of group 2. if those groups are composed to two subgroups with distinct traits (highs and lows, whatever) then the average of group 1 is really a weighted average of its two sub-groups. heuristically, we tend to assume that those groups all have the same size but its not necessarily true, which can create the "paradox" in which group 1 has a higher average than group 2, but group 1a < group 2a, and group 1b < group 2b.

the way to avoid it is to not group obviously different populations and then use the average to describe the distribution. but if that's already been done, the next question becomes whether or not there is some "pairing dimension" which explains some of the internal variation in both groups (a test for average difference between paired data points is powerful if your data meets the conditions). otherwise, consider a cluster analysis of some kidn to see if you can't break the larger group into its components (often statistically messy if you use multiple dimensions, easier when you use one or two and can do it by eye to some extent).

Jordan Low wonders:

Isn't the simplest explanation that one test is comparing student GPAs (simply reflecting grades), without accounting for the fact that smarter students take harder classes?

William Huggins agrees:

100%. when I write letters of rec for grad school (for undergrads who came out of sometimes large courses), among other things I include contextual data including rank in class, t-stat relative to class mean, and a raw count of how often they scored above the mean on assessments. when they've taken more than one course with me (have had some reach 4), I run the paired t-test and report their average spread over my own course means as well. recipients still have to benchmark "what a top student from that school" looks like in practice, but it's my (admittedly nerdy) way of helping others make sense of the letter grade on the related transcript.

Archives

Resources & Links

Search