Mar

19

Simpson’s paradox

March 19, 2024 |

Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science statistics, and is particularly problematic when frequency data are unduly given causal interpretations. The paradox can be resolved when confounding variables and causal relations are appropriately addressed in the statistical modeling (e.g., through cluster analysis). Simpson's paradox has been used to illustrate the kind of misleading results that the misuse of statistics can generate.

Lance Bialas comments:

Exactly the challenge we face at the macro level currently. Averages are giving us only the opaque as of now: real incomes, mortgage coupons, debt servicing.

William Huggins writes:

i teach people to use paired t-tests as a way of sidestepping the "inappropriate grouping" that causes Simpson's paradox. if you blend up groups rather than comparing like to like, its easy to make a mistake.

Zubin Al Genubi asks:

Would a larger sample size help avoid the paradox? Or higher confidence levels?

William Huggins answers:

t-tests (difference of means) aren't going to solve the problem because they compare average of group 1 to average of group 2. if those groups are composed to two subgroups with distinct traits (highs and lows, whatever) then the average of group 1 is really a weighted average of its two sub-groups. heuristically, we tend to assume that those groups all have the same size but its not necessarily true, which can create the "paradox" in which group 1 has a higher average than group 2, but group 1a < group 2a, and group 1b < group 2b.

the way to avoid it is to not group obviously different populations and then use the average to describe the distribution. but if that's already been done, the next question becomes whether or not there is some "pairing dimension" which explains some of the internal variation in both groups (a test for average difference between paired data points is powerful if your data meets the conditions). otherwise, consider a cluster analysis of some kidn to see if you can't break the larger group into its components (often statistically messy if you use multiple dimensions, easier when you use one or two and can do it by eye to some extent).

Jordan Low wonders:

Isn't the simplest explanation that one test is comparing student GPAs (simply reflecting grades), without accounting for the fact that smarter students take harder classes?

William Huggins agrees:

100%. when I write letters of rec for grad school (for undergrads who came out of sometimes large courses), among other things I include contextual data including rank in class, t-stat relative to class mean, and a raw count of how often they scored above the mean on assessments. when they've taken more than one course with me (have had some reach 4), I run the paired t-test and report their average spread over my own course means as well. recipients still have to benchmark "what a top student from that school" looks like in practice, but it's my (admittedly nerdy) way of helping others make sense of the letter grade on the related transcript.


Comments

Name

Email

Website

Speak your mind

1 Comment so far

  1. junggun lim on March 27, 2024 5:33 am

    To address Simpson’s Paradox,

    Stratification: This involves breaking down the dataset into different strata or layers that are homogeneous within themselves. By analyzing each stratum separately, you can understand the true nature of the relationship within each subgroup. This approach is somewhat similar to clustering, as it involves grouping data, but stratification often relies on known characteristics or categories within the dataset.

    Multivariate Analysis: Conducting a multivariate analysis, such as regression analysis with multiple variables, can help control for confounding variables that might be responsible for Simpson’s Paradox. By including these variables in your analysis, you can get a more accurate picture of the relationships between your primary variables of interest.

    Causal Analysis: Employing causal analysis techniques, such as causal diagrams or structural equation modeling, can help in understanding the underlying causal relationships between variables. This can be particularly useful in distinguishing between correlation and causation and understanding if the observed paradox is due to a confounding factor.

    Longitudinal Analysis: If the data allows, conducting a longitudinal analysis to observe how trends change over time within each group can provide insights that might resolve the paradox. Sometimes the paradox arises from cross-sectional data analysis, and a temporal perspective can offer clarity.

    Data Segmentation: Similar to clustering, data segmentation involves dividing the data into more homogenous segments based on certain criteria. By analyzing these segments separately, one can understand the nuances of the data that might be obscured when aggregating across all segments.

Archives

Resources & Links

Search