本文介绍 A/B 实验过程中多重假设检验的问题，实验策略迭代优化的时候，可能会影响多个指标。业务上，我们希望能在保持住某些指标的情况下（约束指标），努力提升一些指标，很少有对所有指标都带来显著提升的情况。所以，在一次 A/B 实验中做了多个检验，其中一些指标向好，另一些指标向坏，在探索和利用的精细平衡之间，常常出现跷跷板现象。逐个检验指标还是同时检验多个指标以避免假阳性问题呢？当然是后者，在统计学上，这是一个多重假设检验的问题。

1 比例齐性检验

四个独立二项总体 \(\mathrm{Binom}(n_i, p_i),i=1,2,3,4\) 的比例齐性检验问题如下：

\[ H_0: p_1 = p_2 = p_3 = p_4 \quad H_1: \mathrm{至少有一个} p_i \mathrm{不一致} \]

下面考虑一个真实数据问题，数据集来自函数 prop.test() 的帮助文档。

实验数据
实验编号	总样本量（病人总数）	吸烟人数	吸烟比例
1	86（\(n_1\)）	83	0.965（\(p_1\)）
2	93（\(n_2\)）	90	0.968（\(p_2\)）
3	136（\(n_3\)）	129	0.949（\(p_3\)）
4	82（\(n_4\)）	70	0.854（\(p_4\)）
总计	397	372	0.937（\(\bar{p}\)）

smokers <- c(83, 90, 129, 70)
patients <- c(86, 93, 136, 82)

R 语言中函数 prop.test() 可以用来做独立二项总体的比例齐性检验。

prop.test(x = smokers, n = patients,
          alternative = "two.sided",
          conf.level = 0.95)

## 
## 	4-sample test for equality of proportions without continuity correction
## 
## data:  smokers out of patients
## X-squared = 13, df = 3, p-value = 0.006
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 
## 0.9651 0.9677 0.9485 0.8537

P 值小于 0.05，意味着拒绝 \(H_0\) ，至少有一个 \(p_i\) 与其它的不同。要搞清楚是哪两个样本的比例存在显著差异，需要两两逐对比较（pairwise comparisons）检验。

# 等价于
chisq.test(
  x = rbind(smokers, patients - smokers)
)

## 
## 	Pearson's Chi-squared test
## 
## data:  rbind(smokers, patients - smokers)
## X-squared = 13, df = 3, p-value = 0.006

实际上，是一个 \(4\times2\) 的列联表独立性检验，检验统计量的形式如下：

\[ \chi^2 = \frac{1}{(1-\bar{p})\bar{p}}\sum_{i=1}^{4}n_i(p_i-\bar{p})^2 \]

它服从自由度为 \((4-1)\times(2-1)=3\) 的卡方分布，这个检验统计量是如下一般形式的特例。

\[ \chi^2 = \sum_{i=1}^{4}\sum_{j=1}^{2}\frac{(O_{ij}-E_{ij})^2}{E_{ij}} \]

关于上式的详细说明，以及函数 prop.test() 和 chisq.test() 的关系见文章A/B 实验之两样本的比率检验，此处不再赘述。

2 多重假设检验

函数 pairwise.prop.test() 可以做两两逐对比较的多重检验，四个组共计 \(4*(4-1)/2 = 6\) 个检验问题。如果每个检验问题都独立地看，显著性水平都控制在 0.05，也就是犯第一类错误的概率控制在 0.05。如果将每个检验看作一次事件，就是说检验出错的事件概率为 0.05。那 6 次检验至少有一个检验出错的概率为 1 减去 6 个检验都不出错的概率。

# 6 个检验都不出问题的概率
(1 - 0.05)^6 # 小于 0.95

## [1] 0.7351

# 至少有一个检验出错的概率
1 - (1 - 0.05)^6 # 大于 0.05

## [1] 0.2649

显然，远大于 0.05。同时检验这 6 个问题，不能每个检验问题的显著性水平都设置为 0.05 ，而是要 6 个检验问题整体的显著性水平（犯第一类错误的概率）控制在 0.05 。一个自然的想法是将每个检验的显著性水平降低，比如为 \(\alpha/k\) ，\(k\) 是检验的个数。这是为了控制假阳性率，这需要调整每个检验的 P 值，函数 pairwise.prop.test() 默认采用的 P 值矫正的方法是 holm 。

# R 语言软件内置的 P 值矫正方法如下
p.adjust.methods

## [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
## [6] "BY"         "fdr"        "none"

# 默认方法 holm
pairwise.prop.test(x = smokers, n = patients)

## Warning in prop.test(x[c(i, j)], n[c(i, j)], ...): Chi-squared approximation
## may be incorrect
## Warning in prop.test(x[c(i, j)], n[c(i, j)], ...): Chi-squared approximation
## may be incorrect
## Warning in prop.test(x[c(i, j)], n[c(i, j)], ...): Chi-squared approximation
## may be incorrect

## 
## 	Pairwise comparisons using Pairwise comparison of proportions 
## 
## data:  smokers out of patients 
## 
##   1    2    3   
## 2 1.00 -    -   
## 3 1.00 1.00 -   
## 4 0.12 0.09 0.12
## 
## P value adjustment method: holm

计算结果输出 6 个检验问题的 P 值。从中可以看出，第 4 组和第 2 组的比率相较其他组之间的差异更大，接近统计显著性水平 0.05。输出的警告是函数调用 prop.test() 两两检验时发出的，表明一个列联表中有的格子的 E 值（期望值，见前面的公式）小于 5，三次警告说明有三组统计检验对应的列联表中有 E 值小于 5。当我们逐个单独检验时，能发现那两组差异是最大的。

# 多重检验对单组检验的 P 值进行了调整
x = c(83, 90, 129, 70)
n = c(86, 93, 136, 82)
compare.levels <- function(i, j) {
   prop.test(x[c(i, j)], n[c(i, j)])$p.value
}
level.names = names(x) %||% seq_along(x)
pairwise.table(compare.levels, level.names, p.adjust.method = "holm")

## Warning in prop.test(x[c(i, j)], n[c(i, j)]): Chi-squared approximation may be
## incorrect
## Warning in prop.test(x[c(i, j)], n[c(i, j)]): Chi-squared approximation may be
## incorrect
## Warning in prop.test(x[c(i, j)], n[c(i, j)]): Chi-squared approximation may be
## incorrect

##        1       2      3
## 2 1.0000      NA     NA
## 3 1.0000 1.00000     NA
## 4 0.1186 0.09322 0.1238

多重检验对单组检验的 P 值进行了调整。

2.1 Holm 调整

调整过程分三步：

将所有假设（共 \(m\) 个），做检验并计算 P 值，将 P 值由小到大排列，记住重排 P 值后，P 值对应的假设，显著性水平 \(\alpha=0.05\) 。
逐步比较
- 从最小的 P 值开始，对于第 \(k\) 个排序后的 P 值 \(p^{(k)}\) ，计算阈值 \(\alpha/(m - k + 1)\)
- 如果 \(p^{(k)}\) 小于阈值 \(\alpha/(m - k + 1)\) ，则拒绝该 P 值对应的假设，并继续到下一个 \(k\) ，否则，停止，不再拒绝该假设及之后的所有假设。
结论：所有被拒绝的假设都视为显著，未被拒绝的假设保持现状。

# 第二组和第四组 
(m <- prop.test(x = c(90, 70), n = c(93, 82),
          alternative = "two.sided"))

## 
## 	2-sample test for equality of proportions with continuity correction
## 
## data:  c(90, 70) out of c(93, 82)
## X-squared = 5.9, df = 1, p-value = 0.02
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.0181 0.2101
## sample estimates:
## prop 1 prop 2 
## 0.9677 0.8537

m$p.value

## [1] 0.01554

第二组和第四组的 P 值是最小的，但是它并没有小于阈值 \(1/6*\alpha\) ，所以它以及 P 值比它大的组都不能被拒绝。

# P 值 m$p.value 被 holm 方法调整后的值
p.adjust(m$p.value, method = "holm", n = 6)

## [1] 0.09322

调整后的 P 值，凡是比 0.05 大都不能视为显著。

m$p.value * 6

## [1] 0.09322

3 参考文献

示例数据来自书籍《Statistical Methods for Rates & Proportions》（第三版）Joseph L. Fleiss, 189 页. 打包在 R 环境中，见函数 prop.test() 或 chisq.test() 的帮助文档。

A/B 实验之多重假设检验

黄湘云

1 比例齐性检验

2 多重假设检验

2.1 Holm 调整

3 参考文献