临床研究的最后一道防线（四）：倾向性评分匹配PSM在Python的实现

文章来源：企鹅号 - 小胖统计

临床研究的最后一道防线（四）：倾向性评分匹配(propensity score matching, PSM) 在Python的实现

No.25介绍了SPSS实现倾向性评分匹配(propensity score matching, PSM)的具体流程，如果用于基本临床试验或者发表论文已经足够，但当进阶进行1:2或者1：N多重有放回匹配时SPSS的劣势就暴露无遗，因此这一讲围绕脚本语言Python实现PSM的流程进行详细探讨。

运算前Python需要安装numpy、scipy、pandas、scikit-learn与PSM算法数据包ctmatching-0.0.6-source.zip (md5)，具体安装流程与细节参见24讲。数据文件仍然选择25讲里使用的re78。

进入脚本界面，首先引入上述数据包：

import pandas as pd

import numpy as np

from ctmatching import psm, load_re78

载入数据集re78：

control, treatment = load_re78()

使用len函数可看到处理组treatment的观察数目为185，对照组control的观察数目为429。

len(control)

Out[78]: 429

len(treatment)

Out[79]: 185

使用help函数查阅psm的说明：help（psm）

psm(control, treatment, use_col=None, stratify_order=None, independent=True, k=1)

Propensity score matching main function.

If you want to know the inside of the psm algorithm, check

:func:`stratified_matching`, :func:`non_stratified_matching`,

:func:`non_repeat_index_matching`, :func:`independent_index_matching`.

otherwise, just read the parameters' definition.

Suppose we have m1 control samples, m2 treatment samples. Sample is n-dimension vector.

:param control: control group sample data, m1 x n matrix. Example::

[[c1_1, c1_2, ..., c1_n], # c means control

[c2_1, c2_2, ..., c2_n],

...,

[cm1_1, cm1_2, ..., cm1_n],]

:type control: numpy.ndarray

:param treatment: control group sample data, m2 x n matrix. Example::

[[t1_1, t1_2, ..., t1_n], # t means treatment

[t2_1, t2_2, ..., t2_n],

...,

[tm1_1, tm1_2, ..., tm1_n],]

:type treatment: numpy.ndarray

:param use_col: (default None, use all) list of column index. Example::

[0, 1, 4, 6, 7, 9] # use first, second, fifth, ... columns

:type use_col: list or numpy.ndarray

:param stratify_order: (default None, use normal nearest neighbor)

list of list. Example::

# for input data has 6 columns

# first feature has highest priority

# [second, third, forth] features' has second highest priority by mean of euclidean distance

# fifth feature has third priority, ...

[[0], [1, 2, 3], [4], [5]]

:type stratify_order: list of list

:param independent: (default True), if True, same treatment sample could be matched to different control sample.

:type independent: Boolean

:param k: (default 1) Number of samples selected from control group.

:type k: int

:returns selected_control_index: all control sample been selected for

entire treatment group.

:returns selected_control_index_for_each_treatment: selected control sample for each treatment sample.

selected_control_index: selected control sample index. Example (k = 3)::

(m2 * k)-length array: [7, 120, 43, 54, 12, 98, ..., 71, 37, 14]

selected_control_index_for_each_treatment: selected control sample index for each treatment sample. Example (k = 3)::

# for treatment[0], we have control[7], control[120], control[43]

# matched by mean of stratification.

[[7, 120, 43],

[54, 12, 98],

...,

[71, 37, 14],]

:raises InputError: if the input parameters are not legal.

:raises NotEnoughControlSampleError: if don't have sufficient data for independent index matching.

头晕吧！简单一点。

selected_control, selected_control_each_treatment = psm(

control, treatment, use_col=[1,2,3,4,5,6], stratify_order=None,

independent=True, k=2)

psm需要调用的五个关键参数是：对照组（control）、处理组（treatment）、匹配的变量（use_col）, 匹配优先级stratify_order，independent：if True, same treatment sample could be matched to different control sample，即可以进行有放回多重匹配，k为一个处理组匹配的对照组个数，这里选为2，即采用1:2匹配。

selected_control为选择出的对照组，selected_control_each_treatment为处理组匹配的对照组编号。

进行for循环嵌套，外置循环treatment_sample在treatment中进行，index为对照组编号。内循环目的是寻找匹配出的对照组control[index[i]]（i的取值范围为0,1）。

for treatment_sample, index in zip(treatment, selected_control_each_treatment):

print treatment_sample

print("matches")

for i in range(2):

print control[index[i]]

以下是结果，匹配出来后可进行后续的进一步比较分析，这里不再罗列。

[u'NSW183', 1, 35, 9, 1, 0, 1, 1, 13602.43, 13830.64, 12803.97]

matches

[u'PSID27', 0, 36, 9, 1, 0, 1, 1, 13256.4, 8457.484, 0.0]

[u'PSID6', 0, 37, 9, 1, 0, 1, 1, 13685.48, 12756.05, 17833.2]

=======================================

[u'NSW184', 1, 35, 8, 1, 0, 1, 1, 13732.07, 17976.15, 3786.628]

matches

[u'PSID27', 0, 36, 9, 1, 0, 1, 1, 13256.4, 8457.484, 0.0]

[u'PSID6', 0, 37, 9, 1, 0, 1, 1, 13685.48, 12756.05, 17833.2]

=======================================

[u'NSW185', 1, 33, 11, 1, 0, 1, 1, 14660.71, 25142.24, 4181.942]

matches

[u'PSID380', 0, 34, 12, 1, 0, 1, 0, 0.0, 0.0, 18716.88]

[u'PSID293', 0, 31, 12, 1, 0, 1, 0, 0.0, 42.96774, 11023.84]

发表于: 2018-03-122018-03-12 21:06:23
原文链接：http://kuaibao.qq.com/s/20180312G1LB7G00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

临床研究的最后一道防线（四）：倾向性评分匹配PSM在Python的实现

相关快讯

扫码

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐