我有一个包含patientID的数据,下一列包含疾病,其中包含多个用逗号分隔的类别。我需要找到每个疾病类别的患者总数和每个类别的患者百分比。我尝试了正常的方法,它给出了正确的频率,但不是百分比。
数据如下所示。
ID Type_of_illness
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
空白处代表的不是疾病。我首先将疾病分离到单独的列中,但随后被困在那里,不知道如何处理以找出百分比。
我写的代码如下:
Data new;
set old;
array P(3) L1 L2 L3;
do i to dim(p);
p(i)=scan(type_of_illness,i,',');
end;
run;
然后我创建了一个新的专栏,将所有的疾病复制到它上面,所以我认为它会给我正确的频率,但它没有给我正确的百分比。
data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;
我必须创建一个表,类似于
*Total numer of patients Percent*
.......................................
lf5
lf7
lf6
lf11
lf12
lf13
请帮帮忙。
发布于 2014-05-21 15:44:52
您正在尝试输出非互斥组(每种疾病)的百分比。在SAS中,如何做到这一点并不明显。
以下代码采用Joe的输入代码,但在从事件数据确定百分比时采用了另一种方法(如果您愿意,可以使用“long”数据集)。我更喜欢这样,而不是在患者层面上为每种疾病创建一个二进制变量(一个“广泛的”数据集),因为对我来说,这很快就会变得笨拙。也就是说,如果你继续做一些建模,那么一个“宽”的数据集通常更有用。
下面的代码产生如下输出:
| | Pats | Pats | | | Mean | | |
| | with 0 |with 1+ | % with | Num | events | | |
| |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12 | 19| 7| 27| 7| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7 | 24| 2| 8| 2| 1.0| 0.00| 1|
---------------------------------------------------------------------------------------|
请注意,患者5在您的疾病lf5数据中重复出现。我的代码只计算这条记录一次。如果是慢性病,这是可以的,但如果是急性的,就不行了。此外,我的代码还包括没有事件的分母中的患者。
最后,您可以在mycodestock.com代码共享站点=> https://mycodestock.com/public/snippet/11251上看到使用日期的代码的另一个示例
下面是上表的代码:
options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
proc sort;
by id;
run;
** Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
by id;
run;
** Create event table (1 row per patient*event);
** NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
set have;
attrib grp length = $5 label = 'Illness';
do i = 1 to countc(type_of_illness, ',') + 1;
grp = scan(type_of_illness, i, ',');
if grp ne '' then output;
end;
run;
** Count the number of events each patient had for each grp;
** NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
** NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
by grp;
class id;
output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;
** Add a denominator variable - value '1' for each row.;
** Ensure when num_events = 0 the value is set to missing;
** Create a flag variable - set to 1 - if a patient has a record (no matter how many);
data perc2_summ_grp_pat;
set perc2_summ_grp_pat;
denom = 1;
if num_events = 0 then num_events = .;
flg_scripts = ifn(num_events, 1, .);
run;
proc tabulate data = perc2_summ_grp_pat format=comma8.;
title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
title2 "Units: Patients - within each group level";
title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";
class grp;
var denom flg_scripts num_events;
table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record')
num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;
发布于 2014-05-21 05:42:40
你这样做是对的,但你需要选择不同的百分比。通常百分比是“整个数据集的百分比”,这意味着它将是基数的三倍。你想要基于疾病的百分比。这意味着每种疾病都需要1/0。
一个缺点是您的自动表中有0;您将不得不将表输出到数据集并删除它们,然后继续打印/报告结果数据集以仅获得1-或者使用proc SQL生成表。
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;
data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
else L[_t]=0;
end;
run;
proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;
发布于 2014-05-21 22:15:01
多标签格式解决方案很有趣,因此我将单独介绍它。
使用相同的have,我们创建了一种格式,它接受每种疾病的组合,并为其中的每种疾病输出一行,即,如果您有"1,2,3",它将输出行
1,2,3 = 1
1,2,3 = 2
1,2,3 = 3
启用多标签格式并使用class
-enabled proc (如proc tabulate
),然后您可以使用它来允许每个受访者计算每个标签值,但不会多次计入总数。
data for_procformat;
set have;
start=type_of_illness; *start is the input to the format;
hlo=' m'; *m means multilabel, adding a space
here to leave room for the o later;
type='c'; *character format - n is numeric;
fmtname='$ILLF'; *whatever name you like;
do _t = 1 to countw(type_of_illness,','); *for each 'word' do this once;
label=scan(type_of_illness,_t,','); *label is the 'result' of the format;
if not missing(label) then output;
end;
if _n_=1 then do; *this block adds a row to deal with values;
hlo='om'; *not defined (in this case, just missings);
label='No Illness'; *the o means 'other';
output;
end;
run;
proc sort data=for_procformat nodupkey; *remove duplicates (which there will be many);
by start label;
run;
proc format cntlin=for_procformat; *import the formats;
quit;
proc tabulate data=have;
class type_of_illness/mlf missing ; *mlf means multilabel formats;
format type_of_illness $ILLF.; *apply said format;
tables type_of_illness,n pctn; *and your table;
run;
https://stackoverflow.com/questions/23770153
复制相似问题