文章/答案/技术大牛

发布

社区首页 >问答首页 >基于多列SAS的百分比计算

问基于多列SAS的百分比计算
EN

Stack Overflow用户

提问于 2014-05-21 05:19:22

回答 3查看 1.3K关注 0票数 2

我有一个包含patientID的数据，下一列包含疾病，其中包含多个用逗号分隔的类别。我需要找到每个疾病类别的患者总数和每个类别的患者百分比。我尝试了正常的方法，它给出了正确的频率，但不是百分比。

数据如下所示。

ID    Type_of_illness  
4   lf13  
5   lf5,lf11    
63      
13  lf12    
85      
80      
15      
20      
131 lf6,lf7,lf12  
22      
24      
55  lf12  
150 lf12  
34  lf12  
49  lf12  
151 lf12  
60      
74      
88      
64      
82  lf13  
5   lf5,lf7  
112     
87  lf17  
78      
79  lf16  
83  lf11

空白处代表的不是疾病。我首先将疾病分离到单独的列中，但随后被困在那里，不知道如何处理以找出百分比。

我写的代码如下：

    Data new;
    set old;
    array P(3) L1 L2 L3;
    do i to dim(p);
    p(i)=scan(type_of_illness,i,',');
    end;
    run;

然后我创建了一个新的专栏，将所有的疾病复制到它上面，所以我认为它会给我正确的频率，但它没有给我正确的百分比。

data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;

我必须创建一个表，类似于

*Total numer of patients    Percent*  
.......................................   
lf5         
lf7         
lf6         
lf11            
lf12            
lf13

请帮帮忙。

sas

回答 3

Stack Overflow用户

发布于 2014-05-21 15:44:52

您正在尝试输出非互斥组(每种疾病)的百分比。在SAS中，如何做到这一点并不明显。

以下代码采用Joe的输入代码，但在从事件数据确定百分比时采用了另一种方法(如果您愿意，可以使用“long”数据集)。我更喜欢这样，而不是在患者层面上为每种疾病创建一个二进制变量(一个“广泛的”数据集)，因为对我来说，这很快就会变得笨拙。也就是说，如果你继续做一些建模，那么一个“宽”的数据集通常更有用。

下面的代码产生如下输出：

|                       |  Pats  |  Pats  |        |        |  Mean  |        |        |
|                       | with 0 |with 1+ | % with |  Num   | events |        |        |
|                       |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11                   |      24|       2|       8|       2|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12                   |      19|       7|      27|       7|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13                   |      24|       2|       8|       2|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16                   |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17                   |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5                    |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6                    |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7                    |      24|       2|       8|       2|     1.0|    0.00|       1|
---------------------------------------------------------------------------------------|

请注意，患者5在您的疾病lf5数据中重复出现。我的代码只计算这条记录一次。如果是慢性病，这是可以的，但如果是急性的，就不行了。此外，我的代码还包括没有事件的分母中的患者。

最后，您可以在mycodestock.com代码共享站点=> https://mycodestock.com/public/snippet/11251上看到使用日期的代码的另一个示例

下面是上表的代码：

options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";

data have;
  format type_of_illness $30.;
  infile datalines truncover;
  input ID Type_of_illness $;
  datalines;
  4 lf13
  5 lf5,lf11
  63
  13 lf12
  85
  80
  15
  20
  131 lf6,lf7,lf12
  22
  24
  55 lf12
  150 lf12
  34 lf12
  49 lf12
  151 lf12
  60
  74
  88
  64
  82 lf13
  5 lf5,lf7
  112
  87 lf17
  78
  79 lf16
  83 lf11
  ;;;;
proc sort;
  by id;
run;

**  Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
  by id;
run;

**  Create event table (1 row per patient*event);
**  NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
  set have;
  attrib grp length = $5 label = 'Illness';

  do i = 1 to countc(type_of_illness, ',') + 1;
    grp = scan(type_of_illness, i, ',');
    if grp ne '' then output;
  end;
run;

**  Count the number of events each patient had for each grp;
**  NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
**  NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
  by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
  by grp;
  class id;
  output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;

**  Add a denominator variable - value '1' for each row.;
**  Ensure when num_events = 0 the value is set to missing;  
**  Create a flag variable - set to 1 - if a patient has a record (no matter how many);  
data perc2_summ_grp_pat;
  set perc2_summ_grp_pat;
  denom = 1;
  if num_events = 0 then num_events = .;
  flg_scripts = ifn(num_events, 1, .);
run;

proc tabulate data = perc2_summ_grp_pat format=comma8.;
  title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
  title2 "Units: Patients - within each group level";
  title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
  title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";

  class grp;
  var denom flg_scripts num_events;
  table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record') 
                num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;

票数 1

Stack Overflow用户

发布于 2014-05-21 05:42:40

你这样做是对的，但你需要选择不同的百分比。通常百分比是“整个数据集的百分比”，这意味着它将是基数的三倍。你想要基于疾病的百分比。这意味着每种疾病都需要1/0。

一个缺点是您的自动表中有0；您将不得不将表输出到数据集并删除它们，然后继续打印/报告结果数据集以仅获得1-或者使用proc SQL生成表。

data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;

data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
  if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
  else L[_t]=0;
end;
run;

proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;

票数 0

Stack Overflow用户

发布于 2014-05-21 22:15:01

多标签格式解决方案很有趣，因此我将单独介绍它。

使用相同的have，我们创建了一种格式，它接受每种疾病的组合，并为其中的每种疾病输出一行，即，如果您有"1,2,3"，它将输出行

1,2,3 = 1
1,2,3 = 2
1,2,3 = 3

启用多标签格式并使用class-enabled proc (如proc tabulate )，然后您可以使用它来允许每个受访者计算每个标签值，但不会多次计入总数。

data for_procformat; 
set have;
start=type_of_illness;                     *start is the input to the format;
hlo=' m';                                  *m means multilabel, adding a space 
                                            here to leave room for the o later;
type='c';                                  *character format - n is numeric;
fmtname='$ILLF';                           *whatever name you like;
do _t = 1 to countw(type_of_illness,',');  *for each 'word' do this once;
  label=scan(type_of_illness,_t,',');      *label is the 'result' of the format;
  if not missing(label) then output;       
end;
if _n_=1 then do;                          *this block adds a row to deal with values;
  hlo='om';                                *not defined (in this case, just missings);
  label='No Illness';                      *the o means 'other';
  output;
end;
run;

proc sort data=for_procformat nodupkey;    *remove duplicates (which there will be many);
by start label;
run;

proc format cntlin=for_procformat;         *import the formats;
quit;

proc tabulate data=have;
class type_of_illness/mlf missing ;        *mlf means multilabel formats;
format type_of_illness $ILLF.;             *apply said format;
tables type_of_illness,n pctn;             *and your table;
run;

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23770153

复制

相似问题

问基于多列SAS的百分比计算
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于多列SAS的百分比计算EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于多列SAS的百分比计算
EN