Analyze and Manage Your Output with PROC MEANS

NESUG 2012
Coders' Corner
Analyze and Manage Your Output with PROC MEANS
Si Gao and Xin Li, University at Albany, SUNY
ABSTRACT
®
PROC MEANS is among the most flexible and powerful procedures in SAS system. One can efficiently analyze
data, generate and output inferential statistics in a straightforward fashion. This paper explores applications of
PROC MEANS in depth. Sometimes we need to provide summary statistics at different grouping levels (male,
female and both, e.g.). Instead of working at every individual grouping level and merging later, PROC MEANS
provides a powerful solution to generate a synthesized output directly. As illustrated by examples with health data,
PROC MEANS is especially flexible when there are many grouping levels. Established codes are provided that
can be easily revised to fit into various practical projects. This paper offers insight into a specific application of
PROC MEANS in addition to a broad range of existing applications in the literature.
DESCRIPTION
PROC MEANS provides a flexible procedure that is widely used to analyze data and produce inferential statistics.
What may be ignored by many programmers is its equally powerful ability to manage and generate desirable
output data.
This paper illustrates its application in data management by looking at a practical example. Suppose we have a
data set for the length of stays for each hospital across the states. We want to generate an output data set that
th
contains the statistics of the length of stays, e.g. frequency, median, 90 percentile, broken down by states,
hospitals and age groups (youth and adults), respectively (see Fig. 1). Especially, if we not only want to show the
youth and adult groups, but also want to include an additional row of statistics for total ages for each state and
hospital (as illustrated in Fig. 2), that will raise the difficulty of coding. In the output data set, there are different
ways of combinations for the class variables. There could be several different ways to achieve the output data,
but it’s very hard to avoid multiple procedures to generate each combination first, and then merge all levels
together. Fortunately, with the help of PROC MEANS this task becomes very straightforward and can be
accomplished within only a few lines of codes.
Data:
Fig. 1
Final output data:
1
NESUG 2012
Coders' Corner
Fig. 2
STEP 1. GENERATE STATISTICS
Assume we have raw data of a list of reciepients and their hospitals (provider), state and length of stays at
hospital, saved as ‘sample’. The codes below generate several inferential statistics, and save the output data in
‘sample_stat’.
PROC MEANS DATA=sample NOPRINT;
VAR
length_stay;
CLASS state age_group prov_name;
OUTPUT OUT=sample_stat
MEDIAN=Median_days
Q1=Lower_Quartile
Q3=Upper_Quartile
P90=_90th_PCTL;
RUN;
The output data allows for all ways of combinations of class variables we specify, as shown in Fig. 3. The column
2
NESUG 2012
Coders' Corner
®
_TYPE_ is an SAS internally generated variable, which indicates the way of combinations between the three
variables. If there is a missing value that appears under any of the class variables, then this row shows the
aggregate value in this column against other class variables. For example, rows with _TYPE_=6 show statistics of
the state total for each age group. Similarly, rows with _TYPE_=4 display statistics of state total for all ages.
The column ‘_TYPE_’ is the key class variable that we need to manage and develop output data from the raw
data set. Since we hope to output data that contain statistics for all ages (both 1-Adult and 2-Youth in ‘AGE-GR’)
of every hospital, the state total and nationwide aggregate, we need to keep certain combinations of the class
variables, i.e. ‘STATE’, ‘AGE_GROUP’ and ‘PROV_NAME’.
Fig. 3
STEP 2. KEEP DESIRABLE COMBINATIONS OF CLASS VARIABLES
Observing the output data, we find desirable combinations are _TYPE_ equal to 0, 2, 4, 5, 6 and 7, as in Fig. 3.
Note that the order in which variables are listed in CLASS statement matters! So when you change the order of
the three class variables, values of _TYPE_ may indicate totally different combinations. This may lead to a very
®
common mistake SAS users often make. Although the rule that ‘_TYPE_’ values are assigned may be well
®
perceived by SAS users, we would suggest programmers open up output data set and visually check the
combinations of class variables. Instead of Step 1 and Step 2, we also provide an alternative way to generate
output data with desirable combinations of class variables, with the help of WAYS statement. We will illustrate it
shortly.
3
NESUG 2012
Coders' Corner
Note the purpose of this example is to illustrate the flexibility of PROC Means to manage output data set. We
would like to make our procedures as general as possible, so that it bears a high adaptability for a wide range of
®
use. So we turn to SAS Macros to accomplish. Next, we want to define a Macro that accepts a varying number
of parameter values, i.e. the values of _TYPE_ that we hope to keep in the output data set. Therefore,
programmers could keep ways of combinations they like.
®
In defining SAS Macro, we use the PARMBUFF option to enable us to input a potentially different number of
arguments each time it is invoked. When readers hope to keep different ways of combinations, simply list the
numbers in the Macro ‘Keep’, delimited by comma.
OPTION SYMBOLGEN;
%MACRO Keep/PARMBUFF;
DATA Sample_Out;
SET Sample_stat;
%LET num=1;
%LET type=%SCAN(&SYSPBUFF,&num);
%DO %WHILE(&type NE);
IF _type_=&type.
THEN OUTPUT;
%LET num=%EVAL(&num+1);
%LET type=%SCAN(&SYSPBUFF,&num);
%END;
RUN;
%MEND;
%Keep(0,2,4,5,6,7);
STEP 3. DATA CLEANING
Next we fill in the blank rows and do some cleaning to the output data set, and then sort the data properly:
DATA sample_out (DROP=_TYPE_ RENAME=(_FREQ_=FREQUENCY));
SET sample_out;
IF _TYPE_=0
THEN DO;
GROUPING=' Nationwide Total';
STATE = ' Nationwide';
PROV_NAME=' Nationwide Total';
AGE_GROUP='3-All Ages';
END;
IF _TYPE_=2
THEN DO;
GROUPING=' Nationwide Total';
STATE = ' Nationwide';
4
NESUG 2012
Coders' Corner
PROV_NAME=' Nationwide Total';
END;
IF _TYPE_=4
THEN DO;
GROUPING=' Statewide Total';
PROV_NAME=' '||TRIM(STATE)||' Total';
AGE_GROUP='3-All Ages';
END;
IF _TYPE_=5
THEN DO;
GROUPING=' Provider Total';
AGE_GROUP='3-All Ages';
END;
IF _TYPE_=6
THEN DO;
GROUPING=' Statewide Total';
PROV_NAME=' '||TRIM(STATE)||' Total';
END;
IF _TYPE_=7
THEN DO;
GROUPING=' Provider Total';
END;
RUN;
PROC SORT DATA=sample_out;
BY STATE PROV_NAME AGE_GROUP;
RUN;
After the cleaning and sorting, we obtain the final output data set, as shown in Fig. 2. In total, we only run PROC
MEANS once to have statistics for all the possible ways of combinations, and then keep the desirable ones.
Without PROC MEANS, we would have to aggregate the data in different ways of combinations, calculate the
statistics for each level separately, and then merge them back. Therefore, PROC MEANS can greatly reduce the
coding burden, and gives a straightforward and efficient solution, especially when the number of class variables is
large.
CONCLUSION
Through an example we illustrate output data management using PROC MEANS. It can quickly generate a wide
range of inferential statistics at different levels of combinations in a very flexible fashion. It can be applied in a
wide range of data management and analysis projects.
REFERENCE
®
[1] SAS 9.2 SQL Procedure: User’s Guide
®
[2] SAS 9.2 Macro Language: Reference
[3] Andrew Karp: ’Steps to Success with PROC MEANS’, SUGI 29, Paper 240-29
5
NESUG 2012
Coders' Corner
ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Si Gao
University at Albany, SUNY
1400 Washington Ave.
Albany, NY, 12222
[email protected]
Xin Li
University at Albany, SUNY
1400 Washington Ave.
Albany, NY, 12222
[email protected]
6