Analysis of Single-factor Experiment with Categorical Outcomes

Data Presentation
Numerical Summary Measures
Chung-Yi Li, PhD
Dept. of Public Health, College of Med.
NCKU
Outline for Data Presentation
• Types of Numerical Data
• Tables
– Frequency Distributions
– Relative Frequency
• Graphs
–
–
–
–
Bar/Pie Charts
Histograms
Frequency Polygons
Stem & Leaf Plot
– One-Way Scatter Plots
– Box Plots
– Two-Way Scatter Plots
– Line Graphs
2
Outline for Numerical Summary
Measures
• Measures of Central Tendency
Mean / Median / Mode
• Measures of Dispersion
–
–
–
–
Range
Interquartile Range
Variance and Standard Deviation
Coefficient of Variation
• Grouped Data
– Grouped Mean / Grouped Variance
3
Types of Numerical Data
• Nominal
– Dichotomous/binary: gender (1=females and
0=males)
– Categorical: blood type (1=O, 2=A, 3=B, and 4=AB) or
race/ethnicity
• Ordinal
– Level of severity: 1=fatal, 2=severe, 3=moderate, and
4=minor
– Liker’s scale: Level of “agree”: 1=the least agree to
5=the most agree
• Ranked
– Leading causes of death/cancer in Taiwan
4
• Interval scale
– Temperature (C)
• Ratio scale
– Body height, weight, concentration of white
blood cell
5
6
Tables for Continuous Data
7
Guidelines
• Closed ends would be better than open
ends in constructing frequency table, as
they provide more information.
• Intervals should be comprehensive but
must be mutually exclusive.
• Frequency tables for continuous data are
somewhat misleading……………
8
9
Comment
• Grouping a continuous variable might not
be biologically plausible. For example, in
MCH studies, maternal ages are normally
categorized into <20, 20-24, 25-29, 30-34,
and >=35.
• Women aged 29 would be more similar to
women aged 30 in physiological aspects
than to those 25 years old.
10
No Concern for Tabulating
Categorical Data
11
Bar Chart
12
Pie Chart
13
Histogram
14
How About This One?
15
Frequency Polygons
16
17
18
Stem-and-Leaf Plots
19
Comment
• Does preserve individual measure information,
so not useful for large data sets
• Stem is first digit(s) of measurements, leaves
are last digit of measurements
• Most useful for two digit numbers, more
cumbersome for three+ digits
20: X
30: XXX
40: XXXX
50: XX
60: X
2* | 1
3* | 244
4* | 2468
5* | 26
6* | 4
Stem Leaf
20
One-Way Scatter Plots
21
Two-Way Scatter Plots
22
Box Plots
23
Comment
• Descriptive method to convey information
about measures of location and dispersion
– Box-and-whisker plots
• Construction of box plot
– Box is IQR
– Line at median
– Whiskers at smallest and largest observations
– Other conventions can be used, especially to
represent extreme values
24
Good for Making Comparisons
25
Line Graphs
26
27
Summary
• In practice, descriptive statistics play a
major role
– Always the first 1-2 tables/figures in a paper
– Statistician needs to know about each
variable before deciding how to analyze to
answer research questions
• In any analysis, 90% of the effort goes into
setting up the data
– Descriptive statistics are part of that 90%
28
Measures of Central Tendency
• Mean
– Arithmetic mean
– Geometric mean
• Median
• Mode
29
Arithmetic Mean (population)
Suppose we have N measurements of a particular
variable in a population.We denote these N
measurements as:
X1, X2, X3,…,XN
where X1 is the first measurement, X2 is the second,
etc.
Definition
More accurately called the arithmetic mean, it is defined as the sum
of measures observed divided by the number of observations. 30
Arithmetic Mean
• Probably most common of the measures of
central tendency
– A.K.A. ‘Average’
• Definition
– Normal distribution, although we tend to use it
regardless of distribution
– μ for population mean
31
Comment
• Weakness
– Influenced by extreme values
• Translations
– Additive
– Multiplicative
32
Geometric Mean
• Used to describe data with an extreme
skewness to the right
– Ex., Laboratory data: lipid measurements
• Definition
– Antilog of the mean of the log xi
33
• Used to calculate mean of a log-normal
distribution
• Definition
– Antilog of the mean of the log xi
34
35
Median
• Frequently used if there are extreme values in a
distribution or if the distribution is non-normal
• Definition
– That value that divides the ‘ordered array’ into two
equal parts
• If an odd number of observations, the median will be the
(n+1)/2 observation
– Ex.: Median of 11 observations is the 6th observation
• If an even number of observations, the median will be the
midpoint between the middle two observations
– Ex.: Median of 12 observations is the midpoint between
6th and 7th
36
Mode
• Not used very frequently in practice
• Definition
– Value that occurs most frequently in data set
• If all values different, no mode
• May be more than one mode
– Bimodal or multimodal
37
38
Why Measures of Dispersion?
39
Range
40
Inter-Quartile Range
41
Percentiles and Quartiles
• Definition of percentiles
– Given a set of n observations x1, x2,…, xn, the pth
percentile P is value of X such that p percent or less
of the observations are less than P and (100-p)
percent or less are greater than P
– P10 indicates 10th percentile, etc.
• Definition of quartiles
– First quartile is P25
– Second quartile is median or P50
– Third quartile is P75
42
Variance and Standard
Deviation (population)
• Suppose we have N measurements of a
particular variable in a population: X1, X2,
X3,…,XN,
• The mean is μ, as
, we define:
as variance
as standard deviation
43
Variance and Standard
Deviation (sample)
Suppose we have n measurements of a particular
variable in a sample: x1, x2, x3,…,xn,
The mean is , we define:

as sample variance

as standard deviation
44
Why n-1 for Sample Variance
and SD ?


Population=[1,2,3] =2, 2=0.667
n=2, repeated sampling
1 [1,1]
2 [1,2]
3 [1,3]
4 [2,1]
5 [2,2]
6 [2,3]
7 [3,1]
8 [3,2]
9 [3,3]
0
0.5
2
0.5
0
0.5
2
0.5
0
Average=0.667
0
0.25
1
0.25
0
0.25
1
0.25
0
Average=0.333
45
s is expected to be an
unbiased estimate of 
46
Coefficient of Variation
• Relative variation rather than absolute
variation such as standard deviation
• Definition of C.V.
47
Comment
• Useful in comparing variation between two
distributions
– Used particularly in comparing laboratory
measures to identify those determinations
with more variation
– Also used in QC analyses for comparing
observers
48
A Class of Students
Body weight: Mean=60 kg; SD=5 kg
Body height: Mean=170 cm; SD=10 cm


Which variable is with greater variation?
Weight or Height ?
SD, 10cm>5kg ???
 CV, 10 cm/170 cm<5 kg/60 kg


CV is the only descriptive statistic without
unit
49
Software
• Statistical software
–
–
–
–
SAS
SPSS
Stata
Minitab
• Graphical software
– Sigmaplot
– Power Point
– Excel
50