Data Presentation Numerical Summary Measures Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU Outline for Data Presentation • Types of Numerical Data • Tables – Frequency Distributions – Relative Frequency • Graphs – – – – Bar/Pie Charts Histograms Frequency Polygons Stem & Leaf Plot – One-Way Scatter Plots – Box Plots – Two-Way Scatter Plots – Line Graphs 2 Outline for Numerical Summary Measures • Measures of Central Tendency Mean / Median / Mode • Measures of Dispersion – – – – Range Interquartile Range Variance and Standard Deviation Coefficient of Variation • Grouped Data – Grouped Mean / Grouped Variance 3 Types of Numerical Data • Nominal – Dichotomous/binary: gender (1=females and 0=males) – Categorical: blood type (1=O, 2=A, 3=B, and 4=AB) or race/ethnicity • Ordinal – Level of severity: 1=fatal, 2=severe, 3=moderate, and 4=minor – Liker’s scale: Level of “agree”: 1=the least agree to 5=the most agree • Ranked – Leading causes of death/cancer in Taiwan 4 • Interval scale – Temperature (C) • Ratio scale – Body height, weight, concentration of white blood cell 5 6 Tables for Continuous Data 7 Guidelines • Closed ends would be better than open ends in constructing frequency table, as they provide more information. • Intervals should be comprehensive but must be mutually exclusive. • Frequency tables for continuous data are somewhat misleading…………… 8 9 Comment • Grouping a continuous variable might not be biologically plausible. For example, in MCH studies, maternal ages are normally categorized into <20, 20-24, 25-29, 30-34, and >=35. • Women aged 29 would be more similar to women aged 30 in physiological aspects than to those 25 years old. 10 No Concern for Tabulating Categorical Data 11 Bar Chart 12 Pie Chart 13 Histogram 14 How About This One? 15 Frequency Polygons 16 17 18 Stem-and-Leaf Plots 19 Comment • Does preserve individual measure information, so not useful for large data sets • Stem is first digit(s) of measurements, leaves are last digit of measurements • Most useful for two digit numbers, more cumbersome for three+ digits 20: X 30: XXX 40: XXXX 50: XX 60: X 2* | 1 3* | 244 4* | 2468 5* | 26 6* | 4 Stem Leaf 20 One-Way Scatter Plots 21 Two-Way Scatter Plots 22 Box Plots 23 Comment • Descriptive method to convey information about measures of location and dispersion – Box-and-whisker plots • Construction of box plot – Box is IQR – Line at median – Whiskers at smallest and largest observations – Other conventions can be used, especially to represent extreme values 24 Good for Making Comparisons 25 Line Graphs 26 27 Summary • In practice, descriptive statistics play a major role – Always the first 1-2 tables/figures in a paper – Statistician needs to know about each variable before deciding how to analyze to answer research questions • In any analysis, 90% of the effort goes into setting up the data – Descriptive statistics are part of that 90% 28 Measures of Central Tendency • Mean – Arithmetic mean – Geometric mean • Median • Mode 29 Arithmetic Mean (population) Suppose we have N measurements of a particular variable in a population.We denote these N measurements as: X1, X2, X3,…,XN where X1 is the first measurement, X2 is the second, etc. Definition More accurately called the arithmetic mean, it is defined as the sum of measures observed divided by the number of observations. 30 Arithmetic Mean • Probably most common of the measures of central tendency – A.K.A. ‘Average’ • Definition – Normal distribution, although we tend to use it regardless of distribution – μ for population mean 31 Comment • Weakness – Influenced by extreme values • Translations – Additive – Multiplicative 32 Geometric Mean • Used to describe data with an extreme skewness to the right – Ex., Laboratory data: lipid measurements • Definition – Antilog of the mean of the log xi 33 • Used to calculate mean of a log-normal distribution • Definition – Antilog of the mean of the log xi 34 35 Median • Frequently used if there are extreme values in a distribution or if the distribution is non-normal • Definition – That value that divides the ‘ordered array’ into two equal parts • If an odd number of observations, the median will be the (n+1)/2 observation – Ex.: Median of 11 observations is the 6th observation • If an even number of observations, the median will be the midpoint between the middle two observations – Ex.: Median of 12 observations is the midpoint between 6th and 7th 36 Mode • Not used very frequently in practice • Definition – Value that occurs most frequently in data set • If all values different, no mode • May be more than one mode – Bimodal or multimodal 37 38 Why Measures of Dispersion? 39 Range 40 Inter-Quartile Range 41 Percentiles and Quartiles • Definition of percentiles – Given a set of n observations x1, x2,…, xn, the pth percentile P is value of X such that p percent or less of the observations are less than P and (100-p) percent or less are greater than P – P10 indicates 10th percentile, etc. • Definition of quartiles – First quartile is P25 – Second quartile is median or P50 – Third quartile is P75 42 Variance and Standard Deviation (population) • Suppose we have N measurements of a particular variable in a population: X1, X2, X3,…,XN, • The mean is μ, as , we define: as variance as standard deviation 43 Variance and Standard Deviation (sample) Suppose we have n measurements of a particular variable in a sample: x1, x2, x3,…,xn, The mean is , we define: as sample variance as standard deviation 44 Why n-1 for Sample Variance and SD ? Population=[1,2,3] =2, 2=0.667 n=2, repeated sampling 1 [1,1] 2 [1,2] 3 [1,3] 4 [2,1] 5 [2,2] 6 [2,3] 7 [3,1] 8 [3,2] 9 [3,3] 0 0.5 2 0.5 0 0.5 2 0.5 0 Average=0.667 0 0.25 1 0.25 0 0.25 1 0.25 0 Average=0.333 45 s is expected to be an unbiased estimate of 46 Coefficient of Variation • Relative variation rather than absolute variation such as standard deviation • Definition of C.V. 47 Comment • Useful in comparing variation between two distributions – Used particularly in comparing laboratory measures to identify those determinations with more variation – Also used in QC analyses for comparing observers 48 A Class of Students Body weight: Mean=60 kg; SD=5 kg Body height: Mean=170 cm; SD=10 cm Which variable is with greater variation? Weight or Height ? SD, 10cm>5kg ??? CV, 10 cm/170 cm<5 kg/60 kg CV is the only descriptive statistic without unit 49 Software • Statistical software – – – – SAS SPSS Stata Minitab • Graphical software – Sigmaplot – Power Point – Excel 50
© Copyright 2025 Paperzz