Part 6: Descriptive Statistics

The descriptive statistic is often the first step in the analysis of data. It analyzes data in columns (or rows) with several measures to describe central tendency (mean, median), scatter and normality. Some measures sounds trivial, e.g. mean, but do you know that sometimes the median is more representative than the mean (example)?

Minimum
The smallest value in a data set.

Maximum
The largest value in a data set.

Sum
Adding all values of a data group.

Mean (arithmetic)
The mean is the average. It is the sum of all values divided by the number of values.

Standard deviation
It describes the scatter or variation between values in a group of data. We want to minimize the variation of measured values, for example by optimizing our experiment or measuring device. A mean is often meaningless without reporting the standard deviation. A mean price of $400000 for property in a neighborhood does not tell much as the most expensive house could cost $750000 and the most affordable $50000 (large variation), probably representing a large gap in the size and condition of the houses. However, with a price of $450000 for the most expensive house, and $350000 for the most affordable (small variation), we get the same mean price of $400000 and the information that the houses are unlikely differ much in size and condition. That is why we should always report the mean with the standard deviation as mean ± standard deviation.

Standard error of the mean
It is the standard deviation divided by the square root of the number of values. The standard error is basically the estimated standard deviation of the mean, for example if we have only a single data group not allowing for multiple mean determinations between different groups from repetitive experiments. The standard error of the mean becomes smaller with increasing number of values, and therefore the more reliable is the estimated mean.

Confidence interval of the mean
With the confidence interval we can express the precision of the mean with a defined probability (usually 95%). The confidence interval depends on the sample size and the variability (standard deviation).

Let us assume we measure the concentration of a pesticide at multiple sites in a lake. The concentrations are the following: 0.21, 0.17, 0.25, 0.12, 0.18, 0.22 and 0.26 mg/L. The mean and standard deviation is 0.201 ± 0.049. The 95% confidence interval lies between 0.156 and 0.247, i.e. the real mean lies in this range with a probability of 95%. In this example, it is a wide range closed to the minimum and maximum, but can be realistic considering that a few sampling sites may have been in proximity to the input source of the pesticide, and other sites further away from the source.

Geometric mean
The geometric mean is the nth root of the product of all values. It is used to determine a mean between data groups, which differ much in their ranges. For example, we want to rank companies with environmental sustainability (scored between 0 and 5) and financial security (scored between 0 and 100) as indicators. We need to use the geometric mean to calculate the mean score and rank the companies. If we use the arithmetic mean, financial security will be weighted much more than environmental sustainability.

Coefficient of variation
The coefficient of variation (CV) is the relative variability of a data group. It equals the standard deviation divided by the mean and can be expressed either as a fraction or a percentage. Reporting CV is only meaningful if the variable has a real zero values, e.g. weight, length, pressure. Temperature measured in units other than Kelvin (K) has an arbitrarily zero value and reporting a CV is not meaningful.

We report a CV if we want to compare the standard deviation of multiple variable with different units. For example, we can not compare the standard deviations of length and weight, but we can compare their CVs.

Skewness and kurtosis
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. If the values of skewness and kurtosis are closed to zero, the data group follows a normal distribution.

Median and percentiles
A percentile is the value of a variable below which a certain percent of observations fall. For example, the 75th percentile is the value (or score) below which 75 percent of the observations may be found. The term percentile is often used in the reporting of scores from norm-referenced tests. For example, if a score is in the 86th percentile, it is higher than 85% of the other scores.

The median value is the 50th percentile, i.e. the data are ordered with increasing values and the value in the center of the series is the median. If the data group contains an even number of values, the mean of the two centered values is the median.

1,2,3,4,5,6,7,8 the median value is 4.5 (mean of 4 and 5).

Doing descriptive statistics with MaxStat 3.5 is very easy as all the statistical analysis is done in three simple steps within a single dialog window. MaxStat guides user with little experience in their statistical analysis. Download a trial version of MaxStat 3.5 at www.maxstat.de

DoingDescriptive

 

Advertisements

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden / Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ Foto

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s