# Descriptive Statistics

Descriptive statistics are used for summarizing and describing the main features of a dataset. It provides a quick summary and highlights the main characteristics of the dataset. Some common measures used in descriptive statistics include measures of central tendency (such as mean, median, and mode), and measures of dispersions (such as standard deviation and variance).

## What type of indicators to use?

Nominal: Frequency, Mode

Ordinal: Frequency, Mode, Median

Scale, Ratio: Frequency, Mode, Median, Mean, Range, Std. Deviation, Skewness, Kurtosis, Range

## Measures of central tendencies

Mode: The most frequently occuring number.

Median: 156.000 Ft. It means: 50% of our respondents earn less then 156.000Ft and 50% of our respondents earn more then 156.000Ft.

Mean: the average. Example: The average monthly salary is 156.000 Ft. So, on average people earn 156.000 Ft per month. This is a good measure only when the variable is normally distributed since the mean value is influenced by outliers. If the data is not normally distributed then it is more appropriate to use the median then the mean.

## Measures of dispersions

Tells us more about how our data is distributed.

**Standard Deviation:** the average distance a score is from the mean. This tells us how widely spread out our distribution is. Low SD. means the values are close to the mean, whereas high SD means values are spread out over a large range. So, the larger the deviation from 0 the greater consideration you might give towards transforming your data in some way to make it normal.

**Skewness** is a measure of symmetry of distributions. A perfectly normal distribution has a skewness statistics of 0.

Positive skewness: Mode < Median < Mean

Negative skewness: Mean < Median < Mode

**Kurtosis** will let us know if our data is peaked or flat. The Kurtosis of a perfectly normal distribution is 3. Example: Kurtosis=1.28 This indicates that there is a positive skewness, meaning our distribution is relatively peaked. We have relatively low outliers. It is a measure of the sheep of the curve. It measures if the bell of the curve is normal, flat or peaked.

**The variance:** is the squared standard deviation. It is not reported as frequently as the standard deviation.

**Range:** maximum – minimum.

## Outliers and extreme values

There are several methods for finding outliers and extreme values in a database, depending on the type of data and the nature of the distribution. Here are some common approaches:

- Box plot: A box plot is a graphical representation of the distribution of the data. It displays the median, quartiles, and extreme values (outliers) of the dataset. Outliers are defined as any data points that fall outside the whiskers of the box plot, which are typically 1.5 times the interquartile range (IQR) from the upper or lower quartile. More on boxplots
- Z-score: A Z-score is a statistical measure that indicates how many standard deviations a data point is from the mean of the dataset. Any data point with a Z-score greater than 3 or less than -3 is considered an outlier.
- Interquartile range (IQR) method: The IQR is the range of the middle 50% of the data, between the 25th and 75th percentile. Any data point outside the range of Q1 – 1.5
*IQR to Q3 + 1.5*IQR is considered an outlier. - Modified Z-score: The modified Z-score is a variation of the Z-score method that is less sensitive to outliers. It uses the median and median absolute deviation (MAD) instead of the mean and standard deviation. Any data point with a modified Z-score greater than 3.5 or less than -3.5 is considered an outlier.
- Tukey’s method: Tukey’s method is a combination of the box plot and IQR methods. It defines outliers as any data points that fall outside the range of Q1 – 1.5
*IQR to Q3 + 1.5*IQR, where Q1 and Q3 are the 25th and 75th percentiles, respectively.

Once you have identified potential outliers using one or more of these methods, you should examine the data to determine whether they are genuine outliers or errors in the data. You may want to remove them from the dataset or handle them separately in your analysis if they are genuine outliers.