#### Chapter 11 Language of Descriptive Statistics

**Section 11.2 Frequency Distributions and Percentage Calculation**

# 11.2.5 Types of Diagrams

Qualitative and quantitative discrete data gained from a sample are often presented graphically by

**bar charts**.

##### **Info 11.2.20 **

The bar chart shows the absolute or relative frequencies as a function of a finite number of property values in the sample. The bar lengths are proportional to the values they represent.

This is now illustrated by an example. The species of $10$ trees at the forest's edge was determined. The possible characteristic attributes are:

A sample resulted in the following original list:

$i$ | $1$ | $2$ | $3$ | $4$ | $5$ | $6$ | $7$ | $8$ | $9$ | $10$ |

${x}_{i}$ | ${a}_{2}$ | ${a}_{1}$ | ${a}_{1}$ | ${a}_{3}$ | ${a}_{1}$ | ${a}_{2}$ | ${a}_{1}$ | ${a}_{1}$ | ${a}_{3}$ | ${a}_{3}$ |

This original list results in the following empirical frequency table:

Attribute | absolute | relative | in $\%$ |

Oak | $5$ | $0.5$ | 50 |

Beech | $2$ | $0.2$ | 20 |

Spruce | $3$ | $0.3$ | 30 |

The bar chart corresponding to this empirical frequency table is shown in the figure below.

Bar chart

Qualitative properties are often represented by

**pie charts**:

##### **Info 11.2.21 **

A slice is assigned to each characteristic attribute according to its relative frequency, where

Here, ${\alpha}_{j}$ is the angle (in degree measure) of the slice (circular sector) that corresponds to the attribute $j$ within the original list $x=({x}_{1},{x}_{2},\dots ,{x}_{n})$.

This is again illustrated by an example.

A number $n=1000$ of households were queried as to how satisfied they were with a new kind of barbecue. The possible answers were: very satisfied (1), satisfied (2), less satisfied (3) and not satisfied (4).

The survey resulted in the following empirical frequency table.

Attribute | Absolute frequencies | Relative frequencies | Percentage |

Very satisfied | $100$ | $0.1$ | $10\%$ |

Satisfied | $240$ | $0.24$ | $24\%$ |

Less satisfied | $480$ | $0.48$ | $48\%$ |

Not satisfied | $180$ | $0.18$ | $18\%$ |

Sum | $1000$ | $1$ | $100\%$ |

The corresponding angles are, according to the Info Box above,

- ${\alpha}_{1}={360}^{\circ}\xb70.1={36}^{\circ}$,

- ${\alpha}_{2}={360}^{\circ}\xb70.24=86.{4}^{\circ}$,

- ${\alpha}_{3}={360}^{\circ}\xb70.48=172.{8}^{\circ}$,

- ${\alpha}_{4}={360}^{\circ}\xb70.18=64.{8}^{\circ}$.

This results in the following pie chart:

It is often pointless to present all possible attributes in a diagram. It is more convenient to classify them and draw only the frequencies of the classes into a diagram. This is the only way to visualise the frequencies of continuous characteristics in a bar or pie chart.

Let $X$ be a quantitative (continuous) property, and $x=({x}_{1},{x}_{2},\dots ,{x}_{n})$ the original list for a sample of size $n$. An empirical frequency distribution is obtained according to the following approach:

- Find the minimum and the maximum sample value, i.e.

${x}_{(1)}\mathrm{\hspace{0.5em}\hspace{0.5em}}=\mathrm{\hspace{0.5em}\hspace{0.5em}}min\{{x}_{1},{x}_{2},\dots ,{x}_{n}\}\mathrm{\hspace{0.5em}\hspace{0.5em}}\mathrm{\hspace{0.5em}\hspace{0.5em}}\text{and}\mathrm{\hspace{0.5em}\hspace{0.5em}}\mathrm{\hspace{0.5em}\hspace{0.5em}}{x}_{(n)}\mathrm{\hspace{0.5em}\hspace{0.5em}}=\mathrm{\hspace{0.5em}\hspace{0.5em}}max\{{x}_{1},{x}_{2},\dots ,{x}_{n}\}\hspace{0.5em}.$

- List these and all values in between, rounded to the required fractional digit and sorted by size. This converts the (continuous) property $X$ into a discrete property.

- Prepare a tally sheet and draw the corresponding empirical frequency distribution.

The empirical frequency distribution of a continuous property can be very broad. In particular zeros may appear, caused by measurement values that do not occur in the original list (sample). Due to this, the empirical frequency table gets very confusing and bulky. Hence, a

**classification**is carried out to reduce the amount of data (data reduction). In fact, this corresponds to a reduction of measurement accuracy (rounding!).

##### **Info 11.2.22 **

**Classes**are half-open intervals of the form

There is no general rule defining the number $k$ of classes or the size of a class. However, the following guidelines are recommended:

- Uniform classification: Find ${x}_{(1)}=min\{{x}_{1},{x}_{2},\dots ,{x}_{n}\}$ and ${x}_{(n)}=max\{{x}_{1},{x}_{2},\dots ,{x}_{n}\}$. Then divide the interval $({x}_{(1)}-\u03f5;{x}_{(n)}+\u03f5]$ with small $\u03f5>0$ into $k$ uniform, non-overlapping, half-open subintervals.

- Avoid classes that are too small or too large.

- If possible, avoid classes with only a few observations.

- Find approximately $k\approx \sqrt{n}$ equally sized classes, where $n$ is the number of samples.

A histogram is obtained through the following approach: let

be an original list for a sample of size $n$ of a quantitative property $X$.

- Use a classification into $k$ classes. Let the interval of the $j$th class $j=\mathrm{1,2},\dots ,k$ be $({t}_{j};{t}_{j+1}]$.

- Let ${H}_{j}$ be the number of sample values in the interval $({t}_{j};{t}_{j+1}]$ for $j=\mathrm{1,2},\dots ,k$. The numbers ${H}_{j}$ are also called absolute class frequencies.

- For each $j\in \{\mathrm{1,2},\dots ,k\}$ draw a rectangle over the base $({t}_{j};{t}_{j+1}]$ of height ${d}_{j}$ with the area ${d}_{j}\xb7({t}_{j+1}-{t}_{j})={h}_{j}=\frac{{H}_{j}}{n}$. The areas ${h}_{j}$ are the relative frequencies.

The total area of these rectangles equals $1$.

This approach is now illustrated by a detailed example. In a data centre, the processing time (in s, rounded to one fractional digit) of $20$ program jobs was determined. This resulted in the following original list of a sample of size $n=20$:

3.9 | 3.3 | 4.6 | 4.0 | 3.8 |

3.8 | 3.6 | 4.6 | 4.0 | 3.9 |

3.9 | 3.9 | 4.1 | 3.7 | 3.6 |

4.6 | 4.0 | 4.0 | 3.8 | 4.1 |

The smallest value is $3.3$ s, the largest value is $4.6$ s, the increment is $0.1$ s. According to the guidelines above, we should find approximately $k\approx \sqrt{n}$ equally sized classes. Here, we use the following classification into $k=4$ classes.

Class | $({t}_{j};{t}_{j+1}],\mathrm{\hspace{0.5em}\hspace{0.5em}}j=\mathrm{1,2},\mathrm{3,4}$ | Data |

Class 1 | $(3.25;3.65]$ | "From $3.3$ to $3.6$" |

Class 2 | $(3.65;3.95]$ | "From $3.7$ to $3.9$" |

Class 3 | $(3.95;4.25]$ | "From $4.0$ to $4.2$" |

Class 4 | $(4.25;4.65]$ | "From $4.3$ to $4.6$" |

The table of the absolute and relative frequencies has the following form:

Class | abs. Class frequency ${H}_{j}$ | rel. Class frequency ${h}_{j}$ |

Class 1 | $3$ | $0.15$ |

Class 2 | $8$ | $0.4$ |

Class 3 | $6$ | $0.3$ |

Class 4 | $3$ | $0.15$ |

The heights of the $k=4$ rectangles are as follows:

- Class 1: ${d}_{1}\xb7({t}_{2}-{t}_{1})={d}_{1}\xb70.4={h}_{1}=0.15$, i.e. ${d}_{1}=\frac{3}{8}=0.375$.

- Class 2: ${d}_{2}\xb7({t}_{3}-{t}_{2})={d}_{2}\xb70.3={h}_{2}=0.4$, i.e. ${d}_{2}=\frac{4}{3}=1.\stackrel{\u203e}{3}$.

- Class 3: ${d}_{3}\xb7({t}_{4}-{t}_{3})={d}_{3}\xb70.3={h}_{3}=0.3$, i.e. ${d}_{3}=1$.

- Class 4: ${d}_{4}\xb7({t}_{5}-{t}_{4})={d}_{4}\xb70.4={h}_{4}=0.15$, i.e. ${d}_{4}=\frac{3}{8}=0.375$.

Thus, we have the following histogram: