Homework 1

Table of Contents

Problem 1

Suppose we have the following sample of daily weekday afternoon (3 to 7pm) lead concentrations (in micrograms per cubic meter, μg/m3\mu g/m^3) recorded by an air-monitoring station near the San Diego Freeway in Los Angeles during the fall of 1976:

6.75.45.26.08.76.06.48.35.35.97.65.06.96.84.96.35.06.07.28.08.17.210.99.28.66.26.114.110.68.4\begin{array}{rrrrrrrrrr} 6.7 & 5.4 & 5.2 & 6.0 & 8.7 & 6.0 & 6.4 & 8.3 & 5.3 & 5.9 \\ 7.6 & 5.0 & 6.9 & 6.8 & 4.9 & 6.3 & 5.0 & 6.0 & 7.2 & 8.0 \\ 8.1 & 7.2 & 10.9 & 9.2 & 8.6 & 6.2 & 6.1 & 14.1 & 10.6 & 8.4 \\ \end{array}
  1. Calculate the sample mean x\mean{x}, sample variance s2s^2, and sample standard deviation ss.
  2. How many observations lie within one standard deviation from the mean? How many lie within two standard deviations from the mean?
Solution.
  1. This is a straightforward computation. Just keep in mind that to calculate s2s^2, you use 1n1\frac{1}{n-1} instead of 1n\frac{1}{n}.

    x7.2s24.2s2.0\boxed{ \begin{aligned} \mean{x} &\approx 7.2 \\ s^2 &\approx 4.2 \\ s &\approx 2.0 \end{aligned} }
  2. There are 2424 observations (strictly) within one standard deviation from the mean, and 2929 (strictly) within two standard deviations from the mean. An efficient way to do this by hand is to calculate the range the observations need to lie in. For instance, for xx to be within one standard deviation from the mean, xx needs to satisfy

    xsxx+s.\mean{x} - s \leq x \leq \mean{x} + s.

Python Code

data = [6.7, 5.4,  5.2, 6.0, 8.7, 6.0, 6.4,  8.3,  5.3, 5.9,
        7.6, 5.0,  6.9, 6.8, 4.9, 6.3, 5.0,  6.0,  7.2, 8.0,
        8.1, 7.2, 10.9, 9.2, 8.6, 6.2, 6.1, 14.1, 10.6, 8.4]

def sample_mean(data):
    n = len(data)
    return sum(data) / n


def sample_variance(data):
    x_bar = sample_mean(data)
    n = len(data)
    return sum([(x - x_bar) ** 2 for x in data]) / (n - 1)


def sample_standard_deviation(data):
    return sample_variance(data) ** 0.5


x_bar = sample_mean(data)
s2 = sample_variance(data)
s = sample_standard_deviation(data)

within_s = len([x for x in data if abs(x - x_bar) <= s])
within_2s = len([x for x in data if abs(x - x_bar) <= 2 * s])

print("x-bar:", x_bar)  # 7.233333333333333
print("s^2:  ", s2)     # 4.182298850574712
print("s:    ", s)      # 2.0450669550346543
print()
print("within  s of mean:", within_s)   # 24
print("within 2s of mean:", within_2s)  # 29

Problem 2

Suppose that a linear transformation is applied to each of the observations x1,x2,,xnx_1, x_2, \ldots, x_n in a set of data; that is, a transformed data set y1,y2,,yny_1, y_2, \ldots, y_n is created from the original data via the equation

yi=axi+b,i=1,2,,n,a,bR.y_i = ax_i + b, \quad i = 1, 2, \ldots, n, \quad a, b \in \R.

Show that if x\mean{x} and sx2s_x^2 are the sample mean and sample variance of the original data, then the sample mean and sample variance of the transformed data are given by

y=ax+bsy2=a2sx2.\begin{aligned} \mean{y} &= a\mean{x} + b \\ s_y^2 &= a^2 s_x^2. \end{aligned}
Solution.

For the mean,

y=1ni=1nyi=1ni=1n(axi+b)=a(1ni=1nxi)+1ni=1nb=ax+b.\begin{aligned} \mean{y} &= \frac{1}{n} \sum_{i=1}^n y_i \\ &= \frac{1}{n} \sum_{i=1}^n \p{ax_i + b} \\ &= a\p{\frac{1}{n} \sum_{i=1}^n x_i} + \frac{1}{n} \sum_{i=1}^n b \\ &= a\mean{x} + b. \end{aligned}

For the sample variance,

sy2=1n1i=1n(yiy)2=1n1i=1n(axi+b(ax+b))2=1n1i=1n(axiax)2=a2(1n1i=1n(xix)2)=a2sx2.\begin{aligned} s_y^2 &= \frac{1}{n-1} \sum_{i=1}^n \p{y_i - \mean{y}}^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n \p{ax_i + b - \p{a\mean{x} + b}}^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n \p{ax_i - a\mean{x}}^2 \\ &= a^2\p{\frac{1}{n-1} \sum_{i=1}^n \p{x_i - \mean{x}}^2} \\ &= a^2 s_x^2. \end{aligned}

Problem 3

Given the data set provided in Problem 1, calculate:

  1. the first and third sample quartiles π~0.25,π~0.75\widetilde{\pi}_{0.25}, \widetilde{\pi}_{0.75}, and the interquartile range IQR\mathrm{IQR};
  2. the 10th10^{\mathrm{th}} and 90th90^{\mathrm{th}} sample percentiles;
  3. are there any suspected outliers or outliers in the sample?
Solution.

When sorted, the data is

4.95.05.05.25.35.45.96.06.06.06.16.26.36.46.76.86.97.27.27.68.08.18.38.48.68.79.210.610.914.1\begin{array}{rrrrrrrrrr} 4.9 & 5.0 & 5.0 & 5.2 & 5.3 & 5.4 & 5.9 & 6.0 & 6.0 & 6.0 \\ 6.1 & 6.2 & 6.3 & 6.4 & 6.7 & 6.8 & 6.9 & 7.2 & 7.2 & 7.6 \\ 8.0 & 8.1 & 8.3 & 8.4 & 8.6 & 8.7 & 9.2 & 10.6 & 10.9 & 14.1 \end{array}

Recall that to compute π~p\widetilde{\pi}_p, we first decompose (n+1)p=d+r\p{n+1}p = d + r, where dd is an integer and 0r<10 \leq r < 1. In our case, n=30n = 30, so

(30+1)0.25=7+0.75.\p{30 + 1} \cdot 0.25 = 7 + 0.75.

Thus, π~0.25=yd+0.75(yd+1yd)=5.9+0.75(6.05.9)=5.975\widetilde{\pi}_{0.25} = y_d + 0.75\p{y_{d+1} - y_d} = 5.9 + 0.75 \cdot \p{6.0 - 5.9} = 5.975. The remaining percentiles are calculated similarly:

π~0.25=5.975π~0.75=8.325π~0.1=5.02π~0.9=10.46\boxed{ \begin{aligned} \widetilde{\pi}_{0.25} &= 5.975 \\ \widetilde{\pi}_{0.75} &= 8.325 \\ \widetilde{\pi}_{0.1} &= 5.02 \\ \widetilde{\pi}_{0.9} &= 10.46 \end{aligned} }
IQR=π~0.75π~0.25=2.35 \boxed{\mathrm{IQR} = \widetilde{\pi}_{0.75} - \widetilde{\pi}_{0.25} = 2.35}

Suspected outliers are the ones that satisfy

π~0.253IQRxπ~0.251.5IQR,orπ~0.75+1.5IQRxπ~0.75+3IQR,\begin{aligned} \widetilde{\pi}_{0.25} - 3\mathrm{IQR} &\leq x \leq \widetilde{\pi}_{0.25} - 1.5\mathrm{IQR},\quad\text{or} \\ \widetilde{\pi}_{0.75} + 1.5\mathrm{IQR} &\leq x \leq \widetilde{\pi}_{0.75} + 3\mathrm{IQR}, \end{aligned}

and outliers satisfy

x<π~0.253IQR,orx>π~0.75+3IQR.\begin{aligned} x &< \widetilde{\pi}_{0.25} - 3\mathrm{IQR},\quad\text{or} \\ x &> \widetilde{\pi}_{0.75} + 3\mathrm{IQR}. \end{aligned}

You can check by hand that 14.114.1 is the only suspected outlier and that there are no outliers.

Python Code

import math


def sample_percentile(y, p):
    x = (len(data) + 1) * p
    d = math.floor(x)
    r = x - d

    # y[d] = y_{d+1} since arrays start at 0
    return y[d - 1] + r * (y[d] - y[d - 1])


y = sorted(data)

pi_25 = sample_percentile(y, 0.25)
pi_75 = sample_percentile(y, 0.75)
iqr = pi_75 - pi_25

pi_10 = sample_percentile(y, 0.1)
pi_90 = sample_percentile(y, 0.9)

suspected = [
    x
    for x in data
    if (pi_25 - 3 * iqr <= x and x <= pi_25 - 1.5 * iqr)
    or (pi_75 + 1.5 * iqr <= x and x <= pi_75 + 3 * iqr)
]

outliers = [x for x in data if x < pi_25 - 3 * iqr or pi_75 + 3 * iqr < x]

print("pi_25:", pi_25)  # 5.975
print("pi_75:", pi_75)  # 8.325000000000001
print("iqr:  ", iqr)    # 2.3500000000000014
print()
print("pi_10:", pi_10)  # 5.0200000000000005
print("pi_90:", pi_90)  # 10.460000000000003
print()
print("suspected outliers: ", suspected)  # [14.1]
print("outliers:           ", outliers)   # []

Problem 4

Let Y1<Y2<<Y8Y_1 < Y_2 < \cdots < Y_8 be the order statistics of eight independent observations from a continuous-type distribution with 70th70^{\mathrm{th}} percentile π0.7=27.3\pi_{0.7} = 27.3.

  1. Determine P(Y7<27.3)\P\p{Y_7 < 27.3}.
  2. Find P(Y5<27.3<Y8)\P\p{Y_5 < 27.3 < Y_8}.
Solution.

First recall that being the 70th70^{\mathrm{th}} percentile means

P(X1<π0.7)=0.7.\P\p{X_1 < \pi_{0.7}} = 0.7.
  1. We need at least 77 of the XiX_i's to be less than 27.327.3, so
P(Y7<27.3)=k=78P(X1<27.3)k(1P(X1<27.3))8k=k=78(8k)(0.7)k(0.3)8k.\begin{aligned} \P\p{Y_7 < 27.3} &= \sum_{k=7}^8 \P\p{X_1 < 27.3}^k \p{1 - \P\p{X_1 < 27.3}}^{8-k} \\ &= \boxed{\sum_{k=7}^8 \binom{8}{k} \p{0.7}^k \p{0.3}^{8-k}}. \end{aligned}
  1. We need at least 55 of the XiX_i's to be less than 27.327.3, and strictly fewer than 88 of the XiX_i's to be greater than Y8Y_8, so
P(Y5<27.3<Y8)=k=581P(X1<27.3)k(1P(X1<27.3))8k=k=57(8k)(0.7)k(0.3)8k.\begin{aligned} \P\p{Y_5 < 27.3 < Y_8} &= \sum_{k=5}^{8-1} \P\p{X_1 < 27.3}^k \p{1 - \P\p{X_1 < 27.3}}^{8-k} \\ &= \boxed{\sum_{k=5}^7 \binom{8}{k} \p{0.7}^k \p{0.3}^{8-k}}. \end{aligned}

Problem 5

Let W1<W2<<WnW_1 < W_2 < \cdots < W_n be the order statistics of nn independent observations from a uniform distribution U(0,1)U\p{0, 1}.

  1. Show that

    E[Wr2]=r(r+1)(n+1)(n+2).\E\br{W_r^2} = \frac{r\p{r+1}}{\p{n+1}\p{n+2}}.
  2. Find the variance of WrW_r.

Solution.
  1. Recall that the cdf of the order statistic WrW_r is

    FWr(w)=P(Wrw)=k=rn(nk)F(w)k(1F(w))nkf(w),F_{W_r}\p{w} = \P\p{W_r \leq w} = \sum_{k=r}^n \binom{n}{k} F\p{w}^k \p{1 - F\p{w}}^{n-k} f\p{w},

    where F,fF, f are the cdf, pdf of the original random variables. In this case, f(x)=1f\p{x} = 1 for 0<x<10 < x < 1 and F(x)=xF\p{x} = x for 0<x<10 < x < 1, so the cdf of WrW_r is

    FWr(w)=k=rn(nk)wk(1w)nk.F_{W_r}\p{w} = \sum_{k=r}^n \binom{n}{k} w^k \p{1 - w}^{n-k}.

    The last step is a computation. There are two ways to do it: (i) by using the pdf given, or (ii) from the cdf alone. I'll do it via (ii) since I prefer using the cdf, and I think the computation will be a good review. (i) will be easier, however, and will just require step (1) below.

    I'll number the harder steps and explain them afterwards. If fWrf_{W_r} is the pdf, then

    E[Wr2]=01w2fWr(w)dw=w2FWr(w)01201wFWr(w)dw(1)=1201k=rn(nk)wk+1(1w)nkdw=12k=rn(nk)01w(k+2)1(1w)(nk+1)1dw=12k=rn(nk)Γ(k+2)Γ(nk+1)Γ(n+3)(2)=12k=rnn!k!(nk)!(k+1)!(nk)!(n+2)!=12k=rnk+1(n+2)(n+1)=12(n+1)(n+2)k=rn(k+1)(3)=12(n+1)(n+2)(n+r+2)(nr+1)2=r(r+1)(n+1)(n+2).\begin{aligned} \E\br{W_r^2} &= \int_0^1 w^2 f_{W_r}\p{w} \,\diff{w} \\ &= \left. w^2 F_{W_r}\p{w} \right\rvert_0^1 - 2 \int_0^1 wF_{W_r}\p{w} \,\diff{w} && \p{1} \\ &= 1 - 2 \int_0^1 \sum_{k=r}^n \binom{n}{k} w^{k+1} \p{1 - w}^{n-k} \,\diff{w} \\ &= 1 - 2 \sum_{k=r}^n \binom{n}{k} \int_0^1 w^{\p{k+2}-1} \p{1 - w}^{\p{n-k+1}-1} \,\diff{w} \\ &= 1 - 2 \sum_{k=r}^n \binom{n}{k} \frac{\Gamma\p{k + 2} \Gamma\p{n - k + 1}}{\Gamma\p{n + 3}} && \p{2} \\ &= 1 - 2 \sum_{k=r}^n \frac{n!}{k!\p{n - k}!} \frac{\p{k + 1}!\p{n - k}!}{\p{n + 2}!} \\ &= 1 - 2 \sum_{k=r}^n \frac{k + 1}{\p{n + 2}\p{n + 1}} \\ &= 1 - \frac{2}{\p{n + 1}\p{n + 2}} \sum_{k=r}^n \p{k + 1} && \p{3} \\ &= 1 - \frac{2}{\p{n + 1}\p{n + 2}} \frac{\p{n + r + 2}\p{n - r + 1}}{2} \\ &= \frac{r\p{r + 1}}{\p{n + 1}\p{n + 2}}. \end{aligned}

    (1) is integration by parts with

    {u=w2dv=fWr(w)dw    {du=2wdwv=FWr(w).\begin{cases} u = w^2 \\ \diff{v} = f_{W_r}\p{w} \,\diff{w} \\ \end{cases} \implies \begin{cases} \diff{u} = 2w \,\diff{w} \\ v = F_{W_r}\p{w}. \end{cases}

    (2) is from the pdf of a Beta(α,β)\operatorname{Beta}\p{\alpha, \beta} distribution, which tells us

    01xα1(1x)β1dx=Γ(α)Γ(β)Γ(α+β),\int_0^1 x^{\alpha-1} \p{1 - x}^{\beta-1} \,\diff{x} = \frac{\Gamma\p{\alpha} \Gamma\p{\beta}}{\Gamma\p{\alpha + \beta}},

    where Γ(n)=(n1)!\Gamma\p{n} = \p{n-1}! for positive integers nn.

    (3) is the sum of an arithmetic sequence. The result is a little different than what's there since the sum starts at k=rk = r instead of k=0k = 0, but you can fix that by using

    k=rnak=k=0nakk=0r1ak.\sum_{k=r}^n a_k = \sum_{k=0}^n a_k - \sum_{k=0}^{r-1} a_k.
  2. Recall that for any random variable XX,

    VarX=E[X2](E[X])2.\Var X = \E\br{X^2} - \p{\E\br{X}}^2.

    We already know that E[Wr]=rn+1\E\br{W_r} = \frac{r}{n+1} and we computed the second moment in part 1, so plugging everything in,

    VarWr=r(r+1)(n+1)(n+2)(r(n+1))2=r(nr+1)(n+1)2(n+2).\Var W_r = \frac{r\p{r+1}}{\p{n+1}\p{n+2}} - \p{\frac{r}{\p{n+1}}}^2 = \boxed{\frac{r\p{n-r+1}}{\p{n+1}^2\p{n+2}}}.

    (Side note: The expectation and variance of WrW_r agree with those of a Beta(r,nr+1)\operatorname{Beta}\p{r, n - r + 1} random variable!)