Suppose we have the following sample of daily weekday afternoon (3 to 7pm) lead concentrations (in micrograms per cubic meter, μg/m3) recorded by an air-monitoring station near the San Diego Freeway in Los Angeles during the fall of 1976:
Calculate the sample mean x, sample variance s2, and sample standard deviation s.
How many observations lie within one standard deviation from the mean? How many lie within two standard deviations from the mean?
Solution.
This is a straightforward computation. Just keep in mind that to calculate s2, you use n−11 instead of n1.
xs2s≈7.2≈4.2≈2.0
There are 24 observations (strictly) within one standard deviation from the mean, and 29 (strictly) within two standard deviations from the mean. An efficient way to do this by hand is to calculate the range the observations need to lie in. For instance, for x to be within one standard deviation from the mean, x needs to satisfy
x−s≤x≤x+s.
Python Code
data = [6.7, 5.4, 5.2, 6.0, 8.7, 6.0, 6.4, 8.3, 5.3, 5.9,
7.6, 5.0, 6.9, 6.8, 4.9, 6.3, 5.0, 6.0, 7.2, 8.0,
8.1, 7.2, 10.9, 9.2, 8.6, 6.2, 6.1, 14.1, 10.6, 8.4]
defsample_mean(data):
n = len(data)
returnsum(data) / n
defsample_variance(data):
x_bar = sample_mean(data)
n = len(data)
returnsum([(x - x_bar) ** 2for x in data]) / (n - 1)
defsample_standard_deviation(data):
return sample_variance(data) ** 0.5
x_bar = sample_mean(data)
s2 = sample_variance(data)
s = sample_standard_deviation(data)
within_s = len([x for x in data ifabs(x - x_bar) <= s])
within_2s = len([x for x in data ifabs(x - x_bar) <= 2 * s])
print("x-bar:", x_bar) # 7.233333333333333print("s^2: ", s2) # 4.182298850574712print("s: ", s) # 2.0450669550346543print()
print("within s of mean:", within_s) # 24print("within 2s of mean:", within_2s) # 29
Problem 2
Suppose that a linear transformation is applied to each of the observations x1,x2,…,xn in a set of data; that is, a transformed data set y1,y2,…,yn is created from the original data via the equation
yi=axi+b,i=1,2,…,n,a,b∈R.
Show that if x and sx2 are the sample mean and sample variance of the original data, then the sample mean and sample variance of the transformed data are given by
You can check by hand that 14.1 is the only suspected outlier and that there are no outliers.
Python Code
import math
defsample_percentile(y, p):
x = (len(data) + 1) * p
d = math.floor(x)
r = x - d
# y[d] = y_{d+1} since arrays start at 0return y[d - 1] + r * (y[d] - y[d - 1])
y = sorted(data)
pi_25 = sample_percentile(y, 0.25)
pi_75 = sample_percentile(y, 0.75)
iqr = pi_75 - pi_25
pi_10 = sample_percentile(y, 0.1)
pi_90 = sample_percentile(y, 0.9)
suspected = [
x
for x in data
if (pi_25 - 3 * iqr <= x and x <= pi_25 - 1.5 * iqr)
or (pi_75 + 1.5 * iqr <= x and x <= pi_75 + 3 * iqr)
]
outliers = [x for x in data if x < pi_25 - 3 * iqr or pi_75 + 3 * iqr < x]
print("pi_25:", pi_25) # 5.975print("pi_75:", pi_75) # 8.325000000000001print("iqr: ", iqr) # 2.3500000000000014print()
print("pi_10:", pi_10) # 5.0200000000000005print("pi_90:", pi_90) # 10.460000000000003print()
print("suspected outliers: ", suspected) # [14.1]print("outliers: ", outliers) # []
Problem 4
Let Y1<Y2<⋯<Y8 be the order statistics of eight independent observations from a continuous-type distribution with 70th percentile π0.7=27.3.
Determine P(Y7<27.3).
Find P(Y5<27.3<Y8).
Solution.
First recall that being the 70th percentile means
P(X1<π0.7)=0.7.
We need at least 7 of the Xi's to be less than 27.3, so
where F,f are the cdf, pdf of the original random variables. In this case, f(x)=1 for 0<x<1 and F(x)=x for 0<x<1, so the cdf of Wr is
FWr(w)=k=r∑n(kn)wk(1−w)n−k.
The last step is a computation. There are two ways to do it: (i) by using the pdf given, or (ii) from the cdf alone. I'll do it via (ii) since I prefer using the cdf, and I think the computation will be a good review. (i) will be easier, however, and will just require step (1) below.
I'll number the harder steps and explain them afterwards. If fWr is the pdf, then
(2) is from the pdf of a Beta(α,β) distribution, which tells us
∫01xα−1(1−x)β−1dx=Γ(α+β)Γ(α)Γ(β),
where Γ(n)=(n−1)! for positive integersn.
(3) is the sum of an arithmetic sequence. The result is a little different than what's there since the sum starts at k=r instead of k=0, but you can fix that by using
k=r∑nak=k=0∑nak−k=0∑r−1ak.
Recall that for any random variable X,
VarX=E[X2]−(E[X])2.
We already know that E[Wr]=n+1r and we computed the second moment in part 1, so plugging everything in,