2. K-nn Regression

K-nn method (Regression)

The best possible value for f^(x0) is

f^(x0)=E[Y0]=f(x0).

Given data (x1,Y1),,(xn,Yn), how do we learn/estimate f(x0)?

First: assume x1,,xn take only values 0 or 1 and x0=0.
A natural approach:

f^(0)=Average(Yi:xi=0)=i=1nYiI{xi=0}i=1nI{xi=0}

Here,

I{xi=0}={1,if xi=0,0,otherwise.

This works in data sets where x1,,xn take few distinct values and we are interested in predicting outcome for one of those values.

Idea: instead of requiring xi=x0 take K of the ‘closest’ xi.
K-nn (K nearest neighbours) method:

f^(x0)=i=1nYiI{xi among closest K to x0}K

Bias-Variance

Variance

Yi=f(xi)+εi, εi i.i.d. with E[εi]=0, Var(εi)=σ2, xi fixed, f^(x0) K-nn estimator.
Then, provided K closest points xi to x0 are unique:

Var[f^(x0)]=σ2K.

Derivation

Ii=I{xi among k closest to xi}Var(1kIiYi)=ind.(1K2)Var(IiYi)=1K2(IiVar(Yi)σ2)=1K2Kσ2=1Kσ2

Bias

For simplicity: xi=i/n, i=1,,n,
f:[0,1]R two times continuously differentiable
K=2+1
x0=j/n with <j<n.
Then:

E[f^(x0)]=1Ku=f(j+un)f(x0)+124f(x0)(Kn)2+rK,n,

where rK,n is a remainder term, 'small' under suitable conditions.

Remark: if we let =n with n/n0 in the analysis above, this would be Asymptotic Statistics.

Derivation

K nearest neighbours

jn,j+1n,,jn,j+1n,,j+nE[1ki=1nIiYi]=1Ku=E[Yj+u]Ii=1 if i=j,j+1,,j+=1Ku=f(j+un)=1Ku=(f(jn)+unf(jn)+12(un)2f(jn)+)=1Ku=f(jn)f(x0)+un1Ku=f(jn)0+121Ku=(un)2f(jn)121Kf(x0)u=u2/n2+1Kremainder

Validation

MSEtrain=1n(f^(xi)Yi)2(x1,Y1),,(xn,Yn)

Should instead use a test set

Test Error

Population level quantity at a fixed value of predictor x0: for Y0 new data point that is independent of f^(x0)

E[(f^(x0)Y0)2].

Population level quantity across a range of predictors: if (x1te,Y1te),,(xNte,YNte) is independent of f^, average value of expected MSE

1Ni=1NE[(f^(xite)Yite)2].

If we had an additional sample, called test sample, (x1te,Y1te),,(xNte,YNte) which was not used to build f^, we could try to approximate this error by

MSEtest=1Ni=1N(f^(xite)Yite)2.

MSEtest should converge to the population level
Not really WLLN since f^ relies on the other observations and both contain N.

So we split into testing and validating but we have two issue:

  1. split is random so our model might depend on the split
  2. Splitting in half might give the wrong bias variance trade off, normally more data means less variance
Var(f^)=σ2Kbias(f^)c(Kn)2

Improving Cross-Validation

Average validation error over several random splits and/or use a larger portion of the data for learning and smaller for validation

L-fold Cross-Validation

Let K={k1,,kC} denote the set of candidate values for k in k-nn, i.e. we want to select the best one (in terms of test error) of those values.

  1. Divide data into L disjoint parts (folds) of roughly equal size. Call folds S1,,SL.

  2. For =1,,L, kK, compute the k-nn estimator based on data
    {(xi,Yi):iS} (data not in fold S).
    Call resulting estimator f^,k,kK.

  3. Compute test error on the 'th fold

e(k)=iS(f^,k(xi)Yi)2.
  1. The selected k is the value with the smallest resulting error =1Le(k) over kK, or
k=argminkK=1Le(k).f^1,k is trained on the the other sets then applied on set 1 

Standard Error Rule

For each value of k, K-fold cross-validation returns estimated errors on each of the K folds.
Call those errors e1(k),,eK(k).

Those errors e1(k),,eK(k) are random variables (random splits, random data, ...).

To reduce noise in model selection, and to select simpler models, one can use the one standard error rule.

  1. For each kK and each =1,,L compute e(k) as in L-fold cross-validation.

  2. Find k0 which minimizes

    L1j=1Lej(k).
  3. Compute sd^(k0), the sample standard deviation of e1(k0),,eL(k0).

  4. Select the largest k among those k which satisfy

    L1j=1Lej(k)L1j=1Lej(k0)+L1/2sd^(k0)estimate of sd of sample mean

IMG_F06955603B1A-1.jpeg|384x233

Model Assessment

  1. Divide into training and testing
  2. Use training set to to fit candidates models. This includes tuning parameters with cross-validation
  3. Use the final models to predict data in the test set and compute test error

Set aside 20%-30% of data for test

Drawbacks of K-nn Regression

Curse of dimensionality: does not work when there are many predictors, especially
of some of those predictors are not important.

Difficult to work with qualitative predictors (student status, color of a car etc).

Resulting model not very interpretable