K nearest Neighbors
1. K-Nearest Neighbors (KNN) Cheat Sheet
High-Level Overview:
K-Nearest Neighbors (KNN) is a supervised learning technique that uses the entire training dataset directly for making predictions. When a new data point comes in, KNN finds the \( k \) closest points in the training set (its "neighbors") and uses their labels (for classification) or values (for regression) to decide the prediction.
It’s simple: no complex training, just comparing new instances to existing examples. However, it can be slow with very large datasets and sensitive to the choice of distance metric and scale of features.
Key Concepts:
- Instance-Based: KNN stores all training examples and uses them at prediction time.
- Distance Metric: Usually Euclidean distance, but others can be used.
- Number of Neighbors (\(k\)): Controls the balance between being too sensitive (small \(k\)) and too general (large \(k\)).
- Classification & Regression:
- Classification: Majority vote among nearest neighbors.
- Regression: Average (or median) of the neighbors’ values.
Distance Measure (Euclidean Distance):
Given two points \(\mathbf{x} = (x_1, x_2, \ldots, x_n)\) and \(\mathbf{y} = (y_1, y_2, \ldots, y_n)\): \[ d(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n - y_n)^2} \]
This defines the “closeness” of points. If features are on different scales, normalization or standardization is often needed.
How KNN Predicts (Classification):
- Pick \( k \), the number of neighbors.
- Calculate the distance from the new point to all training points.
- Find the \( k \)-closest points.
- Use a majority vote of their labels to determine the predicted class.
If \(N_k(\mathbf{x}_{new})\) is the set of \( k \)-nearest neighbors: \[ \hat{y}_{new} = \text{mode}(\{y_i : \mathbf{x}_i \in N_k(\mathbf{x}_{new})\}) \]
More Complex Example (Fruit Classification):
Suppose you have a dataset of fruits. Each fruit is described by features:
- Color Intensity (C) - e.g., 0 = pale, 1 = very vibrant color
- Sweetness (S) - e.g., from 0 (not sweet) to 1 (very sweet)
- Firmness (F) - e.g., from 0 (soft) to 1 (firm)
Example training set:
- Fruit A: C=0.9, S=0.7, F=0.6 → Apple
- Fruit B: C=0.3, S=0.4, F=0.5 → Orange
- Fruit C: C=0.8, S=0.8, F=0.7 → Apple
- Fruit D: C=0.4, S=0.3, F=0.4 → Orange
- Fruit E: C=0.85,S=0.6, F=0.65 → Apple
- Fruit F: C=0.25,S=0.45,F=0.55 → Orange
Now you get a new fruit with features: C=0.75, S=0.7, F=0.6, and you want to predict if it’s an Apple or Orange.
Steps:
- Pick \( k \), say \( k=3 \).
- Compute the distance to each training fruit:
- d(new, A), d(new, B), d(new, C), d(new, D), d(new, E), d(new, F)
- Find the 3 closest. Suppose they are A, C, and E (all Apples).
- Majority vote: 3 Apples vs. 0 Oranges → Predict Apple.
Because the new fruit is similar (in color, sweetness, and firmness) to known apples, KNN classifies it as an Apple.
Choosing \(k\):
- Small \( k \): Model is more sensitive to noise (overfitting).
- Large \( k \): Model becomes smoother and may lose distinct class boundaries (underfitting).
Differences from K-Means:
- Labels:
- KNN: Supervised, requires labeled data.
- K-Means: Unsupervised, no labels needed.
- Purpose:
- KNN: Predict labels (classification or regression output) for new points.
- K-Means: Find natural clusters in unlabeled data.
- Computation:
- KNN: Computationally cheap to set up, but prediction can be slow if dataset is large.
- K-Means: More computational at the start (to find clusters), but quick to assign new points once clusters are found.
Step-by-Step Summary:
- Collect Labeled Data: Gather examples with known labels.
- Define Distance & \(k\): Decide on a distance metric and choose \( k \).
- Predict New Instances: Find the \( k \)-closest examples and do a majority vote (classification) or average (regression).
- Evaluate & Tune: Adjust \( k \) and optionally preprocess data (scaling) to improve results.
ML Engineer Perspective:
- Use KNN as a baseline classifier or regressor because it’s simple and intuitive.
- Works well for small to medium datasets; can be slow on very large datasets.
- Distance-based: Ensure features are scaled and consider which metric best suits your data.
Code Example (Python with scikit-learn):
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# Example Data: Fruits with features [Color Intensity, Sweetness, Firmness]
X = np.array([
[0.9, 0.7, 0.6], # Apple
[0.3, 0.4, 0.5], # Orange
[0.8, 0.8, 0.7], # Apple
[0.4, 0.3, 0.4], # Orange
[0.85,0.6, 0.65], # Apple
[0.25,0.45,0.55] # Orange
])
y = np.array(['Apple', 'Orange', 'Apple', 'Orange', 'Apple', 'Orange'])
# Initialize KNN with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
# Predict label for a new fruit: C=0.75, S=0.7, F=0.6
new_fruit = np.array([[0.75, 0.7, 0.6]])
prediction = knn.predict(new_fruit)
print("Predicted label for new fruit:", prediction[0])
In this example, the new fruit is classified as 'Apple' because its closest neighbors are predominantly apples.
Key Takeaways:
- KNN uses the entire training set at prediction time, no separate “training” step.
- Works for classification (majority vote) and regression (average of neighbors).
- Parameter \( k \) and feature scaling are crucial.
- Simple and interpretable baseline method.