K Nearest Neighbor Algorithm in Python

1. What is KNN?

KNN is a non-parametric, instance-based (or “lazy”) learning method.

Used for both:

Classification: Predicts the class of a new data point by the majority class among its k nearest neighbors.
Regression: Predicts the value of a new data point by averaging the values of its k nearest neighbors.

2. KNN Implementation in Python – Steps in Using KNN

Choosing k

A smaller k: Can lead to overfitting (the model becomes sensitive to noise).
A larger k: Can lead to underfitting (the model may overlook important details).

2.1: Creating Feature & Target Arrays

from sklearn.datasets import load_iris

irisData = load_iris()
X = irisData.data      # Features
y = irisData.target    # Target

2.2: Splitting Data

Use train_test_split to split the data into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

2.3: Building the Model

Instantiate the classifier/regressor with a chosen k (e.g., n_neighbors=7):

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)  # Fit the model

2.4: Making Predictions

predictions = knn.predict(X_test)
print(predictions)

2.5: Measuring Accuracy

Quickly measure accuracy with .score() or other metrics:

from sklearn.metrics import accuracy_score

acc = knn.score(X_test, y_test)
# or: acc = accuracy_score(y_test, predictions)
print("Accuracy:", acc)

3. Finding the Optimal K

A straightforward method is to test different values of k and compare accuracy on training vs. test sets:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load data
irisData = load_iris()
X = irisData.data
y = irisData.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

neighbors = np.arange(1, 9)  # Trying k from 1 to 8
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

for i, k in enumerate(neighbors):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    
    train_accuracy[i] = knn.score(X_train, y_train)
    test_accuracy[i] = knn.score(X_test, y_test)

# Plot results
plt.plot(neighbors, test_accuracy, label='Testing Accuracy')
plt.plot(neighbors, train_accuracy, label='Training Accuracy')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

By examining the resulting plot, you can see which k strikes the best balance between training and testing accuracy (i.e., avoids overfitting and underfitting).

4. Industry Practice: Hyperparameter Tuning

In practice, rather than manually looping over possible k values, you often use hyperparameter tuning techniques such as Grid Search or Randomized Search with cross-validation. For example:

from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()

# Parameter grid to explore different k values and weight schemes
param_grid = {
    'n_neighbors': np.arange(1, 21),
    'weights': ['uniform', 'distance']
}

# 5-fold cross-validation
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

This systematically tests various hyperparameters and reports which combination yields the best cross-validation performance.

5. Final Takeaways

KNN is simple to understand and implement but can be computationally expensive for large datasets (it requires distance calculations to all training points for each prediction).
Always remember to consider feature scaling (e.g., using StandardScaler), since distance-based methods are sensitive to differences in feature scales.
Choice of k: Balance between overfitting (small k) and underfitting (large k).
Evaluate your model using appropriate metrics (accuracy, F1 score, etc.) and hyperparameter tuning techniques to get the most out of KNN.

In short, the KNN algorithm is a powerful, beginner-friendly technique for classification and regression. By iterating over various k values (or employing a more systematic hyperparameter tuning approach), you can achieve an optimal balance in your model’s performance on both training and test data.

Satyam Sharma

Satyam Sharma is a tech writer passionate about making complex topics in Machine Learning, Data Structures, and emerging technologies accessible and engaging. With a knack for breaking down intricate concepts, he crafts compelling articles, tutorials, and guides that bridge the gap between innovation and understanding. When he’s not writing, Satyam enjoys exploring the latest advancements in AI and contributing to open-source projects.

1. What is KNN?

2. KNN Implementation in Python – Steps in Using KNN

3. Finding the Optimal K

4. Industry Practice: Hyperparameter Tuning

5. Final Takeaways

Related Posts