Scikit-Learn Tutorial: Machine Learning with Python
Complete guide to machine learning with scikit-learn. Learn classification, regression, clustering, model evaluation, and building ML pipelines.
Moshiour Rahman
Advertisement
What is Scikit-Learn?
Scikit-learn is Python’s most popular machine learning library. It provides simple and efficient tools for data analysis and modeling, built on NumPy, SciPy, and Matplotlib.
Installation
pip install scikit-learn
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
Machine Learning Workflow
Data → Preprocessing → Train/Test Split → Model Training → Evaluation → Prediction
Loading Datasets
Built-in Datasets
from sklearn.datasets import load_iris, load_boston, load_digits, make_classification
# Classification dataset
iris = load_iris()
X, y = iris.data, iris.target
print(f"Features: {iris.feature_names}")
print(f"Target: {iris.target_names}")
print(f"Shape: {X.shape}")
# Generate synthetic data
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=15,
n_redundant=5,
random_state=42
)
From Pandas DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
Data Preprocessing
Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # Maintain class distribution
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# StandardScaler: Mean=0, Std=1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same transformation
# MinMaxScaler: Range [0, 1]
minmax_scaler = MinMaxScaler()
X_scaled = minmax_scaler.fit_transform(X)
# RobustScaler: Better for outliers
robust_scaler = RobustScaler()
X_scaled = robust_scaler.fit_transform(X)
Handling Missing Values
from sklearn.impute import SimpleImputer
# Replace with mean
imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent'
X_imputed = imputer.fit_transform(X)
# For categorical features
cat_imputer = SimpleImputer(strategy='most_frequent')
Encoding Categorical Variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label encoding (ordinal)
le = LabelEncoder()
y_encoded = le.fit_transform(['cat', 'dog', 'bird', 'cat'])
print(y_encoded) # [0 2 1 0]
# One-hot encoding (nominal)
ohe = OneHotEncoder(sparse=False)
X_encoded = ohe.fit_transform(X[['category_column']])
Classification
Logistic Regression
from sklearn.linear_model import LogisticRegression
# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Probability predictions
y_proba = model.predict_proba(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt
# Train
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
# Predict and evaluate
y_pred = dt.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Visualize tree
plt.figure(figsize=(20, 10))
tree.plot_tree(dt, feature_names=feature_names, class_names=class_names, filled=True)
plt.show()
# Feature importance
importance = pd.DataFrame({
'feature': feature_names,
'importance': dt.feature_importances_
}).sort_values('importance', ascending=False)
print(importance)
Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=5,
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Feature importance
importance = pd.DataFrame({
'feature': feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
Support Vector Machine (SVM)
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm.fit(X_train_scaled, y_train) # SVM benefits from scaling
y_pred = svm.predict(X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Regression
Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
# Coefficients
print(f"Coefficients: {lr.coef_}")
print(f"Intercept: {lr.intercept_}")
Ridge and Lasso Regression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Ridge (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# Lasso (L1 regularization - feature selection)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# ElasticNet (L1 + L2)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
Clustering
K-Means
from sklearn.cluster import KMeans
# Find optimal k using elbow method
inertias = []
K = range(1, 11)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.plot(K, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Train with optimal k
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)
# Cluster centers
print(kmeans.cluster_centers_)
Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Agglomerative clustering
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
clusters = agg.fit_predict(X)
# Dendrogram
linkage_matrix = linkage(X, method='ward')
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Dendrogram')
plt.show()
DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X)
# Number of clusters (excluding noise labeled as -1)
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
print(f"Number of clusters: {n_clusters}")
Model Evaluation
Classification Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_auc_score, roc_curve
)
# Basic metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.4f}")
# Classification report
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# ROC AUC (binary classification)
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"ROC AUC: {auc:.4f}")
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
Cross-Validation
from sklearn.model_selection import cross_val_score, cross_validate
# Simple cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
# Multiple metrics
results = cross_validate(
model, X, y, cv=5,
scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'],
return_train_score=True
)
for metric in ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']:
print(f"{metric}: {results[f'test_{metric}'].mean():.4f}")
Hyperparameter Tuning
Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf, param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
# Use best model
best_model = grid_search.best_estimator_
Randomized Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(5, 50),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
random_search = RandomizedSearchCV(
rf, param_distributions,
n_iter=100,
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
Pipelines
Basic Pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Pipeline with Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['gender', 'occupation', 'city']
# Preprocessing pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
Pipeline with Grid Search
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', SVC())
])
param_grid = {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['rbf', 'linear'],
'classifier__gamma': ['scale', 'auto']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
Saving and Loading Models
import joblib
# Save model
joblib.dump(model, 'model.pkl')
joblib.dump(pipeline, 'pipeline.pkl')
# Load model
loaded_model = joblib.load('model.pkl')
loaded_pipeline = joblib.load('pipeline.pkl')
# Make predictions with loaded model
predictions = loaded_model.predict(X_new)
Practical Example
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Load data
df = pd.read_csv('customer_churn.csv')
# Prepare features and target
X = df.drop(['customer_id', 'churn'], axis=1)
y = df['churn']
# Encode categorical variables
le = LabelEncoder()
for col in X.select_dtypes(include=['object']).columns:
X[col] = le.fit_transform(X[col])
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
# Evaluate on test set
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
# Feature importance
importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print(importance.head(10))
Summary
| Task | Common Algorithms |
|---|---|
| Classification | Logistic Regression, Random Forest, SVM |
| Regression | Linear Regression, Ridge, Random Forest |
| Clustering | K-Means, DBSCAN, Hierarchical |
| Dimensionality Reduction | PCA, t-SNE |
Scikit-learn provides a consistent API across all algorithms, making it easy to experiment and build production-ready ML models.
Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
Getting Started with Machine Learning in Python: A Practical Guide
Learn machine learning fundamentals with Python. Build your first ML models using scikit-learn with hands-on examples for classification, regression, and real-world predictions.
PythonPandas Tutorial: Complete Guide to Data Analysis in Python
Master Pandas for data analysis. Learn DataFrames, data cleaning, manipulation, grouping, merging, and visualization with practical examples.
PythonMatplotlib Tutorial: Complete Guide to Data Visualization in Python
Master data visualization with Matplotlib. Learn to create line plots, bar charts, scatter plots, histograms, and customize your visualizations.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.