Hugging Face Transformers: Complete Python Tutorial
Master Hugging Face Transformers for NLP tasks. Learn text classification, named entity recognition, question answering, and fine-tuning models.
Moshiour Rahman
Advertisement
What is Hugging Face Transformers?
Hugging Face Transformers is the most popular library for working with pre-trained transformer models. It provides thousands of models for NLP, computer vision, and audio tasks.
Installation
pip install transformers torch
pip install datasets evaluate accelerate
Pipelines (Easiest Way)
Text Classification
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product! It's amazing!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Batch processing
results = classifier([
"This is fantastic!",
"This is terrible.",
"It's okay, nothing special."
])
Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
text = "Apple Inc. was founded by Steve Jobs in California."
entities = ner(text)
for entity in entities:
print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")
# Apple Inc.: ORG (0.99)
# Steve Jobs: PER (0.99)
# California: LOC (0.99)
Question Answering
qa = pipeline("question-answering")
context = """
Python is a high-level programming language created by Guido van Rossum.
It was first released in 1991 and is known for its simple syntax.
"""
result = qa(
question="Who created Python?",
context=context
)
print(f"Answer: {result['answer']} (score: {result['score']:.2f})")
# Answer: Guido van Rossum (score: 0.98)
Text Generation
generator = pipeline("text-generation", model="gpt2")
result = generator(
"Machine learning is",
max_length=50,
num_return_sequences=1,
temperature=0.7
)
print(result[0]['generated_text'])
Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """
Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience without being explicitly
programmed. It focuses on developing computer programs that can access
data and use it to learn for themselves. The process begins with
observations or data, such as examples, direct experience, or instruction.
"""
summary = summarizer(article, max_length=50, min_length=25)
print(summary[0]['summary_text'])
Translation
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you today?")
print(result[0]['translation_text'])
# Bonjour, comment allez-vous aujourd'hui?
Zero-Shot Classification
classifier = pipeline("zero-shot-classification")
text = "I need to book a flight to New York for next week"
labels = ["travel", "finance", "technology", "food"]
result = classifier(text, labels)
print(f"Label: {result['labels'][0]} (score: {result['scores'][0]:.2f})")
# Label: travel (score: 0.97)
Working with Models and Tokenizers
Loading Models
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# For classification
classifier_model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2
)
Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Hello, how are you?"
# Basic tokenization
tokens = tokenizer.tokenize(text)
print(tokens) # ['hello', ',', 'how', 'are', 'you', '?']
# Full encoding
encoding = tokenizer(
text,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt" # PyTorch tensors
)
print(encoding.keys()) # dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
print(encoding['input_ids'])
Batch Tokenization
texts = [
"First sentence here.",
"Second sentence is longer than the first one.",
"Third."
]
encodings = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt"
)
print(encodings['input_ids'].shape) # torch.Size([3, 128])
Fine-Tuning Models
Dataset Preparation
from datasets import load_dataset, Dataset
import pandas as pd
# Load from Hugging Face Hub
dataset = load_dataset("imdb")
print(dataset)
# Or from pandas
df = pd.DataFrame({
'text': ['Great product!', 'Terrible experience', 'It was okay'],
'label': [1, 0, 1]
})
dataset = Dataset.from_pandas(df)
Tokenize Dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=128
)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
Training with Trainer API
from transformers import (
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
from datasets import load_dataset
import numpy as np
from evaluate import load
# Load dataset
dataset = load_dataset("imdb")
dataset = dataset.shuffle(seed=42).select(range(1000)) # Smaller subset
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized = dataset.map(tokenize, batched=True)
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
# Metrics
accuracy = load("accuracy")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=100,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True
)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics
)
# Train
trainer.train()
# Evaluate
results = trainer.evaluate()
print(results)
Save and Load Fine-Tuned Model
# Save
trainer.save_model("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")
# Load
from transformers import pipeline
classifier = pipeline(
"sentiment-analysis",
model="./my-fine-tuned-model",
tokenizer="./my-fine-tuned-model"
)
result = classifier("This movie was great!")
Embeddings
Get Text Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
attention_mask = inputs['attention_mask']
token_embeddings = outputs.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
embedding = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return embedding[0].numpy()
# Get embeddings
emb1 = get_embedding("Machine learning is fascinating")
emb2 = get_embedding("AI and ML are interesting")
# Calculate similarity
from numpy import dot
from numpy.linalg import norm
similarity = dot(emb1, emb2) / (norm(emb1) * norm(emb2))
print(f"Similarity: {similarity:.4f}")
Image Classification
from transformers import pipeline
from PIL import Image
import requests
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
# Load image
url = "https://example.com/cat.jpg"
image = Image.open(requests.get(url, stream=True).raw)
results = classifier(image)
for result in results[:3]:
print(f"{result['label']}: {result['score']:.2f}")
Text-to-Image (Stable Diffusion)
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "A beautiful sunset over mountains, digital art"
image = pipe(prompt).images[0]
image.save("generated.png")
Using with GPU
import torch
from transformers import pipeline
# Check GPU availability
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")
# Use GPU for inference
classifier = pipeline("sentiment-analysis", device=device)
# Or move model to GPU manually
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
model = model.to("cuda")
Summary
| Task | Pipeline Name | Example Model |
|---|---|---|
| Sentiment | sentiment-analysis | distilbert-base-uncased |
| NER | ner | bert-base-ner |
| QA | question-answering | distilbert-base-qa |
| Summarization | summarization | facebook/bart-large-cnn |
| Translation | translation | Helsinki-NLP/opus-mt-* |
| Generation | text-generation | gpt2 |
| Classification | zero-shot-classification | facebook/bart-large-mnli |
Hugging Face Transformers makes it easy to use state-of-the-art AI models for any NLP task.
Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
NLP with spaCy: Complete Natural Language Processing Guide
Master natural language processing with spaCy Python. Learn tokenization, NER, POS tagging, text classification, and build NLP pipelines for production.
PythonNatural Language Processing with Python: Complete Guide
Learn NLP fundamentals with Python. Text preprocessing, sentiment analysis, named entity recognition, and building NLP applications with NLTK and spaCy.
PythonAI Agents Fundamentals: Build Your First Agent from Scratch
Master AI agents from the ground up. Learn the agent loop, build a working agent in pure Python, and understand the foundations that power LangGraph and CrewAI.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.