Training a Model on Multiple GPUs with Data Parallelism
This article is divided into two parts; they are: • Data Parallelism • Distributed Data Parallelism If you have multiple GPUs, you can combine them to operate as a single GPU with greater memory capacity.
Train a Model Faster with torch.compile and Gradient Accumulation
This article is divided into two parts; they are: • Using `torch.
Training a Model with Limited Memory using Mixed Precision and Gradient Checkpointing
This article is divided into three parts; they are: • Floating-point Numbers • Automatic Mixed Precision Training • Gradient Checkpointing Let’s get started! The default data type in PyTorch is the IEEE 754 32-bit floating-point format, also known as single precision.
Practical Agentic Coding with Google Jules
If you have an interest in agentic coding, there’s a pretty good chance you’ve heard of
Evaluating Perplexity on Language Models
This article is divided into two parts; they are: • What Is Perplexity and How to Compute It • Evaluate the Perplexity of a Language Model with HellaSwag Dataset Perplexity is a measure of how well a language model predicts a sample of text.
The Journey of a Token: What Really Happens Inside a Transformer
Large language models (LLMs) are based on the transformer architecture, a complex deep neural network whose input is a sequence of token embeddings.
Pretrain a BERT Model from Scratch
This article is divided into three parts; they are: • Creating a BERT Model the Easy Way • Creating a BERT Model from Scratch with PyTorch • Pre-training the BERT Model If your goal is to create a BERT model so that you can train it on your own data, using the Hugging Face `transformers` […]
K-Means Cluster Evaluation with Silhouette Analysis
Clustering models in machine learning must be assessed by how well they separate data into meaningful groups with distinctive characteristics.
The Complete Guide to Docker for Machine Learning Engineers
Machine learning models often behave differently across environments.
Preparing Data for BERT Training
This article is divided into four parts; they are: • Preparing Documents • Creating Sentence Pairs from Document • Masking Tokens • Saving the Training Data for Reuse Unlike decoder-only models, BERT’s pretraining is more complex.