Implementing SimCLR on the Food-101 Dataset

Recently, I implemented the SimCLR framework (Chen et al., 2020) for contrastive learning using the Food-101 dataset. This post summarizes how the project was structured, implementation details from the actual code, and results I observed. The full implementation is on Gist.

Background: What is SimCLR?

SimCLR (A Simple Framework for Contrastive Learning of Visual Representations) is a self-supervised learning method that learns image representations without requiring labeled data. Instead, it pulls together different augmented views of the same image (positive pairs) and pushes apart views of different images (negative pairs) using a contrastive loss (NT-Xent).

Key components:

Data augmentation to create positive pairs
A base encoder network (e.g., ResNet-50)
A projection head (MLP) after the encoder
NT-Xent contrastive loss

After pretraining, the encoder is used to extract features, and a linear classifier can be trained on top using labeled data (linear evaluation).

Dataset: Food-101

Food-101 consists of 101 food categories, each with 1,000 images, for a total of 101,000 images. I chose it for its manageable size and diversity, which makes it a decent testbed for representation learning. I used the raw dataset without manual cleaning.

Implementation

Data Augmentation

The key insight in SimCLR is that the choice of augmentation matters a lot. The model learns to be invariant to exactly those transformations, so weak augmentations lead to trivial representations.

image_size = 256

blur_kernel = int(0.1 * image_size)
if blur_kernel % 2 == 0:
    blur_kernel += 1

simclr_transform = transforms.Compose([
    transforms.RandomResizedCrop(image_size, scale=(0.08, 1.0)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomApply([
        transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)  # brightness, contrast, saturation, hue
    ], p=0.8),
    transforms.RandomGrayscale(p=0.2),
    transforms.GaussianBlur(kernel_size=blur_kernel),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

The Gaussian blur kernel size is computed as 10% of the image size and forced to be odd (required by PyTorch). The paper applies blur with p=0.5 via RandomApply; here it is applied unconditionally, which is a minor deviation.

Dataset Wrapper

SimCLR requires two independently augmented views of the same image. The cleanest way to implement this is a thin wrapper around the base dataset that applies the stochastic transform twice:

class SimCLRDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, transform):
        self.dataset = dataset
        self.transform = transform

    def __getitem__(self, index):
        x, _ = self.dataset[index]  # discard labels during pretraining
        x1 = self.transform(x)      # first random augmentation
        x2 = self.transform(x)      # second random augmentation (different result)
        return x1, x2

    def __len__(self):
        return len(self.dataset)

base_dataset = ImageFolder(data_path)
dataset = SimCLRDataset(base_dataset, simclr_transform)
train_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)

Because transform contains stochastic operations (random crop, random flip, random color jitter), calling it twice on the same image produces two different views. The labels are discarded; during pretraining, the only supervision signal comes from the augmentation pairing.

Model Architecture

ResNet-50 is used as the base encoder. The default fully connected layer is replaced with a 2-layer MLP projection head:

hidden_dim = 128

model = resnet50(weights=None)
model.fc = nn.Sequential(
    nn.Linear(2048, 4 * hidden_dim),   # 2048 → 512
    nn.ReLU(inplace=True),
    nn.Linear(4 * hidden_dim, hidden_dim)  # 512 → 128
)

The intermediate dimension here is 4 * 128 = 512. The original paper uses 2048 as the intermediate dimension, but this smaller version reduces memory usage at the cost of some representational capacity.

After pretraining, only the ResNet-50 trunk (before model.fc) is used for feature extraction. The projection head is discarded, which is a key finding of the SimCLR paper: representations are better before the projection head than after.

NT-Xent Loss

The Normalized Temperature-scaled Cross Entropy loss is the core of SimCLR. Given a batch of N images, we have 2N augmented views. For each view, the positive sample is its counterpart augmentation, and all other 2N - 2 views in the batch are treated as negatives.

def nt_xent_loss(z1, z2, temperature=0.5):
    """
    z1, z2: (batch_size, dim) -- output embeddings from two augmented views
    Returns: scalar loss
    """
    batch_size = z1.shape[0]

    # Concatenate both views and L2-normalize
    z = torch.cat([z1, z2], dim=0)       # (2*BS, D)
    z = F.normalize(z, dim=1)

    # All-pairs cosine similarity, scaled by temperature
    sim = torch.matmul(z, z.T) / temperature  # (2*BS, 2*BS)

    # Mask out self-similarity (diagonal)
    mask = torch.eye(2 * batch_size, dtype=torch.bool).to(z.device)
    sim.masked_fill_(mask, -float('inf'))

    # Positive pair: z1[i] pairs with z2[i], i.e., index i and i + batch_size
    targets = torch.arange(batch_size, device=z.device)
    targets = torch.cat([targets + batch_size, targets])

    return F.cross_entropy(sim, targets)

A few things to note about this implementation:

L2 normalization happens inside the loss function, so the model outputs raw embeddings. The cosine similarity then naturally lives in [-1/τ, 1/τ].
The target construction is the key: for the first half of the batch (views from z1), the positives are at positions [batch_size, ..., 2*batch_size-1]. For the second half (z2), positives are at [0, ..., batch_size-1]. This is achieved by cat([targets + batch_size, targets]).
Self-similarity (a view paired with itself) is masked to -inf so it contributes zero to the softmax denominator.

Training Setup

BS = 64
tau = 0.5
epoch = 100

optimizer = optim.SGD(model.parameters(), lr=0.03, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epoch)

The original SimCLR paper uses LARS (Layer-wise Adaptive Rate Scaling), which is designed for very large batch sizes (4096+). For smaller batches like 64, standard SGD with cosine annealing is a reasonable substitute and avoids the need for an external LARS implementation.

Training loop:

writer = SummaryWriter(log_dir='runs/simclr_food101')

for e in range(epoch):
    running_loss = 0.0
    for i, (aug_images1, aug_images2) in enumerate(train_dataloader):
        aug_images1 = aug_images1.to(device)
        aug_images2 = aug_images2.to(device)

        z1 = model(aug_images1)
        z2 = model(aug_images2)

        loss = nt_xent_loss(z1, z2, temperature=tau)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 10 == 9:
            avg_loss = running_loss / 10
            print(f"[Epoch {e+1}, Batch {i+1}] loss: {avg_loss:.4f}")
            writer.add_scalar('training loss', avg_loss, e * len(train_dataloader) + i)
            running_loss = 0.0

    scheduler.step()

torch.save(model.state_dict(), 'simclr_food101.pth')
writer.close()

Linear Evaluation

After pretraining, the encoder is frozen and a linear classifier is trained on top using Food-101 labels (75,750 train / 25,250 test):

# Extract the trunk (discard projection head)
encoder = nn.Sequential(*list(model.children())[:-1])  # up to avgpool
encoder.eval()
for p in encoder.parameters():
    p.requires_grad = False

# Train a linear head
linear_head = nn.Linear(2048, 101).to(device)

Results

SimCLR performed well on Food-101 without any labels during pretraining:

Top-1 accuracy: ~72%
Top-5 accuracy: ~91%

These are competitive with supervised baselines trained from scratch on this dataset. t-SNE visualizations of the learned embeddings showed clear clustering by food category, even though the model had never seen labels during pretraining.

Implementation Notes

Batch size and negatives. SimCLR is known to be sensitive to batch size because a larger batch provides more negatives per update. With BS=64, you only have 126 negatives per sample; the paper uses 4096+ (8190 negatives). Performance improves noticeably with larger batches if you have the GPU memory.

Temperature. The paper evaluates τ ∈ {0.1, 0.5, 1.0} and finds 0.5 works best for ImageNet-scale training. At smaller batch sizes, lower temperatures can help by sharpening the distribution, but this needs tuning.

Projection head vs. encoder output. The paper’s key finding is that the projection head improves downstream linear evaluation. Intuitively, the projection head “absorbs” the augmentation-specific information, leaving the encoder output cleaner for general-purpose features. Always evaluate at the encoder output, not the projection head output.

LARS vs. SGD. The paper’s results use LARS, which adapts the learning rate per layer based on the ratio of weight norm to gradient norm. For large batches (1024+), LARS stabilizes training significantly. For smaller batches with SGD, the standard linear scaling rule (lr = 0.03 * BS / 256) and cosine annealing work fine.

pretrained=False is deprecated. In recent torchvision versions, use weights=None instead.

Takeaways

SimCLR is straightforward to implement and works well even on noisy datasets like Food-101.
Augmentation is the most important design choice; removing color jitter or Gaussian blur hurts performance meaningfully.
The projection head matters for downstream performance but should be discarded after pretraining.
Batch size is a key hyperparameter; more negatives generally help.

Next Steps

Experiment with other self-supervised methods (BYOL, VICReg, MoCo v3)
Use SimCLR pretraining before fine-tuning on smaller food-related datasets
Try a ViT backbone instead of ResNet-50