Fashionpedia is a fine-grained fashion image dataset released by Google in 2020. What sets it apart from other fashion datasets is the combination of two annotation layers: pixel-level segmentation masks for each garment item and fine-grained attribute labels (material, texture, color, etc.) attached to those masks. I’ve been using it as an evaluation benchmark for outfit compatibility retrieval, and this note documents the format, the quirks, and how to work with it in Python.

What Is Fashionpedia?

Fashionpedia contains around 48,000 images (train: ~45,000, val: ~1,200) from the iMaterialist Fashion challenge, annotated with:

  • Segmentation masks (polygon or RLE) for each garment item in the image
  • Category labels: 46 categories ranging from coarse garments (dress, jacket, pants) to fine-grained parts (neckline, sleeve, zipper, pocket)
  • Attribute labels: 294 attributes per annotation (e.g., solid, floral, long sleeve, v-neck)

The key design choice is hierarchical: a single outfit image gets multiple overlapping annotations, one per visible item or part. A typical image in the val set has 7–8 annotations, ranging from 1 to 27.


COCO Format Primer

Fashionpedia follows the MS COCO annotation format, which is the de facto standard for instance segmentation datasets. If you’re new to COCO, here’s the structure.

A COCO annotation file is a single JSON with these top-level keys:

{
  "info":        { ... },          # dataset metadata
  "licenses":    [ ... ],          # image license info
  "images":      [ ... ],          # list of image metadata objects
  "categories":  [ ... ],          # label taxonomy
  "annotations": [ ... ]           # the actual annotation records
}

Fashionpedia adds one more key:

  "attributes":  [ ... ]           # 294 fine-grained attribute definitions

Images

Each entry in images looks like:

{
  "id": 123456,
  "file_name": "123456.jpg",
  "width": 800,
  "height": 1200,
  "license": 1
}

The id field is what you use to look up annotations for a given image.

Annotations

Each entry in annotations is one detected instance:

{
  "id": 999,
  "image_id": 123456,
  "category_id": 5,
  "bbox": [120.0, 45.0, 300.0, 400.0],
  "segmentation": [[120.0, 45.0, 280.0, 45.0, ...]],
  "area": 95000.0,
  "iscrowd": 0,
  "attribute_ids": [12, 47, 103]
}

A few things to note:

  • bbox is [x, y, w, h] — the top-left corner plus width and height, not [x1, y1, x2, y2]. Easy to mix up.
  • segmentation is either a list of polygons or an RLE dict (more on this below).
  • attribute_ids is Fashionpedia-specific: a list of attribute IDs from the attributes array.

Categories

Fashionpedia has 46 categories. They split into two groups:

Coarse garment categories (what you’d call an outfit item):

Category Notes
shirt, blouse  
top, t-shirt, sweatshirt  
sweater  
cardigan  
jacket  
vest  
pants  
shorts  
skirt  
coat  
dress top + bottom combined
jumpsuit  
cape  
shoe  
bag, wallet  
hat  
glasses  
glove  
belt  
watch  
 

Fine-grained part categories (sub-regions of garments):

Category Notes
sleeve most common annotation in the val set
neckline  
collar  
lapel  
pocket  
zipper  
buckle  
hood  
epaulette  
bow  
 

The part categories are heavily represented. In the val set, shoe (1566), sleeve (1442), and neckline (929) are the top three by annotation count — and two of those are parts, not garments.

This matters a lot for downstream use. If you’re building outfit compatibility models, you probably don’t want sleeve as a unit of comparison. Filtering to coarse garment categories (top/bottom/shoes) is a common preprocessing step.


Setup

pip install pycocotools matplotlib opencv-python

Download Fashionpedia from the official site or the Kaggle mirror. You’ll get:

fashionpedia/
  instances_attributes_train2020.json
  instances_attributes_val2020.json
  train/   (images)
  test/    (images, no public annotations)

Loading with pycocotools

pycocotools is the official COCO API. It handles the index-building so you don’t have to manually build lookup tables.

from pycocotools.coco import COCO

data_root = "/path/to/fashionpedia/"
coco = COCO(data_root + "instances_attributes_val2020.json")

The constructor prints something like loading annotations into memory... and builds an internal index. Key methods:

# Get all image IDs
img_ids = coco.getImgIds()

# Load image metadata
img_info = coco.loadImgs(img_id)[0]
# → {"id": 123456, "file_name": "123456.jpg", "width": 800, ...}

# Get annotation IDs for a given image
ann_ids = coco.getAnnIds(imgIds=img_id)

# Load annotations
anns = coco.loadAnns(ann_ids)

# Get category info
cat = coco.loadCats(anns[0]["category_id"])[0]
# → {"id": 5, "name": "shoe", "supercategory": "apparel"}

You can also filter by category:

# Get all annotation IDs for a specific category
shoe_cat_id = [c["id"] for c in coco.dataset["categories"] if c["name"] == "shoe"][0]
ann_ids = coco.getAnnIds(catIds=[shoe_cat_id])

Segmentation: Polygons and RLE

This is the part that trips people up most often. The segmentation field can be one of two formats depending on the annotator.

Polygon Format

The most common case in Fashionpedia. The value is a list of lists:

seg = ann["segmentation"]
# Example: [[x1, y1, x2, y2, x3, y3, ...], [...]]

Each inner list is a closed polygon, given as a flat sequence of [x, y, x, y, ...] coordinates. One annotation can have multiple polygons (for non-contiguous regions — think paired shoes).

To plot:

import numpy as np
import matplotlib.pyplot as plt

for poly in seg:
    pts = np.array(poly).reshape(-1, 2)
    ax.plot(pts[:, 0], pts[:, 1], linewidth=2)

RLE Format

Run-Length Encoding is used for more complex shapes. The value is a dict (not a list):

seg = ann["segmentation"]
# Example: {"counts": "...", "size": [height, width]}

To decode to a binary mask, use pycocotools.mask:

from pycocotools import mask as maskUtils

if isinstance(seg, list):
    # polygon
    pass
else:
    # RLE
    mask = maskUtils.decode(seg)  # → numpy array of shape (H, W), dtype uint8

The check isinstance(seg, list) is the standard way to branch between the two. In Fashionpedia, most annotations are polygons, but you’ll encounter RLE for crowd regions and occasionally for complex shapes.


Visualizing Annotations

Here’s a complete snippet that handles both segmentation types:

import os
import random
import json
import numpy as np
import cv2
import matplotlib.pyplot as plt
from pycocotools.coco import COCO
from pycocotools import mask as maskUtils

data_root = "/path/to/fashionpedia/"
IMAGE_DIR = os.path.join(data_root, "train")

coco = COCO(os.path.join(data_root, "instances_attributes_train2020.json"))

# Pick a random image
img_id = random.choice(coco.getImgIds())
img_info = coco.loadImgs(img_id)[0]
image = cv2.cvtColor(
    cv2.imread(os.path.join(IMAGE_DIR, img_info["file_name"])),
    cv2.COLOR_BGR2RGB
)

anns = coco.loadAnns(coco.getAnnIds(imgIds=img_id))

fig, ax = plt.subplots(1, figsize=(10, 10))
ax.imshow(image)

for ann in anns:
    x, y, w, h = ann["bbox"]
    color = np.random.rand(3)

    # Bounding box
    ax.add_patch(plt.Rectangle((x, y), w, h,
                               fill=False, edgecolor=color, linewidth=2))

    # Segmentation mask
    seg = ann["segmentation"]
    if isinstance(seg, list):
        for poly in seg:
            pts = np.array(poly).reshape(-1, 2)
            ax.fill(pts[:, 0], pts[:, 1], alpha=0.3, color=color)
            ax.plot(pts[:, 0], pts[:, 1], linewidth=1.5, color=color)
    else:
        mask = maskUtils.decode(seg)
        colored = np.zeros((*mask.shape, 4))
        colored[mask == 1] = [*color, 0.4]
        ax.imshow(colored)

    # Category label
    cat_name = coco.loadCats(ann["category_id"])[0]["name"]
    ax.text(x, y - 5, cat_name, color='white', fontsize=9,
            bbox=dict(facecolor=color, alpha=0.8, pad=2))

plt.axis("off")
plt.tight_layout()
plt.show()

Attribute Annotations

The attribute_ids field is Fashionpedia’s unique contribution beyond standard COCO. Each annotation carries a list of attribute IDs from the attributes array:

attrs = {a["id"]: a["name"] for a in coco.dataset["attributes"]}

for ann in anns:
    cat_name = coco.loadCats(ann["category_id"])[0]["name"]
    ann_attrs = [attrs[a] for a in ann.get("attribute_ids", [])]
    print(f"{cat_name}: {ann_attrs[:5]}")

Example output:

jacket: ['solid', 'long sleeve', 'button', 'lapel collar', 'fitted']
pants: ['solid', 'straight leg', 'mid-rise']
shoe: ['leather', 'block heel', 'pointed toe']

The 294 attributes are a mix of:

  • Color/pattern: solid, floral, striped, plaid, color-blocking
  • Silhouette: slim fit, oversized, fitted, straight leg, flared
  • Detail: button, zipper, lace, ruffle, bow, sequin
  • Neckline/collar: v-neck, round neck, turtleneck, lapel collar
  • Sleeve: long sleeve, short sleeve, sleeveless, cap sleeve
  • Material: denim, leather, knit, chiffon, velvet

Not all attributes are applicable to all categories, and the annotations are not exhaustive — an item may have solid but not have long sleeve even if it’s a long-sleeved top. Think of it as a weak multi-label setup.


Dataset Structure Summary

Running a quick analysis on the val set (1,158 images):

with open(os.path.join(data_root, "instances_attributes_val2020.json")) as f:
    data = json.load(f)

from collections import Counter
img_ann_count = Counter(ann["image_id"] for ann in data["annotations"])
counts = list(img_ann_count.values())
print(f"Images: {len(img_ann_count)}")
print(f"Annotations per image: min={min(counts)}, mean={sum(counts)/len(counts):.1f}, max={max(counts)}")
Images: 1158
Annotations per image: min=1, mean=7.6, max=27

Top categories by annotation count:

shoe:                       1566
sleeve:                     1442
neckline:                    929
pocket:                      541
dress:                       508
top, t-shirt, sweatshirt:    477
pants:                       314
collar:                      218
bag, wallet:                 214
zipper:                      194

The dominance of parts (sleeve, neckline, pocket, collar, zipper) over garments in the annotation count reflects the design intent — Fashionpedia was built with fine-grained attribute grounding in mind, not just item detection.


Practical: Filtering to Outfit Items

For outfit-level tasks, you typically want to work with coarse garments only and map the 46 categories to a smaller set. Here’s the mapping I use:

CATEGORY_MAPPING = {
    # top
    "shirt, blouse": "top",
    "top, t-shirt, sweatshirt": "top",
    "sweater": "top",
    "cardigan": "top",
    "jacket": "top",
    "vest": "top",
    # bottom
    "pants": "bottom",
    "shorts": "bottom",
    "skirt": "bottom",
    # shoes
    "shoe": "shoes",
    # everything else → None (skip)
}

def get_outfit_items(anns, coco, mapping=CATEGORY_MAPPING):
    items = []
    cat_id_to_name = {c["id"]: c["name"] for c in coco.dataset["categories"]}
    for ann in anns:
        cat_name = cat_id_to_name.get(ann["category_id"], "")
        norm = mapping.get(cat_name)
        if norm is None:
            continue
        items.append({
            "category": norm,
            "bbox": ann["bbox"],
            "segmentation": ann["segmentation"],
        })
    return items

A note on dress: it spans both top and bottom, so it doesn’t fit cleanly into either bucket. I exclude it from the outfit-item pipeline and handle it separately depending on the task.

Checking coverage in the val set — how many images have a full outfit (top + bottom + shoes)?

from collections import defaultdict

img_cats = defaultdict(set)
for ann in data["annotations"]:
    cat_name = cat_id_to_name[ann["category_id"]]
    norm = CATEGORY_MAPPING.get(cat_name)
    if norm:
        img_cats[ann["image_id"]].add(norm)

full_outfit = [img_id for img_id, cats in img_cats.items()
               if {"top", "bottom", "shoes"}.issubset(cats)]
print(f"Images with top+bottom+shoes: {len(full_outfit)} / {len(data['images'])}")
Images with top+bottom+shoes: 342 / 1158

About 30% of val images have all three item types. For outfit compatibility evaluation, this is the subset you’d typically use.


Notes and Gotchas

bbox is [x, y, w, h], not [x1, y1, x2, y2]. This catches people used to other formats. To convert: x2 = x + w, y2 = y + h.

Shoes almost always come in pairs. A single image with two shoes gets two shoe annotations, each with its own polygon. If you’re building a per-item embedding, expect 2 shoe crops per image for most outfit photos.

Part annotations overlap with garment annotations. A sleeve annotation overlaps spatially with the jacket or top annotation it belongs to. Fashionpedia does not encode the parent-child relationship explicitly in the annotation JSON, so you’d need to infer it from spatial overlap if needed.

RLE segmentations use (height, width) order in the size field, which is the opposite of the usual (width, height) convention. maskUtils.decode() handles this correctly, but keep it in mind if you’re doing any manual RLE manipulation.

The attribute_ids field can be absent or empty. Use .get("attribute_ids", []) rather than direct key access to avoid KeyError.

Train and val images are in separate directories, but annotations reference only filenames. If you’re combining train and val, make sure your image lookup searches both directories.


Next Steps

  • Crop items from images using bbox + polygon mask and feed into a vision encoder
  • Explore attribute prediction as an auxiliary task alongside compatibility scoring
  • Compare polygon-based crops vs. bbox-only crops on downstream retrieval metrics