Fashionpedia: A Practical Guide to the Dataset and COCO Format
Fashionpedia is a fine-grained fashion image dataset released by Google in 2020. What sets it apart from other fashion datasets is the combination of two annotation layers: pixel-level segmentation masks for each garment item and fine-grained attribute labels (material, texture, color, etc.) attached to those masks. I’ve been using it as an evaluation benchmark for outfit compatibility retrieval, and this note documents the format, the quirks, and how to work with it in Python.
What Is Fashionpedia?
Fashionpedia contains around 48,000 images (train: ~45,000, val: ~1,200) from the iMaterialist Fashion challenge, annotated with:
- Segmentation masks (polygon or RLE) for each garment item in the image
- Category labels: 46 categories ranging from coarse garments (
dress,jacket,pants) to fine-grained parts (neckline,sleeve,zipper,pocket) - Attribute labels: 294 attributes per annotation (e.g.,
solid,floral,long sleeve,v-neck)
The key design choice is hierarchical: a single outfit image gets multiple overlapping annotations, one per visible item or part. A typical image in the val set has 7–8 annotations, ranging from 1 to 27.
COCO Format Primer
Fashionpedia follows the MS COCO annotation format, which is the de facto standard for instance segmentation datasets. If you’re new to COCO, here’s the structure.
A COCO annotation file is a single JSON with these top-level keys:
{
"info": { ... }, # dataset metadata
"licenses": [ ... ], # image license info
"images": [ ... ], # list of image metadata objects
"categories": [ ... ], # label taxonomy
"annotations": [ ... ] # the actual annotation records
}
Fashionpedia adds one more key:
"attributes": [ ... ] # 294 fine-grained attribute definitions
Images
Each entry in images looks like:
{
"id": 123456,
"file_name": "123456.jpg",
"width": 800,
"height": 1200,
"license": 1
}
The id field is what you use to look up annotations for a given image.
Annotations
Each entry in annotations is one detected instance:
{
"id": 999,
"image_id": 123456,
"category_id": 5,
"bbox": [120.0, 45.0, 300.0, 400.0],
"segmentation": [[120.0, 45.0, 280.0, 45.0, ...]],
"area": 95000.0,
"iscrowd": 0,
"attribute_ids": [12, 47, 103]
}
A few things to note:
bboxis[x, y, w, h]— the top-left corner plus width and height, not[x1, y1, x2, y2]. Easy to mix up.segmentationis either a list of polygons or an RLE dict (more on this below).attribute_idsis Fashionpedia-specific: a list of attribute IDs from theattributesarray.
Categories
Fashionpedia has 46 categories. They split into two groups:
Coarse garment categories (what you’d call an outfit item):
| Category | Notes |
|---|---|
shirt, blouse |
|
top, t-shirt, sweatshirt |
|
sweater |
|
cardigan |
|
jacket |
|
vest |
|
pants |
|
shorts |
|
skirt |
|
coat |
|
dress |
top + bottom combined |
jumpsuit |
|
cape |
|
shoe |
|
bag, wallet |
|
hat |
|
glasses |
|
glove |
|
belt |
|
watch |
|
| … |
Fine-grained part categories (sub-regions of garments):
| Category | Notes |
|---|---|
sleeve |
most common annotation in the val set |
neckline |
|
collar |
|
lapel |
|
pocket |
|
zipper |
|
buckle |
|
hood |
|
epaulette |
|
bow |
|
| … |
The part categories are heavily represented. In the val set, shoe (1566), sleeve (1442), and neckline (929) are the top three by annotation count — and two of those are parts, not garments.
This matters a lot for downstream use. If you’re building outfit compatibility models, you probably don’t want sleeve as a unit of comparison. Filtering to coarse garment categories (top/bottom/shoes) is a common preprocessing step.
Setup
pip install pycocotools matplotlib opencv-python
Download Fashionpedia from the official site or the Kaggle mirror. You’ll get:
fashionpedia/
instances_attributes_train2020.json
instances_attributes_val2020.json
train/ (images)
test/ (images, no public annotations)
Loading with pycocotools
pycocotools is the official COCO API. It handles the index-building so you don’t have to manually build lookup tables.
from pycocotools.coco import COCO
data_root = "/path/to/fashionpedia/"
coco = COCO(data_root + "instances_attributes_val2020.json")
The constructor prints something like loading annotations into memory... and builds an internal index. Key methods:
# Get all image IDs
img_ids = coco.getImgIds()
# Load image metadata
img_info = coco.loadImgs(img_id)[0]
# → {"id": 123456, "file_name": "123456.jpg", "width": 800, ...}
# Get annotation IDs for a given image
ann_ids = coco.getAnnIds(imgIds=img_id)
# Load annotations
anns = coco.loadAnns(ann_ids)
# Get category info
cat = coco.loadCats(anns[0]["category_id"])[0]
# → {"id": 5, "name": "shoe", "supercategory": "apparel"}
You can also filter by category:
# Get all annotation IDs for a specific category
shoe_cat_id = [c["id"] for c in coco.dataset["categories"] if c["name"] == "shoe"][0]
ann_ids = coco.getAnnIds(catIds=[shoe_cat_id])
Segmentation: Polygons and RLE
This is the part that trips people up most often. The segmentation field can be one of two formats depending on the annotator.
Polygon Format
The most common case in Fashionpedia. The value is a list of lists:
seg = ann["segmentation"]
# Example: [[x1, y1, x2, y2, x3, y3, ...], [...]]
Each inner list is a closed polygon, given as a flat sequence of [x, y, x, y, ...] coordinates. One annotation can have multiple polygons (for non-contiguous regions — think paired shoes).
To plot:
import numpy as np
import matplotlib.pyplot as plt
for poly in seg:
pts = np.array(poly).reshape(-1, 2)
ax.plot(pts[:, 0], pts[:, 1], linewidth=2)
RLE Format
Run-Length Encoding is used for more complex shapes. The value is a dict (not a list):
seg = ann["segmentation"]
# Example: {"counts": "...", "size": [height, width]}
To decode to a binary mask, use pycocotools.mask:
from pycocotools import mask as maskUtils
if isinstance(seg, list):
# polygon
pass
else:
# RLE
mask = maskUtils.decode(seg) # → numpy array of shape (H, W), dtype uint8
The check isinstance(seg, list) is the standard way to branch between the two. In Fashionpedia, most annotations are polygons, but you’ll encounter RLE for crowd regions and occasionally for complex shapes.
Visualizing Annotations
Here’s a complete snippet that handles both segmentation types:
import os
import random
import json
import numpy as np
import cv2
import matplotlib.pyplot as plt
from pycocotools.coco import COCO
from pycocotools import mask as maskUtils
data_root = "/path/to/fashionpedia/"
IMAGE_DIR = os.path.join(data_root, "train")
coco = COCO(os.path.join(data_root, "instances_attributes_train2020.json"))
# Pick a random image
img_id = random.choice(coco.getImgIds())
img_info = coco.loadImgs(img_id)[0]
image = cv2.cvtColor(
cv2.imread(os.path.join(IMAGE_DIR, img_info["file_name"])),
cv2.COLOR_BGR2RGB
)
anns = coco.loadAnns(coco.getAnnIds(imgIds=img_id))
fig, ax = plt.subplots(1, figsize=(10, 10))
ax.imshow(image)
for ann in anns:
x, y, w, h = ann["bbox"]
color = np.random.rand(3)
# Bounding box
ax.add_patch(plt.Rectangle((x, y), w, h,
fill=False, edgecolor=color, linewidth=2))
# Segmentation mask
seg = ann["segmentation"]
if isinstance(seg, list):
for poly in seg:
pts = np.array(poly).reshape(-1, 2)
ax.fill(pts[:, 0], pts[:, 1], alpha=0.3, color=color)
ax.plot(pts[:, 0], pts[:, 1], linewidth=1.5, color=color)
else:
mask = maskUtils.decode(seg)
colored = np.zeros((*mask.shape, 4))
colored[mask == 1] = [*color, 0.4]
ax.imshow(colored)
# Category label
cat_name = coco.loadCats(ann["category_id"])[0]["name"]
ax.text(x, y - 5, cat_name, color='white', fontsize=9,
bbox=dict(facecolor=color, alpha=0.8, pad=2))
plt.axis("off")
plt.tight_layout()
plt.show()
Attribute Annotations
The attribute_ids field is Fashionpedia’s unique contribution beyond standard COCO. Each annotation carries a list of attribute IDs from the attributes array:
attrs = {a["id"]: a["name"] for a in coco.dataset["attributes"]}
for ann in anns:
cat_name = coco.loadCats(ann["category_id"])[0]["name"]
ann_attrs = [attrs[a] for a in ann.get("attribute_ids", [])]
print(f"{cat_name}: {ann_attrs[:5]}")
Example output:
jacket: ['solid', 'long sleeve', 'button', 'lapel collar', 'fitted']
pants: ['solid', 'straight leg', 'mid-rise']
shoe: ['leather', 'block heel', 'pointed toe']
The 294 attributes are a mix of:
- Color/pattern:
solid,floral,striped,plaid,color-blocking - Silhouette:
slim fit,oversized,fitted,straight leg,flared - Detail:
button,zipper,lace,ruffle,bow,sequin - Neckline/collar:
v-neck,round neck,turtleneck,lapel collar - Sleeve:
long sleeve,short sleeve,sleeveless,cap sleeve - Material:
denim,leather,knit,chiffon,velvet
Not all attributes are applicable to all categories, and the annotations are not exhaustive — an item may have solid but not have long sleeve even if it’s a long-sleeved top. Think of it as a weak multi-label setup.
Dataset Structure Summary
Running a quick analysis on the val set (1,158 images):
with open(os.path.join(data_root, "instances_attributes_val2020.json")) as f:
data = json.load(f)
from collections import Counter
img_ann_count = Counter(ann["image_id"] for ann in data["annotations"])
counts = list(img_ann_count.values())
print(f"Images: {len(img_ann_count)}")
print(f"Annotations per image: min={min(counts)}, mean={sum(counts)/len(counts):.1f}, max={max(counts)}")
Images: 1158
Annotations per image: min=1, mean=7.6, max=27
Top categories by annotation count:
shoe: 1566
sleeve: 1442
neckline: 929
pocket: 541
dress: 508
top, t-shirt, sweatshirt: 477
pants: 314
collar: 218
bag, wallet: 214
zipper: 194
The dominance of parts (sleeve, neckline, pocket, collar, zipper) over garments in the annotation count reflects the design intent — Fashionpedia was built with fine-grained attribute grounding in mind, not just item detection.
Practical: Filtering to Outfit Items
For outfit-level tasks, you typically want to work with coarse garments only and map the 46 categories to a smaller set. Here’s the mapping I use:
CATEGORY_MAPPING = {
# top
"shirt, blouse": "top",
"top, t-shirt, sweatshirt": "top",
"sweater": "top",
"cardigan": "top",
"jacket": "top",
"vest": "top",
# bottom
"pants": "bottom",
"shorts": "bottom",
"skirt": "bottom",
# shoes
"shoe": "shoes",
# everything else → None (skip)
}
def get_outfit_items(anns, coco, mapping=CATEGORY_MAPPING):
items = []
cat_id_to_name = {c["id"]: c["name"] for c in coco.dataset["categories"]}
for ann in anns:
cat_name = cat_id_to_name.get(ann["category_id"], "")
norm = mapping.get(cat_name)
if norm is None:
continue
items.append({
"category": norm,
"bbox": ann["bbox"],
"segmentation": ann["segmentation"],
})
return items
A note on dress: it spans both top and bottom, so it doesn’t fit cleanly into either bucket. I exclude it from the outfit-item pipeline and handle it separately depending on the task.
Checking coverage in the val set — how many images have a full outfit (top + bottom + shoes)?
from collections import defaultdict
img_cats = defaultdict(set)
for ann in data["annotations"]:
cat_name = cat_id_to_name[ann["category_id"]]
norm = CATEGORY_MAPPING.get(cat_name)
if norm:
img_cats[ann["image_id"]].add(norm)
full_outfit = [img_id for img_id, cats in img_cats.items()
if {"top", "bottom", "shoes"}.issubset(cats)]
print(f"Images with top+bottom+shoes: {len(full_outfit)} / {len(data['images'])}")
Images with top+bottom+shoes: 342 / 1158
About 30% of val images have all three item types. For outfit compatibility evaluation, this is the subset you’d typically use.
Notes and Gotchas
bbox is [x, y, w, h], not [x1, y1, x2, y2]. This catches people used to other formats. To convert: x2 = x + w, y2 = y + h.
Shoes almost always come in pairs. A single image with two shoes gets two shoe annotations, each with its own polygon. If you’re building a per-item embedding, expect 2 shoe crops per image for most outfit photos.
Part annotations overlap with garment annotations. A sleeve annotation overlaps spatially with the jacket or top annotation it belongs to. Fashionpedia does not encode the parent-child relationship explicitly in the annotation JSON, so you’d need to infer it from spatial overlap if needed.
RLE segmentations use (height, width) order in the size field, which is the opposite of the usual (width, height) convention. maskUtils.decode() handles this correctly, but keep it in mind if you’re doing any manual RLE manipulation.
The attribute_ids field can be absent or empty. Use .get("attribute_ids", []) rather than direct key access to avoid KeyError.
Train and val images are in separate directories, but annotations reference only filenames. If you’re combining train and val, make sure your image lookup searches both directories.
Next Steps
- Crop items from images using bbox + polygon mask and feed into a vision encoder
- Explore attribute prediction as an auxiliary task alongside compatibility scoring
- Compare polygon-based crops vs. bbox-only crops on downstream retrieval metrics