Affordance Prediction Analysis

Qualitative Comparison against baselines

We compare our affordance model against two baselines using CLIP features - ClipSeg and CliPort on the RAVENS dataset. Spatula and Sauce Pan are seen at training time while Hammer is unseen for CLiPort. For our method, we annotate one exemplar from each category.

Hammer


Ground Truth	ClipSeg	CliPort	Ours

Sauce Pan


Ground Truth	ClipSeg	CliPort	Ours

Spatula


Ground Truth	ClipSeg	CliPort	Ours

Using CLIP Features for Affordance Prediction

While CLIP features are great at identifying objects with open vocabulary generalization, it struggles with localizing specific object parts or affordances. Below is an example prediction from ClipSeg for the image of a hammer with the prompt "hammer handle".