jellybean's braindump

Yes, We Can Verify That Too! (Vision Edition)

GRPO (Group Relative Policy Optimization) is moving beyond language tasks, with early investigations into how you can reinforce a vision language model's sight. Arguably, vision tasks are more diverse, especially in representational and computational demands. WHICH makes online RL for vision even more difficult and inconsistent. For "verifiable rewards", there are many unexplored design choices (task formulations), and we can construct a suite of toy tasks to exercise our environment engineering skills. They should still maintain two properties - 1) scalability to arbitrary complexity, 2) utilize existing data to expose deeper generalization skills while maintaining perfect accuracy. (as pointed out by kalo)

Here's my list of tasks to experiment with (the core verifier is a suggestion, at the moment):

1. Rotation Prediction

Can be applied to simple classification: {0,90,180,270} or regression: any θ[0,2π] (can even predict quaternion for synthetically tilted 3-D renderings).

Verifier: reward = 1 – |wrap(θ^θ)| / 180


2. Jigsaw

Simple 3 × 3 shuffles to multi-resolution pyramids (e.g. solve big 4×4 puzzle whose pieces are themselves 2×2 micro-puzzles).

Verifier: exact permutation match


3. Counting

Complexity can scale from synthetic (CLEVR-count) to real-world object counting.

Verifier: reward = exp(-|count – count_gt|), or a clipped MAE.


4. Segmentation

From semantic segmentation to panoptic segmentation.

Verifier: mean-IoU


5. Transformation Program Induction

Basically presenting a before-image and asking the model to output the sequence of transformations (crop, rotate, colour-shift) that regenerates the after-image. Complexity scales with number of transforms.

Verifier: apply predicted program with OpenCV; reward = IoU(result, target).


7. Document Layout Parsing

Starting from detect title / paragraph boxes in clean PDF prints to full category labelling (title, subtitle, table, figure, caption) on noisy scans.

Verifier: IoU + class accuracy per box against annotations.

#ml