Project
Autoregressive Video Models
Driving large video models with next token prediction In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image…
Publication
Video In-context Learning
Project
A Dynamic Benchmark for Image Understanding
We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection. A key question for understanding multimodal model performance is how well is can understand images, in particular…