An agent that plans, calls visual tools, and uses a vision-based critic to boost multimodal VQA

Overview

Decision SnapshotNeeds Validation

Results show consistent zero-shot gains across multiple datasets and a new video QA set; evidence is solid but depends on tool availability and compute, and critic failure modes require mitigation.

Citations2

Evidence Strength0.70

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Somnath Kumar, Yash Gadhia, Tanuja Ganu, Akshay Nambi

Links

Abstract / PDF

Why It Matters For Business

MMCTAgent improves accuracy on hard visual QA tasks by combining planning, specialist vision tools, and an automated visual verifier — useful for analytics, media search, and QA over long videos, but it adds compute and tool dependencies.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

MMCTAgent is a modular agent-style pipeline that combines an LLM planner/reasoner, a suite of visual/audio/text tools, and a novel vision-based critic to iteratively solve hard image and long-form video question answering (VQA) tasks. In zero-shot tests across several image benchmarks and long videos, MMCTAgent outperforms strong multimodal models. Example results: MMVET 74.24% (MMCT w/ critic) vs GPT-4V 60.2%; EgoSchema 71.2% (MMCT w/ critic) vs GPT-4V 63.5%. The critic adds ~3–5 percentage points but can also introduce errors when it shares weaknesses with the base vision model.

Problem Statement

Modern multimodal LLMs still struggle with detailed visual reasoning, long video context, and verifying chain-of-thought. MMCTAgent tackles this by (1) decomposing queries with an LLM planner, (2) calling specialized vision/audio/text tools to gather grounded evidence, and (3) using an automated vision-based critic to check and refine answers.

Main Contribution

MMCTAgent: a modular agent that iteratively plans, queries tools, and refines answers for image and long-form video VQA.

Vision-based critic: an automated verifier that derives task-specific criteria and evaluates the full multimodal reasoning chain.

Key Findings

MMCTAgent yields higher accuracy than evaluated SOTA multimodal models on image benchmarks.

NumbersMMVET: 74.24% (MMCT w/ critic) vs GPT-4V 60.2% (Table 1)

Practical UseUse a tool-augmented agent pipeline to gain large, concrete accuracy gains (e.g., +14 pts on MMVET) over single-pass MLLMs for complex images.

Evidence RefTable 1

Vision-based critic improves accuracy on average.

NumbersImages: ~+5% average uplift; Videos: ~+3–4% uplift (Sections 6.2, 7.2)

Practical UseAdd a critic step to validate and often improve answers, but monitor for critic-induced errors.

Evidence RefSections 6.2 and 7.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	74.24% (MMCT w/ critic)	GPT-4V 60.2%	+14.04 pp	MMVET	Table 1 shows MMCTAgent 74.24% vs GPT-4V 60.2%	Table 1
Accuracy	71.2% (MMCT w/ critic)	GPT-4V 63.5%	+7.7 pp	EgoSchema (500 Q subset)	Table 2 reports 71.2% vs GPT-4V 63.5%	Table 2

What To Try In 7 Days

Prototype a simple planner + OCR + object-detection loop on a small image VQA set to measure uplift vs single-pass MLLM.

Add a lightweight verifier step (separate model) that checks the answer against extracted visual facts.

Run MMCTAgent-style retrieval+indexing on one long video to test targeted frame retrieval vs naive chunking.

Agent Features

Memory

short-term retrieval/indexing for videosno long-term persistent memory reported

Planning

iterative task decomposition with LLMdynamic plan adaptation based on new evidence

Tool Use

VIT (vision interpreter)OCRobject detectionASRAzure Video RetrieverGPT-4V for localized visual analysis

Frameworks

ReActLLama_IndexGPT-4 (planner) + GPT-4V (critic/tool)

Is Agentic

Yes

Architectures

planner-reasoner agentmodular tool-augmented pipelinevision-based critic module

Collaboration

single-agent orchestration of many tools (not multi-agent)

Optimization Features

Token Efficiency

selective frame retrieval reduces LLM context usage

Infra Optimization

GPU required for local tools; recommended 1xA100 80GB in experiments

System Optimization

distribute video frames into photo-grid images for critic API limits

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Critic shares failure modes with the vision model when the same MLLM is used for both, causing acceptance of wrong answers (Appendix 16).

API limits for GPT-4V (10 frames per call) force photo-grid packing and limit full-video inspection.

When Not To Use

Low-latency, real-time systems where GPU-backed tool calls are infeasible.

Environments that cannot tolerate stochastic hallucinations or require strict verification guarantees.

Failure Modes

Base pipeline wrong and critic accepts wrong answer: 20.41% of samples (Appendix 16).

Critic can flip a correct base answer to wrong: 5.35% of samples (Appendix 16).

Core Entities

Models

GPT-4GPT-4VGeminiClaude 3LLaVAInstructBLIPCLIP

Metrics

Accuracy

Datasets

MMVETMMMUMMBenchOKVQAMathVistaEgoSchemaMMCT-QAYoutube-8M

Benchmarks

MMVETMMMUMMBenchOKVQAMathVistaEgoSchemaMMCT-QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MMCTAgent yields higher accuracy than evaluated SOTA multimodal models on image benchmarks.

Vision-based critic improves accuracy on average.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding