An agent that plans, calls visual tools, and uses a vision-based critic to boost multimodal VQA

May 28, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Somnath Kumar, Yash Gadhia, Tanuja Ganu, Akshay Nambi

Links

Abstract / PDF

Why It Matters For Business

MMCTAgent improves accuracy on hard visual QA tasks by combining planning, specialist vision tools, and an automated visual verifier — useful for analytics, media search, and QA over long videos, but it adds compute and tool dependencies.

Summary TLDR

MMCTAgent is a modular agent-style pipeline that combines an LLM planner/reasoner, a suite of visual/audio/text tools, and a novel vision-based critic to iteratively solve hard image and long-form video question answering (VQA) tasks. In zero-shot tests across several image benchmarks and long videos, MMCTAgent outperforms strong multimodal models. Example results: MMVET 74.24% (MMCT w/ critic) vs GPT-4V 60.2%; EgoSchema 71.2% (MMCT w/ critic) vs GPT-4V 63.5%. The critic adds ~3–5 percentage points but can also introduce errors when it shares weaknesses with the base vision model.

Problem Statement

Modern multimodal LLMs still struggle with detailed visual reasoning, long video context, and verifying chain-of-thought. MMCTAgent tackles this by (1) decomposing queries with an LLM planner, (2) calling specialized vision/audio/text tools to gather grounded evidence, and (3) using an automated vision-based critic to check and refine answers.

Main Contribution

MMCTAgent: a modular agent that iteratively plans, queries tools, and refines answers for image and long-form video VQA.

Vision-based critic: an automated verifier that derives task-specific criteria and evaluates the full multimodal reasoning chain.

Comprehensive zero-shot evaluation across image benchmarks and long-form video (EgoSchema) plus a new MMCT-QA dataset (129 QAs).

Key Findings

MMCTAgent yields higher accuracy than evaluated SOTA multimodal models on image benchmarks.

NumbersMMVET: 74.24% (MMCT w/ critic) vs GPT-4V 60.2% (Table 1)

Vision-based critic improves accuracy on average.

NumbersImages: ~+5% average uplift; Videos: ~+3–4% uplift (Sections 6.2, 7.2)

MMCTAgent achieves strong long-form video results versus baselines.

NumbersEgoSchema: 71.2% (MMCT w/ critic) vs GPT-4V 63.5% (Table 2)

Critic sometimes accepts or causes wrong answers when sharing model weaknesses.

NumbersBase-wrong & critic-accept: 20.41%; critic-causes-wrong: 5.35% (Appendix 16)

Results

Accuracy

Value74.24% (MMCT w/ critic)

BaselineGPT-4V 60.2%

Accuracy

Value71.2% (MMCT w/ critic)

BaselineGPT-4V 63.5%

Accuracy

Value71.3% (MMCT w/ critic)

BaselineBaseline2 51.2%

Critic uplift

ValueImages: ~+5% avg; Videos: ~+3–4%

BaselineMMCT w/o critic

Who Should Care

What To Try In 7 Days

Prototype a simple planner + OCR + object-detection loop on a small image VQA set to measure uplift vs single-pass MLLM.

Add a lightweight verifier step (separate model) that checks the answer against extracted visual facts.

Run MMCTAgent-style retrieval+indexing on one long video to test targeted frame retrieval vs naive chunking.

Agent Features

Memory

  • short-term retrieval/indexing for videos
  • no long-term persistent memory reported

Planning

  • iterative task decomposition with LLM
  • dynamic plan adaptation based on new evidence

Tool Use

  • VIT (vision interpreter)
  • OCR
  • object detection
  • ASR
  • Azure Video Retriever
  • GPT-4V for localized visual analysis

Frameworks

  • ReAct
  • LLama_Index
  • GPT-4 (planner) + GPT-4V (critic/tool)

Is Agentic

true

Architectures

  • planner-reasoner agent
  • modular tool-augmented pipeline
  • vision-based critic module

Collaboration

  • single-agent orchestration of many tools (not multi-agent)

Optimization Features

Token Efficiency

  • selective frame retrieval reduces LLM context usage

Infra Optimization

  • GPU required for local tools; recommended 1xA100 80GB in experiments

System Optimization

  • distribute video frames into photo-grid images for critic API limits

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Critic shares failure modes with the vision model when the same MLLM is used for both, causing acceptance of wrong answers (Appendix 16).
  • API limits for GPT-4V (10 frames per call) force photo-grid packing and limit full-video inspection.
  • High compute footprint (GPU, large LLMs and many tools) harms real-time or resource-constrained use.
  • Pipeline depends on many external tools; failures or poor tool outputs degrade results.

When Not To Use

  • Low-latency, real-time systems where GPU-backed tool calls are infeasible.
  • Environments that cannot tolerate stochastic hallucinations or require strict verification guarantees.
  • Cases where tool APIs are unavailable or cost-prohibitive.

Failure Modes

  • Base pipeline wrong and critic accepts wrong answer: 20.41% of samples (Appendix 16).
  • Critic can flip a correct base answer to wrong: 5.35% of samples (Appendix 16).
  • Hallucinations in OCR/spatial reading lead to incorrect evidence and downstream errors (Appendix 16).

Core Entities

Models

  • GPT-4
  • GPT-4V
  • Gemini
  • Claude 3
  • LLaVA
  • InstructBLIP
  • CLIP

Metrics

  • Accuracy

Datasets

  • MMVET
  • MMMU
  • MMBench
  • OKVQA
  • MathVista
  • EgoSchema
  • MMCT-QA
  • Youtube-8M

Benchmarks

  • MMVET
  • MMMU
  • MMBench
  • OKVQA
  • MathVista
  • EgoSchema
  • MMCT-QA