Agentic AI falls short when key signals hide inside images

December 24, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

An Luo, Jin Du, Fangqiao Tian, Xun Xian, Robert Specht, Ganghua Wang, Xuan Bi, Charles Fleming, Jayanth Srinivasa, Ashish Kundu, Mingyi Hong, Jie Ding

Links

Abstract / PDF

Why It Matters For Business

Automated agentic pipelines that ignore images or other non-tabular signals can miss high-value predictive cues. For insurance, property, and other fields where visuals matter, adding simple image-processing steps can substantially improve forecasts and pricing.

Summary TLDR

A synthetic property-insurance task hides a crucial variable (roof condition) inside aerial images. Generic agentic AI pipelines that use only tabular data score far worse (normalized Gini 0.3823) than methods that extract image-based domain signals (up to Gini 0.7719 with CLIP features, 0.8310 with perfect labels). The paper shows that agentic pipelines must be augmented to read multimodal, domain cues to reach human-like performance.

Problem Statement

Can current agentic AI (LLM-driven systems that auto-generate analytics code) match human data scientists when key predictive information is encoded outside the provided table — specifically inside images that require domain knowledge to interpret?

Main Contribution

Design of a controlled synthetic property-insurance dataset where a latent variable (RoofHealth: Good/Fair/Bad) is hidden from tabular features but encoded in overhead roof images.

A focused empirical comparison showing agentic AI using only tabular data performs much worse than approaches that extract image-based domain signals.

Quantified performance gap using normalized Gini and a clear oracle upper bound to show room for improvement in agentic AI.

Demonstration that standard image embeddings (CLIP) and vision-language extraction (gpt-4o-mini) partially recover the hidden signal, but performance still depends on how image information is used.

Key Findings

Generic agentic AI that uses only tabular features achieves low predictive ranking performance.

NumbersNormalized Gini = 0.3823 (Agentic AI; Random Forest on tabular only)

Pretrained image embeddings (CLIP features) turned into predictors almost close the human-expert performance.

NumbersNormalized Gini = 0.7719 (RF + CLIP features)

Vision-language extraction (gpt-4o-mini) can recover the latent RoofHealth but yields lower final performance than CLIP features in this setup.

NumbersNormalized Gini = 0.7271; Corr. with true RoofHealth = 0.8062

Perfect labeling of the latent image variable nearly matches the Bayes-optimal upper bound.

NumbersNormalized Gini = 0.8310 (true RoofHealth) vs Oracle = 0.8379

Results

Normalized Gini

Value0.3823

BaselineAgentic AI (tabular only)

Normalized Gini

Value0.5042

BaselineRF + CLIP clustered as 3 labels

Normalized Gini

Value0.7719

BaselineRF + CLIP features

Normalized Gini

Value0.7271

BaselineRF + RoofHealth from gpt-4o-mini

Normalized Gini

Value0.8310

BaselineRF + true RoofHealth

Normalized Gini

Value0.8379

BaselineOracle (Bayes-optimal expected loss)

Who Should Care

What To Try In 7 Days

Run a quick proof-of-concept: extract CLIP embeddings from available images and add them to your tabular model to measure lift.

Compare two simple pipelines: (a) tabular-only agentic pipeline and (b) tabular+image embeddings; report normalized Gini or business metric.

If images are available, try a vision-LM to produce human-readable labels and compare label quality vs embeddings.

Agent Features

Planning

  • generic pipeline generation

Tool Use

  • Uses image embedding tool (CLIP)
  • Uses vision-language tool (gpt-4o-mini)
  • Uses text-to-image tool for dataset construction (gpt-image-1)

Is Agentic

true

Architectures

  • LLM-driven code generation (generic analytics pipeline)
  • single-agent pipeline (tabular-focused)

Collaboration

  • human-AI teaming discussed (humans supply domain labels)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Synthetic dataset may idealize images: generated with text-to-image prompts and thus cleaner than real aerial imagery.
  • Study uses one domain (property insurance) and one latent variable design; results may not generalize to all tasks.
  • Agentic AI baseline is a generic tabular pipeline — other agentic designs that inspect image files were not tested.
  • No public code or data links provided to verify or extend experiments.

When Not To Use

  • Don't assume the numeric gaps hold for real-world data without testing; validate on your own datasets.
  • If your task is purely tabular with no external modalities, the conclusions about image signals do not apply.

Failure Modes

  • Agentic pipelines ignore non-tabular files and miss key variables encoded in images or text.
  • Vision-LM or automated labeling can be noisy and introduce label errors that reduce model performance.
  • Synthetic image styles may bias methods that work well on generated images but fail on raw real-world photos.

Core Entities

Models

  • CLIP
  • gpt-4o-mini
  • gpt-image-1
  • Random Forest
  • Bayes-optimal (oracle)

Metrics

  • Normalized Gini

Datasets

  • Synthetic property-insurance dataset (tabular + 1024x1024 roof images, 2000 policies)

Context Entities

Benchmarks

  • MLE-bench (related work)
  • DSBench (related work)