Agentic AI falls short when key signals hide inside images

December 24, 20257 min

Overview

Decision SnapshotNeeds Validation

Clear controlled experiment with concrete numeric gaps. Results are strong for the setup but limited to one synthetic dataset and agentic pipeline variant.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

An Luo, Jin Du, Fangqiao Tian, Xun Xian, Robert Specht, Ganghua Wang, Xuan Bi, Charles Fleming, Jayanth Srinivasa, Ashish Kundu, Mingyi Hong, Jie Ding

Links

Abstract / PDF

Why It Matters For Business

Automated agentic pipelines that ignore images or other non-tabular signals can miss high-value predictive cues. For insurance, property, and other fields where visuals matter, adding simple image-processing steps can substantially improve forecasts and pricing.

Who Should Care

Summary TLDR

A synthetic property-insurance task hides a crucial variable (roof condition) inside aerial images. Generic agentic AI pipelines that use only tabular data score far worse (normalized Gini 0.3823) than methods that extract image-based domain signals (up to Gini 0.7719 with CLIP features, 0.8310 with perfect labels). The paper shows that agentic pipelines must be augmented to read multimodal, domain cues to reach human-like performance.

Problem Statement

Can current agentic AI (LLM-driven systems that auto-generate analytics code) match human data scientists when key predictive information is encoded outside the provided table — specifically inside images that require domain knowledge to interpret?

Main Contribution

Design of a controlled synthetic property-insurance dataset where a latent variable (RoofHealth: Good/Fair/Bad) is hidden from tabular features but encoded in overhead roof images.

A focused empirical comparison showing agentic AI using only tabular data performs much worse than approaches that extract image-based domain signals.

Key Findings

Generic agentic AI that uses only tabular features achieves low predictive ranking performance.

NumbersNormalized Gini = 0.3823 (Agentic AI; Random Forest on tabular only)

Practical UseDo not trust generic, tabular-only agentic pipelines when important signals may live in other modalities; add multimodal steps or human review.

Evidence RefTable II

Pretrained image embeddings (CLIP features) turned into predictors almost close the human-expert performance.

NumbersNormalized Gini = 0.7719 (RF + CLIP features)

Practical UseExtract image embeddings with a pretrained model and feed them to your predictor — it's an efficient, high-impact way to recover hidden visual signals.

Evidence RefTable II

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Normalized Gini0.3823Agentic AI (tabular only)Synthetic test set (n=1000)Agentic AI random forest using only tabular featuresTable II
Normalized Gini0.5042RF + CLIP clustered as 3 labelsvs tabular: +0.1219Synthetic test set (n=1000)RF using CLIP embeddings clustered into 3 categoriesTable II

What To Try In 7 Days

Run a quick proof-of-concept: extract CLIP embeddings from available images and add them to your tabular model to measure lift.

Compare two simple pipelines: (a) tabular-only agentic pipeline and (b) tabular+image embeddings; report normalized Gini or business metric.

If images are available, try a vision-LM to produce human-readable labels and compare label quality vs embeddings.

Agent Features

Planning
generic pipeline generation
Tool Use
Uses image embedding tool (CLIP)Uses vision-language tool (gpt-4o-mini)Uses text-to-image tool for dataset construction (gpt-image-1)
Is Agentic

Yes

Architectures
LLM-driven code generation (generic analytics pipeline)single-agent pipeline (tabular-focused)
Collaboration
human-AI teaming discussed (humans supply domain labels)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Synthetic dataset may idealize images: generated with text-to-image prompts and thus cleaner than real aerial imagery.

Study uses one domain (property insurance) and one latent variable design; results may not generalize to all tasks.

When Not To Use

Don't assume the numeric gaps hold for real-world data without testing; validate on your own datasets.

If your task is purely tabular with no external modalities, the conclusions about image signals do not apply.

Failure Modes

Agentic pipelines ignore non-tabular files and miss key variables encoded in images or text.

Vision-LM or automated labeling can be noisy and introduce label errors that reduce model performance.

Core Entities

Models

CLIPgpt-4o-minigpt-image-1Random ForestBayes-optimal (oracle)

Metrics

Normalized Gini

Datasets

Synthetic property-insurance dataset (tabular + 1024x1024 roof images, 2000 policies)

Context Entities

Benchmarks

MLE-bench (related work)DSBench (related work)