Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Automated agentic pipelines that ignore images or other non-tabular signals can miss high-value predictive cues. For insurance, property, and other fields where visuals matter, adding simple image-processing steps can substantially improve forecasts and pricing.
Summary TLDR
A synthetic property-insurance task hides a crucial variable (roof condition) inside aerial images. Generic agentic AI pipelines that use only tabular data score far worse (normalized Gini 0.3823) than methods that extract image-based domain signals (up to Gini 0.7719 with CLIP features, 0.8310 with perfect labels). The paper shows that agentic pipelines must be augmented to read multimodal, domain cues to reach human-like performance.
Problem Statement
Can current agentic AI (LLM-driven systems that auto-generate analytics code) match human data scientists when key predictive information is encoded outside the provided table — specifically inside images that require domain knowledge to interpret?
Main Contribution
Design of a controlled synthetic property-insurance dataset where a latent variable (RoofHealth: Good/Fair/Bad) is hidden from tabular features but encoded in overhead roof images.
A focused empirical comparison showing agentic AI using only tabular data performs much worse than approaches that extract image-based domain signals.
Quantified performance gap using normalized Gini and a clear oracle upper bound to show room for improvement in agentic AI.
Demonstration that standard image embeddings (CLIP) and vision-language extraction (gpt-4o-mini) partially recover the hidden signal, but performance still depends on how image information is used.
Key Findings
Generic agentic AI that uses only tabular features achieves low predictive ranking performance.
Pretrained image embeddings (CLIP features) turned into predictors almost close the human-expert performance.
Vision-language extraction (gpt-4o-mini) can recover the latent RoofHealth but yields lower final performance than CLIP features in this setup.
Perfect labeling of the latent image variable nearly matches the Bayes-optimal upper bound.
Results
Normalized Gini
Normalized Gini
Normalized Gini
Normalized Gini
Normalized Gini
Normalized Gini
Who Should Care
What To Try In 7 Days
Run a quick proof-of-concept: extract CLIP embeddings from available images and add them to your tabular model to measure lift.
Compare two simple pipelines: (a) tabular-only agentic pipeline and (b) tabular+image embeddings; report normalized Gini or business metric.
If images are available, try a vision-LM to produce human-readable labels and compare label quality vs embeddings.
Agent Features
Planning
- generic pipeline generation
Tool Use
- Uses image embedding tool (CLIP)
- Uses vision-language tool (gpt-4o-mini)
- Uses text-to-image tool for dataset construction (gpt-image-1)
Is Agentic
true
Architectures
- LLM-driven code generation (generic analytics pipeline)
- single-agent pipeline (tabular-focused)
Collaboration
- human-AI teaming discussed (humans supply domain labels)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Synthetic dataset may idealize images: generated with text-to-image prompts and thus cleaner than real aerial imagery.
- Study uses one domain (property insurance) and one latent variable design; results may not generalize to all tasks.
- Agentic AI baseline is a generic tabular pipeline — other agentic designs that inspect image files were not tested.
- No public code or data links provided to verify or extend experiments.
When Not To Use
- Don't assume the numeric gaps hold for real-world data without testing; validate on your own datasets.
- If your task is purely tabular with no external modalities, the conclusions about image signals do not apply.
Failure Modes
- Agentic pipelines ignore non-tabular files and miss key variables encoded in images or text.
- Vision-LM or automated labeling can be noisy and introduce label errors that reduce model performance.
- Synthetic image styles may bias methods that work well on generated images but fail on raw real-world photos.
Core Entities
Models
- CLIP
- gpt-4o-mini
- gpt-image-1
- Random Forest
- Bayes-optimal (oracle)
Metrics
- Normalized Gini
Datasets
- Synthetic property-insurance dataset (tabular + 1024x1024 roof images, 2000 policies)
Context Entities
Benchmarks
- MLE-bench (related work)
- DSBench (related work)

