SynEval: a compact framework to measure fidelity, utility and privacy of LLM-generated tabular reviews

Overview

Decision SnapshotReady For Pilot

SynEval is a practical, open framework with clear metrics and code; experiments use small samples and one domain, so validate on your full data before trusting results.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/8

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Yefeng Yuan, Yuhong Liu, Liang Cheng

Links

Abstract / PDF / Code

Why It Matters For Business

SynEval helps teams judge if synthetic data is usable: it flags fidelity gaps, estimates downstream model performance, and surfaces privacy risk before data sharing.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

This paper introduces SynEval, an open-source toolkit to score synthetic tabular data (including review text) along three axes: fidelity (structure, column distributions, text features), utility (train-on-synthetic, test-on-real for sentiment), and privacy (membership inference attacks). The authors run SynEval on synthetic product reviews from Claude 3 Opus, ChatGPT 3.5, and Llama 2 13B. Main takeaways: Claude yields the highest fidelity and more realistic text; all synthetic sets give comparable sentiment-model accuracy to real data on small samples; but membership inference success is high (83–91%), so privacy risk is real. Code: github.com/SCU-TrustworthyAI/SynEval.

Problem Statement

Practitioners lack a single, practical framework to measure how well LLMs generate synthetic tabular data: does generated data match real data statistics, remain useful for ML tasks, and preserve privacy? Existing metrics are fragmented and rarely target mixed tabular+text review data.

Main Contribution

SynEval: a unified, open-source framework that measures fidelity, utility, and privacy for synthetic tabular data with text.

Concrete fidelity metrics: structure preservation, integrity, column-shape (KS/TVD), plus text tests (sentiment, keywords, length).

Key Findings

All three models preserved table columns and ordering exactly.

NumbersStructure Preserving Score = 100% (Table 1)

Practical UseYou can rely on prompts to keep schema/column names; downstream pipelines that expect fixed columns will work without reformatting.

Evidence RefTable 1

Claude produced the most faithful non-text data and text style among tested models.

NumbersData Integrity Claude 98.4% vs ChatGPT 93.9% vs Llama 87.59% (Table 1)

Practical UsePrefer Claude-style prompts/models when fidelity matters; expect fewer out-of-range categories and fewer title duplicates.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Structure Preserving Score	100% for all models	—	—	software review subset	Table 1: all models 100% structure preservation	Table 1
Data Integrity Score	Claude 98.4%, ChatGPT 93.9%, Llama 87.59%	—	—	non-text tabular columns	Table 1 data integrity numbers	Table 1

What To Try In 7 Days

Run SynEval on a small synthetic set to check schema and column shape.

Do a TSTR test for your key ML task to measure real-world utility.

Run a membership-inference test and deduplicate IDs before sharing data.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/SCU-TrustworthyAI/SynEval

Risks & Boundaries

Limitations

Small-scale generation: experiments used 300 entries per model, limiting generality.

Single domain: evaluation focuses on software product reviews from Amazon.

When Not To Use

When you require mathematical privacy guarantees (use DP frameworks instead).

When your domain differs heavily from product reviews without re-validating SynEval.

Failure Modes

Duplicate outputs reduce uniqueness and raise re-identification risk.

Numeric fields can be out-of-range unless prompts tightly constrain them.

Core Entities

Models

Claude 3 OpusChatGPT 3.5Llama 2 13B

Metrics

Structure Preserving Score (SPS)Integrity Score (IS)Column Shape (KS statistic, TVD)Sentiment distributionTop keywordsAverage review lengthTSTR (Train-Synthetic-Test-Real)AccuracyMean Absolute Error (MAE)Membership Inference Attack (MIA) success rate

Datasets

Amazon product review dataset (software subset)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

All three models preserved table columns and ordering exactly.

Claude produced the most faithful non-text data and text style among tested models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding