SynEval: a compact framework to measure fidelity, utility and privacy of LLM-generated tabular reviews

April 20, 20247 min

Overview

Decision SnapshotReady For Pilot

SynEval is a practical, open framework with clear metrics and code; experiments use small samples and one domain, so validate on your full data before trusting results.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/8

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Yefeng Yuan, Yuhong Liu, Liang Cheng

Links

Abstract / PDF / Code

Why It Matters For Business

SynEval helps teams judge if synthetic data is usable: it flags fidelity gaps, estimates downstream model performance, and surfaces privacy risk before data sharing.

Who Should Care

Summary TLDR

This paper introduces SynEval, an open-source toolkit to score synthetic tabular data (including review text) along three axes: fidelity (structure, column distributions, text features), utility (train-on-synthetic, test-on-real for sentiment), and privacy (membership inference attacks). The authors run SynEval on synthetic product reviews from Claude 3 Opus, ChatGPT 3.5, and Llama 2 13B. Main takeaways: Claude yields the highest fidelity and more realistic text; all synthetic sets give comparable sentiment-model accuracy to real data on small samples; but membership inference success is high (83–91%), so privacy risk is real. Code: github.com/SCU-TrustworthyAI/SynEval.

Problem Statement

Practitioners lack a single, practical framework to measure how well LLMs generate synthetic tabular data: does generated data match real data statistics, remain useful for ML tasks, and preserve privacy? Existing metrics are fragmented and rarely target mixed tabular+text review data.

Main Contribution

SynEval: a unified, open-source framework that measures fidelity, utility, and privacy for synthetic tabular data with text.

Concrete fidelity metrics: structure preservation, integrity, column-shape (KS/TVD), plus text tests (sentiment, keywords, length).

Key Findings

All three models preserved table columns and ordering exactly.

NumbersStructure Preserving Score = 100% (Table 1)

Practical UseYou can rely on prompts to keep schema/column names; downstream pipelines that expect fixed columns will work without reformatting.

Evidence RefTable 1

Claude produced the most faithful non-text data and text style among tested models.

NumbersData Integrity Claude 98.4% vs ChatGPT 93.9% vs Llama 87.59% (Table 1)

Practical UsePrefer Claude-style prompts/models when fidelity matters; expect fewer out-of-range categories and fewer title duplicates.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Structure Preserving Score100% for all modelssoftware review subsetTable 1: all models 100% structure preservationTable 1
Data Integrity ScoreClaude 98.4%, ChatGPT 93.9%, Llama 87.59%non-text tabular columnsTable 1 data integrity numbersTable 1

What To Try In 7 Days

Run SynEval on a small synthetic set to check schema and column shape.

Do a TSTR test for your key ML task to measure real-world utility.

Run a membership-inference test and deduplicate IDs before sharing data.

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small-scale generation: experiments used 300 entries per model, limiting generality.

Single domain: evaluation focuses on software product reviews from Amazon.

When Not To Use

When you require mathematical privacy guarantees (use DP frameworks instead).

When your domain differs heavily from product reviews without re-validating SynEval.

Failure Modes

Duplicate outputs reduce uniqueness and raise re-identification risk.

Numeric fields can be out-of-range unless prompts tightly constrain them.

Core Entities

Models

Claude 3 OpusChatGPT 3.5Llama 2 13B

Metrics

Structure Preserving Score (SPS)Integrity Score (IS)Column Shape (KS statistic, TVD)Sentiment distributionTop keywordsAverage review lengthTSTR (Train-Synthetic-Test-Real)AccuracyMean Absolute Error (MAE)Membership Inference Attack (MIA) success rate

Datasets

Amazon product review dataset (software subset)