Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
SynEval helps teams judge if synthetic data is usable: it flags fidelity gaps, estimates downstream model performance, and surfaces privacy risk before data sharing.
Summary TLDR
This paper introduces SynEval, an open-source toolkit to score synthetic tabular data (including review text) along three axes: fidelity (structure, column distributions, text features), utility (train-on-synthetic, test-on-real for sentiment), and privacy (membership inference attacks). The authors run SynEval on synthetic product reviews from Claude 3 Opus, ChatGPT 3.5, and Llama 2 13B. Main takeaways: Claude yields the highest fidelity and more realistic text; all synthetic sets give comparable sentiment-model accuracy to real data on small samples; but membership inference success is high (83–91%), so privacy risk is real. Code: github.com/SCU-TrustworthyAI/SynEval.
Problem Statement
Practitioners lack a single, practical framework to measure how well LLMs generate synthetic tabular data: does generated data match real data statistics, remain useful for ML tasks, and preserve privacy? Existing metrics are fragmented and rarely target mixed tabular+text review data.
Main Contribution
SynEval: a unified, open-source framework that measures fidelity, utility, and privacy for synthetic tabular data with text.
Concrete fidelity metrics: structure preservation, integrity, column-shape (KS/TVD), plus text tests (sentiment, keywords, length).
Empirical evaluation on Amazon review slices comparing Claude, ChatGPT, and Llama; includes practical recommendations.
Key Findings
All three models preserved table columns and ordering exactly.
Claude produced the most faithful non-text data and text style among tested models.
Generated review lengths were shorter than real reviews; Claude was closest.
Models trained on synthetic data gave sentiment accuracy close to models trained on real data.
Membership inference attacks succeed at high rates on all synthetic sets—indicating privacy risk.
Llama produced many duplicates and wrong numeric/text formats without careful prompting.
Results
Structure Preserving Score
Data Integrity Score
Column Shapes Score
Average review length (words)
Accuracy
Mean Absolute Error (MAE) for sentiment model
Membership Inference Attack (MIA) success rate
Unique synthetic entries produced (requested 300)
Who Should Care
What To Try In 7 Days
Run SynEval on a small synthetic set to check schema and column shape.
Do a TSTR test for your key ML task to measure real-world utility.
Run a membership-inference test and deduplicate IDs before sharing data.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small-scale generation: experiments used 300 entries per model, limiting generality.
- Single domain: evaluation focuses on software product reviews from Amazon.
- Privacy tests limited to MIA; no differential privacy guarantees measured.
When Not To Use
- When you require mathematical privacy guarantees (use DP frameworks instead).
- When your domain differs heavily from product reviews without re-validating SynEval.
Failure Modes
- Duplicate outputs reduce uniqueness and raise re-identification risk.
- Numeric fields can be out-of-range unless prompts tightly constrain them.
- Generated text often shorter and less detailed than original reviews.
Core Entities
Models
- Claude 3 Opus
- ChatGPT 3.5
- Llama 2 13B
Metrics
- Structure Preserving Score (SPS)
- Integrity Score (IS)
- Column Shape (KS statistic, TVD)
- Sentiment distribution
- Top keywords
- Average review length
- TSTR (Train-Synthetic-Test-Real)
- Accuracy
- Mean Absolute Error (MAE)
- Membership Inference Attack (MIA) success rate
Datasets
- Amazon product review dataset (software subset)

