Overview
SynEval is a practical, open framework with clear metrics and code; experiments use small samples and one domain, so validate on your full data before trusting results.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/8
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
SynEval helps teams judge if synthetic data is usable: it flags fidelity gaps, estimates downstream model performance, and surfaces privacy risk before data sharing.
Who Should Care
Summary TLDR
This paper introduces SynEval, an open-source toolkit to score synthetic tabular data (including review text) along three axes: fidelity (structure, column distributions, text features), utility (train-on-synthetic, test-on-real for sentiment), and privacy (membership inference attacks). The authors run SynEval on synthetic product reviews from Claude 3 Opus, ChatGPT 3.5, and Llama 2 13B. Main takeaways: Claude yields the highest fidelity and more realistic text; all synthetic sets give comparable sentiment-model accuracy to real data on small samples; but membership inference success is high (83–91%), so privacy risk is real. Code: github.com/SCU-TrustworthyAI/SynEval.
Problem Statement
Practitioners lack a single, practical framework to measure how well LLMs generate synthetic tabular data: does generated data match real data statistics, remain useful for ML tasks, and preserve privacy? Existing metrics are fragmented and rarely target mixed tabular+text review data.
Main Contribution
SynEval: a unified, open-source framework that measures fidelity, utility, and privacy for synthetic tabular data with text.
Concrete fidelity metrics: structure preservation, integrity, column-shape (KS/TVD), plus text tests (sentiment, keywords, length).
Key Findings
All three models preserved table columns and ordering exactly.
Claude produced the most faithful non-text data and text style among tested models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Structure Preserving Score | 100% for all models | — | — | software review subset | Table 1: all models 100% structure preservation | Table 1 |
| Data Integrity Score | Claude 98.4%, ChatGPT 93.9%, Llama 87.59% | — | — | non-text tabular columns | Table 1 data integrity numbers | Table 1 |
What To Try In 7 Days
Run SynEval on a small synthetic set to check schema and column shape.
Do a TSTR test for your key ML task to measure real-world utility.
Run a membership-inference test and deduplicate IDs before sharing data.
Reproducibility
Risks & Boundaries
Limitations
Small-scale generation: experiments used 300 entries per model, limiting generality.
Single domain: evaluation focuses on software product reviews from Amazon.
When Not To Use
When you require mathematical privacy guarantees (use DP frameworks instead).
When your domain differs heavily from product reviews without re-validating SynEval.
Failure Modes
Duplicate outputs reduce uniqueness and raise re-identification risk.
Numeric fields can be out-of-range unless prompts tightly constrain them.

