SynEval: a compact framework to measure fidelity, utility and privacy of LLM-generated tabular reviews

April 20, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Yefeng Yuan, Yuhong Liu, Liang Cheng

Links

Abstract / PDF

Why It Matters For Business

SynEval helps teams judge if synthetic data is usable: it flags fidelity gaps, estimates downstream model performance, and surfaces privacy risk before data sharing.

Summary TLDR

This paper introduces SynEval, an open-source toolkit to score synthetic tabular data (including review text) along three axes: fidelity (structure, column distributions, text features), utility (train-on-synthetic, test-on-real for sentiment), and privacy (membership inference attacks). The authors run SynEval on synthetic product reviews from Claude 3 Opus, ChatGPT 3.5, and Llama 2 13B. Main takeaways: Claude yields the highest fidelity and more realistic text; all synthetic sets give comparable sentiment-model accuracy to real data on small samples; but membership inference success is high (83–91%), so privacy risk is real. Code: github.com/SCU-TrustworthyAI/SynEval.

Problem Statement

Practitioners lack a single, practical framework to measure how well LLMs generate synthetic tabular data: does generated data match real data statistics, remain useful for ML tasks, and preserve privacy? Existing metrics are fragmented and rarely target mixed tabular+text review data.

Main Contribution

SynEval: a unified, open-source framework that measures fidelity, utility, and privacy for synthetic tabular data with text.

Concrete fidelity metrics: structure preservation, integrity, column-shape (KS/TVD), plus text tests (sentiment, keywords, length).

Empirical evaluation on Amazon review slices comparing Claude, ChatGPT, and Llama; includes practical recommendations.

Key Findings

All three models preserved table columns and ordering exactly.

NumbersStructure Preserving Score = 100% (Table 1)

Claude produced the most faithful non-text data and text style among tested models.

NumbersData Integrity Claude 98.4% vs ChatGPT 93.9% vs Llama 87.59% (Table 1)

Generated review lengths were shorter than real reviews; Claude was closest.

NumbersAvg words: Real 59.26, Claude 40.48, ChatGPT 16.55, Llama 18.69 (Table 2)

Models trained on synthetic data gave sentiment accuracy close to models trained on real data.

NumbersAccuracy: Claude 67.68%, ChatGPT 67.35%, Real 67.92% (Table 3)

Membership inference attacks succeed at high rates on all synthetic sets—indicating privacy risk.

NumbersMIA success: Claude 91%, ChatGPT 90%, Llama 83% (Table 4)

Llama produced many duplicates and wrong numeric/text formats without careful prompting.

NumbersUnique samples: Claude 300, ChatGPT 292, Llama 115 (Section 4.1)

Results

Structure Preserving Score

Value100% for all models

Data Integrity Score

ValueClaude 98.4%, ChatGPT 93.9%, Llama 87.59%

Column Shapes Score

ValueClaude 80.92%, ChatGPT 80.97%, Llama 62.29%

Average review length (words)

ValueReal 59.26, Claude 40.48, ChatGPT 16.55, Llama 18.69

BaselineReal 59.26

Accuracy

ValueClaude 67.68%, ChatGPT 67.35%, Llama 62.26%, Real 67.92%

BaselineReal 67.92%

Mean Absolute Error (MAE) for sentiment model

ValueClaude 1.2929, ChatGPT 1.2041, Llama 1.4151, Real 1.3019

BaselineReal 1.3019

Membership Inference Attack (MIA) success rate

ValueClaude 91%, ChatGPT 90%, Llama 83%

Unique synthetic entries produced (requested 300)

ValueClaude 300, ChatGPT 292, Llama 115

Baselinerequested 300

Who Should Care

What To Try In 7 Days

Run SynEval on a small synthetic set to check schema and column shape.

Do a TSTR test for your key ML task to measure real-world utility.

Run a membership-inference test and deduplicate IDs before sharing data.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small-scale generation: experiments used 300 entries per model, limiting generality.
  • Single domain: evaluation focuses on software product reviews from Amazon.
  • Privacy tests limited to MIA; no differential privacy guarantees measured.

When Not To Use

  • When you require mathematical privacy guarantees (use DP frameworks instead).
  • When your domain differs heavily from product reviews without re-validating SynEval.

Failure Modes

  • Duplicate outputs reduce uniqueness and raise re-identification risk.
  • Numeric fields can be out-of-range unless prompts tightly constrain them.
  • Generated text often shorter and less detailed than original reviews.

Core Entities

Models

  • Claude 3 Opus
  • ChatGPT 3.5
  • Llama 2 13B

Metrics

  • Structure Preserving Score (SPS)
  • Integrity Score (IS)
  • Column Shape (KS statistic, TVD)
  • Sentiment distribution
  • Top keywords
  • Average review length
  • TSTR (Train-Synthetic-Test-Real)
  • Accuracy
  • Mean Absolute Error (MAE)
  • Membership Inference Attack (MIA) success rate

Datasets

  • Amazon product review dataset (software subset)