Synthetic data , integers and floats. #2349

CarlangaUC · 2025-01-16T19:01:58Z

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

SDV version: 1.17.3
Python version: 3.10.11
Operating System: Windows

Problem description

I'm trying to compare different synthesizers, specifically GaussianCopula, TVAE, CTGAN, and CopulaGAN. With the last three, I encountered the problem of the synthetic data containing floats instead of integers in the same columns as the original data. I was wondering if there is any hyperparameter in these synthesizers (I couldn't find it in the documentation) to round these values during training or sampling.

What I already tried

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

The text was updated successfully, but these errors were encountered:

npatki · 2025-01-16T22:15:20Z

Hi @CarlangaUC, nice to meet you. I've seen this issue once before with a different user, but we never really got to the bottom of it 100%. It could be great if you're able to help us debug this!

My hunch is that is related to how you are loading and storing your real data in Python (data that is used for fitting). The problem appears to be unrelated to the actual ML modeling, which means that updating hyperparameters or other settings won't have any effects.

To get to the bottom of this, would you be able to share your code from loading the data into Python up until sampling from the synthesizer? In particular, I'm curious what format your original data is in. Are you updating or modifying your data in anyway after loading it into Python?

We've tested the SDV explicitly with the case of reading a CSV file into Python and passing it directly (unmodified) into a synthesizer. As a result, the data table should be stored as floats, ints, objects, etc. (dtypes of the pandas DataFrame).

import pandas as pd

from sdv.single_table import GaussianCouplaSynthesizer
from sdv.metadata import Metadata

data = pd.read_csv('my_data_file.csv')
metadata = Metadata.load_from_json('my_metadata_file.json')

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(data)

Does anything about your usage jump out to you as being different from this?

CarlangaUC added new Automatic label applied to new issues question General question about the software labels Jan 16, 2025

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic data , integers and floats. #2349

Synthetic data , integers and floats. #2349

CarlangaUC commented Jan 16, 2025

npatki commented Jan 16, 2025 •

edited

Loading

Synthetic data , integers and floats. #2349

Synthetic data , integers and floats. #2349

Comments

CarlangaUC commented Jan 16, 2025

Environment details

Problem description

What I already tried

npatki commented Jan 16, 2025 • edited Loading

npatki commented Jan 16, 2025 •

edited

Loading