Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data , integers and floats. #2349

Open
CarlangaUC opened this issue Jan 16, 2025 · 1 comment
Open

Synthetic data , integers and floats. #2349

CarlangaUC opened this issue Jan 16, 2025 · 1 comment
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@CarlangaUC
Copy link

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.17.3
  • Python version: 3.10.11
  • Operating System: Windows

Problem description

I'm trying to compare different synthesizers, specifically GaussianCopula, TVAE, CTGAN, and CopulaGAN. With the last three, I encountered the problem of the synthetic data containing floats instead of integers in the same columns as the original data. I was wondering if there is any hyperparameter in these synthesizers (I couldn't find it in the documentation) to round these values during training or sampling.

What I already tried

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
@CarlangaUC CarlangaUC added new Automatic label applied to new issues question General question about the software labels Jan 16, 2025
@npatki
Copy link
Contributor

npatki commented Jan 16, 2025

Hi @CarlangaUC, nice to meet you. I've seen this issue once before with a different user, but we never really got to the bottom of it 100%. It could be great if you're able to help us debug this!

My hunch is that is related to how you are loading and storing your real data in Python (data that is used for fitting). The problem appears to be unrelated to the actual ML modeling, which means that updating hyperparameters or other settings won't have any effects.

To get to the bottom of this, would you be able to share your code from loading the data into Python up until sampling from the synthesizer? In particular, I'm curious what format your original data is in. Are you updating or modifying your data in anyway after loading it into Python?

We've tested the SDV explicitly with the case of reading a CSV file into Python and passing it directly (unmodified) into a synthesizer. As a result, the data table should be stored as floats, ints, objects, etc. (dtypes of the pandas DataFrame).

import pandas as pd

from sdv.single_table import GaussianCouplaSynthesizer
from sdv.metadata import Metadata

data = pd.read_csv('my_data_file.csv')
metadata = Metadata.load_from_json('my_metadata_file.json')

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(data)

Does anything about your usage jump out to you as being different from this?

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants