Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you generate missing values that have correlations? #2310

Closed
npatki opened this issue Nov 25, 2024 · 3 comments
Closed

How do you generate missing values that have correlations? #2310

npatki opened this issue Nov 25, 2024 · 3 comments
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@npatki
Copy link
Contributor

npatki commented Nov 25, 2024

I am filing this issue on behalf of @wilcovanvorstenbosch, first asked in #2288

From @wilcovanvorstenbosch:

the values are not missing at random. Often, the variable was not relevant for a specific row because of a certain value for another variable. I was hoping that the synthesizer would be able to pick up on this correlations. It should, right?
In the original data, whether a column has a NaN is not random.

From @npatki:

[By default] SDV will assume your data is missing completely at random. I can walk you how to update this if you'd like.

We can use this issue to further discuss the topic.

@npatki npatki added question General question about the software new Automatic label applied to new issues labels Nov 25, 2024
@npatki
Copy link
Contributor Author

npatki commented Nov 25, 2024

Hello, expanding upon this a bit:

By default, SDV assumes that data is missing completely at random, meaning that there are no correlations between whether something is missing and any other variable. So by default, the only thing we expect is for the overall % of missing values to be learned but not any correlations.

Telling SDV to learn missing value correlations

SDV handles missing values via data preprocessing. So you would need to update the transformers.

Use the update_transformers method. Assign any relevant numerical columns to a FloatFormatter that is set to have missing_value_generation='from_column'.

from rdt.transformers import FloatFormatter

synthesizer = GaussianCopulaSynthesizer(your_metadata)
synthesizer.auto_assign_transformers(your_data)

synthesizer.update_transformers(column_name_to_transformer={
    'column_A': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
    'column_B': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
    'column_C': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
   ...
})

synthesizer.fit(your_data)
synthetic_data = synthesizer.sample(num_rows=100)

As a result,

  • SDV will continue to learn the % of missing values and create synthetic data with roughly the same proportion
  • SDV should also pick up on correlations between the "missingness" of a value and other columns.

Correlations vs. constraints

Note that the correlations SDV learns will be probabilistic in nature. If you have a deterministic rule (eg. if column A = True, then column B = missing), then you will need to use a constraint instead.

Do let me know if this works and if you have any follow ups.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Nov 25, 2024
@wilcovanvorstenbosch
Copy link

Thank you very much for your explanation.
I followed the steps, and it works like a charm.
This is exactly what I was looking for!

Additionally, I found out that you can update the numerical_distributions as a parameter.
I used the Fitter package to find the best distribution out of the available options.
Question: Why is this not included by default in the SDV package? It seems like 'beta' is not always the best guess.

Are there other parameters that I can tweak for the GaussianCopula synthesizer?

Also, I imagine it is possible to update the transformers for CTGAN as well. Do you think this will fix my issue there?

Kind regards,
Wilco

@npatki
Copy link
Contributor Author

npatki commented Dec 2, 2024

Hi @wilcovanvorstenbosch great, very happy that we could help.

You can certainly use these transformers on CTGAN. If you are referring to your issue from #2288, I'm not really sure whether it would help but you can try. Issue #2288 is likely due to some bug in the FloatFormatter but we have not been able to replicate it.

I'm going to close this particular issue since we have answered the question of generating missing values with correlations and that it seems be working for you.

For other, other questions, please file a new issue. To keep our GitHub space clean, we would appreciate it if we can stick to using a new issue for each new topic. You can file a new issue if you'd like to discuss the numerical distributions parameter, Fitter package, etc. Thanks.

@npatki npatki closed this as completed Dec 2, 2024
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

2 participants