How do you generate missing values that have correlations? #2310

npatki · 2024-11-25T19:11:31Z

I am filing this issue on behalf of @wilcovanvorstenbosch, first asked in #2288

the values are not missing at random. Often, the variable was not relevant for a specific row because of a certain value for another variable. I was hoping that the synthesizer would be able to pick up on this correlations. It should, right?
In the original data, whether a column has a NaN is not random.

From @npatki:

[By default] SDV will assume your data is missing completely at random. I can walk you how to update this if you'd like.

We can use this issue to further discuss the topic.

npatki · 2024-11-25T19:58:24Z

Hello, expanding upon this a bit:

By default, SDV assumes that data is missing completely at random, meaning that there are no correlations between whether something is missing and any other variable. So by default, the only thing we expect is for the overall % of missing values to be learned but not any correlations.

Telling SDV to learn missing value correlations

SDV handles missing values via data preprocessing. So you would need to update the transformers.

Use the update_transformers method. Assign any relevant numerical columns to a FloatFormatter that is set to have missing_value_generation='from_column'.

from rdt.transformers import FloatFormatter

synthesizer = GaussianCopulaSynthesizer(your_metadata)
synthesizer.auto_assign_transformers(your_data)

synthesizer.update_transformers(column_name_to_transformer={
    'column_A': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
    'column_B': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
    'column_C': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
   ...
})

synthesizer.fit(your_data)
synthetic_data = synthesizer.sample(num_rows=100)

As a result,

SDV will continue to learn the % of missing values and create synthetic data with roughly the same proportion
SDV should also pick up on correlations between the "missingness" of a value and other columns.

Correlations vs. constraints

Note that the correlations SDV learns will be probabilistic in nature. If you have a deterministic rule (eg. if column A = True, then column B = missing), then you will need to use a constraint instead.

Do let me know if this works and if you have any follow ups.

wilcovanvorstenbosch · 2024-12-02T16:38:07Z

Thank you very much for your explanation.
I followed the steps, and it works like a charm.
This is exactly what I was looking for!

Additionally, I found out that you can update the numerical_distributions as a parameter.
I used the Fitter package to find the best distribution out of the available options.
Question: Why is this not included by default in the SDV package? It seems like 'beta' is not always the best guess.

Are there other parameters that I can tweak for the GaussianCopula synthesizer?

Also, I imagine it is possible to update the transformers for CTGAN as well. Do you think this will fix my issue there?

Kind regards,
Wilco

npatki · 2024-12-02T19:56:18Z

Hi @wilcovanvorstenbosch great, very happy that we could help.

You can certainly use these transformers on CTGAN. If you are referring to your issue from #2288, I'm not really sure whether it would help but you can try. Issue #2288 is likely due to some bug in the FloatFormatter but we have not been able to replicate it.

I'm going to close this particular issue since we have answered the question of generating missing values with correlations and that it seems be working for you.

For other, other questions, please file a new issue. To keep our GitHub space clean, we would appreciate it if we can stick to using a new issue for each new topic. You can file a new issue if you'd like to discuss the numerical distributions parameter, Fitter package, etc. Thanks.

npatki added question General question about the software new Automatic label applied to new issues labels Nov 25, 2024

npatki mentioned this issue Nov 25, 2024

NaN values for numerical variables DISAPPEAR when using CTGANSynthesizer #2288

Open

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Nov 25, 2024

npatki closed this as completed Dec 2, 2024

npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you generate missing values that have correlations? #2310

How do you generate missing values that have correlations? #2310

npatki commented Nov 25, 2024

npatki commented Nov 25, 2024 •

edited

Loading

wilcovanvorstenbosch commented Dec 2, 2024

npatki commented Dec 2, 2024

How do you generate missing values that have correlations? #2310

How do you generate missing values that have correlations? #2310

Comments

npatki commented Nov 25, 2024

npatki commented Nov 25, 2024 • edited Loading

Telling SDV to learn missing value correlations

Correlations vs. constraints

wilcovanvorstenbosch commented Dec 2, 2024

npatki commented Dec 2, 2024

npatki commented Nov 25, 2024 •

edited

Loading