-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do you generate missing values that have correlations? #2310
Comments
Hello, expanding upon this a bit: By default, SDV assumes that data is missing completely at random, meaning that there are no correlations between whether something is missing and any other variable. So by default, the only thing we expect is for the overall % of missing values to be learned but not any correlations. Telling SDV to learn missing value correlationsSDV handles missing values via data preprocessing. So you would need to update the transformers. Use the update_transformers method. Assign any relevant numerical columns to a FloatFormatter that is set to have from rdt.transformers import FloatFormatter
synthesizer = GaussianCopulaSynthesizer(your_metadata)
synthesizer.auto_assign_transformers(your_data)
synthesizer.update_transformers(column_name_to_transformer={
'column_A': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
'column_B': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
'column_C': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True, missing_value_generation='from_column'),
...
})
synthesizer.fit(your_data)
synthetic_data = synthesizer.sample(num_rows=100) As a result,
Correlations vs. constraintsNote that the correlations SDV learns will be probabilistic in nature. If you have a deterministic rule (eg. if column A = True, then column B = missing), then you will need to use a constraint instead. Do let me know if this works and if you have any follow ups. |
Thank you very much for your explanation. Additionally, I found out that you can update the Are there other parameters that I can tweak for the GaussianCopula synthesizer? Also, I imagine it is possible to update the transformers for CTGAN as well. Do you think this will fix my issue there? Kind regards, |
Hi @wilcovanvorstenbosch great, very happy that we could help. You can certainly use these transformers on CTGAN. If you are referring to your issue from #2288, I'm not really sure whether it would help but you can try. Issue #2288 is likely due to some bug in the FloatFormatter but we have not been able to replicate it. I'm going to close this particular issue since we have answered the question of generating missing values with correlations and that it seems be working for you. For other, other questions, please file a new issue. To keep our GitHub space clean, we would appreciate it if we can stick to using a new issue for each new topic. You can file a new issue if you'd like to discuss the numerical distributions parameter, Fitter package, etc. Thanks. |
I am filing this issue on behalf of @wilcovanvorstenbosch, first asked in #2288
From @wilcovanvorstenbosch:
From @npatki:
We can use this issue to further discuss the topic.
The text was updated successfully, but these errors were encountered: