Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading of numpy types fails #7

Open
jtbraun opened this issue May 1, 2019 · 13 comments
Open

Loading of numpy types fails #7

jtbraun opened this issue May 1, 2019 · 13 comments

Comments

@jtbraun
Copy link

jtbraun commented May 1, 2019

Attached is a small example.
PyYaml itself is capable of loading/dumping numpy data. When called through serialize, though, RepresenterErrors are thrown.

It's possible that this is a problem with numpy itself, but given that numpy works through the yaml.Loader, I figure it's something to do with serialize.

I used python 3.6.8 to reproduce these.
pyyaml_numpy_failure.zip

@hgrecco
Copy link
Owner

hgrecco commented May 2, 2019

Thanks for the very detailed summary. I am getting this when I run your script with Python 3.6.5, pyyaml 3.12, numpy 1.14.3

ERROR:root:ERROR: Failed dtype=int64     /<class 'int'>                  serialize-yaml                 to load(foo1.bin  ) 1: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'int'>                  vanilla "Safe" yaml            to dump(foo7.bin  ) array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) : RepresenterError('cannot represent an object: [1 1 1 1 1 1 1 1 1 1]',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'int'>                  vanilla "Safe" yaml            to dump(foo7.bin  ) 1 : RepresenterError('cannot represent an object: 1',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'int'>                  serialize-yaml                 to load(foo10.bin ) 1: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'int'>                  vanilla "Safe" yaml            to dump(foo16.bin ) array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) : RepresenterError('cannot represent an object: [1 1 1 1 1 1 1 1 1 1]',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'int'>                  vanilla "Safe" yaml            to dump(foo16.bin ) 1 : RepresenterError('cannot represent an object: 1',)
ERROR:root:ERROR: Failed dtype=int32     /<class 'numpy.int32'>          serialize-yaml                 to load(foo19.bin ) 1: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=int32     /<class 'numpy.int32'>          vanilla "Safe" yaml            to dump(foo25.bin ) array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32) : RepresenterError('cannot represent an object: [1 1 1 1 1 1 1 1 1 1]',)
ERROR:root:ERROR: Failed dtype=int32     /<class 'numpy.int32'>          vanilla "Safe" yaml            to dump(foo25.bin ) 1 : RepresenterError('cannot represent an object: 1',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'numpy.int64'>          serialize-yaml                 to load(foo28.bin ) 1: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'numpy.int64'>          vanilla "Safe" yaml            to dump(foo34.bin ) array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) : RepresenterError('cannot represent an object: [1 1 1 1 1 1 1 1 1 1]',)
ERROR:root:ERROR: Failed dtype=int64     /<class 'numpy.int64'>          vanilla "Safe" yaml            to dump(foo34.bin ) 1 : RepresenterError('cannot represent an object: 1',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'float'>                serialize-yaml                 to load(foo37.bin ) 1.0: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'float'>                vanilla "Safe" yaml            to dump(foo43.bin ) array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) : RepresenterError('cannot represent an object: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'float'>                vanilla "Safe" yaml            to dump(foo43.bin ) 1.0 : RepresenterError('cannot represent an object: 1.0',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'float'>                serialize-yaml                 to load(foo46.bin ) 1.0: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'float'>                vanilla "Safe" yaml            to dump(foo52.bin ) array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) : RepresenterError('cannot represent an object: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'float'>                vanilla "Safe" yaml            to dump(foo52.bin ) 1.0 : RepresenterError('cannot represent an object: 1.0',)
ERROR:root:ERROR: Failed dtype=float32   /<class 'numpy.float32'>        serialize-yaml                 to load(foo55.bin ) 1.0: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=float32   /<class 'numpy.float32'>        vanilla "Safe" yaml            to dump(foo61.bin ) array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32) : RepresenterError('cannot represent an object: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]',)
ERROR:root:ERROR: Failed dtype=float32   /<class 'numpy.float32'>        vanilla "Safe" yaml            to dump(foo61.bin ) 1.0 : RepresenterError('cannot represent an object: 1.0',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'numpy.float64'>        serialize-yaml                 to load(foo64.bin ) 1.0: TypeError('scalar() argument 1 must be numpy.dtype, not dict',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'numpy.float64'>        vanilla "Safe" yaml            to dump(foo70.bin ) array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) : RepresenterError('cannot represent an object: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]',)
ERROR:root:ERROR: Failed dtype=float64   /<class 'numpy.float64'>        vanilla "Safe" yaml            to dump(foo70.bin ) 1.0 : RepresenterError('cannot represent an object: 1.0',)

So both the vainilla and serialize are failing, right?

@jtbraun
Copy link
Author

jtbraun commented May 2, 2019

Ignore the vanilla "Safe" yaml failures. They are purely pyyaml, and are expected because the yaml.SafeDumper:

SafeDumper(stream) produces only standard YAML tags and thus cannot represent class instances and probably more compatible with other YAML processors.

But I do expect the "serialize-yaml" versions to work. They do serialize to SOMETHING (look at foo46.bin for instance), and that something looks something like what I'd expect to come from serializing a numpy object), but the loading of that obpect fails with a type error.

serialize uses it's own custom Dumper/Loader I think, and it's probably not round-tripping numpy.array object because they reduce differently than others or something?

@hgrecco
Copy link
Owner

hgrecco commented May 2, 2019

We basically subclass Dumper and Loader. Take a look at yaml.py

@hgrecco
Copy link
Owner

hgrecco commented May 2, 2019

I basically removed the modified Loader in yaml.py and everything works fine. I wonder if this is a common problem that we should address.

@jtbraun
Copy link
Author

jtbraun commented May 2, 2019

I would say "yes, it needs to be addressed"... numpy is in common use, and the underlying pyyaml correctly handles it. It's a surprising result that using serialize prevents that from working. If serialize is meant to make the job of using the various serialization formats easier, it shouldn't break anything.

@hgrecco
Copy link
Owner

hgrecco commented May 2, 2019

I did a small change that seems to fix this issue in a way that does not destroy the other functionality of serialize (namely the ability to dump custom objects). I would appreciate if you can install it from github and test it. If it works, them I will make a new release.

By the way while it is great that pyyaml can deal with numpy array natively this is not the usual case. That is why serialize provides a way to create custom functions to dump/load per class. They are also sometimes more efficient.

@jtbraun
Copy link
Author

jtbraun commented May 2, 2019

I'll give it a shot, probably tomorrow.

By the way while it is great that pyyaml can deal with numpy array natively this is not the usual case.
I think this is an artifact of pyyaml implementing the pickle reduce api, and each of the individual serializers supporting pickling.

That is why serialize provides a way to create custom functions to dump/load per class. They are also sometimes more efficient.

That's the entire reason I'm using serialize. It's very handy! Thank you.

@jtbraun
Copy link
Author

jtbraun commented May 3, 2019

I got around to trying this. It works as written for the test case I gave you, but it causes errors in some other stuff in my larger application. I don't know if that's incompatibilities between the files on disk serialized with the older version (likely), or a bigger problem.

I think there's a better solution that's guaranteed to be right, but is more work. I'll describe it below, and later I'll see about putting together an example. Ultimately you're trying to determine if a MappingNode has been put through one of serialize'd decoders or not. Right now you just key off of whether or not the node is a MappingNode, but the problem is that pyyaml encodes objects in three different ways, depending on what sort of serialization api they support:

    def represent_object(self, data):
        if not args and not listitems and not dictitems \
                and isinstance(state, dict) and newobj:
            return self.represent_mapping(
                    'tag:yaml.org,2002:python/object:'+function_name, state)
        if not listitems and not dictitems  \
                and isinstance(state, dict) and not state:
            return self.represent_sequence(tag+function_name, args)
        return self.represent_mapping(tag+function_name, value)

If a classes __reduce__ method returns only the cls/dictionary portions, pyyaml encodes it as a MappingNode with a tag that indicates it's an object, and the name of the function to call w/ the MappingNode's kwargs to reinstate the object. Numpy arrays are one such object that encode to only-dicts. I've attached another (contrived!) example here. MyLoaderA is the same as the default serializer.yaml loader, and both fail to property reconstruct numpy arrays and the toy AttributeDict. The hacky MyLoaderB checks the node.tag first, and lets yaml do the object decode if it's not really a Mapping (the serialize-yamlB test cases all pass).

I think the proper fix is to modify the "representer" behavior of a Dumper. Take a look at [py]yaml.representer.py.

You'll want to make you're own representer class, and everytime that someone calls serialize.register_class(cls, to, from), you want to make a corresponding call to your MyYamlRepresenterClass.add_representer(cls, MyYamlRepresenterClass.represent_serialized_decode.

Then MyYamlRepresenterClass.represent_serialized_decode does whatever you want it to (all.encode, all.decode). But then it returns the node: self.represent_mapping(tag, encoded_value)

Where tag is a unique string that allows you to identify that it's "serialized-encoded for class XYZ". On the load side, you'd sort of do what you're doing in the released version today, but instead of just checking if isinstance(node, MappingNode), you also check if the node.tag value matches your unique string. Then your custom decoder knows that it was the thing that encoded it, decodes it and returns, and you avoid all of these sorts of scenarios.
pyyaml_numpy_failure.zip

@hgrecco
Copy link
Owner

hgrecco commented May 3, 2019

Sounds like a really good plan. We might need to add a hook to register_class that all implementors can bind to.

jtbraun added a commit to jtbraun/serialize that referenced this issue May 14, 2019
Documented in hgrecco#7

serialize was overloading the object encoding/decoding, but didn't
properly differentiate itself from the built-in object handlers in
pyyaml. pyaml has a tagging facility to deal with this sort of thing, so
now serialize-handled classes are handled explicitly with their own
unique tag.

This is a breaking change, previously-serialized data will no longer
decode correctly.

This adds a register_class callback for formats, and yaml implements
that to register those classes with the built-in
serialize.yaml.Dumper/Loader for tagging/encoding.
@jtbraun
Copy link
Author

jtbraun commented May 14, 2019

I mocked this up in jtbraun@abe54f4

there's a couple of outstanding things that need fixed in serialize.yaml.py... the tag needs to be confirmed/changed, and the deep=? flag needs to be set correctly. Both are marked with #BUGBUG.

This is a breaking change, previously dumps data may not load, and newly dumped data won't load with old versions.

@hgrecco
Copy link
Owner

hgrecco commented May 15, 2019

I like the code and I think is the way to go. My suggestions:

  1. Add tests to the testsuite
  2. copy the all yaml.py to yaml_legacy.py and register it as yaml:legacy
  3. register your version as yaml (the default)
  4. Add this change to CHANGES

and the I merge one other change that I have and we publish a new version.

@jtbraun
Copy link
Author

jtbraun commented May 15, 2019

Tests added, yaml_legacy there (I backed out your other changes to it that were never released, they broke some of the tests anyway).
CHANGES modified.

pushed to that same fix_pickle branch for your review.

@hgrecco
Copy link
Owner

hgrecco commented May 16, 2019

Looks good. Make a PR against my repo and I will merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants