Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMLS. init_from_nlm_zip can't decode charmap #8

Open
DavidLikesLearning opened this issue Dec 22, 2022 · 7 comments
Open

UMLS. init_from_nlm_zip can't decode charmap #8

DavidLikesLearning opened this issue Dec 22, 2022 · 7 comments

Comments

@DavidLikesLearning
Copy link

DavidLikesLearning commented Dec 22, 2022

Describe the bug

I can't install the UMLS as directed by the tutorial notebooks. The UMLS object can't be initialized.

Steps to reproduce the bug

I downloaded the relevant zip file from the provided link (https://download.nlm.nih.gov/umls/kss/2020AB/umls-2020AB-metathesaurus.zip) and placed the file in the same directory as the 1_Installing_the_UMLS.ipynb notebook in the tutorials folder. Then I ran the notebook as given in the github.

Sample code to reproduce the bug

Expected results

A clear and concise description of the expected results.

Actual results

Specify the actual results or traceback.

The libraries and python version are all on the pdf attached

Environment info

troveDecodeError

  • Python version:

The libraries and python version are all on the pdf attached

  • PyArrow version:

The libraries and python version are all on the pdf attached

@jason-fries
Copy link
Contributor

Hi @elsirdavid
Thanks for the detailed debugging information! Let's test a few things first (using the dev branch)

1. Can you confirm that the UMLS zip file isn't corrupted?

Test this via the command line md5 umls-2020AB-metathesaurus.zip --> 69d2929e0902e7e42af0b2cb74d5005a
or using the use_checksum flag in UMLS.init_from_nlm_zip(NLM_ZIPFILE_PATH, use_checksum=True)

2. Try creating a new conda env using the enviornment.yml file

You can init from scratch using conda env create -f enviornment.yml

If neither of these fix the UMLS issue we can dive deeper into debugging.

@DavidLikesLearning
Copy link
Author

Hi @jason-fries (and Happy New Year!!)

Thank you for your help.

I couldn't use the md5 command from the command line. I did use the checksum suggested and used other code to get a md5 hash of the file.

The checksum was added inline, the hash is below the list of python libraries in the environment. The UMLS code seems to have a problem with the declaration of the 'release' variable.

1_Installing_the_UMLS_md5_checksum.pdf

for the creation of a new environment, I used the 'requirements.txt' file as directed by the README. This manages to install some libraries but crashes when collecting scipy (error in preparing metadata regardign pyproject.toml).

troveDistUtilsFail

I installed msgpack, pandas by hand. The results were the same and are below:

1_Installing_the_UMLS-Copy-trove_env_md5_checksum.pdf

@jason-fries
Copy link
Contributor

jason-fries commented Jan 3, 2023

Hi @elsirdavid

Two issues: (1) For your MD5 hash check, your provided code

import hashlib
md5 = hashlib.md5(b'umls-2020AB-metathesaurus.zip')
print(md5, '\n',md5.digest()) 

generates a hash of the string literal not the contents of the UMLS zip file. You'll want to use

hashlib.md5(open("umls-2020AB-metathesaurus.zip", "rb").read()).hexdigest()

to generate a hash of the contents of the zip file. The above code snippet should return 69d2929e0902e7e42af0b2cb74d5005a for the 2020AB release. If you get a different number your file is corrupted and should be redownloaded from the NLM.

(2) Trove is only tested with Python 3.7.x. From your PDF it looks like your environment is 3.9.7 If you create a fresh env using conda env create -f environment.yml it should install the correct Python version.

On my machine installing from the latest trove dev branch commit using a fresh conda env works, so let's see if any of the above are the source of your issues.

Also make certain to wipe your temp directory (~/.trove/umls2022AB in your code) if the installation of the UMLS bombs out.

@DavidLikesLearning
Copy link
Author

Could you point me to that environment.yml file? I can't find it in the github or any of the folders I've searched. The README from trove suggests using requirements.txt but as i mentioned earlier, that fails too. I'm not certain how to make this environment, then.

@DavidLikesLearning
Copy link
Author

Also, thanks for fixing my hash code. It is indeed not corrupted, I do get the right hash thankfully.

@DavidLikesLearning
Copy link
Author

Thank you for the changing branch idea. I have now tried to to use the relevatn yml file. The creation fails with the output in the included txt file. I am going to try to install the relevant libraries and python version by hand.
create_env.txt

@DavidLikesLearning
Copy link
Author

I ended up installing python 3.7, msgpack and pandas as the yml file directed and the resulting notebook is here:
1_Installing_the_UMLS_013123.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants