Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread cr3 file metadata ETL #1436

Merged
merged 13 commits into from
May 2, 2024
Merged

Thread cr3 file metadata ETL #1436

merged 13 commits into from
May 2, 2024

Conversation

frankhereford
Copy link
Member

@frankhereford frankhereford commented Apr 15, 2024

Associated issues

This PR aims to close cityofaustin/atd-data-tech#16746.

Testing

Local testing for this one.

  • Spin up local DB and replicate production data down onto it.
  • Create a env file or get one from me
  • docker compose build
  • docker compose run metadata
  • ./populate_cr3_file_metadata.py -s 100 for example. Try -h to explore the options.

Ship list

  • Check migrations for any conflicts with latest migrations in master branch
  • Confirm Hasura role permissions for necessary access
  • Code reviewed
  • Product manager approved

Copy link
Member Author

@frankhereford frankhereford Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tag-along change. We need to, non-urgently, remove the deprecated version designation from our compose files. The same goes for the other two compose files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where you would get these values: https://github.com/cityofaustin/atd-airflow/blob/production/dags/vz_populate_cr3_metadata.py#L20-L45. Also, you can point it at your local stack with something like this for the endpoint:

HASURA_ENDPOINT=http://host.docker.internal:8084/v1/graphql

NB: It takes a little extra fiddling to work around that magic DNS name on linux. I have no idea why Docker desktop provides it but good ol' normal docker does not. 🤯

Comment on lines +20 to +24
parser = argparse.ArgumentParser(description='Process some CR3s for metadata')
parser.add_argument('-t', '--threads', type=int, default=25, help='Number of threads')
parser.add_argument('-v', '--verbose', action='store_true', help='Verbose output')
parser.add_argument('-s', '--batch-size', type=int, default=os.getenv('PDF_MAX_RECORDS', 100), help='Batch size')
parser.add_argument('-a', '--all', action='store_true', help='Process all records')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming arguments and picking the right ones is always a challenge, but I like these OK. In airflow, we'll run this like:

populate_cr3_file_metadata.py -v 

And while we're catching up manually on our local machines, something like:

./populate_cr3_file_metadata.py -a

futures.append(executor.submit(process_record, crash_id=crash_id, args=args))

if not args.verbose:
for future in tqdm(futures, total=todo_count):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✨ fancy progress bar. this library is a gem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

Copy link
Contributor

@mddilley mddilley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see the threading spread to another script! 🧵 I took a quick look and added so small comments about docs. I know this process is screaming along right now so I'll take a closer look at the code this afternoon.

atd-etl/populate_cr3_file_metadata/.gitignore Show resolved Hide resolved
futures.append(executor.submit(process_record, crash_id=crash_id, args=args))

if not args.verbose:
for future in tqdm(futures, total=todo_count):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

atd-etl/populate_cr3_file_metadata/docker-compose.yaml Outdated Show resolved Hide resolved
@frankhereford
Copy link
Member Author

@mddilley wrote:

Could you add the tiniest of READMEs with this command for the next person who needs to run this locally?

Great idea, you bet! How's this look? https://github.com/cityofaustin/atd-vz-data/tree/thread-cr3-file-metadata/atd-etl/populate_cr3_file_metadata

0510864

Copy link
Contributor

@mddilley mddilley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this written up so fast! 🏎️ The readme looks great, too. I didn't test locally since this has been run in production and proven itself. 🚢 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +193 to +194
if args.verbose:
print(f"Processing crash_id: {crash_id}")
Copy link
Contributor

@mddilley mddilley Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a showstopper but I like john's comment on the other threading PR about introducing logging levels to control verbosity. I think it would be a good goal for us to do that when possible and could totally be a spin-off since this PR was aimed at getting this up and running to get caught up.

Copy link
Member Author

@frankhereford frankhereford Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea i totally like the idea of returning to these scripts and giving them a once-over in terms of logging. It'd be good, I think, to chat as a group about what we want to see in our logging and then, with that shared goal, come back through all the VZ etls and make them similar where we can. Adopt argparse, logging, threading, and docker compose for local dev 🤔 . @johnclary - I'm curious how you feel about that - maybe something we can put on the slow cooker in the backlog for when we're caught up?

Copy link
Member

@johnclary johnclary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankhereford looks great to me. would you PR the needed Airflow changes against this same issue: cityofaustin/atd-data-tech#16746?

🚢 🚢 🚢 🚢 🚢

@frankhereford frankhereford merged commit 4b689f5 into master May 2, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] Increase the speed of the CR3 file metadata populator
3 participants