-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread cr3 file metadata ETL #1436
Conversation
docker-compose-docker-volume.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tag-along change. We need to, non-urgently, remove the deprecated version
designation from our compose files. The same goes for the other two compose files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where you would get these values: https://github.com/cityofaustin/atd-airflow/blob/production/dags/vz_populate_cr3_metadata.py#L20-L45. Also, you can point it at your local stack with something like this for the endpoint:
HASURA_ENDPOINT=http://host.docker.internal:8084/v1/graphql
NB: It takes a little extra fiddling to work around that magic DNS name on linux. I have no idea why Docker desktop provides it but good ol' normal docker does not. 🤯
parser = argparse.ArgumentParser(description='Process some CR3s for metadata') | ||
parser.add_argument('-t', '--threads', type=int, default=25, help='Number of threads') | ||
parser.add_argument('-v', '--verbose', action='store_true', help='Verbose output') | ||
parser.add_argument('-s', '--batch-size', type=int, default=os.getenv('PDF_MAX_RECORDS', 100), help='Batch size') | ||
parser.add_argument('-a', '--all', action='store_true', help='Process all records') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming arguments and picking the right ones is always a challenge, but I like these OK. In airflow, we'll run this like:
populate_cr3_file_metadata.py -v
And while we're catching up manually on our local machines, something like:
./populate_cr3_file_metadata.py -a
futures.append(executor.submit(process_record, crash_id=crash_id, args=args)) | ||
|
||
if not args.verbose: | ||
for future in tqdm(futures, total=todo_count): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨ fancy progress bar. this library is a gem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome to see the threading spread to another script! 🧵 I took a quick look and added so small comments about docs. I know this process is screaming along right now so I'll take a closer look at the code this afternoon.
futures.append(executor.submit(process_record, crash_id=crash_id, args=args)) | ||
|
||
if not args.verbose: | ||
for future in tqdm(futures, total=todo_count): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
@mddilley wrote:
Great idea, you bet! How's this look? https://github.com/cityofaustin/atd-vz-data/tree/thread-cr3-file-metadata/atd-etl/populate_cr3_file_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this written up so fast! 🏎️ The readme looks great, too. I didn't test locally since this has been run in production and proven itself. 🚢 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨
if args.verbose: | ||
print(f"Processing crash_id: {crash_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a showstopper but I like john's comment on the other threading PR about introducing logging levels to control verbosity. I think it would be a good goal for us to do that when possible and could totally be a spin-off since this PR was aimed at getting this up and running to get caught up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea i totally like the idea of returning to these scripts and giving them a once-over in terms of logging. It'd be good, I think, to chat as a group about what we want to see in our logging and then, with that shared goal, come back through all the VZ etls and make them similar where we can. Adopt argparse
, logging
, threading, and docker compose
for local dev 🤔 . @johnclary - I'm curious how you feel about that - maybe something we can put on the slow cooker in the backlog for when we're caught up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankhereford looks great to me. would you PR the needed Airflow changes against this same issue: cityofaustin/atd-data-tech#16746?
🚢 🚢 🚢 🚢 🚢
Associated issues
This PR aims to close cityofaustin/atd-data-tech#16746.
Testing
Local testing for this one.
env
file or get one from medocker compose build
docker compose run metadata
./populate_cr3_file_metadata.py -s 100
for example. Try-h
to explore the options.Ship list
Check migrations for any conflicts with latest migrations in master branchConfirm Hasura role permissions for necessary access