Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CRIS import script to use S3 and log activities #1445

Merged
merged 52 commits into from
May 14, 2024

Conversation

frankhereford
Copy link
Member

@frankhereford frankhereford commented May 6, 2024

Associated issues

This PR aims to advance the following issue: cityofaustin/atd-data-tech#16953.

It completes the second bullet point, and on merger, allows the final two points to become possible.

Mechanics

There is a S3 bucket named vision-zero-cris-exports. It has versioning turned on, so we can always recover a lost or changed file, if we were to need to -- at least for 180 days. After 180 days, things start to get removed due to lifecycle rules in place.

In the bucket, you'll find 3 folders, one for development, staging and production. Within each of those, you'll see an inbox, where CRIS will upload extracts, and a processed folder, where we will keep our, well, processed extracts.

Additionally, you'll see a new table in the DB called cris_import_log, which will keep some logs of the import events, one record per processing attempt per file.

Testing

This command has been real useful in development, and I imagine it may be useful in testing. It assumes you have the aws command set up with your credentials.

aws s3 mv s3://vision-zero-cris-exports/development/processed/extract_2023_20240417155400400_99726_20240401-20240405_HAYSTRAVISWILLIAMSON.zip s3://vision-zero-cris-exports/development/inbox/                                                                                

Local testing for this one. To get going, spin up your local stack to where you have at least a DB and a graphql-engine endpoint running. ⚠️ Be sure to apply the new migration that comes along with this PR.

cd atd-etl/cris_import;
docker compose build --no-cache;

There's a file in the processed folder for development, and you can use the above to move it into the inbox or via the S3 console works just fine too.

docker compose run cris-import;
  • Did it run?

  • ⚠️ Don't forget to rebuild your cris import docker image after you switch back to master if you're intending to use it for something other than testing this work. 🐳


Ship list

  • Check migrations for any conflicts with latest migrations in master branch
  • Confirm Hasura role permissions for necessary access
  • Code reviewed
  • Product manager approved

@@ -727,13 +777,12 @@ def group_csvs_into_logical_groups(extracted_archives, dry_run):
files = os.listdir(str(extracted_archives))
logical_groups = []
for file in files:
if file.endswith(".xml"):
if file.endswith(".xml") or file == "crashReports":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracts can have a folder in them of PDFs now, and we don't want to treat the folder with the code for CSVs in this block, so this skips over that folder.

Comment on lines -1 to -2
version: '3'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tag along change; docker compose not docker-compose, no longer need version designation. Same as many other PRs. LMK if you have any questions on this.

@@ -5,7 +5,7 @@ RUN ln -snf /usr/share/zoneinfo/America/Chicago /etc/localtime && echo America/C
RUN apt-get update && apt-get install -y tzdata

RUN apt-get -y upgrade
RUN apt-get install -y aptitude magic-wormhole vim black python3-pip
RUN apt-get install -y aptitude magic-wormhole vim black python3-pip libmagic1 file
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using libmagic to check that the zips are zips through inspection of the binary file.

Comment on lines -238 to +309
key_directory + "/id_ed25519", SFTP_ENDPOINT_SSH_PRIVATE_KEY + "\n"
key_directory + "/id_ed25519", DB_BASTION_HOST_SSH_PRIVATE_KEY + "\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a real confusing part of this diff. It's managed to intermingle two parts of the program that did wildly different things:

  • Connecting to the sftp host to rsync files down
  • Connect to the bastion to tunnel into the DB server

It's bonkers how it diff's like this.

object_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ DEFAULT NULL,
records_processed JSONB DEFAULT '{}',
Copy link
Member Author

@frankhereford frankhereford May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is really Frank writing this. I think this is a good place for a JSON blob in lieu of a one-to-many table FK relationship because I want the log to fit neatly in one table. I'm seeing this field getting populated with a dict that is keyed by CRIS schema year and the value being the number of crashes which we processed, or, /at absolute most/ an array of crash_ids that were processed. I think if we develop that metadata to be any more complex, we should bail out of the JSONB field and switch to proper DB tables with normal, normal relationships.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 i think this is a cool idea and i'm fine w/ leaving it here as a placeholder. like we've talked about a bunch of times now, i like your motivation to leave these breadcrumbs for us to help with debugging 🕵️

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ DEFAULT NULL,
records_processed JSONB DEFAULT '{}',
outcome_status TEXT NULL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This column is intended to hold a short, textual message that indicates how the program finished. It will default to something like "nominal completion," and for all error states, each one can have a text string that goes into here that's a breadcrumb for the dev who is trying to figure out what went wrong and why. The breadcrumb strings can come from the string argument of the raise() function if we stick in some sort of error handling. I'm not 100% i know exactly how this goes down in python, but I'm pretty hopeful that it's possible.

@frankhereford frankhereford requested a review from johnclary May 9, 2024 04:27
@frankhereford
Copy link
Member Author

@johnclary, thank you /so much/ for all of your thoughtful feedback; I appreciate your ability to step back and see the big picture.

No rush on any sort of review here. I am going to baby any CRIS extracts we get (or have to get via downloading) though for the time being until we're ready to roll this out. It's really no trouble at all, esp after working on this -- for a hot second, I have all of the moving parts under control.

Copy link
Member

@johnclary johnclary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankhereford i was able to run the script following your instructions, with just a few minor things:

  • don't forget to apply migrations after you start the db ;)
  • one line of code needs to be de-indented

besides that—it would be nice to have a few comments in the db migrations. awesome work—thank you!

@@ -40,6 +40,7 @@ def insert_crash_change_template(new_record_dict, differences, crash_id):
new_record_escaped = json.dumps(json.dumps(new_record_dict), default=str)
# Build the template and inject required values
output = (
# ! fixme: there is a misspelling of uuid in the following mutation & in the DB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_uqid - funny! indeed this will be fixed soon 😅

object_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ DEFAULT NULL,
records_processed JSONB DEFAULT '{}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 i think this is a cool idea and i'm fine w/ leaving it here as a placeholder. like we've talked about a bunch of times now, i like your motivation to leave these breadcrumbs for us to help with debugging 🕵️

atd-etl/cris_import/cris_import.py Outdated Show resolved Hide resolved
@frankhereford frankhereford requested a review from johnclary May 9, 2024 16:37
@@ -77,7 +77,7 @@ def no_override_columns():
"prsn_nbr", # key column
"prsn_type_id", # key column (but why?)
"prsn_occpnt_pos_id", # key column (but why?)
"injury_severity",
"prsn_injry_sev_id",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was an issue we identified which is unrelated to this PR. i'm going to start spin out a separate task of investing whether or not we've overwritten the prsn_injry_sev_id in cases where we had not intended to 😵

Copy link
Member

@johnclary johnclary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff—ship it! 🙏 🚢

Copy link
Contributor

@mddilley mddilley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally and see the log, S3 objects, and import all working just like the code says it does. The local dev experience is so slick now!!! I really like the bucket path structure and how straightforward it is to see the flow. It is going to be so nice when we sunset the SFTP server 👋 - thanks for getting us there! 🙌 🚀

@frankhereford frankhereford merged commit a0a09d3 into master May 14, 2024
10 checks passed
@frankhereford frankhereford deleted the use-s3-for-cris-delivery branch May 14, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants