-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor CRIS import script to use S3 and log activities #1445
Conversation
@@ -727,13 +777,12 @@ def group_csvs_into_logical_groups(extracted_archives, dry_run): | |||
files = os.listdir(str(extracted_archives)) | |||
logical_groups = [] | |||
for file in files: | |||
if file.endswith(".xml"): | |||
if file.endswith(".xml") or file == "crashReports": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracts can have a folder in them of PDFs now, and we don't want to treat the folder with the code for CSVs in this block, so this skips over that folder.
version: '3' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tag along change; docker compose
not docker-compose
, no longer need version designation. Same as many other PRs. LMK if you have any questions on this.
@@ -5,7 +5,7 @@ RUN ln -snf /usr/share/zoneinfo/America/Chicago /etc/localtime && echo America/C | |||
RUN apt-get update && apt-get install -y tzdata | |||
|
|||
RUN apt-get -y upgrade | |||
RUN apt-get install -y aptitude magic-wormhole vim black python3-pip | |||
RUN apt-get install -y aptitude magic-wormhole vim black python3-pip libmagic1 file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using libmagic
to check that the zips are zips through inspection of the binary file.
key_directory + "/id_ed25519", SFTP_ENDPOINT_SSH_PRIVATE_KEY + "\n" | ||
key_directory + "/id_ed25519", DB_BASTION_HOST_SSH_PRIVATE_KEY + "\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a real confusing part of this diff. It's managed to intermingle two parts of the program that did wildly different things:
- Connecting to the sftp host to rsync files down
- Connect to the bastion to tunnel into the DB server
It's bonkers how it diff's like this.
object_name TEXT NOT NULL, | ||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
completed_at TIMESTAMPTZ DEFAULT NULL, | ||
records_processed JSONB DEFAULT '{}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is really Frank writing this. I think this is a good place for a JSON blob in lieu of a one-to-many table FK relationship because I want the log to fit neatly in one table. I'm seeing this field getting populated with a dict that is keyed by CRIS schema year and the value being the number of crashes which we processed, or, /at absolute most/ an array of crash_id
s that were processed. I think if we develop that metadata to be any more complex, we should bail out of the JSONB field and switch to proper DB tables with normal, normal relationships.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 i think this is a cool idea and i'm fine w/ leaving it here as a placeholder. like we've talked about a bunch of times now, i like your motivation to leave these breadcrumbs for us to help with debugging 🕵️
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
completed_at TIMESTAMPTZ DEFAULT NULL, | ||
records_processed JSONB DEFAULT '{}', | ||
outcome_status TEXT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This column is intended to hold a short, textual message that indicates how the program finished. It will default to something like "nominal completion," and for all error states, each one can have a text string that goes into here that's a breadcrumb for the dev who is trying to figure out what went wrong and why. The breadcrumb strings can come from the string argument of the raise()
function if we stick in some sort of error handling. I'm not 100% i know exactly how this goes down in python, but I'm pretty hopeful that it's possible.
@johnclary, thank you /so much/ for all of your thoughtful feedback; I appreciate your ability to step back and see the big picture. No rush on any sort of review here. I am going to baby any CRIS extracts we get (or have to get via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankhereford i was able to run the script following your instructions, with just a few minor things:
- don't forget to apply migrations after you start the db ;)
- one line of code needs to be de-indented
besides that—it would be nice to have a few comments in the db migrations. awesome work—thank you!
@@ -40,6 +40,7 @@ def insert_crash_change_template(new_record_dict, differences, crash_id): | |||
new_record_escaped = json.dumps(json.dumps(new_record_dict), default=str) | |||
# Build the template and inject required values | |||
output = ( | |||
# ! fixme: there is a misspelling of uuid in the following mutation & in the DB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record_uqid
- funny! indeed this will be fixed soon 😅
object_name TEXT NOT NULL, | ||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
completed_at TIMESTAMPTZ DEFAULT NULL, | ||
records_processed JSONB DEFAULT '{}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 i think this is a cool idea and i'm fine w/ leaving it here as a placeholder. like we've talked about a bunch of times now, i like your motivation to leave these breadcrumbs for us to help with debugging 🕵️
@@ -77,7 +77,7 @@ def no_override_columns(): | |||
"prsn_nbr", # key column | |||
"prsn_type_id", # key column (but why?) | |||
"prsn_occpnt_pos_id", # key column (but why?) | |||
"injury_severity", | |||
"prsn_injry_sev_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was an issue we identified which is unrelated to this PR. i'm going to start spin out a separate task of investing whether or not we've overwritten the prsn_injry_sev_id
in cases where we had not intended to 😵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff—ship it! 🙏 🚢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this locally and see the log, S3 objects, and import all working just like the code says it does. The local dev experience is so slick now!!! I really like the bucket path structure and how straightforward it is to see the flow. It is going to be so nice when we sunset the SFTP server 👋 - thanks for getting us there! 🙌 🚀
Associated issues
This PR aims to advance the following issue: cityofaustin/atd-data-tech#16953.
It completes the second bullet point, and on merger, allows the final two points to become possible.
Mechanics
There is a S3 bucket named
vision-zero-cris-exports
. It has versioning turned on, so we can always recover a lost or changed file, if we were to need to -- at least for 180 days. After 180 days, things start to get removed due to lifecycle rules in place.In the bucket, you'll find 3 folders, one for development, staging and production. Within each of those, you'll see an inbox, where CRIS will upload extracts, and a processed folder, where we will keep our, well, processed extracts.
Additionally, you'll see a new table in the DB called
cris_import_log
, which will keep some logs of the import events, one record per processing attempt per file.Testing
This command has been real useful in development, and I imagine it may be useful in testing. It assumes you have the
aws
command set up with your credentials.Local testing for this one. To get going, spin up your local stack to where you have at least a DB and a⚠️ Be sure to apply the new migration that comes along with this PR.
graphql-engine
endpoint running.There's a file in the processed folder for development, and you can use the above to move it into the inbox or via the S3 console works just fine too.
docker compose run cris-import;
Did it run?
master
if you're intending to use it for something other than testing this work. 🐳Ship list