Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve old CR3 metadata when CRIS updates a crash #1427

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion atd-etl/cris_import/lib/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def get_pgfutter_path():

def invalidate_cr3(pg, crash_id):
invalidate_cr3_sql = f"""UPDATE public.atd_txdot_crashes
SET cr3_stored_flag = 'N', cr3_file_metadata = null, cr3_ocr_extraction_date = null
SET cr3_stored_flag = 'N'
Copy link
Contributor

@mddilley mddilley Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm going to pick back up on this later so I wanted to drop some notes because I'm finding it hard to trace the dependencies between the cris import, cr3 download, narrative extract, and populate CR3 metadata scripts right now in addition to picturing the entire flow.

I'm looking at the query that the CR3 metadata script uses and the one that the OCR extract one uses. I'll take a closer look at this later but wanted to share in case anyone else is looking at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mddilley - thanks for these great questions.

Here's a little bit of info that might be helpful, at least in terms of the OCR script. Back when the OCR script was first put into place, I was behind a little bit of questionable database design. The cr3_ocr_extraction_date field is a nullable timestamp, and it's intended to be both the most recent time that OCR was completed on the CR3 and, if null, to mean that OCR is required / requested for the CR3.

It's the implied meaning of the null value that doesn't jump off the page for me, and I wish I had not opted for that choice now, but it's not entirely wrong either ... it's not great.

I certainly don't think that I've answered Mike's questions here, but it was a little bit of history that might be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, thanks @mddilley for mentioned those other scripts. i had only looked at the CR3 download query, but overlooked the obvious way that these other ETLs depend on this field. so this PR as written will have the effect of causing CR3s to never be re-OCR'd after they're downloaded anew.

i am going to cancel my review request and probably move this into the backlog. it feels like something that we can pick up again in the data model work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, i don't think that it is obvious at all! thanks for thinking through this, y'all! (i was way too distracted w/ the eclipse yesterday). I do think that we could get the switches right if we put all our heads together, but, yep, agreed that it will be nice to take another swing at this in the near future!

WHERE crash_id = {crash_id}"""
cursor = pg.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cursor.execute(invalidate_cr3_sql)
Expand Down
Loading