Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve old CR3 metadata when CRIS updates a crash #1427

Closed
wants to merge 1 commit into from

Conversation

johnclary
Copy link
Member

@johnclary johnclary commented Apr 8, 2024

Associated issues

Testing

I'm not sure what the best way is to test the CRIS import ETL. Looking for guidance here :)


Ship list

  • Check migrations for any conflicts with latest migrations in master branch
  • Confirm Hasura role permissions for necessary access
  • Code reviewed
  • Product manager approved

Copy link
Member

@chiaberry chiaberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't test but 🚢

@@ -25,7 +25,7 @@ def get_pgfutter_path():

def invalidate_cr3(pg, crash_id):
invalidate_cr3_sql = f"""UPDATE public.atd_txdot_crashes
SET cr3_stored_flag = 'N', cr3_file_metadata = null, cr3_ocr_extraction_date = null
SET cr3_stored_flag = 'N'
Copy link
Contributor

@mddilley mddilley Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm going to pick back up on this later so I wanted to drop some notes because I'm finding it hard to trace the dependencies between the cris import, cr3 download, narrative extract, and populate CR3 metadata scripts right now in addition to picturing the entire flow.

I'm looking at the query that the CR3 metadata script uses and the one that the OCR extract one uses. I'll take a closer look at this later but wanted to share in case anyone else is looking at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mddilley - thanks for these great questions.

Here's a little bit of info that might be helpful, at least in terms of the OCR script. Back when the OCR script was first put into place, I was behind a little bit of questionable database design. The cr3_ocr_extraction_date field is a nullable timestamp, and it's intended to be both the most recent time that OCR was completed on the CR3 and, if null, to mean that OCR is required / requested for the CR3.

It's the implied meaning of the null value that doesn't jump off the page for me, and I wish I had not opted for that choice now, but it's not entirely wrong either ... it's not great.

I certainly don't think that I've answered Mike's questions here, but it was a little bit of history that might be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, thanks @mddilley for mentioned those other scripts. i had only looked at the CR3 download query, but overlooked the obvious way that these other ETLs depend on this field. so this PR as written will have the effect of causing CR3s to never be re-OCR'd after they're downloaded anew.

i am going to cancel my review request and probably move this into the backlog. it feels like something that we can pick up again in the data model work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, i don't think that it is obvious at all! thanks for thinking through this, y'all! (i was way too distracted w/ the eclipse yesterday). I do think that we could get the switches right if we put all our heads together, but, yep, agreed that it will be nice to take another swing at this in the near future!

@frankhereford
Copy link
Member

frankhereford commented Apr 9, 2024

Hi John, thank you for looking into how we can limit the number of CR3s we have to download. It's a bunch, no question.

How does this know when a crash has been meaningfully updated? I totally get that a huge number of the crashes that we're receiving right now are due to reprocessing, but there are still the normal volume of crashes which are getting normal updates.

Never mind! I had the wrong idea about the intent of this PR and my question above was 100% ill-informed. When I reviewed the issue itself, this all came together for me. Thank you for working to keep the CR3s available to the end user, which is 100% a better behavior.

Here's the query that the downloader uses, and all that really matters is that the cr3_stored_flag is set to N which is happening just like it always has. This is 👍 in terms of this edit to this script.

Also, thank you to Mike for asking good questions to help ensure all the parts of the stack stay in concert with each other. 🙏

Thanks!

@johnclary johnclary closed this Apr 11, 2024
@johnclary johnclary deleted the jc-preserve-cr3 branch June 7, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants