Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm going to pick back up on this later so I wanted to drop some notes because I'm finding it hard to trace the dependencies between the cris import, cr3 download, narrative extract, and populate CR3 metadata scripts right now in addition to picturing the entire flow.
I'm looking at the query that the CR3 metadata script uses and the one that the OCR extract one uses. I'll take a closer look at this later but wanted to share in case anyone else is looking at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mddilley - thanks for these great questions.
Here's a little bit of info that might be helpful, at least in terms of the OCR script. Back when the OCR script was first put into place, I was behind a little bit of questionable database design. The
cr3_ocr_extraction_date
field is a nullable timestamp, and it's intended to be both the most recent time that OCR was completed on the CR3 and, if null, to mean that OCR is required / requested for the CR3.It's the implied meaning of the
null
value that doesn't jump off the page for me, and I wish I had not opted for that choice now, but it's not entirely wrong either ... it's not great.I certainly don't think that I've answered Mike's questions here, but it was a little bit of history that might be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, thanks @mddilley for mentioned those other scripts. i had only looked at the CR3 download query, but overlooked the obvious way that these other ETLs depend on this field. so this PR as written will have the effect of causing CR3s to never be re-OCR'd after they're downloaded anew.
i am going to cancel my review request and probably move this into the backlog. it feels like something that we can pick up again in the data model work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, i don't think that it is obvious at all! thanks for thinking through this, y'all! (i was way too distracted w/ the eclipse yesterday). I do think that we could get the switches right if we put all our heads together, but, yep, agreed that it will be nice to take another swing at this in the near future!