-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(deposits): Reduce deposit fetch retry timout #2406
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2406 +/- ##
=======================================
Coverage 27.08% 27.08%
=======================================
Files 351 351
Lines 15543 15543
Branches 20 20
=======================================
Hits 4210 4210
Misses 11130 11130
Partials 203 203
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
@@ -31,7 +31,7 @@ import ( | |||
) | |||
|
|||
// defaultRetryInterval processes a deposit event. | |||
const defaultRetryInterval = 20 * time.Second | |||
const defaultRetryInterval = 2 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shotes q: should we include some metrics so understand how long this can take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens on a separate independent goroutine in depositCatchupFetcher()
. That goroutine will just loop over and over every defaultRetryInterval
seconds and check for failed blocks. If there are any, then it will attempt to fetch them.
What metric should we measure that might impact this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just wondering what happen if the time it takes to the EL to return deposits grows to be comparable with the new defaultRetryInterval.
Not sure if this is realistic but in that case wouldn't we clog the EL with continuous retries?
I was thinking a metric around the time it takes to fetch the deposits could help understanding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. That's a good point to bring up, my gut feeling is that this should not be a problem.
- We are only sending retries if we are unable to fetch it the first time.
- For each request, the EL client fetches the corresponding payload for the blocknum, filters through the
events
that are emitted during that payload, and then returns them. - The fetcher isn't sending async requests every
defaultRetryInterval
seconds. It is synchronously requesting to fetch the deposits, so there should never be more than 1 request at a time.
Under expected circumstances, that should take very little amount of time. Even now, we expect it to take a small enough time that we can run it at the very end of FinalizeBlock
.
Will have to investigate more how this could impact the EL client while it is under heavy load, but I don't think the impact is high.
If a deposit is missed for whatever reason, the node needs to attempt to recover them as quickly as possible. IMO, the 20 second timeout for retrying deposit fetch is too long. This is reduced to check every 2 seconds. This could be made even shorter if needed.