Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(deposits): Reduce deposit fetch retry timout #2406

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shotes
Copy link
Contributor

@shotes shotes commented Jan 24, 2025

If a deposit is missed for whatever reason, the node needs to attempt to recover them as quickly as possible. IMO, the 20 second timeout for retrying deposit fetch is too long. This is reduced to check every 2 seconds. This could be made even shorter if needed.

@shotes shotes requested a review from a team as a code owner January 24, 2025 07:47
Copy link

codecov bot commented Jan 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 27.08%. Comparing base (419cd66) to head (576c695).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2406   +/-   ##
=======================================
  Coverage   27.08%   27.08%           
=======================================
  Files         351      351           
  Lines       15543    15543           
  Branches       20       20           
=======================================
  Hits         4210     4210           
  Misses      11130    11130           
  Partials      203      203           
Files with missing lines Coverage Δ
beacon/blockchain/deposit.go 0.00% <ø> (ø)

Copy link
Contributor

@nidhi-singh02 nidhi-singh02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@@ -31,7 +31,7 @@ import (
)

// defaultRetryInterval processes a deposit event.
const defaultRetryInterval = 20 * time.Second
const defaultRetryInterval = 2 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shotes q: should we include some metrics so understand how long this can take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens on a separate independent goroutine in depositCatchupFetcher(). That goroutine will just loop over and over every defaultRetryInterval seconds and check for failed blocks. If there are any, then it will attempt to fetch them.

What metric should we measure that might impact this value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering what happen if the time it takes to the EL to return deposits grows to be comparable with the new defaultRetryInterval.
Not sure if this is realistic but in that case wouldn't we clog the EL with continuous retries?
I was thinking a metric around the time it takes to fetch the deposits could help understanding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. That's a good point to bring up, my gut feeling is that this should not be a problem.

  1. We are only sending retries if we are unable to fetch it the first time.
  2. For each request, the EL client fetches the corresponding payload for the blocknum, filters through the events that are emitted during that payload, and then returns them.
  3. The fetcher isn't sending async requests every defaultRetryInterval seconds. It is synchronously requesting to fetch the deposits, so there should never be more than 1 request at a time.

Under expected circumstances, that should take very little amount of time. Even now, we expect it to take a small enough time that we can run it at the very end of FinalizeBlock.

Will have to investigate more how this could impact the EL client while it is under heavy load, but I don't think the impact is high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants