Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix custodial peer assumption on lookup custody requests #6815

Merged
merged 2 commits into from
Jan 20, 2025

Conversation

dapplion
Copy link
Collaborator

Issue Addressed

Current unstable assumes that all peers to which we request columns by root, MUST have the columns in custody

// true = enforce max_requests are returned data_columns_by_root. We only issue requests
// for blocks after we know the block has data, and only request peers after they claim to
// have imported the block+columns and claim to be custodians
true,

This statement is incorrect, as we pick a random peer from the global set of peers. The result, is a lot of downscore events like

VerifyError(NotEnoughResponsesReturned { actual: 0 })

for innocent peers.

Proposed Changes

Provide the custody request with an up-to-date list of peers that claim to have imported the lookup block. If peers give us that signal we can expect and enforce that they MUST have the columns in custody. If we don't have any such peers, we draw from the global set, but don't enforce then to have the columns in custody.

Since we need share state between the custody request and the overall lookup request I introduced a Mutex. I don't see another way around it if we want the custody request to have the most up-to-date view of the peer set. With @michaelsproul we discussed using an immutable HashSet and clone it, but that would result in the custody request having a quickly stale view. Only when the overall custody request fails (after 3 retries) the peer set would update.

Consider the following sequence of events, if we don't use a Mutex:

  • Receive attestation for block A
  • Lookup for block A is created with a single peer
  • Custody request is created with a single peer
  • Receive attestations for block A from 50 other peers
  • (..) wait for custody request to fail 3 times (could take many seconds)
  • Re-create custody request with 51 peers

During a significant duration the custody request has a stale view, rendering this feature not that useful.

@dapplion dapplion added Networking das Data Availability Sampling labels Jan 17, 2025
@dapplion dapplion requested a review from jxs as a code owner January 17, 2025 05:18
@dapplion dapplion force-pushed the custody-lookup-peers branch from 4278402 to 4ab0678 Compare January 17, 2025 05:22
@jimmygchen jimmygchen self-requested a review January 17, 2025 06:05
@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Jan 17, 2025
jimmygchen added a commit to jimmygchen/lighthouse that referenced this pull request Jan 17, 2025
Squashed commit of the following:

commit 4ab0678
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Fri Jan 17 12:09:35 2025 +0700

    Fix custodial peer assumption on lookup custody requests
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution, logic is pretty clean 👍

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jan 20, 2025
@jimmygchen
Copy link
Member

@mergify queue

Copy link

mergify bot commented Jan 20, 2025

queue

🛑 The pull request has been removed from the queue default

Pull request #6815 has been dequeued by a dequeue command.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@jimmygchen
Copy link
Member

@mergify dequeue

Copy link

mergify bot commented Jan 20, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #6815 has been dequeued by a dequeue command

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Copy link

mergify bot commented Jan 20, 2025

dequeue

✅ The pull request has been removed from the queue default

@jimmygchen
Copy link
Member

@mergify queue

Copy link

mergify bot commented Jan 20, 2025

queue

🛑 The pull request has been removed from the queue default

Pull request #6815 has been dequeued by a dequeue command.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@jimmygchen
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Jan 20, 2025

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Jan 20, 2025

queue

🛑 The pull request has been removed from the queue default

The merge conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

mergify bot added a commit that referenced this pull request Jan 20, 2025
Copy link

mergify bot commented Jan 20, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

@jimmygchen
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Jan 20, 2025

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Jan 20, 2025

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 7a0388e

mergify bot added a commit that referenced this pull request Jan 20, 2025
@mergify mergify bot merged commit 7a0388e into sigp:unstable Jan 20, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling Networking ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants