optimize tagging controller workqueue handling #1091

kmala · 2025-01-16T02:09:58Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The tagging controller can add the same node multiple times to the workqueue since the node is updated frequently by the kubelet if the add event for that node is not processed yet. This is because for the workqueue add we are not adding an object which is different for each event. When a large number of nodes are added at a time, the queue can become so big that it can take hours/days for all the events to be handled.
Removed the time and retry count from the workqueue object as they are provided by default by the workqueue library and made the workqueue object simple such that multiple add/updates to a node will only add one item to the workqueue.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

cloudprovider_aws_tagging_controller_work_item_duration_seconds is no longer published by the controller and instead the same details can be fetched from the default work queue metrics(https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/prometheus/workqueue/metrics.go#L40)

k8s-ci-robot · 2025-01-16T02:10:07Z

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dims · 2025-01-16T03:04:56Z

/hold

give me a bit please to review this!

dims · 2025-01-16T03:17:18Z

/hold cancel

Looks good to me @kmala but some questions in line.

dims · 2025-01-16T03:18:04Z

pkg/controllers/tagging/metrics.go

 )

 var register sync.Once

 var (
-	workItemDuration = metrics.NewHistogramVec(


are we ok dropping this metric?

We should have an equivalent metric from the workqueue package itself via:

cloud-provider-aws/pkg/controllers/tagging/tagging_controller.go

Line 35 in bf18a51

_ "k8s.io/component-base/metrics/prometheus/workqueue" // enable prometheus provider for workqueue metrics

but please make sure we actually scrape that one @kmala !

work queue metrics are available by default https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/prometheus/workqueue/metrics.go#L40 and i had verified the same the metrics endpoint of the CCM will provide the tagging controller metrics

Yes we've had those metrics for a while because of the import above (you have to import it for these to be registered: https://github.com/kubernetes/kubernetes/blob/d06398aac3e378da0c95472cd39713998206e9ff/staging/src/k8s.io/component-base/metrics/prometheus/workqueue/metrics.go#L26)

But do we actually scrape those? I'm pretty sure we've been scraping this other metric that we're removing

yes, these metrics are available and i had verified the same by querying the metrics endpoint.
what do you mean by scrape those?

dims · 2025-01-16T03:19:03Z

pkg/controllers/tagging/tagging_controller.go

 	for _, resource := range tc.resources {
 		switch resource {
 		case opt.Instance:
-			err := tc.tagEc2Instance(node)
+			v1node, err := tc.nodeInformer.Lister().Get(node.name)


Do we want to try to capture all necessary information in taggingControllerNode itself to avoid this lookup?

(in a follow up!)

we need to have a lookup if we want to avoid having multiple items in workqueue since the labels can be different on different updates.

The *Node should point to the same struct though, right? the watcher/lister should only have one copy of the Node in memory

cartermckinnon · 2025-01-16T21:12:05Z

pkg/controllers/tagging/tagging_controller.go

+	// if the struct has fields which are all comparable then the workqueue add will handle make sure multiple adds of the same object
+	// will only have one item in the workqueue.
+	item := workItem{
+		name:       node.GetName(),
+		providerID: node.Spec.ProviderID,
+		action:     action,


The action func wasn't comparable, so this change allows multiple workitems for the same node, with different actions. Is that desirable? Why change action to a string?

I'm thinking the same. Seems we're relying on queuing-order to determine the final state which boils down to a last-write-wins situation in the workqueue.

so this change allows multiple workitems for the same node, with different actions. Is that desirable?

Before we are adding to workitems for every update if the add hasn't happened and there could be work item for the delete also. With this change we would only have maximum of 2 work items in the queue which should be the same behavior as before as as single or multiple add tag work items would mean the same behavior at the end.

the action to a string is made so that its comparable and will prevent duplicate updates to not add to the workqueue.

cartermckinnon · 2025-01-16T21:19:35Z

pkg/controllers/tagging/tagging_controller.go

-	requeuingCount int
-	enqueueTime    time.Time


So these 2 fields were the crux of the issue, right? If a workItem already existed in the queue, it's enqueueTime would always be different, even if it's requeingCount happened to match (0).

yes, mostly

Correct me if I am wrong, the requeuingCount looks like there is a maximum retry. Are we good to remove it?
Is there a option to keep those fields and only compare the node identity fields when enqueue?

we can get the requeue count from the workqueue itself and i changed the code to use that. So the functionality remains the same as before.

cartermckinnon · 2025-01-16T22:42:11Z

@ndbaker1 can you review this one?

cartermckinnon · 2025-01-23T01:00:38Z

/lgtm
/approve

k8s-ci-robot · 2025-01-23T01:00:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cartermckinnon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cartermckinnon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…stream-release-1.32 Automated cherry pick of #1091: optimize tagging controller workqueue handling

optimize tagging controller workqueue handling

ba119e5

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 16, 2025

k8s-ci-robot requested review from cartermckinnon and dims January 16, 2025 02:10

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 16, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 16, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2025

dims reviewed Jan 16, 2025

View reviewed changes

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 16, 2025

cartermckinnon reviewed Jan 16, 2025

View reviewed changes

k8s-ci-robot assigned cartermckinnon Jan 23, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2025

k8s-ci-robot merged commit c9d7595 into kubernetes:master Jan 23, 2025
13 checks passed

k8s-ci-robot added a commit that referenced this pull request Jan 23, 2025

Merge pull request #1092 from kmala/automated-cherry-pick-of-#1091-up…

000947e

…stream-release-1.32 Automated cherry pick of #1091: optimize tagging controller workqueue handling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize tagging controller workqueue handling #1091

optimize tagging controller workqueue handling #1091

kmala commented Jan 16, 2025 •

edited

Loading

k8s-ci-robot commented Jan 16, 2025

dims commented Jan 16, 2025

dims commented Jan 16, 2025 •

edited

Loading

dims Jan 16, 2025

cartermckinnon Jan 16, 2025

kmala Jan 16, 2025

cartermckinnon Jan 16, 2025

kmala Jan 17, 2025

dims Jan 16, 2025

dims Jan 16, 2025

kmala Jan 16, 2025

cartermckinnon Jan 16, 2025

cartermckinnon Jan 16, 2025

ndbaker1 Jan 17, 2025 •

edited

Loading

kmala Jan 17, 2025

cartermckinnon Jan 16, 2025

kmala Jan 17, 2025

Issacwww Jan 17, 2025

kmala Jan 17, 2025

cartermckinnon commented Jan 16, 2025

cartermckinnon commented Jan 23, 2025

k8s-ci-robot commented Jan 23, 2025

optimize tagging controller workqueue handling #1091

optimize tagging controller workqueue handling #1091

Conversation

kmala commented Jan 16, 2025 • edited Loading

k8s-ci-robot commented Jan 16, 2025

dims commented Jan 16, 2025

dims commented Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndbaker1 Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cartermckinnon commented Jan 16, 2025

cartermckinnon commented Jan 23, 2025

k8s-ci-robot commented Jan 23, 2025

kmala commented Jan 16, 2025 •

edited

Loading

dims commented Jan 16, 2025 •

edited

Loading

ndbaker1 Jan 17, 2025 •

edited

Loading