-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update controller and agent to kube-rs client 0.91.0
#702
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass here, didn't look at the tests for now.
/// This attempts to remove nodes from the nodes list and deviceUsage | ||
/// map in an Instance. An attempt is made to update | ||
/// the instance in etcd, any failure is returned. | ||
async fn try_remove_nodes_from_instance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this play well with Server Side Apply used by the agent ? I fear this would completely mess with fields ownership and prevent agents to remove themselves from an Instance afterwards (on device disappearance).
Also would need to remove the finalizer of the vanished nodes' agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server side apply is fairly new to me. It sounds like the idea is that only one owner should manage an object or there could be conflicts. The owner of the instance objects should be the agent. Therefore, i should switch this back to do patching as we used to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove a node at a time, then I think we should impersonate the node's agent to do so and do what the missing node agent's would do if it were to consider the device disappears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of impersonation, but i don't see removing a finalizer as impersonation. Does d9689ae address your concern
let owner_references: Vec<OwnerReference> = vec![OwnerReference { | ||
api_version: ownership.get_api_version(), | ||
kind: ownership.get_kind(), | ||
controller: ownership.get_controller(), | ||
block_owner_deletion: ownership.get_block_owner_deletion(), | ||
name: ownership.get_name(), | ||
uid: ownership.get_uid(), | ||
}]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be easier to implement From<OwnershipInfo>
for OwnerReference
than repeating this block every time
Thinking about this more, i think we want to avoid using the finalizer API for the instance_watcher since we don't do anything on instance deletes so a finalizer shouldn't be necessary For node_watcher, we could also remove the finalizer given that we don't care if the node is fully deleted before reconciling it. |
@kate-goldenring the Controller system doesn't have any "delete" event partly because these can be missed easily, e.g for the node watcher if the controller was on the now missing node, it will never get a delete event for that node. Overall, the akri controller currently use an event based imperative mechanism (node got deleted, so I do remove it from Instances). But the kube-rs "Controller" system is a bit different in that it works with: I have a resource (in our case the Instance), I should ensure everything is set-up correctly for it. So if we were to follow the full kube-rs flow here, we would have an Instance controller, that ensure all broker Pods/Jobs/... are here, and that all referenced nodes are healthy, it would trigger on an Instance change to reconcile that Instance, would trigger on owned Pod/Job/... change (to ensure it is always according to spec), and would also trigger (for all Instances) on node state change. We could also have a reflector for the Configuration to maintain a cache of them and avoid fetching the Configuration from API every time we reconcile an Instance. I didn't comment on this as your goal was to do the minimal changes necessary to upgrade to latest kube-rs |
@diconico07 that vision makes sense. I started down the path of minimal changes and obviously it grew so much that a full rewrite with just an instance controller would have been easier. At this point, I don't have much more time to devote to this, so could we look at this from the lens of what needs to change here for parity and upgraded kube-rs and then track an issue on rewriting the controller? |
37fc057
to
97470b0
Compare
Because we are not reconciling instance deletes anymore, the |
I think we need to bump rust version in a separate PR given the failing e2e tests probably due to too low a rust version in the cross containers:
|
@diconico07 clarifying here: are you saying the controller should still watch Instance, Pod and Node resources? or just instance resources and verify the existence of resources based on the instance change? |
It should reconcile Instances (Controller mode), watch Nodes (IIRC there is a way to just get a trigger when a specific field gets updated), and watch Pods,Services,Jobs (if we want to always keep these in line with the spec in Configuration). The last two being light watches (or probably reflector for nodes), that will just trigger an Instance reconciliation. |
fb7d4e6
to
335da46
Compare
Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
…version to 1.75.0 Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
335da46
to
a588a5c
Compare
Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
977d79a
to
e035d67
Compare
Signed-off-by: Kate Goldenring <kate.goldenring@fermyon.com>
What this PR does / why we need it:
Our controller uses a really out of date Kubernetes client (
kube-rs
) API. This updates it to a newer version. Another PR can bump us to latest, which will cause more changes to the agent which is why I avoided that big of a jump.Much of the controller code was written years ago when the kube-rs did a lot less of the heavy lifting. Switching to a newer version with a better reconciliation model means I was able to delete a lot of helper code.
All the tests had to be rewritten for the new API.
Special notes for your reviewer:
This is a lot. I am hoping the tests can confirm the functionality but we may also need to do a good bug bash.
If applicable:
cargo fmt
)cargo build
)cargo clippy
)cargo test
)cargo doc
)