-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
East/west connectivity monitoring tool #5514
Comments
@zm1990s Thanks for the proposal. Regarding Pod-to-Service, Pod-to-External monitoring, I'm not sure if it could be really helpful and practicable to proactively generating traffic. It's not easy to even know whether a particular access is supposed to succeed or not given different policy, firewall, network topology configuration, not to mention the generated traffic on behalf of user's application or towards user's application may be not wanted by many users. I think in practice most such tools are implemented as script/playbook executed out-band using user's own application according what they want to monitor. |
However, it seems monitoring EW connectivity via memberlist would just be a faster way to get the notification of Node unreachable event compared with the K8s's native Node status. If user just wants such status is reported faster, they can also just update |
A tool (like anctl subcommand) for smoke testing may be the most practicable way in the end. |
I think that from an Antrea perspective, it would be good to monitor the health of the overlay network (in encap mode) by running some ping-mesh across all gateways. Being able to report latency across Nodes would also be quite nice, but I don't think we can do that with memberlist (IIRC, we discussed that in the past). With latency data available, we could even display a heat map in Antrea UI and update it in real-time. |
Agree with what @antoninbas said. |
If without the need of latency data, I think the health of the overlay network shares the health status of memberlist in practice. Unless a misconfiguration that the memberlist port is whitelisted but not the overlay port, which could only occur when deploying a cluster and not during the routine running, I don't think of a situation that memberlist reports a Node is health but its overlay doesn't work. But if we want to add latency data, I agree memberlist may not achieve it (However, I don't quite rememember we discussed this, could you share a link if there is one?). |
I think overlay (ping between gateway) is a bit more "end-to-end". In addition to port whitelisting, we could potentially detect issues like a missing route on the host (granted, that has not happened in a while, but we used to have such issues). I was thinking that with the right "probe" (e.g. a TCP data exchange), the health check would also fail in case of checksum issue (basically any issue with the NIC configuration that is specific to double encapsulation).
The latency heat map is something that has been on my mind for a while. I remember someone telling me that Weave had something like this, but I can't find a reference to it. |
@tnqn I think this tool should be decoupled from Antrea Controller/Agent, just like nsx-interworking. Users can decide whether they need to use it or not. |
Assigning to @tushartathgur who said he would look into this. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
We have submitted this issue as a project idea for the LFX mentorship program: cncf/mentoring#1129. So no-one should ideally work on this issue until we know if the proposal is accepted and if we can match a mentee to work on it. |
@antoninbas, I am greatly interested in the project, how do I reach you guys in slack or are there other options? Though I have an intermediate knowledge of k8s and golang, I am curious to know how much frontend approach has to be driven here. Thank you. |
@prakrit55 you can reach out to us on Slack (we have the but for this specific issue, please see comment above (#5514 (comment)). If you are interested, you could consider applying for the LFX mentorship program. |
Hey thank you @antoninbas, I got your channel. I would really like to apply for lfx mentorship for it, in the term March-May. |
@antoninbas The prospect of working collectively on a comprehensive project like this is truly exciting, and I am keen on contributing my skills and enthusiasm to its success. The outlined sub-projects align perfectly with my interests, and they present a great opportunity for learning growth, and industrial exposure. |
Hello, everyone. I'm pleased to see how many folks are interested in participating in the LFX Mentorship Program. Upstream issues like this are an excellent place to discuss specific technical topics or provide ideas about how you may tackle a problem; however, please post any questions about the LFX program and how to apply on the mentorship discussion forums (and indeed, some of these questions may have already been answered there, or on the Program Guidelines page). |
For all the folks who have applied or are considering applying to one of the Antrea projects for the LFX mentorship program, we have published instructions to complete test tasks: #5976. We will review your submissions for these tasks alongside other material (resume, cover letter) when selecting mentees. The deadline for submitting is February 20th 5PM PST. |
Signed-off-by: Md Sahil <[email protected]>
@IRONICBo will work on this as part of the LFX mentorship program |
Monitoring tool api design proposalMonitoring tool needs a uniform config Users and administrators need a way to measure and monitor network performance, specifically the latency between nodes, to ensure optimal cluster performance and troubleshoot potential issues. Watch a singleton CRD The proposed solution is to introduce a new Custom Resource Definition (CRD) called The Antrea agents will listen for changes to this CRD and adjust their monitoring behavior accordingly. When we enable this monitoring feature in Feature gate and config, agent will watch the events of creation/update/deletion of this CRD and update the start, stop and parameter update of monitor tool in real time. Additionally, a singleton pattern will be enforced using a validation webhook to ensure that only one instance of the CRD exists in the cluster. Use Feature Gate & Config & CRD to start monitoring tool The solution introduces a new user-facing feature that allows users to enable and configure the ping monitoring tool via a YAML config file. Users can apply this YAML file using The changes will be automatically picked up by the Antrea agents, and the monitoring behavior will be updated accordingly. This feature provides users with a structured and easy-to-consume API for enabling and configuring the ping mesh feature. Main design/architecture The main design involves the following components:
type PingMonitoringToolConfigSpec struct {
PingInterval string `json:"pingInterval,omitempty"`
PingTimeout string `json:"pingTimeout,omitempty"`
PingConcurrentLimit int `json:"pingConcurrentLimit,omitempty"`
}
type PingMonitoringToolConfig struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec PingMonitoringToolConfigSpec `json:"spec,omitempty"`
}
apiVersion: networking.antrea.io/v1alpha1
kind: PingMonitoringToolConfig
metadata:
name: default
spec:
pingInterval: "10s"
pingTimeout: "5s"
pingConcurrentLimit: 10 In this example, the ping monitoring tool is enabled with a ping interval of 10 seconds, a ping timeout of 5 seconds, and a concurrency limit of 10. Alternative solutions
This proposal aims to provide a flexible and user-friendly way to monitor node-to-node latency in a Kubernetes cluster, enhancing the observability and manageability of the network performance in Antrea-managed clusters. |
A validation webhook won't be necessary if we simply add an open-api validation rule which constraints the name of the CRD object created. See https://github.com/kubernetes-sigs/network-policy-api/blob/main/apis/v1alpha1/baselineadminnetworkpolicy_types.go#L29 as an example |
We introduce a new feature to measure inter-Node latency in a K8s cluster running Antrea. The feature is currently Alpha and uses the NodeLatencyMonitor FeatureGate. In addition to the FeatureGate, enablement of the feature is controlled by a new CRD, called NodeLatencyMonitor. This CRD supports at most one CR instance, which must be named "default". When the CR exists, Antrea Agents will start "pinging" each other to take latency measurements. Each Agent only stores the latest measured value (at least at the moment), we do not store time series data. We support both IPv4 and IPv6. When an oberlay is used by Antrea, the ping is sent over the tunnel (by using the gateway IP as the destination). This change does not add any functionality besides collecting latency data at each Agent. A follow-up change will take care of reporting the latency data to the Antrea Controller, so it can be consumed via an APIService. For #5514 Signed-off-by: IRONICBo <[email protected]> Signed-off-by: Asklv <[email protected]>
Follow up to #6120 See #5514 Signed-off-by: Asklv <[email protected]>
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
Implement REST server for NodeLatencyStats in v1alpha1.stats.antrea.io With this change the feature is now usable. `kubectl get nodelatencystats` will display the latest latency information. For #5514 Signed-off-by: Asklv <[email protected]> Signed-off-by: Antonin Bas <[email protected]> Co-authored-by: Antonin Bas <[email protected]>
With the addition of the NodeLatencyMonitor feature in Antrea v2.1 (thanks @IRONICBo!), I will now close this issue. Additional capabilities for this feature can be added over time. An issue has been created for the addition of a latency visualization dashboard in the Antrea UI: antrea-io/antrea-ui#455 |
Description
Antrea only monitors Controller/Agent status at the moment, and Controller/Agent's status doesn’t means East-West connectivity is good, and metrics provieded by Antrea also does not reflect to Pod to Pod connectivity.
From an application perspective, we need a tool that can detect and inform Pod-to-pod connectivity issues.
Core feature required
A tool (maybe a Daemonset) that can generate East/West traffic periodically and check whether the E/W connectivity is good. if some of the detection fails, alerts or logs should be send out to external monitoring tools.
The detection interval should be adjustable like traditional loadbalancer do, for example send detection every 1 second and when 3 consecutive detection fails, sends out an alert.
Other related features
Since we're doing a E/W monitoring tool, so other related Antrea features can be monitored too. For example:
The text was updated successfully, but these errors were encountered: