Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-operator v2.0.2 - listen tcp :443: bind: permission denied #2331

Open
1 task
karanalang opened this issue Nov 21, 2024 · 7 comments
Open
1 task

spark-operator v2.0.2 - listen tcp :443: bind: permission denied #2331

karanalang opened this issue Nov 21, 2024 · 7 comments
Labels
kind/bug Something isn't working

Comments

@karanalang
Copy link

What happened?

  • ✋ I have searched the open/closed issues and my issue is not listed.
    I'm trying to install spark-operator on k8s (v1.28), and running in to issues 👍

Command -

helm upgrade --install spark-operator spark-operator/spark-operator \
  --namespace so350 \
  --set image.tag=2.0.2 \
  --create-namespace \
  --set webhook.enable=true \
  --set webhook.port=443 \
  --set webhook.namespaceSelector="spark-webhook-enabled=true" \
  --set webhook.containerSecurityContext.privileged=true \
  --set webhook.containerSecurityContext.capabilities.add[0]=NET_BIND_SERVICE \
  --set logLevel=debug \
  --set enableResourceQuotaEnforcement=true \
  --set webhook.failOnError=true \
  --set controller.resources.limits.cpu=100m \
  --set controller.resources.limits.memory=200Mi \
  --set controller.resources.requests.cpu=50m \
  --set controller.resources.requests.memory=100Mi \
  --set webhook.resources.limits.cpu=100m \
  --set webhook.resources.limits.memory=200Mi \
  --set webhook.resources.requests.cpu=50m \
  --set webhook.resources.requests.memory=100Mi \
  --set "sparkJobNamespaces={spark-apps}" \
  --set webhook.containerSecurityContext.runAsUser=0

spark-controller pod is started but webhook pod is failing -

NAME                                             READY   STATUS    RESTARTS      AGE
pod/spark-operator-controller-688c7c9955-tkdpf   1/1     Running   0             3m15s
pod/spark-operator-webhook-567bd94f66-tg567      0/1     Error     5 (94s ago)   3m15s

NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/spark-operator-webhook-svc   ClusterIP   10.108.242.219   <none>        443/TCP   3m15s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/spark-operator-controller   1/1     1            1           3m15s
deployment.apps/spark-operator-webhook      0/1     1            0           3m15s

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/spark-operator-controller-688c7c9955   1         1         1       3m15s
replicaset.apps/spark-operator-webhook-567bd94f66      1         1         0       3m15s

Logs from webhook pod -

(base) Karans-MacBook-Pro:~ karanalang$ kc logs -f pod/spark-operator-webhook-567bd94f66-tg567  -n so350
++ id -u
+ uid=185
++ id -g
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [[ -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator webhook start --zap-log-level=info --namespaces=default --webhook-secret-name=spark-operator-webhook-certs --webhook-secret-namespace=so350 --webhook-svc-name=spark-operator-webhook-svc --webhook-svc-namespace=so350 --webhook-port=443 --mutating-webhook-name=spark-operator-webhook --validating-webhook-name=spark-operator-webhook --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-webhook-lock --leader-election-lock-namespace=so350
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID: 
Git Tree State: clean
Go Version: go1.23.1
Compiler: gc
Platform: linux/amd64
2024-11-21T20:56:37.838Z	INFO	webhook/start.go:244	Syncing webhook secret	{"name": "spark-operator-webhook-certs", "namespace": "so350"}
2024-11-21T20:56:37.936Z	INFO	webhook/start.go:258	Writing certificates	{"path": "/etc/k8s-webhook-server/serving-certs", "certificate name": "tls.crt", "key name": "tls.key"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.builder	builder/webhook.go:158	Registering a mutating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.builder	builder/webhook.go:189	Registering a validating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.builder	builder/webhook.go:158	Registering a mutating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.builder	builder/webhook.go:189	Registering a validating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.builder	builder/webhook.go:158	Registering a mutating webhook	{"GVK": "/v1, Kind=Pod", "path": "/mutate--v1-pod"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/mutate--v1-pod"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.builder	builder/webhook.go:204	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "/v1, Kind=Pod"}
2024-11-21T20:56:38.037Z	INFO	webhook/start.go:320	Starting manager
2024-11-21T20:56:38.038Z	INFO	controller-runtime.metrics	server/server.go:205	Starting metrics server
2024-11-21T20:56:38.038Z	INFO	controller-runtime.metrics	server/server.go:244	Serving metrics server	{"bindAddress": ":8080", "secure": false}
2024-11-21T20:56:38.039Z	INFO	manager/server.go:50	starting server	{"kind": "health probe", "addr": "[::]:8081"}
2024-11-21T20:56:38.039Z	INFO	controller-runtime.webhook	webhook/server.go:191	Starting webhook server
2024-11-21T20:56:38.039Z	INFO	webhook/start.go:358	disabling http/2
2024-11-21T20:56:38.039Z	INFO	controller-runtime.certwatcher	certwatcher/certwatcher.go:161	Updated current TLS certificate
2024-11-21T20:56:38.040Z	INFO	controller-runtime.certwatcher	certwatcher/certwatcher.go:115	Starting certificate watcher
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:534	Stopping and waiting for non leader election runnables
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:538	Stopping and waiting for leader election runnables
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:546	Stopping and waiting for caches
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:550	Stopping and waiting for webhooks
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:553	Stopping and waiting for HTTP servers
I1121 20:56:38.040581      10 leaderelection.go:250] attempting to acquire leader lease so350/spark-operator-webhook-lock...
2024-11-21T20:56:38.041Z	INFO	manager/server.go:43	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-11-21T20:56:38.041Z	INFO	controller-runtime.metrics	server/server.go:251	Shutting down metrics server with timeout of 1 minute
2024-11-21T20:56:38.041Z	INFO	manager/internal.go:557	Wait completed, proceeding to shutdown the manager
E1121 20:56:38.041688      10 leaderelection.go:332] error retrieving resource lock so350/spark-operator-webhook-lock: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/so350/leases/spark-operator-webhook-lock": context canceled
2024-11-21T20:56:38.041Z	ERROR	webhook/start.go:322	Failed to start manager	{"error": "listen tcp :443: bind: permission denied"}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
	/workspace/cmd/operator/webhook/start.go:322
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
	/workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041
main.main
	/workspace/cmd/main.go:27
runtime.main
	/usr/local/go/src/runtime/proc.go:272

Pls note - I'd installed v2.0.0-rc.0, it was working fine .. however. running into issues with v2.0.2

Pls help with this.

thanks!

Reproduction Code

No response

Expected behavior

No response

Actual behavior

No response

Environment & Versions

  • Kubernetes Version: 1.28
  • Spark Operator Version: 2.0.2
  • Apache Spark Version: 3.5

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@karanalang karanalang added the kind/bug Something isn't working label Nov 21, 2024
@ChenYi015
Copy link
Contributor

ChenYi015 commented Nov 22, 2024

@karanalang Please use a non-privileged webhook port (default to 9443) if possible, or you will need to run as root or modify the security context for that we have removed all the capabilities to enhance the container security.

@jacobsalway
Copy link
Member

Worth noting I think you want webhook.securityContext rather than webhook.containerSecurityContext. I was able to successfully run on Kind with your Helm values once I changed that.

https://github.com/kubeflow/spark-operator/blob/master/charts/spark-operator-chart/templates/webhook/deployment.yaml#L113-L116

@karanalang
Copy link
Author

Worth noting I think you want webhook.securityContext rather than webhook.containerSecurityContext. I was able to successfully run on Kind with your Helm values once I changed that.

https://github.com/kubeflow/spark-operator/blob/master/charts/spark-operator-chart/templates/webhook/deployment.yaml#L113-L116

Hi @jacobsalway, -

This command worked, and i'm able to install spark-operator with web-hook port = 443:

(base) Karans-MacBook-Pro:onPrem karanalang$ helm upgrade --install spark-operator spark-operator/spark-operator \
>   --namespace so350 \
>   --set image.tag=2.0.2 \
>   --create-namespace \
>   --set webhook.enable=true \
>   --set webhook.port=443 \
>   --set webhook.namespaceSelector="spark-webhook-enabled=true" \
>   --set webhook.containerSecurityContext.privileged=true \
>   --set webhook.containerSecurityContext.capabilities.add[0]=NET_BIND_SERVICE \
>   --set logLevel=debug \
>   --set enableResourceQuotaEnforcement=true \
>   --set webhook.failOnError=true \
>   --set controller.resources.limits.cpu=100m \
>   --set controller.resources.limits.memory=200Mi \
>   --set controller.resources.requests.cpu=50m \
>   --set controller.resources.requests.memory=100Mi \
>   --set webhook.resources.limits.cpu=100m \
>   --set webhook.resources.limits.memory=200Mi \
>   --set webhook.resources.requests.cpu=50m \
>   --set webhook.resources.requests.memory=100Mi \
>   --set "sparkJobNamespaces={spark-apps}" \
>   --set webhook.securityContext.runAsUser=0

However - i see one more issue .. I'm running my spark jobs in namespace (spark-apps), Spark Operator is not recognizing the jobs running on ns - spark-apps, instead it seems to be monitoring namesapce - default

here is the log of spark controller :

(base) Karans-MacBook-Pro:onPrem karanalang$ kc logs -f pod/spark-operator-controller-688c7c9955-l9l6p -n so350
++ id -u
+ uid=185
++ id -g
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [[ -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator controller start --zap-log-level=info --namespaces=default --controller-threads=10 --enable-ui-service=true --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-controller-lock --leader-election-lock-namespace=so350 --workqueue-ratelimiter-bucket-qps=50 --workqueue-ratelimiter-bucket-size=500 --workqueue-ratelimiter-max-delay=6h
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID: 
Git Tree State: clean
Go Version: go1.23.1
Compiler: gc
Platform: linux/amd64
2024-11-25T19:54:30.847Z	INFO	controller/start.go:298	Starting manager
2024-11-25T19:54:30.935Z	INFO	controller-runtime.metrics	server/server.go:205	Starting metrics server
2024-11-25T19:54:30.935Z	INFO	manager/server.go:50	starting server	{"kind": "health probe", "addr": "[::]:8081"}
2024-11-25T19:54:30.936Z	INFO	controller-runtime.metrics	server/server.go:244	Serving metrics server	{"bindAddress": ":8080", "secure": false}
I1125 19:54:31.035664      10 leaderelection.go:250] attempting to acquire leader lease so350/spark-operator-controller-lock...
I1125 19:54:47.461700      10 leaderelection.go:260] successfully acquired lease so350/spark-operator-controller-lock
2024-11-25T19:54:47.462Z	INFO	controller/controller.go:178	Starting EventSource	{"controller": "spark-application-controller", "source": "kind source: *v1.Pod"}
2024-11-25T19:54:47.462Z	INFO	controller/controller.go:178	Starting EventSource	{"controller": "spark-application-controller", "source": "kind source: *v1beta2.SparkApplication"}
2024-11-25T19:54:47.462Z	INFO	controller/controller.go:186	Starting Controller	{"controller": "spark-application-controller"}
2024-11-25T19:54:47.462Z	INFO	controller/controller.go:178	Starting EventSource	{"controller": "scheduled-spark-application-controller", "source": "kind source: *v1beta2.ScheduledSparkApplication"}
2024-11-25T19:54:47.462Z	INFO	controller/controller.go:186	Starting Controller	{"controller": "scheduled-spark-application-controller"}
2024-11-25T19:54:47.634Z	INFO	controller/controller.go:220	Starting workers	{"controller": "scheduled-spark-application-controller", "worker count": 10}
2024-11-25T19:54:47.634Z	INFO	controller/controller.go:220	Starting workers	{"controller": "spark-application-controller", "worker count": 10}


How to fix this ?

@karanalang
Copy link
Author

@karanalang Please use a non-privileged webhook port (default to 9443) if possible, or you will need to run as root or modify the security context for that we have removed all the capabilities to enhance the container security.

Hi @ChenYi015 - would running the webhook on non-privileged port on production have any security impact ? btw, it does work if i set the webhook port to 9443 or 8443

Also, pls check my response to @jacobsalway .. i'm facing another issue wrt spark-operator not recognizing jobs in namespace spark-apps

@karanalang
Copy link
Author

Pls note - the sparkJobNamespaces was working v2.0.0 .. @jacobsalway - per your comment in slack, how to get the helm chart version for v2.0.0 .. ideally, IMO - we should not have the helm chart & the image versions tied together

@jacobsalway
Copy link
Member

  • Use spark.jobNamespaces instead of sparkJobNamespaces. This was changed in the 2.0 rewrite and the docs should be updated now changed sparkJobNamespaces to spark.JobNamespaces website#3924
  • Using a non-privileged port would actually improve the security posture as your webhook doesn't need to run as root. Unless your environment has a specific need for running on port 443, I'd default to using the non-privileged default port in the chart.
  • Use the --version 2.0.0 flag in your Helm installation e.g. helm install --upgrade spark-operator spark-operator/spark-operator --version 2.0.0.
  • The chart and image version are tied together as arguments to the container are templated from the Helm values. For example, if we removed an argument in a newer version but you tried to use an older image, the controller or webhook may panic on startup due to extra or missing flags.

@karanalang
Copy link
Author

  • Use spark.jobNamespaces instead of sparkJobNamespaces. This was changed in the 2.0 rewrite and the docs should be updated now changed sparkJobNamespaces to spark.JobNamespaces website#3924
  • Using a non-privileged port would actually improve the security posture as your webhook doesn't need to run as root. Unless your environment has a specific need for running on port 443, I'd default to using the non-privileged default port in the chart.
  • Use the --version 2.0.0 flag in your Helm installation e.g. helm install --upgrade spark-operator spark-operator/spark-operator --version 2.0.0.
  • The chart and image version are tied together as arguments to the container are templated from the Helm values. For example, if we removed an argument in a newer version but you tried to use an older image, the controller or webhook may panic on startup due to extra or missing flags.

thanks, @jacobsalway .. let me check this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants