Learn about troubleshooting steps that you might find helpful if you run intoproblems using Google Kubernetes Engine (GKE).
If you need additional assistance, reach out toCloud Customer Care.
Debugging Kubernetes resources
If you are experiencing an issue related to your cluster, refer toTroubleshooting Clusters in the Kubernetes documentation.
If you are having an issue with your application, its Pods, or its controllerobject, refer to Troubleshooting Applications.
If you are having an issue related to connectivity between Compute Engine VMsthat are in the same Virtual Private Cloud (VPC) network or twoVPC networks connected with VPC Network Peering, refer toTroubleshooting connectivity between virtual machine (VM) instances withinternal IP addresses.
If you are experiencing packet loss when sending traffic from acluster to an external IP address using Cloud NAT,VPC-native clusters, orIP masquerade agent,seeTroubleshooting Cloud NAT packet loss from a GKE cluster.
Troubleshooting issues with kubectl
command
The kubectl
command isn't found
Install the
kubectl
binary by running the following command:gcloud components update kubectl
Answer "yes" when the installer prompts you to modify your
$PATH
environmentvariable. Modifying this variable enables you to usekubectl
commands withouttyping their full file path.Alternatively, add the following line to
~/.bashrc
(or~/.bash_profile
inmacOS, or wherever your shell stores environment variables):export PATH=$PATH:/usr/local/share/google/google-cloud-sdk/bin/
Run the following command to load your updated
.bashrc
(or.bash_profile
) file:source ~/.bashrc
kubectl
commands return "connection refused" error
Set the cluster context with the following command:
gcloud container clusters get-credentials CLUSTER_NAME
If you are unsure of what to enter for CLUSTER_NAME
, usethe following command to list your clusters:
gcloud container clusters list
kubectl
command times out
After creating a cluster, attempting to run the kubectl
command against thecluster returns an error, such as Unable to connect to the server: dialtcp IP_ADDRESS: connect: connection timed out
or Unable to connect to theserver: dial tcp IP_ADDRESS: i/o timeout
.
This can occur when kubectl
is unable to communicate with the cluster controlplane.
To resolve this issue, verify the context were the cluster is set:
Go to
$HOME/.kube/config
or run the commandkubectl config view
to verifythe config file contains the cluster context and the external IP address of thecontrol plane.Set the cluster credentials:
gcloud container clusters get-credentials CLUSTER_NAME \ --location=COMPUTE_LOCATION \ --project=PROJECT_ID
Replace the following:
CLUSTER_NAME
: the name of your cluster.COMPUTE_LOCATION
: theCompute Engine location.PROJECT_ID
: ID of the project in which theGKE cluster was created.
If the cluster is a private GKE cluster,then ensure that the outgoing IP of the machine you are attempting to connectfrom is included in the list of existing authorized networks. You can findyour existing authorized networks in the console or by runningthe following command:
gcloud container clusters describe CLUSTER_NAME \ --location=COMPUTE_LOCATION \ --project=PROJECT_ID \ --format "flattened(masterAuthorizedNetworksConfig.cidrBlocks[])"
If the outgoing IP of the machine is not included in the list of authorizednetworks from the output of the command above, then follow steps inCan't reach control plane of a private cluster,or Using Cloud Shell to access a private clusterif connecting from Cloud Shell.
kubectl
commands return "failed to negotiate an api version" error
Ensure kubectl has authentication credentials:
gcloud auth application-default login
The kubectl
logs
, attach
, exec
, and port-forward
commands stops responding
These commands rely on the cluster's control plane being able to talkto the nodes in the cluster. However, because the control plane isn't in thesame Compute Engine network as your cluster's nodes, we rely on either SSH orKonnectivityproxy tunnels to enable secure communication.
GKE saves an SSH public key file in your Compute Engine projectmetadata. All Compute Engine VMs using Google-provided images regularly checktheir project's common metadata and their instance's metadata for SSH keys toadd to the VM's list of authorized users. GKE also adds afirewall rule to your Compute Engine network allowing SSH access from thecontrol plane's IP address to each node in the cluster.
If any of the above kubectl
commands don't run, it's likely that the APIserver is unable to communicate with the nodes. Check for these potentialcauses:
The cluster doesn't have any nodes.
If you've scaled down the number of nodes in your cluster to zero, thecommands won't work.
To fix it, resize your cluster to have at least one node.
SSH
Your network's firewall rules don't allow for SSH access from the control plane.
All Compute Engine networks are created with a firewall rule called
default-allow-ssh
that allows SSH access from all IP addresses (requiringa valid private key, of course). GKE also inserts an SSH rulefor each public cluster of the formgke-CLUSTER_NAME-RANDOM_CHARACTERS-ssh
that allows SSH access specifically from the cluster's control plane to thecluster's nodes. If neither of these rules exists, then the control planecan't open SSH tunnels.To fix it, re-add a firewall rule allowing access to VMs with the tagthat's on all the cluster's nodes from the control plane's IP address.
Your project's common metadata entry for "ssh-keys" is full.
If the project's metadata entry named "ssh-keys" is close to maximum size limit,then GKE isn't able to add its own SSH key toenable it to open SSH tunnels. You can see your project's metadata byrunning the following command:
gcloud compute project-info describe [--project=PROJECT_ID]
And then check the length of the list of ssh-keys.
To fix it, delete some of the SSH keys that are no longer needed.
You have set a metadata field with the key "ssh-keys" on the VMs in thecluster.
The node agent on VMs prefers per-instance ssh-keys to project-wide SSH keys,so if you've set any SSH keys specifically on the cluster's nodes, then thecontrol plane's SSH key in the project metadata won't be respected by the nodes.To check, run
gcloud compute instances describe VM_NAME
and look foranssh-keys
field in the metadata.To fix it, delete the per-instance SSH keys from the instance metadata.
Konnectivity proxy
Determine if your cluster uses the Konnectivity proxy by checking for thefollowing system Deployment:
kubectl get deployments konnectivity-agent --namespace kube-system
Your network's firewall rules don't allow for Konnectivity agent accessto the control plane.
On cluster creation, Konnectivity agent pods establish andmaintain a connection to the control plane on port
8132
. When one of thekubectl
commands is run, the API server uses this connection tocommunicate with the cluster.If your network's firewall rules contain Egress Deny rule(s), it can preventthe agent from connecting. You must allow Egress traffic to the clustercontrol plane on port 8132. (For comparison, the API server uses 443).
Your cluster's network policyblocks ingress from
kube-system
namespace toworkload
namespace.To find network policies in the affected namespace run the following command:kubectl get networkpolicy --namespace AFFECTED_NAMESPACE
To resolve the issue add the following to the network policies
spec.ingress
field:- from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kube-system podSelector: matchLabels: k8s-app: konnectivity-agent
These features are not required for the correct functioning of the cluster.If you prefer to keep your cluster's network locked down from all outside access,be aware that features like these won't work.
Troubleshooting error 4xx issues
Authentication and authorization errors when connecting to GKE clusters
This issue might occur when you try to run a kubectl
command in yourGKE cluster from a local environment. The command fails anddisplays an error message, usually with HTTP status code 401 (Unauthorized).
The cause of this issue might be one of the following:
- The
gke-gcloud-auth-plugin
authentication plugin is not correctlyinstalled or configured. - You lack the permissions to connect to the cluster API server and run
kubectl
commands.
To diagnose the cause, do the following:
Connect to the cluster using curl
Using curl
bypasses the kubectl
CLI and the gke-gcloud-auth-plugin
plugin.
Set environment variables:
APISERVER=https://$(gcloud container clusters describe CLUSTER_NAME --location=COMPUTE_LOCATION --format "value(endpoint)")TOKEN=$(gcloud auth print-access-token)
Verify that your access token is valid:
curl https://oauth2.googleapis.com/tokeninfo?access_token=$TOKEN
Check that you can connect to the core API endpoint in the API server:
gcloud container clusters describe CLUSTER_NAME --location=COMPUTE_LOCATION --format "value(masterAuth.clusterCaCertificate)" | base64 -d > /tmp/ca.crtcurl -s -X GET "${APISERVER}/api/v1/namespaces" --header "Authorization: Bearer $TOKEN" --cacert /tmp/ca.crt
If the curl
command fails with an output that is similar to the following,check that you have the correct permissions to access the cluster:
{"kind": "Status","apiVersion": "v1","metadata": {},"status": "Failure","message": "Unauthorized","reason": "Unauthorized","code": 401}
If the curl
command succeeds, check whether the plugin is the cause.
Configure the plugin in kubeconfig
The following steps configure your local environment to ignore thegke-gcloud-auth-plugin
binary when authenticating to the cluster. In Kubernetesclients running version 1.25 and later, the gke-gcloud-auth-plugin
binary is required, souse these steps if you want to access your cluster without needing the plugin.
Install
kubectl
CLI version 1.24 usingcurl
:curl -LO https://dl.k8s.io/release/v1.24.0/bin/linux/amd64/kubectl
You can use any
kubectl
CLI version 1.24 or earlier.Open your shell startup script file, such as
.bashrc
for the Bash shell,in a text editor:vi ~/.bashrc
Add the following line to the file and save it:
export USE_GKE_GCLOUD_AUTH_PLUGIN=False
Run the startup script:
source ~/.bashrc
Get credentials for your cluster, which sets up your
.kube/config
file:gcloud container clusters get-credentials CLUSTER_NAME \ --location=COMPUTE_LOCATION
Replace the following:
CLUSTER_NAME
: the name of the cluster.COMPUTE_LOCATION
: theCompute Engine location.
Run a
kubectl
command:kubectl cluster-info
If you get a 401 error or a similar authorization error, ensure that you havethe correct permissions to perform the operation.
Error 400: Node pool requires recreation
The following issue occurs when you try to perform an action that recreates yourcontrol plane and nodes, such as when youcomplete an ongoing credential rotation.
The operation fails because GKE has not recreated one or morenode pools in your cluster. On the backend, node pools are marked forrecreation, but the actual recreation operation might take some time to begin.
The error message is similar to the following:
ERROR: (gcloud.container.clusters.update) ResponseError: code=400, message=Node pool "test-pool-1" requires recreation.
To resolve this issue, do one of the following:
- Wait for the recreation to happen. This might take hours, days, or weeksdepending on factors such as existing maintenance windows and exclusions.
Manually start a recreation of the affected node pools by starting a versionupgrade to the same version as the control plane. To start a recreation, runthe following command:
gcloud container clusters upgrade CLUSTER_NAME \ --node-pool=POOL_NAME
After the upgrade completes, try the operation again.
Error 403: Insufficient permissions
The following error occurs when you try to connect to a GKEcluster using gcloud container clusters get-credentials
, but the accountdoesn't have permission to access the Kubernetes API server.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=403, message=Required "container.clusters.get" permission(s) for "projects/<your-project>/locations/<region>/clusters/<your-cluster>".
To resolve this issue, do the following:
Identify the account that has the access issue:
gcloud auth list
Grant the required access to the account using the instructions inAuthenticating to the Kubernetes API server.
Error 404: Resource "not found" when calling gcloud container
commands
Re-authenticate to the Google Cloud CLI:
gcloud auth login
Error 400/403: Missing edit permissions on account
Your Compute Engine default service account, the Google APIs Service Agent,or the service account associated with GKEhas been deleted or edited manually.
When you enable the Compute Engine or Kubernetes Engine API, Google Cloudcreates the following service accounts and agents:
- Compute Engine default service account with edit permissions on yourproject.
- Google APIs Service Agent with edit permissions on your project.
- Google Kubernetes Engine service account with the Kubernetes Engine Service Agentrole on your project.
If at any point you edit those permissions, remove the role bindings on the project, remove the serviceaccount entirely, or disable the API, cluster creation and all management functionality will fail.
The name of your Google Kubernetes Engine service account is as follows, wherePROJECT_NUMBER
is your project number:
service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com
The following command can be used to verify that the Google Kubernetes Engine serviceaccount has the Kubernetes Engine Service Agent role assigned on the project:
gcloud projects get-iam-policy PROJECT_ID
Replace PROJECT_ID
with your project ID.
To resolve the issue, if you have removed the Kubernetes Engine Service Agentrole from your Google Kubernetes Engine service account, add it back. Otherwise, youcan re-enable the Kubernetes Engine API, which will correctly restore your serviceaccounts and permissions.
Console
Go to the APIs & Services page in the Google Cloud console.
Select your project.
Click Enable APIs and Services.
Search for Kubernetes, then select the API from the search results.
Click Enable. If you have previously enabled the API, you must firstdisable it and then enable it again. It can take several minutes for the APIand related services to be enabled.
gcloud
Run the following command in the gcloud CLI to add back the service account:
PROJECT_NUMBER=$(gcloud projects describe "PROJECT_ID" --format 'get(projectNumber)')gcloud projects add-iam-policy-binding PROJECT_ID \ --member "serviceAccount:service-${PROJECT_NUMBER?}@container-engine-robot.iam.gserviceaccount.com" \ --role roles/container.serviceAgent
Troubleshooting issues with GKE cluster creation
Error CONDITION_NOT_MET: Constraint constraints/compute.vmExternalIpAccess violated
You have the organization policy constraint constraints/compute.vmExternalIpAccess configured to Deny All
or to restrict external IPs to specific VM instances at the organization, folder, or project level in which you are trying to create a public GKE cluster.
When you create public GKE clusters, the underlying Compute Engine VMs, which make up the worker nodes of this cluster, have external IP addresses assigned. If you configure the organization policy constraint constraints/compute.vmExternalIpAccess to Deny All
or to restrict external IPs to specific VM instances, then the policy prevents the GKE worker nodes from obtaining external IP addresses, which results in cluster creation failure.
To find the logs of the cluster creation operation, you can review the GKE Cluster Operations Audit Logs using Logs Explorerwith a search query similar to the following:
resource.type="gke_cluster"logName="projects/test-last-gke-sa/logs/cloudaudit.googleapis.com%2Factivity"protoPayload.methodName="google.container.v1beta1.ClusterManager.CreateCluster"resource.labels.cluster_name="CLUSTER_NAME"resource.labels.project_id="PROJECT_ID"
To resolve this issue, ensure that the effective policy for the constraint constraints/compute.vmExternalIpAccess
is Allow All
on the project where you are trying to create a GKE public cluster. See Restricting external IP addresses to specific VM instances for information on working with this constraint. After setting the constraint to Allow All
, delete the failed cluster and create a new cluster. This is required because repairing the failed cluster is not possible.
Troubleshooting issues with deployed workloads
GKE returns an error if there are issues with a workload's Pods.You can check the status of a Pod using the kubectl
command-line tool or theGoogle Cloud console.
kubectl
To see all Pods running in your cluster, run the following command:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGEPOD_NAME 0/1 CrashLoopBackOff 23 8d
To get more details information about a specific Pod, run the following command:
kubectl describe pod POD_NAME
Replace POD_NAME
with the name of the desired Pod.
Console
Perform the following steps:
Go to the Workloads page in the Google Cloud console.
Go to Workloads
Select the desired workload. The Overview tab displays the statusof the workload.
From the Managed Pods section, click the error status message.
The following sections explain some common errors returned by workloads andhow to resolve them.
CrashLoopBackOff
CrashLoopBackOff
indicates that a container is repeatedly crashing afterrestarting. A container might crash for many reasons, and checking a Pod'slogs might aid in troubleshooting the root cause.
By default, crashed containers restart with an exponential delay limited tofive minutes. You can change this behavior by setting the restartPolicy
fieldDeployment's Pod specification under spec: restartPolicy
. The field's defaultvalue is Always
.
You can troubleshoot CrashLoopBackOff
errors using the Google Cloud console:
Go to the Crashlooping Pods Interactive Playbook:
Go to Playbook
For filter_list Cluster, enter thename of the cluster you want to troubleshoot.
For filter_list Namespace, enter thenamespace you want to troubleshoot.
(Optional) Create an alert to notify you of future
CrashLoopBackOff
errors:- In the Future Mitigation Tips section, select Create an Alert.
Inspect logs
You can find out why your Pod's container is crashing using the kubectl
command-line tool or the Google Cloud console.
kubectl
To see all Pods running in your cluster, run the following command:
kubectl get pods
Look for the Pod with the CrashLoopBackOff
error.
To get the Pod's logs, run the following command:
kubectl logs POD_NAME
Replace POD_NAME
with the name of the problematicPod.
You can also pass in the -p
flag to get the logs for the previousinstance of a Pod's container, if it exists.
Console
Perform the following steps:
Go to the Workloads page in the Google Cloud console.
Go to Workloads
Select the desired workload. The Overview tab displays the statusof the workload.
From the Managed Pods section, click the problematic Pod.
From the Pod's menu, click the Logs tab.
Check "Exit Code" of the crashed container
You can find the exit code by performing the following tasks:
Run the following command:
kubectl describe pod POD_NAME
Replace
POD_NAME
with the name of the Pod.Review the value in the
containers: CONTAINER_NAME: last state: exit code
field:- If the exit code is 1, the container crashed because the application crashed.
- If the exit code is 0, verify for how long your app was running.
Containers exit when your application's main process exits. If your appfinishes execution very quickly, container might continue to restart.
Connect to a running container
Open a shell to the Pod:
kubectl exec -it POD_NAME -- /bin/bash
If there is more than one container in your Pod, add-c CONTAINER_NAME
.
Now, you can run bash commands from the container: you can test the network orcheck if you have access to files or databases used by your application.
ImagePullBackOff and ErrImagePull
ImagePullBackOff
and ErrImagePull
indicate that the image usedby a container cannot be loaded from the image registry.
You can verify this issue using the Google Cloud console or the kubectl
command-line tool.
kubectl
To get more information about a Pod's container image, run the followingcommand:
kubectl describe pod POD_NAME
Console
Perform the following steps:
Go to the Workloads page in the Google Cloud console.
Go to Workloads
Select the desired workload. The Overview tab displays the statusof the workload.
From the Managed Pods section, click the problematic Pod.
From the Pod's menu, click the Events tab.
If the image is not found
If your image is not found:
- Verify that the image's name is correct.
- Verify that the image's tag is correct. (Try
:latest
or no tag to pull thelatest image). - If the image has full registry path, verify that it exists in the Dockerregistry you are using. If you provide only the image name, check theDocker Hub registry.
Try to pull the docker image manually:
SSH into the node:
For example, to SSH into a VM:
gcloud compute ssh VM_NAME --zone=ZONE_NAME
Replace the following:
VM_NAME
: the name of the VM.ZONE_NAME
: aCompute Engine zone.
Run
docker-credential-gcr configure-docker
. This commandgenerates a config file at/home/[USER]/.docker/config.json
. Ensure thatthis file includes the registry of the image in thecredHelpers
field.For example, the following file includes authentication information forimages hosted at asia.gcr.io, eu.gcr.io, gcr.io, marketplace.gcr.io, andus.gcr.io:{ "auths": {}, "credHelpers": { "asia.gcr.io": "gcr", "eu.gcr.io": "gcr", "gcr.io": "gcr", "marketplace.gcr.io": "gcr", "us.gcr.io": "gcr" }}
Run
docker pull IMAGE_NAME
.
If this option works, you probably need to specifyImagePullSecrets on a Pod. Pods can only reference imagepull secrets in their own namespace, so this process needs to be done onetime per namespace.
Permission denied error
If you encounter a "permission denied" or "no pull access" error, verify thatyou are logged in and have access to the image. Try one of the following methodsdepending on the registry in which you host your images.
Artifact Registry
If your image is in Artifact Registry, yournode pool's service accountneeds read access to the repository that contains the image.
Grant theartifactregistry.reader roleto the service account:
gcloud artifacts repositories add-iam-policy-binding REPOSITORY_NAME \ --location=REPOSITORY_LOCATION \ --member=serviceAccount:SERVICE_ACCOUNT_EMAIL \ --role="roles/artifactregistry.reader"
Replace the following:
REPOSITORY_NAME
: the name of your Artifact Registryrepository.REPOSITORY_LOCATION
: theregion of your Artifact Registryrepository.SERVICE_ACCOUNT_EMAIL
: the email address of theIAM service account associated with your node pool.
Container Registry
If your image is in Container Registry, yournode pool's service accountneeds read access to the Cloud Storage bucket that contains the image.
Grant the roles/storage.objectViewer roleto the service account so that it can read from the bucket:
gsutil iam ch \serviceAccount:SERVICE_ACCOUNT_EMAIL:roles/storage.objectViewer \ gs://BUCKET_NAME
Replace the following:
SERVICE_ACCOUNT_EMAIL
: the email of the serviceaccount associated with your node pool. You can list all the service accountsin your project usinggcloud iam service-accounts list
.BUCKET_NAME
: the name of the Cloud Storage bucketthat contains your images. You can list all the buckets in your project usinggsutil ls
.
If your registry administrator set upgcr.io repositories in Artifact Registryto store images for the gcr.io
domain instead of Container Registry, you mustgrant read access to Artifact Registry instead of Container Registry.
Private registry
If your image is in a private registry, you might require keys to access theimages. See Using private registries for more information.
401 Unauthorized: Cannot pull images from private container registry repository
An error similar to the following might occur when you pull an image from aprivate Container Registry repository:
gcr.io/PROJECT_ID/IMAGE:TAG: rpc error: code = Unknown desc = failed to pull andunpack image gcr.io/PROJECT_ID/IMAGE:TAG: failed to resolve referencegcr.io/PROJECT_ID/IMAGE]:TAG: unexpected status code [manifests 1.0]: 401 UnauthorizedWarning Failed 3m39s (x4 over 5m12s) kubelet Error: ErrImagePullWarning Failed 3m9s (x6 over 5m12s) kubelet Error: ImagePullBackOffNormal BackOff 2s (x18 over 5m12s) kubelet Back-off pulling image
Identify the node running the pod:
kubectl describe pod POD_NAME | grep "Node:"
Verify the node has the storage scope:
gcloud compute instances describe NODE_NAME \ --zone=COMPUTE_ZONE --format="flattened(serviceAccounts[].scopes)"
The node's access scope should contain at least one of the following:
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/devstorage.read_onlyserviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/cloud-platform
Recreate node pool the node belongs to with sufficient scope. You cannotmodify existing nodes, you must recreate the node with the correct scope.
Recommended: create a new node pool with the
gke-default
scope:gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --zone=COMPUTE_ZONE \ --scopes="gke-default"
Create a new node pool with only storage scope:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --zone=COMPUTE_ZONE \ --scopes="https://www.googleapis.com/auth/devstorage.read_only"
Pod unschedulable
PodUnschedulable
indicates that your Pod cannot be scheduled because ofinsufficient resources or some configuration error.
If you haveconfiguredyour GKE cluster to send Kubernetes API server and Kubernetesscheduler metrics to Cloud Monitoring, you can find more information aboutthese errors inscheduler metricsandAPI server metrics.
You can troubleshoot PodUnschedulable
errors using the Google Cloud console:
Go to the Unschedulable Pods Interactive Playbook:
Go to Playbook
For filter_list Cluster, enter thename of the cluster you want to troubleshoot.
For filter_list Namespace, enter thenamespace you want to troubleshoot.
(Optional) Create an alert to notify you of future
PodUnschedulable
errors:- In the Future Mitigation Tips section, select Create an Alert.
Insufficient resources
You might encounter an error indicating a lack of CPU, memory, or anotherresource. For example: "No nodes are available that match all of the predicates:Insufficient cpu (2)" which indicates that on two nodes there isn't enough CPUavailable to fulfill a Pod's requests.
If your Pod resource requests exceed that of a single node from any eligiblenode pools, GKE does not schedule the Pod and also does nottrigger scale up to add a new node. For GKE to schedule the Pod,you must either request fewer resources for the Pod, or create a new node poolwith sufficient resources.
You can also enablenode auto-provisioningso that GKE can automatically create node pools with nodes wherethe unscheduled Pods can run.
The default CPU request is 100m or 10% of a CPU (or one core).If you want to request more or fewer resources, specify the value in the Podspecification under spec: containers: resources: requests
.
MatchNodeSelector
MatchNodeSelector
indicates that there are no nodes that match the Pod'slabel selector.
To verify this, check the labels specified in the Pod specification'snodeSelector
field, under spec: nodeSelector
.
To see how nodes in your cluster are labelled, run the following command:
kubectl get nodes --show-labels
To attach a label to a node, run the following command:
kubectl label nodes NODE_NAME LABEL_KEY=LABEL_VALUE
Replace the following:
NODE_NAME
: the desired node.LABEL_KEY
: the label's key.LABEL_VALUE
: the label's value.
For more information, refer toAssigning Pods to Nodes.
PodToleratesNodeTaints
PodToleratesNodeTaints
indicates that the Pod can't be scheduled to any nodebecause no node currently tolerates its node taint.
To verify that this is the case, run the following command:
kubectl describe nodes NODE_NAME
In the output, check the Taints
field, which lists key-value pairs andscheduling effects.
If the effect listed is NoSchedule
, then no Pod can be scheduled on that nodeunless it has a matching toleration.
One way to resolve this issue is to remove the taint. For example, to remove aNoSchedule taint, run the following command:
kubectl taint nodes NODE_NAME key:NoSchedule-
PodFitsHostPorts
PodFitsHostPorts
indicates that a port that a node is attempting to use isalready in use.
To resolve this issue, check the Pod specification's hostPort
value underspec: containers: ports: hostPort
. You might need to change this value toanother port.
Does not have minimum availability
If a node has adequate resources but you still see the Does not have minimum availability
message, check the Pod's status. If the status is SchedulingDisabled
orCordoned
status, the node cannot schedule new Pods. You can check the status of anode using the Google Cloud console or the kubectl
command-line tool.
kubectl
To get statuses of your nodes, run the following command:
kubectl get nodes
To enable scheduling on the node, run:
kubectl uncordon NODE_NAME
Console
Perform the following steps:
Go to the Google Kubernetes Engine page in the Google Cloud console.
Go to Google Kubernetes Engine
Select the desired cluster. The Nodes tab displays the Nodes and their status.
To enable scheduling on the Node, perform the following steps:
From the list, click the desired Node.
From the Node Details, click Uncordon button.
Maximum pods per node limit reached
If the Maximum pods per nodelimit is reached by all nodes in the cluster, the Pods will be stuck inUnschedulable state. Under the Pod Events tab, you will see a messageincluding the phrase Too many pods
.
Check the
Maximum pods per node
configuration from the Nodes tabin GKE cluster details in the Google Cloud console.Get a list of nodes:
kubectl get nodes
For each node, verify the number of Pods running on the node:
kubectl get pods -o wide | grep NODE_NAME | wc -l
If limit is reached, add a new node pool or add additional nodes tothe existing node pool.
Maximum node pool size reached with cluster autoscaler enabled
If the node pool has reached its maximumsizeaccording to its cluster autoscaler configuration, GKE does nottrigger scale up for the Pod that would otherwise be scheduled with this nodepool. If you want the Pod to be scheduled with this node pool, change thecluster autoscalerconfiguration.
Maximum node pool size reached with cluster autoscaler disabled
If the node pool has reached its maximum number of nodes, and cluster autoscaleris disabled, GKE cannot schedule the Pod with the node pool.Increase the size of your nodepool or enableclusterautoscalerfor GKE to resize your cluster automatically.
Unbound PersistentVolumeClaims
Unbound PersistentVolumeClaims
indicates that the Pod references aPersistentVolumeClaim that is not bound. This error might happen if yourPersistentVolume failed to provision. You can verify that provisioning failed bygetting the events for your PersistentVolumeClaim and examining them forfailures.
To get events, run the following command:
kubectl describe pvc STATEFULSET_NAME-PVC_NAME-0
Replace the following:
STATEFULSET_NAME
: the name of the StatefulSet object.PVC_NAME
: the name of the PersistentVolumeClaim object.
This may also happen if there was a configuration error during your manualpre-provisioning of a PersistentVolume and its binding to aPersistentVolumeClaim. You can try to pre-provision the volume again.
Insufficient quota
Verify that your project has sufficient Compute Engine quota forGKE to scale up your cluster. If GKE attempts toadd a node to your cluster to schedule the Pod, and scaling up would exceed yourproject's available quota, you receive the scale.up.error.quota.exceeded
errormessage.
To learn more, seeScaleUp errors.
Deprecated APIs
Ensure that you are not using deprecated APIs that are removed with yourcluster's minor version. To learn more, see GKE deprecations.
Connectivity issues
As mentioned in theNetwork Overviewdiscussion, it is important to understand how Pods are wired from theirnetwork namespaces to the root namespace on the node in order totroubleshoot effectively. For the following discussion, unless otherwisestated, assume that the cluster uses GKE's native CNI ratherthan Calico's. That is, no network policyhas been applied.
Pods on select nodes have no availability
If Pods on select nodes have no network connectivity, ensure thatthe Linux bridge is up:
ip address show cbr0
If the Linux bridge is down, raise it:
sudo ip link set cbr0 up
Ensure that the node is learning Pod MAC addresses attached to cbr0:
arp -an
Pods on select nodes have minimal connectivity
If Pods on select nodes have minimal connectivity, you should first confirmwhether there are any lost packets by running tcpdump
in the toolbox container:
sudo toolbox bash
Install tcpdump
in the toolbox if you have not done so already:
apt install -y tcpdump
Run tcpdump
against cbr0:
tcpdump -ni cbr0 host HOSTNAME and port PORT_NUMBER and [TCP|UDP|ICMP]
Should it appear that large packets are being dropped downstream from thebridge (for example, the TCP handshake completes, but no SSL hellos arereceived), ensure that the MTU for each Linux Pod interface is correctly set tothe MTU of the cluster's VPC network.
ip address show cbr0
When overlays are used (for example, Weave or Flannel), this MTU must be furtherreduced to accommodate encapsulation overhead on the overlay.
GKE MTU
The MTU selected for a Pod interface is dependent on the Container NetworkInterface (CNI) used by the cluster Nodes and the underlying VPC MTU setting.For more information, seePods.
The Pod interface MTU value is either 1460
or inherited from the primaryinterface of the Node.
CNI | MTU | GKE Standard |
---|---|---|
kubenet | 1460 | Default |
kubenet (GKE version 1.26.1 and later) | Inherited | Default |
Calico | 1460 | Enabled by using For details, see Control communication between Pods and Services using network policies. |
netd | Inherited | Enabled by using any of the following:
|
GKE Dataplane V2 | Inherited | Enabled by using For details, see Using GKE Dataplane V2. |
Intermittent failed connections
Connections to and from the Pods are forwarded by iptables. Flows are trackedas entries in the conntrack table and, where there are many workloads per node,conntrack table exhaustion may manifest as a failure. These can be logged in theserial console of the node, for example:
nf_conntrack: table full, dropping packet
If you are able to determine that intermittent issues are driven by conntrackexhaustion, you may increase the size of the cluster (thus reducing the numberof workloads and flows per node), or increase nf_conntrack_max
:
new_ct_max=$(awk '$1 == "MemTotal:" { printf "%d\n", $2/32; exit; }' /proc/meminfo)sysctl -w net.netfilter.nf_conntrack_max="${new_ct_max:?}" \ && echo "net.netfilter.nf_conntrack_max=${new_ct_max:?}" >> /etc/sysctl.conf
You can also useNodeLocal DNSCache toreduce connection tracking entries.
"bind: Address already in use" reported for a container
A container in a Pod is unable to start because according to the container logs,the port where the application is trying to bind to is already reserved. Thecontainer is crash looping. For example, in Cloud Logging:
resource.type="container"textPayload:"bind: Address already in use"resource.labels.container_name="redis"2018-10-16 07:06:47.000 CEST 16 Oct 05:06:47.533 # Creating Server TCP listening socket *:60250: bind: Address already in use2018-10-16 07:07:35.000 CEST 16 Oct 05:07:35.753 # Creating Server TCP listening socket *:60250: bind: Address already in use
When Docker crashes, sometimes a running container gets left behind and isstale. The process is still running in the network namespace allocated for thePod, and listening on its port. Because Docker and the kubelet don't know aboutthe stale container they try to start a new container with a new process, whichis unable to bind on the port as it gets added to the network namespace alreadyassociated with the Pod.
To diagnose this problem:
You need the UUID of the Pod in the
.metadata.uuid
field:kubectl get pod -o custom-columns="name:.metadata.name,UUID:.metadata.uid" ubuntu-6948dd5657-4gsggname UUIDubuntu-6948dd5657-4gsgg db9ed086-edba-11e8-bdd6-42010a800164
Get the output of the following commands from the node:
docker ps -aps -eo pid,ppid,stat,wchan:20,netns,comm,args:50,cgroup --cumulative -H | grep [Pod UUID]
Check running processes from this Pod. Because the UUID of the cgroupnamespaces contain the UUID of the Pod, you can grep for the Pod UUID in
ps
output. Grep also the line before, so you will have thedocker-containerd-shim
processes having the container id in the argumentas well. Cut the rest of the cgroup column to get a simpler output:# ps -eo pid,ppid,stat,wchan:20,netns,comm,args:50,cgroup --cumulative -H | grep -B 1 db9ed086-edba-11e8-bdd6-42010a800164 | sed s/'blkio:.*'/''/1283089 959 Sl futex_wait_queue_me 4026531993 docker-co docker-containerd-shim 276e173b0846e24b704d4 12:1283107 1283089 Ss sys_pause 4026532393 pause /pause 12:1283150 959 Sl futex_wait_queue_me 4026531993 docker-co docker-containerd-shim ab4c7762f5abf40951770 12:1283169 1283150 Ss do_wait 4026532393 sh /bin/sh -c echo hello && sleep 6000000 12:1283185 1283169 S hrtimer_nanosleep 4026532393 sleep sleep 6000000 12:1283244 959 Sl futex_wait_queue_me 4026531993 docker-co docker-containerd-shim 44e76e50e5ef4156fd5d3 12:1283263 1283244 Ss sigsuspend 4026532393 nginx nginx: master process nginx -g daemon off; 12:1283282 1283263 S ep_poll 4026532393 nginx nginx: worker process
From this list, you can see the container ids, which should be visible in
docker ps
as well.In this case:
docker-containerd-shim 276e173b0846e24b704d4
for pausedocker-containerd-shim ab4c7762f5abf40951770
for sh with sleep (sleep-ctr)docker-containerd-shim 44e76e50e5ef4156fd5d3
for nginx (echoserver-ctr)
Check those in the
docker ps
output:# docker ps --no-trunc | egrep '276e173b0846e24b704d4|ab4c7762f5abf40951770|44e76e50e5ef4156fd5d3'44e76e50e5ef4156fd5d383744fa6a5f14460582d0b16855177cbed89a3cbd1f gcr.io/google_containers/echoserver@sha256:3e7b182372b398d97b747bbe6cb7595e5ffaaae9a62506c725656966d36643cc "nginx -g 'daemon off;'" 14 hours ago Up 14 hours k8s_echoserver-cnt_ubuntu-6948dd5657-4gsgg_default_db9ed086-edba-11e8-bdd6-42010a800164_0ab4c7762f5abf40951770d3e247fa2559a2d1f8c8834e5412bdcec7df37f8475 ubuntu@sha256:acd85db6e4b18aafa7fcde5480872909bd8e6d5fbd4e5e790ecc09acc06a8b78 "/bin/sh -c 'echo hello && sleep 6000000'" 14 hours ago Up 14 hours k8s_sleep-cnt_ubuntu-6948dd5657-4gsgg_default_db9ed086-edba-11e8-bdd6-42010a800164_0276e173b0846e24b704d41cf4fbb950bfa5d0f59c304827349f4cf5091be3327 registry.k8s.io/pause-amd64:3.1
In normal cases, you see all container ids from
ps
showing up indockerps
. If there is one you don't see, it's a stale container, and probably youwill see a child process of thedocker-containerd-shim process
listeningon the TCP port that is reporting as already in use.To verify this, execute
netstat
in the container's network namespace. Getthe pid of any container process (so NOTdocker-containerd-shim
) for thePod.From the above example:
- 1283107 - pause
- 1283169 - sh
- 1283185 - sleep
- 1283263 - nginx master
- 1283282 - nginx worker
# nsenter -t 1283107 --net netstat -anpActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 1283263/nginx: mastActive UNIX domain sockets (servers and established)Proto RefCnt Flags Type State I-Node PID/Program name Pathunix 3 [ ] STREAM CONNECTED 3097406 1283263/nginx: mastunix 3 [ ] STREAM CONNECTED 3097405 1283263/nginx: mastgke-zonal-110-default-pool-fe00befa-n2hx ~ # nsenter -t 1283169 --net netstat -anpActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 1283263/nginx: mastActive UNIX domain sockets (servers and established)Proto RefCnt Flags Type State I-Node PID/Program name Pathunix 3 [ ] STREAM CONNECTED 3097406 1283263/nginx: mastunix 3 [ ] STREAM CONNECTED 3097405 1283263/nginx: mast
You can also execute
netstat
usingip netns
, but you need to link thenetwork namespace of the process manually, as Docker is not doing the link:# ln -s /proc/1283169/ns/net /var/run/netns/1283169gke-zonal-110-default-pool-fe00befa-n2hx ~ # ip netns list1283169 (id: 2)gke-zonal-110-default-pool-fe00befa-n2hx ~ # ip netns exec 1283169 netstat -anpActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 1283263/nginx: mastActive UNIX domain sockets (servers and established)Proto RefCnt Flags Type State I-Node PID/Program name Pathunix 3 [ ] STREAM CONNECTED 3097406 1283263/nginx: mastunix 3 [ ] STREAM CONNECTED 3097405 1283263/nginx: mastgke-zonal-110-default-pool-fe00befa-n2hx ~ # rm /var/run/netns/1283169
Mitigation:
The short term mitigation is to identify stale processes by the method outlinedabove, and end the processes using the kill [PID]
command.
Long term mitigation involves identifying why Docker is crashing and fixing that.Possible reasons include:
- Zombie processes piling up, so running out of PID namespaces
- Bug in docker
- Resource pressure / OOM
Error: "failed to allocate for range 0: no IP addresses in range set"
GKE version 1.18.17 and later fixed an issue where out-of-memory(OOM) events would result in incorrect Pod eviction if the Pod was deleted beforeits containers were started. This incorrect eviction could result in orphanedpods that continued to have reserved IP addresses from the allocated node range.Over time, GKE ran out of IP addresses to allocate to new podsbecause of the build-up of orphaned pods. This led to the error message failedto allocate for range 0: no IP addresses in range set
, because the allocatednode range didn't have available IPs to assign to new pods.
To resolve this issue, upgrade your cluster and node poolsto GKE version 1.18.17 or later.
To prevent this issue and resolve it on clusters with GKEversions prior to 1.18.17, increase your resource limits to avoid OOM events in the future, and then reclaim the IP addresses by removingthe orphaned pods.
You can also viewGKE IP address utilization insights.
Remove the orphaned pods from affected nodes
You can remove the orphaned pods by draining the node, upgrading the node pool,or moving the affected directories.
Draining the node (recommended)
Cordon the node to prevent new pods from scheduling on it:
kubectl cordon NODE
Replace
NODE
with the name of the node you want to drain.Drain the node. GKE automatically reschedules pods managed by deployments onto other nodes. Use the
--force
flag to drain orphaned pods that don't have a managing resource.kubectl drain NODE --force
Uncordon the node to allow GKE to schedule new pods on it:
kubectl uncordon NODE
Moving affected directories
You can identify orphaned Pod directories in /var/lib/kubelet/pods
and movethem out of the main directory to allow GKE to terminate the pods.
Troubleshooting issues with terminating resources
Namespace stuck in Terminating
state
Namespaces use Kubernetes finalizers to prevent deletion when one or more resources within a namespace still exist.When you delete a namespace using the kubectl delete
command, the namespaceenters the Terminating
state until Kubernetes deletes its dependent resourcesand clears all finalizers. The namespace lifecycle controller first lists allresources in the namespace that GKE needs to delete. IfGKE can't delete a dependent resource, or if the namespacelifecycle controller can't verify that the namespace is empty, the namespaceremains in the Terminating
state until you resolve the issue.
To resolve a namespace stuck in the Terminating
state, you need to identifyand remove the unhealthy component(s) blocking the deletion. Try oneof the following solutions.
Find and remove unavailable API services
List unavailable API services:
kubectl get apiservice | grep False
Troubleshoot any unresponsive services:
kubectl describe apiservice API_SERVICE
Replace
API_SERVICE
with the name of the unresponsiveservice.Check if the namespace is still terminating:
kubectl get ns | grep Terminating
Find and remove remaining resources
List all the resources remaining in the terminating namespace:
kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get -n NAMESPACE
Replace
NAMESPACE
with the name of the namespace you wantto delete.Remove any resources displayed in the output.
Check if the namespace is still terminating:
kubectl get ns | grep Terminating
Force delete the namespace
You can remove the finalizers blocking namespace deletion to force the namespaceto terminate.
Save the namespace manifest as a YAML file:
kubectl get ns NAMESPACE -o yaml > ns-terminating.yml
Open the manifest in a text editor and remove all values in the
spec.finalizers
field:vi ns-terminating.yml
Verify that the finalizers field is empty:
cat ns-terminating.yml
The output should look like the following:
apiVersion: v1kind: Namespacemetadata: annotations: name: NAMESPACEspec: finalizers:status: phase: Terminating
Start an HTTP proxy to access the Kubernetes API:
kubectl proxy
Replace the namespace manifest using
curl
:curl -H "Content-Type: application/yaml" -X PUT --data-binary @ns-terminating.yml http://127.0.0.1:8001/api/v1/namespaces/NAMESPACE/finalize
Check if the namespace is still terminating:
kubectl get ns | grep Terminating
Troubleshooting Cloud NAT packet loss from a GKE cluster
Node VMs in VPC-nativeGKE private clustersdon't have external IP addresses and can't connect to the internet by themselves.You can use Cloud NAT to allocate the external IP addresses and ports thatallow private clusters to make public connections.
If a node VM runs out of its allocation of external ports and IP addresses fromCloud NAT, packets will drop. To avoid this, you can reduce theoutbound packet rate or increase the allocation of availableCloud NAT source IP addresses and ports. The followingsections describe how to diagnose and troubleshoot packet loss fromCloud NAT in the context of GKE private clusters.
Diagnosing packet loss
This section explains how to log dropped packets using Cloud Logging, anddiagnose the cause of dropped packets using Cloud Monitoring.
Logging dropped packets
You can log dropped packets with the following query in Cloud Logging:
resource.type="nat_gateway"resource.labels.region=REGIONresource.labels.gateway_name=GATEWAY_NAMEjsonPayload.allocation_status="DROPPED"
REGION
: the name of the region that the cluster is in.GATEWAY_NAME
: the name of the Cloud NAT gateway.
This command returns a list of all packets dropped by a Cloud NAT gateway,but does not identify the cause.
Monitoring causes for packet loss
To identify causes for dropped packets, query theMetrics observer inCloud Monitoring. Packets drop for one of three reasons:
- OUT_OF_RESOURCES
- ENDPOINT_INDEPENDENT_CONFLICT
- NAT_ALLOCATION_FAILED
To identify packets dropped due to OUT_OF_RESOURCES
orENDPOINT_ALLOCATION_FAILED
error codes, use the following query:
fetch nat_gateway metric 'router.googleapis.com/nat/dropped_sent_packets_count' filter (resource.gateway_name == NAT_NAME) align rate(1m) every 1m group_by [metric.reason], [value_dropped_sent_packets_count_aggregate: aggregate(value.dropped_sent_packets_count)]
To identify packets dropped due to the NAT_ALLOCATION_FAILED
error code, use the following query:
fetch nat_gateway metric 'router.googleapis.com/nat/nat_allocation_failed' group_by 1m, [value_nat_allocation_failed_count_true: count_true(value.nat_allocation_failed)] every 1m
Troubleshooting Cloud NAT with GKE IP masquerading
If the previous queries return empty results, and GKE Pods areunable to communicate to external IP addresses, troubleshoot your configuration:
Configuration | Troubleshooting |
Cloud NAT configured to apply only to the subnet's primary IP address range. | When Cloud NAT is configured only for the subnet's primary IP address range, packets sent from the cluster to external IP addresses must have a source node IP address. In this Cloud NAT configuration:
|
Cloud NAT configured to apply only to the subnet's secondary IP address range used for Pod IPs. | When Cloud NAT is configured only for the subnet's secondary IP address range used by the cluster's Pod IPs, packets sent from the cluster to external IP addresses must have a source Pod IP address. In this Cloud NAT configuration:
|
Optimizations to avoid packet loss
You can stop packet loss by:
Configuring the Cloud NAT gateway to usedynamic port allocation andincrease the maximum number of ports per VM.
Increase the number of minimum ports per VMif using static port allocation.
Optimizing your application
When an application makes multiple outbound connections to the same destinationIP address and port, it can quickly consume all connections Cloud NATcan make to that destination using the number of allocated NAT source addressesand source port tuples. In this scenario, reducing the application's outboundpacket rate helps to reduce packet loss.
For details about the how Cloud NAT uses NAT source addresses andsource ports to make connections, including limits on the number of simultaneousconnections to a destination, refer toPorts and connections.
Reducing the rate of outbound connections from the application can help tomitigate packet loss. You can accomplish this by reusing open connections.Common methods of reusing connections include connection pooling, multiplexingconnections using protocols such asHTTP/2, or establishingpersistent connections reused for multiple requests. For more information, seePorts and Connections.
Node version not compatible with control plane version
Check what version of Kubernetes your cluster's control plane is running,and then check what version of Kubernetes your cluster's node pools are running.If any of the cluster's node pools are more than two minor versions older than the control plane,this might be causing issues with your cluster.
Periodically, the GKE team performs upgrades of the cluster control plane onyour behalf. Control planes are upgraded to newer stable versions of Kubernetes.By default, a cluster's nodes have auto-upgradeenabled, and it is recommended that you do not disable it.
If auto-upgrade is disabled for a cluster's nodes, and you do not manuallyupgrade yournode pool version to a version that is compatible with the controlplane, your control plane will eventually become incompatible with your nodes asthe control plane is automatically upgraded over time. Incompatibility betweenyour cluster's control plane and the nodes can cause unexpected issues.
The Kubernetes version and version skew support policy guarantees that control planes are compatible with nodes up to two minorversions older than the control plane. For example, Kubernetes 1.19 controlplanes are compatible with Kubernetes 1.19, 1.18, and 1.17 nodes. To resolvethis issue, manually upgrade the node pool version to a version that iscompatible with the control plane.
If you are concerned about the upgrade process causing disruption to workloadsrunning on the affected nodes, do the following steps to migrate your workloadsto a new node pool:
- Create a new node poolwith a compatible version.
- Cordon the nodes of the existing node pool.
- Optionally, update your workloads running on the existing node pool to add anodeSelector for the label
cloud.google.com/gke-nodepool:NEW_NODE_POOL_NAME
,whereNEW_NODE_POOL_NAME
is the name of thenew node pool. This ensures that GKE places those workloads on nodes inthe new node pool. - Drain the existing node pool.
- Check that the workloads are running successfully in the new node pool. Ifthey are, you can delete the old node pool. If you notice workloaddisruptions, reschedule the workloads on the existing nodes by uncordoningthe nodes in the existing node pool and draining the new nodes. Troubleshootthe issue and try again.
Metrics from your cluster aren't showing up in Cloud Monitoring
Ensure that you have activated the Cloud Monitoring APIand the Cloud Logging API on yourproject, and that you are able to view your project in Cloud Monitoring.
If the issue persists, check the following potential causes:
Ensure that you have enabled monitoring on your cluster.
Monitoring is enabled by default for clusters created from the Google Cloud consoleand from the Google Cloud CLI, but you can verify by running the following command orclicking into the cluster's details in the Google Cloud console:
gcloud container clusters describe CLUSTER_NAME
The output from this command should include
SYSTEM_COMPONENTS
in the listofenableComponents
in themonitoringConfig
section similar to this:monitoringConfig: componentConfig: enableComponents: - SYSTEM_COMPONENTS
If monitoring is not enabled, run the following command to enable it:
gcloud container clusters update CLUSTER_NAME --monitoring=SYSTEM
How long has it been since your cluster was created or had monitoringenabled?
It can take up to an hour for a new cluster's metrics to startappearing in Cloud Monitoring.
Is a
heapster
orgke-metrics-agent
(the OpenTelemetry Collector) runningin your cluster in the "kube-system" namespace?This pod might be failing to schedule workloads because your clusteris running low on resources. Check whether Heapster or OpenTelemetry isrunning by calling
kubectl get pods --namespace=kube-system
and checkingfor pods withheapster
orgke-metrics-agent
in the name.Is your cluster's control plane able to communicate with the nodes?
Cloud Monitoring relies on that. You can checkwhether this is the case by running the following command:
kubectl logs POD_NAME
If this command returns an error, then the SSH tunnels may be causing theissue. See this sectionfor further information.
If you are having an issue related to the Cloud Logging agent, see itstroubleshooting documentation.
For more information, refer to the Logging documentation.
Missing permissions on account for Shared VPC clusters
For Shared VPC clusters, ensure that the service project's GKE service accounthas a binding for the Host Service Agent Userrole on the host project. You can do this using the gcloud CLI.
To check if the role binding exists, run the following command in your hostproject:
gcloud projects get-iam-policy PROJECT_ID \ --flatten="bindings[].members" \ --format='table(bindings.role)' \ --filter="bindings.members:SERVICE_ACCOUNT_NAME
Replace the following:
PROJECT_ID
: your host project ID.SERVICE_ACCOUNT_NAME
: the GKEservice account name.
In the output, look for the roles/container.hostServiceAgentUser
role:
ROLE...roles/container.hostServiceAgentUser...
If the hostServiceAgentUser
role isn't in the list, follow the instructions inGranting the Host Service Agent User roleto add the binding to the service account.
Restore default service account to your Google Cloud project
GKE's default service account, container-engine-robot
, canaccidentally become unbound from a project. GKE Service Agentis an Identity and Access Management (IAM) role thatgrants the service account the permissions to manage cluster resources. If youremove this role binding from the service account, the default service accountbecomes unbound from the project, which can prevent you from deployingapplications and performing other cluster operations.
You can check to see if the service account has been removed from your projectusing gcloud CLI or the Google Cloud console.
gcloud
Run the following command:
gcloud projects get-iam-policy PROJECT_ID
Replace PROJECT_ID
with your project ID.
Console
Visit the page in the Google Cloud console.
If the command or the dashboard do not display container-engine-robot
amongyour service accounts, the service account has become unbound.
If you removed the GKE Service Agent role binding, run thefollowing commands to restore the role binding:
PROJECT_NUMBER=$(gcloud projects describe "PROJECT_ID" --format 'get(projectNumber)')gcloud projects add-iam-policy-binding PROJECT_ID \ --member "serviceAccount:service-${PROJECT_NUMBER?}@container-engine-robot.iam.gserviceaccount.com" \ --role roles/container.serviceAgent
To confirm that the role binding was granted:
gcloud projects get-iam-policy $PROJECT_ID
If you see the service account name along with the container.serviceAgent
role, the role binding has been granted. For example:
- members: - serviceAccount:service-1234567890@container-engine-robot.iam.gserviceaccount.com role: roles/container.serviceAgent
Enable Compute Engine default service account
Your nodes might fail to register with the cluster if the service account usedfor the node pool is disabled, which usually is theCompute Engine default service account.
You can verify if the service account has been disabled in your projectusing gcloud CLI or the Google Cloud console.
gcloud
Run the following command:
gcloud iam service-accounts list --filter="NAME~'compute' AND disabled=true"
Console
Go to the page in the Google Cloud console.
If the command or the dashboard shows the service account is disabled, run thefollowing command to enable the service account:
gcloud iam service-accounts enable PROJECT_ID-compute@developer.gserviceaccount.com
Replace PROJECT_ID
with your project ID.
If this doesn't solve your node registration issues, refer to Troubleshoot node registrationfor further troubleshooting instructions.
Pods stuck in pending state after enabling Node Allocatable
If you are experiencing an issue with Pods stuck in pending state afterenabling Node Allocatable,please note the following:
Starting with version 1.7.6, GKE reserves CPU and memory forKubernetes overhead, including Docker and the operating system. SeeCluster architecture forinformation on how much of each machine type can be scheduled by Pods.
If Pods are pending after an upgrade, we suggest the following:
Ensure CPU and Memory requests for your Pods do not exceed their peak usage.With GKE reserving CPU and memory for overhead, Pods cannotrequest these resources. Pods that request more CPU or memory than they useprevent other Pods from requesting these resources, and might leave thecluster underutilized. For more information, seeHow Pods with resource requests are scheduled.
Consider resizing your cluster. For instructions, see Resizing a cluster.
Revert this change by downgrading your cluster. For instructions, see Manually upgrading a cluster or node pool.
- Configure your cluster tosend Kubernetes scheduler metrics to Cloud Monitoringand viewscheduler metrics.
Cluster's root Certificate Authority is expiring soon
Your cluster's root Certificate Authority is expiring soon. To prevent normalcluster operations from being interrupted, you mustperform a credentialrotation.
Seeing error "Instance 'Foo' does not contain 'instance-template' metadata"
You may see an error "Instance 'Foo' does not contain 'instance-template'metadata" as a status of a node pool that fails to upgrade, scale, or performautomatic node repair.
This message indicates that the metadata of VM instances, allocated by GKE,was corrupted. This typically happens when custom-authored automation or scriptsattempt to add new instance metadata (like block-project-ssh-keys),and instead of just adding or updating values, it also deletes existing metadata.You can read about VM instance metadata in Setting custom metadata.
In case any of the critical metadata values (among others: instance-template
,kube-labels
, kubelet-config
, kubeconfig
, cluster-name
, configure-sh
,cluster-uid
) were deleted, the node or entire node pool might render itself intoan unstable state as these values are crucial for GKE operations.
If the instance metadata was corrupted, the best way to recover the metadata isto re-create the node pool that contains the corrupted VM instances. You willneed to add a node pool to yourcluster and increase the node count on the new node pool, while cordoning andremoving nodes on another. See the instructions to migrate workloads betweennodepools.
To find who and when instance metadata was edited, you can reviewCompute Engine audit logging informationor find logs using Logs Explorerwith the search query similar to this:
resource.type="gce_instance_group_manager"protoPayload.methodName="v1.compute.instanceGroupManagers.setInstanceTemplate"
In the logs you may find the request originator IP address and user agent:
requestMetadata: { callerIp: "REDACTED" callerSuppliedUserAgent: "google-api-go-client/0.5 GoogleContainerEngine/v1"}
Cloud KMS key is disabled.
The following error message occurs if GKE's default serviceaccount cannot access the Cloud KMS key.
Cluster problem detected (Kubernetes Engine Service Agent account unable to use CloudKMS key configured for Application Level encryption).
To resolve this issue,re-enable the disabled key.
For more information about secrets in GKE, seeEncrypt secrets at the application layer.
Secrets Encryption Update Failed
If the operation to enable, disable orupdate the Cloud KMS key fails, seeTroubleshoot application-layer secrets encryption.