Kubernetes
Misc
- An open-source container orchestration platform. Automates the deployment, scaling, and management of containerized applications. It works by simplifying the containerized workloads based on demand, across a cluster of machines to ensure increased availability, scalability, and efficient resource use.
- Docs
- Resources
- Kubernetes Roadmap - Breakdown of the different areas needed to master k8 with descriptions and links to learning resources.
- Video: Kubernetes Course - Full Beginners Tutorial - 3hr tutorial by freecodecamp
- Microsoft’s Kubernetes Learning Path course
- Tools
- K9s - A powerful CLI tool that makes managing your Kubernetes clusters easy
- Packer - Tool for building identical machine images for multiple platforms. These can – for example – then be pushed to Azure Container Registry for use in Azure Kubernetes Service, or used on VMs in Azure.
- Ansible - An open source IT automation engine that automates provisioning, configuration management, application deployment, orchestration, and many other IT processes.
- Project Components
- Your code (python/R/julia/matlab)
- A Dockerfile that packages up your code
- A configuration file (deployment.yaml, job.yaml) (sometimes someone else will do this for you)
- Write process script for each of these steps
- Run all the tests
- Build the docker image
- Check the image
- Push the image to the registry
- Deploy the new version of your code to the production cluster
- General Process
- Build your machine learning code.
- Package it up into a docker container and push that image into a registry.
- Using the configuration file, you tell the kubernetes cluster where to find that image and how to build your service out of it (“make two copies and give them lots of ram”).
- Kubernetes has new instructions now, and makes sure the cluster state matches those instructions.
- Data Scientist responsibilities
- Make sure your container actually runs, test that extensively!
- Write R tests that check if you can handle the expected inputs.
- Check that you are logging errors when they occur.
- Check how you handle unexpected inputs (a person with an age of 200, a car with no weight, etc)
- Test the container, pass expected input to the container, pass unexpected input, test if the container fails and protests loudly when the required environmental variables are not found.
- Arrange what your API should look like:
- When you use {plumber}: what endpoint will be called, what port should be reached for, and what will the data look like? Make sure you write some tests for that! When you use {shiny}: what port does it live on, what are the memory and CPU requirements for your application. In all cases: what secrets must be supplied to the container?
- Arrange where logs should go and how they should look. My favorite R logging package is {logger} and it can do many many forms of logging. If something goes wrong you want the logs to tell you what happened and where you should investigate.
- Use {renv} to install specific versions of r-packages, and to record that in a lockfile.
- Make sure your container actually runs, test that extensively!
- Cost Benefits
- Autoscaling: Kubernetes consists of some open-source tools (like Horizontal Pod Autoscaler and Cluster Autoscale) that allow the users to dynamically adjust the number of containers to use which prevents the companies from overprovisioning, where they pay for resources that are not being used.
- Improved developer efficiency: This also improves the developer efficiency by streamlining deployments, rollbacks, and scaling allowing them to focus on building apps rather than managing cloud infrastructure.
- Horizontal scaling: It also gives the users leverage the horizontal scaling, allowing them to distribute workloads across multiple nodes, optimizing resource usage and reducing costs.
Terms
- etcd - Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data. Docs
- If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for those data.
- Time-to-Live (TTL) - Kubernetes mechanism for shutting down pods and gc - “Kubernetes v1.23 [stable] TTL-after-finished controller provides a TTL (time to live) mechanism to limit the lifetime of resource objects that have finished execution. TTL controller only handles Jobs.” Docs
Debugging
Notes from Beyond the AKS Basics: Practical Tips for Your Kubernetes Journey
Resources
Experiences
- “The reverse proxy (more specifically an Azure Application Gateway), sitting innocently in front of our VMs and the rest of our system had a default two-minute connection timeout. If allocating a job to a node took longer than that, the proxy would prematurely close the connection, resulting in the dreaded 504 error.”
- Underscores the importance of considering the entire system architecture when debugging and not just Kubernetes — e.g. load balancers, proxies, firewalls, and any other piece of infrastructure that might be interacting with your cluster.
- “The reverse proxy (more specifically an Azure Application Gateway), sitting innocently in front of our VMs and the rest of our system had a default two-minute connection timeout. If allocating a job to a node took longer than that, the proxy would prematurely close the connection, resulting in the dreaded 504 error.”
Spin up a temporary container based on your image and get a shell inside
kubectl run -it --image <your_image> <your_container_name> -- /bin/bash ${AZURE_CONTAINER_REPOSITORY_NAME}.azurecr.io/${IMAGE_NAME} testme -- /bin/bash
- If a Kubernetes session refuses to start, then this might figure out what’s happening
Open shell inside a node
kubectl debug node/<your_node_name> -it --image=ubuntu
View logs filtered by service unit (your application), time range, and specific keywords
sudo journalctl -u $SERVICE_UNIT_NAME --since "$TIME_RANGE" -g "$SEARCH_TERM"
- Many “always-on” Linux-based applications rely on systemctl and journalctl which allows you to use them