Kubernetes (K8s)An open-source container orchestration platform that automates deployment, scaling, and management of containerized applications. DockerA platform for building, shipping, and running applications in lightweight, portable containers. CI/CDContinuous Integration and Continuous Delivery — practices that automate building, testing, and deploying code changes. Infrastructure as Code (IaC)Managing and provisioning infrastructure through machine-readable configuration files rather than manual processes. Tools: Terraform, Ansible. GitOpsAn operational framework where Git repositories serve as the single source of truth for infrastructure and application deployments. SRE (Site Reliability Engineering)A discipline that applies software engineering principles to IT operations, focusing on reliability, scalability, and incident response. SLA (Service Level Agreement)A contract defining the expected level of service, including uptime guarantees and response times. HelmA package manager for Kubernetes that simplifies deploying and managing applications using reusable charts. ArgoCDA declarative GitOps continuous delivery tool for Kubernetes that syncs application state from Git repositories. TerraformAn IaC tool by HashiCorp for provisioning and managing cloud infrastructure across multiple providers. RKE2A Kubernetes distribution by Rancher focused on security and compliance, used for production-grade bare-metal deployments. PrometheusAn open-source monitoring and alerting toolkit designed for reliability, used widely with Kubernetes. GrafanaA visualization and analytics platform for monitoring metrics, logs, and traces from multiple data sources. ObservabilityThe ability to understand the internal state of a system through its external outputs — metrics, logs, and traces. ETL / ELTExtract, Transform, Load — processes for moving data from source systems into a data warehouse. ELT loads first, then transforms. Data Warehouse (DWH)A centralized repository for structured data optimized for analytics and reporting queries. Apache KafkaA distributed event streaming platform for building real-time data pipelines and streaming applications. Apache AirflowA workflow orchestration platform for scheduling, monitoring, and managing complex data pipelines. Feature StoreA centralized repository for storing, managing, and serving ML features consistently across training and inference. LLM (Large Language Model)A deep learning model trained on large text datasets capable of understanding and generating human language. Examples: LLaMA, Mistral, GPT. MLOpsPractices for deploying, monitoring, and maintaining machine learning models in production reliably and efficiently. GPU (Graphics Processing Unit)Specialized hardware for parallel computation, essential for training and running AI/ML models at scale. InferenceThe process of running a trained ML model to generate predictions or outputs from new input data. Fine-tuningAdapting a pre-trained model to a specific task or domain by training it further on specialized data.