Skip to main contentOpteroAIBeta

DevOps interview questions

DevOps and SRE interview questions covering CI/CD, infrastructure as code, containerization, monitoring, incident management, and reliability.

12 questions
4 easy7 medium1 hard

1.Explain the difference between blue-green deployment and canary deployment.

medium
How to approach thisBlue-green: maintain two identical environments. Deploy to the inactive one, test, then switch traffic. Instant rollback by switching back. Canary: gradually roll out to a small percentage of users, monitor metrics, then increase traffic. Canary catches issues with real traffic patterns; blue-green is simpler but uses more resources.

2.How would you design a CI/CD pipeline for a microservices architecture?

medium
How to approach thisEach service gets its own pipeline: lint, unit test, build container image, push to registry, deploy to staging, integration test, deploy to production. Use a monorepo with path-based triggers or separate repos with contract tests. Implement feature flags for decoupling deploys from releases. Store infrastructure as code alongside application code.

3.What is Infrastructure as Code (IaC), and why is it important?

easy
How to approach thisIaC defines infrastructure (servers, networks, databases) in version-controlled configuration files (Terraform, Pulumi, CloudFormation). Benefits: reproducibility (spin up identical environments), auditability (git history shows who changed what), speed (provision in minutes not days), and drift detection. Treat infrastructure changes like code changes: review, test, merge.

4.How do you handle secrets management in a production environment?

medium
How to approach thisNever store secrets in code, environment variables on disk, or container images. Use a dedicated secrets manager (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager). Rotate secrets automatically. Inject secrets at runtime via environment variables or mounted files. Audit access. For local development, use .env files that are gitignored.

5.Explain the concept of SLOs, SLIs, and SLAs. How do they relate?

medium
How to approach thisSLI (Service Level Indicator) is a measurable metric (latency p99, error rate, uptime). SLO (Service Level Objective) is the target for an SLI (p99 latency < 200ms, 99.9% uptime). SLA (Service Level Agreement) is a contractual commitment with consequences for missing the SLO. Set SLOs based on user expectations, and set SLAs slightly looser than SLOs to provide a buffer.

6.How would you troubleshoot a container that keeps crashing in production?

medium
How to approach thisCheck container logs (kubectl logs, docker logs). Look at exit codes (OOMKilled = memory, non-zero = application error). Check resource limits (CPU/memory constraints). Verify environment variables and secrets are correctly injected. Check health check configuration (is the probe timing out?). If needed, exec into a running instance or deploy a debug sidecar container.

7.What is the difference between horizontal and vertical scaling? When would you choose each?

easy
How to approach thisVertical scaling (bigger machine) is simpler but has a ceiling and requires downtime. Horizontal scaling (more machines) handles more traffic but requires stateless application design, load balancing, and distributed data management. Use vertical for databases (up to a point) and horizontal for stateless web/API servers. Most production systems use both.

8.How do you implement effective monitoring and alerting?

medium
How to approach thisCollect the four golden signals: latency, traffic, errors, and saturation. Use structured logging, metrics (Prometheus/Datadog), and distributed tracing (Jaeger/OpenTelemetry). Alert on symptoms (high error rate) not causes (high CPU). Use runbooks linked from alerts. Page only for user-facing impact. Use dashboards for investigation, not alerting.

9.Describe how you would run a blameless post-mortem.

easy
How to approach thisWithin 48 hours of the incident: gather a timeline of events, identify contributing factors (not root cause, because there are usually multiple), document what worked well and what did not, and produce actionable follow-up items with owners and deadlines. Focus on system improvements, not individual mistakes. Publish the document widely to spread learnings.

10.How does a load balancer work, and what algorithms can it use?

easy
How to approach thisA load balancer distributes incoming traffic across multiple backend servers. Algorithms: round-robin (equal distribution), weighted round-robin (more traffic to stronger servers), least connections (send to the least busy server), IP hash (sticky sessions by client IP). Layer 4 (TCP) is faster; Layer 7 (HTTP) can route based on URL path, headers, or cookies.

11.What is GitOps, and how does it differ from traditional CI/CD?

medium
How to approach thisGitOps uses a Git repository as the single source of truth for infrastructure and application state. A controller (ArgoCD, Flux) watches the repo and reconciles the actual state with the desired state. Unlike traditional CI/CD where a pipeline pushes changes, GitOps pulls desired state. Benefits: auditability, easy rollback (git revert), and declarative configuration.

12.How would you handle a major production incident that is affecting all users?

hard
How to approach thisImmediate: communicate to stakeholders ('we are aware and investigating'). Triage: identify the blast radius and impact. Mitigate: roll back, toggle feature flags, scale up, or redirect traffic. Fix: once mitigated, investigate root cause. Communicate: provide regular status updates. After: run a post-mortem, implement preventive measures, and test them.

Prepare further

More interview topics