Q: How would you troubleshoot a container that keeps crashing in production?

Check container logs (kubectl logs, docker logs). Look at exit codes (OOMKilled = memory, non-zero = application error). Check resource limits (CPU/memory constraints). Verify environment variables and secrets are correctly injected. Check health check configuration (is the probe timing out?). If needed, exec into a running instance or deploy a debug sidecar container.

Q: What is the difference between horizontal and vertical scaling? When would you choose each?

Vertical scaling (bigger machine) is simpler but has a ceiling and requires downtime. Horizontal scaling (more machines) handles more traffic but requires stateless application design, load balancing, and distributed data management. Use vertical for databases (up to a point) and horizontal for stateless web/API servers. Most production systems use both.

Q: How do you implement effective monitoring and alerting?

Collect the four golden signals: latency, traffic, errors, and saturation. Use structured logging, metrics (Prometheus/Datadog), and distributed tracing (Jaeger/OpenTelemetry). Alert on symptoms (high error rate) not causes (high CPU). Use runbooks linked from alerts. Page only for user-facing impact. Use dashboards for investigation, not alerting.

Q: Describe how you would run a blameless post-mortem.

Within 48 hours of the incident: gather a timeline of events, identify contributing factors (not root cause, because there are usually multiple), document what worked well and what did not, and produce actionable follow-up items with owners and deadlines. Focus on system improvements, not individual mistakes. Publish the document widely to spread learnings.

Q: How does a load balancer work, and what algorithms can it use?

A load balancer distributes incoming traffic across multiple backend servers. Algorithms: round-robin (equal distribution), weighted round-robin (more traffic to stronger servers), least connections (send to the least busy server), IP hash (sticky sessions by client IP). Layer 4 (TCP) is faster; Layer 7 (HTTP) can route based on URL path, headers, or cookies.

Question 1

Explain the difference between blue-green deployment and canary deployment.

Accepted Answer

Blue-green: maintain two identical environments. Deploy to the inactive one, test, then switch traffic. Instant rollback by switching back. Canary: gradually roll out to a small percentage of users, monitor metrics, then increase traffic. Canary catches issues with real traffic patterns; blue-green is simpler but uses more resources.

Question 2

How would you design a CI/CD pipeline for a microservices architecture?

Accepted Answer

Each service gets its own pipeline: lint, unit test, build container image, push to registry, deploy to staging, integration test, deploy to production. Use a monorepo with path-based triggers or separate repos with contract tests. Implement feature flags for decoupling deploys from releases. Store infrastructure as code alongside application code.

Question 3

What is Infrastructure as Code (IaC), and why is it important?

Accepted Answer

IaC defines infrastructure (servers, networks, databases) in version-controlled configuration files (Terraform, Pulumi, CloudFormation). Benefits: reproducibility (spin up identical environments), auditability (git history shows who changed what), speed (provision in minutes not days), and drift detection. Treat infrastructure changes like code changes: review, test, merge.

Question 4

How do you handle secrets management in a production environment?

Accepted Answer

Never store secrets in code, environment variables on disk, or container images. Use a dedicated secrets manager (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager). Rotate secrets automatically. Inject secrets at runtime via environment variables or mounted files. Audit access. For local development, use .env files that are gitignored.

Question 5

Explain the concept of SLOs, SLIs, and SLAs. How do they relate?

Accepted Answer

SLI (Service Level Indicator) is a measurable metric (latency p99, error rate, uptime). SLO (Service Level Objective) is the target for an SLI (p99 latency < 200ms, 99.9% uptime). SLA (Service Level Agreement) is a contractual commitment with consequences for missing the SLO. Set SLOs based on user expectations, and set SLAs slightly looser than SLOs to provide a buffer.

Question 6

How would you troubleshoot a container that keeps crashing in production?

Accepted Answer

Check container logs (kubectl logs, docker logs). Look at exit codes (OOMKilled = memory, non-zero = application error). Check resource limits (CPU/memory constraints). Verify environment variables and secrets are correctly injected. Check health check configuration (is the probe timing out?). If needed, exec into a running instance or deploy a debug sidecar container.

Question 7

What is the difference between horizontal and vertical scaling? When would you choose each?

Accepted Answer

Vertical scaling (bigger machine) is simpler but has a ceiling and requires downtime. Horizontal scaling (more machines) handles more traffic but requires stateless application design, load balancing, and distributed data management. Use vertical for databases (up to a point) and horizontal for stateless web/API servers. Most production systems use both.

Question 8

How do you implement effective monitoring and alerting?

Accepted Answer

Collect the four golden signals: latency, traffic, errors, and saturation. Use structured logging, metrics (Prometheus/Datadog), and distributed tracing (Jaeger/OpenTelemetry). Alert on symptoms (high error rate) not causes (high CPU). Use runbooks linked from alerts. Page only for user-facing impact. Use dashboards for investigation, not alerting.

Question 9

Describe how you would run a blameless post-mortem.

Accepted Answer

Within 48 hours of the incident: gather a timeline of events, identify contributing factors (not root cause, because there are usually multiple), document what worked well and what did not, and produce actionable follow-up items with owners and deadlines. Focus on system improvements, not individual mistakes. Publish the document widely to spread learnings.

Question 10

How does a load balancer work, and what algorithms can it use?

Accepted Answer

A load balancer distributes incoming traffic across multiple backend servers. Algorithms: round-robin (equal distribution), weighted round-robin (more traffic to stronger servers), least connections (send to the least busy server), IP hash (sticky sessions by client IP). Layer 4 (TCP) is faster; Layer 7 (HTTP) can route based on URL path, headers, or cookies.

Question 11

What is GitOps, and how does it differ from traditional CI/CD?

Accepted Answer

GitOps uses a Git repository as the single source of truth for infrastructure and application state. A controller (ArgoCD, Flux) watches the repo and reconciles the actual state with the desired state. Unlike traditional CI/CD where a pipeline pushes changes, GitOps pulls desired state. Benefits: auditability, easy rollback (git revert), and declarative configuration.

Question 12

How would you handle a major production incident that is affecting all users?

Accepted Answer

Immediate: communicate to stakeholders ('we are aware and investigating'). Triage: identify the blast radius and impact. Mitigate: roll back, toggle feature flags, scale up, or redirect traffic. Fix: once mitigated, investigate root cause. Communicate: provide regular status updates. After: run a post-mortem, implement preventive measures, and test them.

DevOps interview questions

1.Explain the difference between blue-green deployment and canary deployment.

2.How would you design a CI/CD pipeline for a microservices architecture?

3.What is Infrastructure as Code (IaC), and why is it important?

4.How do you handle secrets management in a production environment?

5.Explain the concept of SLOs, SLIs, and SLAs. How do they relate?

6.How would you troubleshoot a container that keeps crashing in production?

7.What is the difference between horizontal and vertical scaling? When would you choose each?

8.How do you implement effective monitoring and alerting?

9.Describe how you would run a blameless post-mortem.

10.How does a load balancer work, and what algorithms can it use?

11.What is GitOps, and how does it differ from traditional CI/CD?

12.How would you handle a major production incident that is affecting all users?

Prepare further

More interview topics