devops

Job Readiness

Resume with engineering impact, GitHub portfolio, interview story building, debugging interviews, and system design basics

Resume with Engineering Impact

The biggest mistake engineers make on resumes: listing tools instead of impact.

Wrong vs Right

WRONG:
• Used Docker, Kubernetes, Terraform, Jenkins, AWS

RIGHT:
• Reduced deployment time from 45 minutes to 8 minutes by migrating from
  manual deployments to a containerized CI/CD pipeline (Docker, GitHub Actions, EKS)

• Decreased infrastructure costs by 35% ($12k/month) by right-sizing EC2 instances
  using CloudWatch metrics and implementing auto-scaling policies

• Improved mean time to recovery from 2 hours to 18 minutes by building
  centralized observability with Prometheus, Grafana, and structured logging

The Formula

[Action verb] + [what you did] + [how you did it] + [measurable result]

"Automated database backup verification, eliminating 3 hours/week of manual work
and catching 2 silent backup failures before they became incidents"

Impact numbers to track:

Deployment frequency before/after
MTTR before/after
Cost savings ($)
Time saved (hours/week)
Reliability improvement (99.x% uptime)
Performance improvement (latency reduction, throughput increase)

Structure

[Name] | [LinkedIn] | [GitHub] | [Location]

SUMMARY (2-3 lines)
DevOps/Platform engineer with X years of experience building CI/CD pipelines
and cloud infrastructure. Focus on reliability, automation, and developer experience.

EXPERIENCE
  Company Name — Job Title (dates)
  • Impact bullet 1
  • Impact bullet 2
  • Impact bullet 3

SKILLS
  Cloud: AWS (EC2, EKS, RDS, S3, IAM), Azure
  Containers: Docker, Kubernetes, Helm
  IaC: Terraform, CloudFormation
  CI/CD: GitHub Actions, Azure Pipelines, ArgoCD
  Observability: Prometheus, Grafana, OpenTelemetry
  Languages: Bash, Python, Go (basic)

GitHub as Portfolio

Your GitHub is your portfolio. Recruiters and hiring managers look at it.

What Makes a Good Portfolio

Real projects with READMEs — not tutorial repos
Consistent commits — shows you actually work, not just courses
DevOps-specific repos (examples below)
Pinned repos — pin your 6 best repos

Project Ideas for DevOps Portfolio

1. Home Lab Infra (Terraform + Kubernetes)
   - Terraform modules for AWS infrastructure
   - K8s cluster with Helm charts for your apps
   - CI/CD pipeline to deploy on git push
   - README explaining architecture

2. Automated Blog/Portfolio CI/CD
   - Your own blog (this one!) deployed with a pipeline
   - Show: GitHub Actions, Docker, deployment

3. Observability Stack
   - docker-compose with Prometheus + Grafana + sample app
   - Pre-built dashboards with useful alerts

4. Incident Response Toolkit
   - Bash/Python scripts for common diagnostics
   - Document your runbook template

GitHub Profile README

Create a README.md in a repo named [yourusername]/[yourusername] — it appears on your profile page.

# Hi, I'm [Name]

DevOps engineer focused on reliability and automation.

## What I'm working on
- Building a multi-region Kubernetes setup with GitOps (Argo CD)
- Learning SRE practices through hands-on labs

## Tech Stack
Docker · Kubernetes · Terraform · AWS · GitHub Actions · Prometheus

## Recent Projects
- [infra-homelab](link) — Full AWS infrastructure with Terraform
- [devops-notes](link) — My learning notes (this site!)

Interview Story Building

Interviewers ask behavioral questions — “Tell me about a time when…” Use the STAR method.

STAR Method

Component	Question it answers
Situation	What was the context?
Task	What were you responsible for?
Action	What did you specifically do?
Result	What was the measurable outcome?

Build Your Story Bank

Prepare stories for these common categories:

Incident/Problem Solving:

“Tell me about a time you resolved a production incident.”

S: Our checkout service was throwing 500 errors at 2am, affecting 20% of users.
T: I was on-call and responsible for restoring service.
A: I checked deployment history (recent deploy 30min earlier), reviewed error logs
   (saw null pointer on payment config), identified missing env var, rolled back deployment.
R: Service restored in 18 minutes. Wrote postmortem, added env var validation to CI/CD.

Automation:

“Tell me about a time you improved a process.”

Collaboration/Communication:

“Tell me about a time you had a conflict with a teammate.”

Failure/Learning:

“Tell me about a mistake you made and what you learned.”

Impact:

“Tell me about your biggest technical accomplishment.”

Debugging Interviews

Technical interviews for DevOps often include debugging scenarios. These are less about knowing the answer and more about demonstrating your thought process.

The Framework

1. CLARIFY — "Before I dive in, can I ask a few questions?"
   - What does the user/customer actually experience?
   - When did this start? Any recent changes?
   - What environment (dev/staging/prod)?

2. GATHER DATA — don't guess, look at evidence
   - Metrics (is there an alert? what does the graph show?)
   - Logs (recent errors in logs?)
   - Recent changes (deployment, config change, infra change?)

3. HYPOTHESIZE — form a hypothesis before testing
   - "Based on this error message, I think it could be X because..."
   - "I'd first check Y because it's the most likely cause"

4. TEST ONE THING AT A TIME — systematic, not random
   - If you change two things at once, you don't know which fixed it

5. COMMUNICATE — narrate what you're doing and why
   - "I'm checking the service logs first because the error suggests..."

Common Interview Scenarios

“A service is returning 503 errors — what do you do?”

1. Check if the service is running (kubectl get pods, ps aux)
2. Check if the port is listening (ss -tlnp | grep :8080)
3. Check the service logs (kubectl logs, journalctl)
4. Check upstream dependencies (database, cache, downstream services)
5. Check load balancer health checks
6. Check recent deployments

“Disk is full on a server — what do you do?”

1. df -h — which filesystem?
2. df -i — could be inode exhaustion
3. du -sh /* — find large directories
4. Find large files: find / -type f -size +500M
5. Common culprits: /var/log, Docker images, core dumps, tmp files
6. Quick fix: truncate large logs, clean docker, clear tmp
7. Long-term: log rotation, monitoring alert, increase disk

System Design Basics (Entry Level)

You’ll get basic system design questions even as a junior/mid DevOps engineer. Know the fundamentals.

The Framework

1. Clarify requirements
   - Scale: how many users? requests/sec? data size?
   - Availability requirements? (99.9%? 99.99%?)
   - Read-heavy or write-heavy?

2. Estimate scale
   - 1M users, 10% daily active, 10 requests each = 1M req/day = ~12 req/sec
   - 1TB of data, growing 1% daily

3. High-level design
   - Draw the components (LB, app servers, DB, cache, CDN)
   - Show data flow

4. Deep dive
   - Interviewer will pick a component to go deeper on

5. Identify failure points and mitigations

Building Blocks

Load Balancer → App Servers → Cache → Database
                     ↓
               Message Queue → Worker Servers
                     ↓
                Object Storage (S3)

Component	When to use
Load balancer	Multiple app servers, horizontal scaling
Cache (Redis)	Repeated reads, session storage, rate limiting
CDN	Static assets, geographically distributed users
Message queue	Async tasks, decouple producers/consumers
Database read replica	Read-heavy workloads
Database sharding	Write-heavy, very large datasets
Object storage	Images, videos, backups, logs

Example: “Design a URL shortener”

Requirements:
- 100M URLs shortened per day
- Redirect latency < 10ms
- Links don't expire (or have configurable TTL)

Scale:
- 100M writes/day = 1,150/sec
- Reads likely 10-100x writes = 11,500-115,000 req/sec

Design:
[Client] → [CDN] → [Load Balancer] → [App Servers] → [Redis Cache]
                                              ↓
                                          [Database]

Short URL generation: base62 encode a counter or random ID
Redirect: look up in Redis cache first → DB if not cached
Database: simple key-value (short_code → original_url, metadata)
Cache: hot URLs cached in Redis with 24h TTL

Scale considerations:
- App servers: stateless, scale horizontally
- Redis: cluster mode for high availability
- Database: read replicas for redirect lookups
- Global: deploy in multiple regions, use Route53 latency-based routing

The STAR Summary

Everything comes back to this: engineers who get hired are the ones who can demonstrate impact, not just tool knowledge.

Before any interview:

List 5-7 technical accomplishments with measurable results
Prepare STAR stories for each
Practice explaining your architecture decisions out loud
Be ready to say “I don’t know, but here’s how I’d find out”

Interviewers hire people they’d want to debug a production incident with at 3am. Be that person — calm, systematic, communicative, and honest about what you don’t know.