RGR SRE Interview Question¶

🎯 Questions and Answers for Junior SRE (Support-focused)¶

🔹 Basic OS and Networking Concepts¶

What is the difference between a process and a thread?¶

A process is a running program with its own memory and resources.
A thread is an execution path inside a process that shares the process’s resources with other threads.

What does it mean when a program in Linux becomes zombie or defunct, and what should be done?¶

A zombie process means it has finished execution but its entry remains in the process table because the parent hasn’t collected the exit status.
Solution: Usually solved by killing/restarting the parent (kill -9 <parent_pid>).

What’s the difference between TCP and UDP and their use cases?¶

TCP is connection-oriented, reliable, guarantees ordered delivery (used for web, email).
UDP is connectionless, faster, less reliable (used for DNS, video streaming, online games).

When a website doesn’t load in the browser, what troubleshooting steps would you take?¶

Check internet connectivity (ping 8.8.8.8)
Verify DNS resolution (dig, nslookup)
Check routing (traceroute)
Test port availability (telnet <host> 80, curl -v)
Review firewall or network policies

What are the uses of `top`, `htop`, `netstat`, `ss`, `curl`, `ping`, `traceroute`?¶

top/htop → Monitor CPU, memory, and processes
netstat/ss → View network connections and ports
curl → Test HTTP endpoints
ping → Check IP reachability
traceroute → Trace the route packets take

🔹 Services and Monitoring¶

What is SRE, and how does it differ from DevOps or SysAdmin?¶

SRE focuses on reliability, availability, and performance of services.
SysAdmins manage servers and infrastructure.
DevOps focuses on CI/CD and bridging development and operations.

What’s the difference between monitoring and observability?¶

Monitoring: Collecting and viewing metrics/logs (e.g., CPU usage, error rates).
Observability: The ability to understand the internal state of a system based on its outputs (logs, metrics, traces).

What are SLI, SLO, and SLA?¶

SLI (Service Level Indicator): A measurable metric (e.g., percentage of successful requests).
SLO (Service Level Objective): The target for SLI (e.g., 99.9% success rate).
SLA (Service Level Agreement): A contractual commitment with customers, including penalties if targets aren’t met.

What would you do if a service’s CPU or memory usage suddenly spikes?¶

Check resource usage (top, htop)
Identify heavy processes
Check logs for memory leaks or infinite loops
Apply limits using cgroups/Kubernetes resource limits
Scale the service if needed

You see HTTP 500 errors in logs — how do you investigate?¶

Identify timestamps of the errors
Check dependencies (databases, external services)
Verify system resources
Test endpoint manually with curl
Review recent changes (deployments, configs)

🔹 Containers and Tools¶

Do you have Docker experience? How does a container interact with the host OS?¶

A container is an isolated process running on the host kernel.
It uses namespaces (for isolation of processes, network, mounts, etc.) and cgroups (to limit resources).

What’s the difference between an image and a container?¶

Image: Immutable, read-only snapshot.
Container: A running instance of an image with a writable layer.

Do you know Kubernetes? What is a Pod, and what does CrashLoopBackOff mean?¶

Pod: The smallest deployable unit in K8s (one or more containers).
CrashLoopBackOff: The container repeatedly crashes and Kubernetes keeps restarting it, often due to misconfiguration, dependency failure, or code bugs.

🔹 Support and Incident Handling¶

If you get an alert at 3 AM that a database is down, what’s your first action?¶

Validate the alert
Check service status (systemctl status, docker logs, health checks)
Review system resources (CPU, RAM, disk)
Attempt quick restore (restart or failover)
Notify the team if escalation is required

What’s the difference between issue, incident, problem, and outage?¶

Issue: Any reported problem
Incident: An event affecting service reliability
Problem: Root cause behind one or more incidents
Outage: Severe or complete service downtime

Have you written a postmortem? What should it include?¶

A postmortem analyzes an incident and should include:
Description of the event
Timeline
Impact on users
Root cause
Actions taken
Action items for prevention

If you notice a critical service is having issues but no alert has fired, what would you do?¶

Investigate and confirm the issue
Document and inform the team
Stabilize or mitigate the issue
Add or fix missing alerts afterward

🔹 Soft Skills¶

How do you collaborate with developers when troubleshooting?¶

Provide logs and metrics clearly
Explain issues factually without blame
Suggest reliability improvements

Can you give an example of automating a repetitive issue?¶

Automated restarting a service after crashes using a script or systemd unit instead of doing it manually.

When multiple requests come at once, how do you prioritize?¶

Based on user and business impact.
Issues affecting critical services or SLA violations come first.

🎯 SRE Interview Questions with Answers (Mid-level+)¶

🖥️ OS & Performance¶

What’s the difference between load average and CPU usage, and how do you interpret them?¶

CPU usage shows the percentage of CPU time used.
Load average shows the average number of processes waiting for CPU, disk, or I/O over 1, 5, and 15 minutes.
High load with low CPU usage usually indicates I/O bottlenecks.

How do you detect memory leaks in a service?¶

Monitor memory usage over time (top, htop, Prometheus metrics).
Use tools like valgrind, pmap, gperftools, or heapdump depending on the language.
Check GC (garbage collection) logs in languages like Java/Python.

What does high IOwait indicate, and how would you troubleshoot it?¶

IOwait means CPU is idle but waiting for disk I/O.
Causes: slow disks, too many concurrent requests, database contention.
Troubleshooting: check iostat, iotop, disk latency, optimize queries, use faster storage (SSD/NVMe).

Difference between vertical scaling and horizontal scaling?¶

Vertical: Adding more resources (CPU/RAM) to a single machine. Easy but limited.
Horizontal: Adding more machines/instances. More scalable, but adds complexity (load balancing, consistency).

🌐 Networking & Distributed Systems¶

How does DNS caching work at different levels?¶

App-level cache (e.g., browser cache).
OS resolver cache (nscd, systemd-resolved).
Recursive resolver cache (like 8.8.8.8).
Authoritative DNS has TTL that controls cache duration.

If latency increases between two microservices, how do you find the root cause?¶

Check metrics (latency, error rate) from monitoring.
Trace requests with distributed tracing (Jaeger, OpenTelemetry).
Use network tools (mtr, tcpdump).
Verify downstream dependencies (DB, API).

What are circuit breaker and retry patterns, and why are they important?¶

Retry: Automatically retry failed requests (with backoff).
Circuit breaker: Stops sending requests when failure rate is high, allowing system to recover.
Important in distributed systems to prevent cascading failures.

Explain CAP theorem with examples¶

C (Consistency): All nodes see the same data at the same time.
A (Availability): Every request receives a response.
P (Partition tolerance): System works despite network splits.
Example CP: HBase, ZooKeeper.
Example AP: Cassandra, DynamoDB.

📊 Observability & Monitoring¶

Difference between metrics, logs, and traces?¶

Metrics: Aggregated numerical data (CPU, QPS, latency).
Logs: Event-based text records.
Traces: Track requests across services (useful in microservices).

What is cardinality in metrics, and why is it dangerous in Prometheus?¶

Cardinality = number of unique label combinations.
High cardinality (e.g., per-user metrics) causes high memory usage and slow queries.
Rule: Avoid unbounded labels (like user IDs, request IDs).

Difference between blackbox and whitebox monitoring?¶

Blackbox: Tests from outside (ping, HTTP check). Focus on availability.
Whitebox: Monitors internal metrics/logs. Focus on internal health.

How do you deal with alert fatigue?¶

Deduplicate and group alerts.
Prioritize by severity and business impact.
Use SLO-based alerting instead of raw metrics.
Regularly review and tune alert rules.

🛠️ Containers / Kubernetes¶

Difference between Deployment, StatefulSet, and DaemonSet in K8s?¶

Deployment: Stateless workloads, easy scaling.
StatefulSet: Stateful apps (DBs, Kafka), keeps identity & stable storage.
DaemonSet: Ensures one pod runs on every node (e.g., logging/monitoring agents).

Why might a Pod stay in Pending state?¶

Not enough resources (CPU/RAM).
Node selectors/affinity rules not satisfied.
Missing PersistentVolume.
Scheduler can’t place it due to taints/tolerations.

When do you use HPA (Horizontal Pod Autoscaler), and what are its limitations?¶

Use: Auto-scale pods based on metrics (CPU, custom metrics).
Limitations:
Scaling delay (reactive, not predictive).
Doesn’t handle queue-based workloads well.
Needs accurate metrics.

What is the role of kube-proxy?¶

Handles Service networking in K8s.
Maintains iptables/IPVS rules for load balancing traffic to pods.

⚡ Incident Response & Reliability¶

Scenario: primary DB is down, replicas are healthy. What do you do?¶

Promote a replica to primary (manual or via failover tool like Patroni, Orchestrator).
Redirect traffic to new primary.
Investigate and fix the original primary before reintroducing it.

What metrics are critical for DB monitoring?¶

Connection count, query latency, slow queries.
Replication lag.
Buffer cache hit ratio.
Disk I/O and locks.

What are MTTR, MTTD, and MTBF, and why are they important?¶

MTTR (Mean Time to Recovery): Avg time to recover from failure.
MTTD (Mean Time to Detect): Avg time to detect a failure.
MTBF (Mean Time Between Failures): Avg time between failures.
They measure reliability and efficiency of incident response.

If rollback isn’t possible during a critical outage, what fallback strategies exist?¶

Feature flagging (disable problematic feature).
Graceful degradation (disable non-critical services).
Rate limiting or traffic shedding.
Serving cached/stale data.

🧩 Automation & DevOps Practices¶

What is Infrastructure as Code (IaC), and why is it important? and it's tools?¶

Managing infra with code instead of manual config.
Benefits: repeatability, version control, automation.
Tools: Terraform, Ansible, Pulumi, CloudFormation.

What are the stages of a CI/CD pipeline, and how do they relate to SRE?¶

Stages: build → test → deploy → verify.
SRE ensures reliability: automated rollbacks, monitoring after deploys, canary releases.

What’s the difference between blue-green and canary deployment?¶

Blue-green: Two environments, switch all traffic at once.
Canary: Gradually shift traffic to new version, monitor errors, roll back if needed.

What is chaos engineering, and why is it useful?¶

Definition: Practice of intentionally injecting failures into systems.
Goal: test resilience and validate recovery processes before real outages happen.

🔹 Senior/Lead SRE Interview Questions (Advanced Level)¶

🏗️ Architecture & Scalability¶

How would you design a highly available global system (e.g., an e-commerce site) that serves millions of users?¶

A (key points):

Multi-region deployment (active-active or active-passive).
Load balancing with global traffic managers (GSLB, Anycast DNS).
Database replication across regions (read replicas, multi-master with conflict resolution).
CDN for static assets.
Caching layer (Redis, Memcached).
Graceful degradation under heavy load.

What’s the difference between consistency models (strong, eventual, causal)? When would you use each?¶

Strong consistency: Clients always read the latest write (banking).
Eventual consistency: Updates propagate over time (social feeds).
Causal consistency: Reads respect cause-effect order (chat systems).

How do you handle a “thundering herd” problem?¶

Add caching with randomized TTLs.
Use request coalescing (deduplicate identical requests).
Apply rate limiting/backpressure.
Queue requests with message brokers.

🌐 Reliability & Trade-offs¶

How do you decide between active-active vs. active-passive architecture for a critical system?¶

Active-active: Higher availability, load distribution, complex conflict resolution.
Active-passive: Easier to manage, cheaper, but slower failover.
Decision depends on SLA requirements and budget.

How do you approach error budgets?¶

Error budget = SLO target − actual reliability.
If error budget is burned quickly → slow down releases, focus on reliability.
If error budget is underused → increase release velocity.

In a system with 5 nines availability target (99.999%), what does that mean in downtime per year?¶

99.999% = ~5 minutes/year downtime allowed.
Implies extreme redundancy, automation, and very fast failover.

📊 Observability at Scale¶

How do you design an observability stack for thousands of microservices?¶

Centralized logging (ELK/Loki/Cloud-native log storage).
Metrics with high retention and downsampling (Prometheus + Cortex/Thanos).
Distributed tracing (OpenTelemetry/Jaeger).
SLO-driven alerting (reduce noise).

How would you debug a performance regression in production when metrics look normal?¶

Use tracing to track request latency across services.
Compare baseline vs regression requests.
Check for hidden bottlenecks (thread pools, DB indexes, GC pauses).
Test with load replay (synthetic traffic).

🔄 Incident Response at Scale¶

Imagine a global outage is caused by a bad config push. How do you manage the incident?¶

Declare incident and assign roles (commander, communicator, investigator).
Rollback config immediately.
Communicate status to stakeholders/customers.
Run postmortem: root cause, why rollback didn’t happen faster, fix processes.

How do you balance speed of incident response with risk of making it worse?¶

Use predefined runbooks for known failure modes.
Prefer safe rollbacks before risky fixes.
Separate mitigation (quick fix) from remediation (long-term fix).

⚡ Cost & Efficiency¶

How do you optimize cloud infrastructure cost while maintaining reliability?¶

Rightsizing instances.
Auto-scaling down at off-peak.
Spot/preemptible instances for non-critical workloads.
Storage lifecycle policies (tiering).
Monitoring unused resources.

If management asks to cut infra costs by 20% without lowering SLOs, what strategies do you apply?¶

Identify overprovisioned clusters.
Consolidate workloads with bin-packing.
Review error budgets → maybe slightly relax SLOs.
Optimize DB queries and caching.

🧩 Culture & Leadership¶

How do you enforce a blameless postmortem culture?¶

Focus on system/process failures, not individuals.
Encourage sharing failures openly.
Reward proactive prevention instead of heroics.

How do you mentor junior SREs?¶

Pair on incident handling.
Walk through troubleshooting thought processes.
Encourage writing runbooks and automation scripts.
Teach prioritization (impact vs urgency).

This level is typically for Lead SRE or Senior¶

Must be able to design complex systems
Explain trade-offs between cost, scale, and reliability
Take a leadership role in critical incidents
Help the team grow and develop the right culture.

🔹 Principal / Head of SRE Interview Questions (Executive-Level)¶

🏗️ System Design & Strategy¶

How would you define an SRE strategy for a company moving from monolith to microservices?¶

(key points):

Establish SLIs/SLOs for critical services early.
Introduce observability stack (metrics, logs, tracing).
Gradual migration with service ownership models.
Create platform teams for shared infra.
Prioritize automation to reduce operational toil.

How do you decide what availability target (SLO) a service should have?¶

Based on business impact of downtime (lost revenue, user trust).
Balance cost vs reliability (99.9% may cost ×3 more than 99.5%).
Use error budgets to align product/development with reliability.

How would you architect a system to handle a Black Friday scale event?¶

Capacity planning + load testing well in advance.
Auto-scaling + burstable capacity in cloud.
Queueing and rate limiting to protect downstream services.
Graceful degradation (recommendations off, checkout prioritized).
“Dark launch” new features to test infra capacity.

📊 Organization-wide Reliability¶

How do you scale SRE practices across multiple teams?¶

Define clear service ownership.
Provide SRE platform/tools (CI/CD, monitoring, alerting).
Create reliability standards (SLI/SLO frameworks).
Run reliability reviews regularly.

How do you measure the success of an SRE organization?¶

Reduction in MTTR/MTTD.
Error budget burn rates.
% of toil vs engineering work (goal: <50% toil).
Deployment frequency & rollback safety.
Team satisfaction & burnout levels.

How do you balance innovation (fast feature delivery) with reliability?¶

Error budget policy: innovation allowed until budget is burned.
Shared accountability with product teams.
Canary releases and progressive rollouts.
Strong CI/CD + automated testing.

🔄 Incident Management & Crisis Leadership¶

How do you run an incident management process for a large org?¶

Define clear roles: incident commander, scribe, comms lead.
Standardize incident severity levels.
Ensure fast comms with stakeholders and customers.
After-action reviews with blameless culture.
Automate common mitigations (auto-remediation).

How would you prepare an org for “unknown unknowns”?¶

Chaos engineering & game days.
Cross-team incident simulations.
Invest in observability and fast detection.
Build culture of learning from small failures.

⚡ Cost, Risk, and Business Alignment¶

How do you explain the value of SRE to non-technical executives?¶

Link downtime to lost revenue and reputation.
Show how SRE reduces risk and accelerates safe delivery.
Use metrics: incidents reduced, faster recovery, customer satisfaction.

If the CTO asks to cut infra costs by 30% without hurting customer experience, how do you respond?¶

Review SLOs → maybe adjust slightly if business agrees.
Optimize infra usage (rightsizing, auto-scaling, spot instances).
Rationalize monitoring/alerting costs.
Propose phased approach, not drastic cuts.

🧩 Culture & Leadership¶

How do you foster a blameless culture while still ensuring accountability?¶

Focus on “what” and “how” not “who”.
Encourage transparency in failures.
Accountability via ownership, not punishment.
Recognition for proactive reliability improvements.

How do you grow and retain top SRE talent?¶

Provide clear career paths (IC vs Manager tracks).
Give engineers time for automation and innovation, not just firefighting.
Invest in learning (chaos days, conferences, training).
Build psychologically safe environments.

This level of interview becomes much less command-line / debugging and much more focused on¶

Organizational vision and strategy
Business and product communication
Team management and SRE culture Focuses.

📌 Summary¶

Junior / Mid → Implementation and troubleshooting
Senior → Distributed systems design and architecture
Principal / Head → Team and organization strategy, scaling culture and SRE tools
Director / VP / CTO → Leading the organization at the business level, budget, risk and global strategy

RGR SRE Interview Question¶

🎯 Questions and Answers for Junior SRE (Support-focused)¶

🔹 Basic OS and Networking Concepts¶

What is the difference between a process and a thread?¶

What does it mean when a program in Linux becomes zombie or defunct, and what should be done?¶

What’s the difference between TCP and UDP and their use cases?¶

When a website doesn’t load in the browser, what troubleshooting steps would you take?¶

What are the uses of top, htop, netstat, ss, curl, ping, traceroute?¶

🔹 Services and Monitoring¶

What is SRE, and how does it differ from DevOps or SysAdmin?¶

What’s the difference between monitoring and observability?¶

What are SLI, SLO, and SLA?¶

What would you do if a service’s CPU or memory usage suddenly spikes?¶

You see HTTP 500 errors in logs — how do you investigate?¶

🔹 Containers and Tools¶

Do you have Docker experience? How does a container interact with the host OS?¶

What’s the difference between an image and a container?¶

Do you know Kubernetes? What is a Pod, and what does CrashLoopBackOff mean?¶

🔹 Support and Incident Handling¶

If you get an alert at 3 AM that a database is down, what’s your first action?¶

What’s the difference between issue, incident, problem, and outage?¶

Have you written a postmortem? What should it include?¶

If you notice a critical service is having issues but no alert has fired, what would you do?¶

🔹 Soft Skills¶

How do you collaborate with developers when troubleshooting?¶

Can you give an example of automating a repetitive issue?¶

When multiple requests come at once, how do you prioritize?¶

🎯 SRE Interview Questions with Answers (Mid-level+)¶

🖥️ OS & Performance¶

What’s the difference between load average and CPU usage, and how do you interpret them?¶

How do you detect memory leaks in a service?¶

What does high IOwait indicate, and how would you troubleshoot it?¶

Difference between vertical scaling and horizontal scaling?¶

🌐 Networking & Distributed Systems¶

How does DNS caching work at different levels?¶

If latency increases between two microservices, how do you find the root cause?¶

What are circuit breaker and retry patterns, and why are they important?¶

Explain CAP theorem with examples¶

📊 Observability & Monitoring¶

Difference between metrics, logs, and traces?¶

What is cardinality in metrics, and why is it dangerous in Prometheus?¶

Difference between blackbox and whitebox monitoring?¶

How do you deal with alert fatigue?¶

🛠️ Containers / Kubernetes¶

Difference between Deployment, StatefulSet, and DaemonSet in K8s?¶

Why might a Pod stay in Pending state?¶

When do you use HPA (Horizontal Pod Autoscaler), and what are its limitations?¶

What is the role of kube-proxy?¶

⚡ Incident Response & Reliability¶

Scenario: primary DB is down, replicas are healthy. What do you do?¶

What metrics are critical for DB monitoring?¶

What are MTTR, MTTD, and MTBF, and why are they important?¶

If rollback isn’t possible during a critical outage, what fallback strategies exist?¶

🧩 Automation & DevOps Practices¶

What is Infrastructure as Code (IaC), and why is it important? and it's tools?¶

What are the stages of a CI/CD pipeline, and how do they relate to SRE?¶

What’s the difference between blue-green and canary deployment?¶

What is chaos engineering, and why is it useful?¶

🔹 Senior/Lead SRE Interview Questions (Advanced Level)¶

🏗️ Architecture & Scalability¶

How would you design a highly available global system (e.g., an e-commerce site) that serves millions of users?¶

What’s the difference between consistency models (strong, eventual, causal)? When would you use each?¶

How do you handle a “thundering herd” problem?¶

🌐 Reliability & Trade-offs¶

How do you decide between active-active vs. active-passive architecture for a critical system?¶

How do you approach error budgets?¶

In a system with 5 nines availability target (99.999%), what does that mean in downtime per year?¶

📊 Observability at Scale¶

How do you design an observability stack for thousands of microservices?¶

How would you debug a performance regression in production when metrics look normal?¶

🔄 Incident Response at Scale¶

Imagine a global outage is caused by a bad config push. How do you manage the incident?¶

How do you balance speed of incident response with risk of making it worse?¶

⚡ Cost & Efficiency¶

How do you optimize cloud infrastructure cost while maintaining reliability?¶

If management asks to cut infra costs by 20% without lowering SLOs, what strategies do you apply?¶

🧩 Culture & Leadership¶

How do you enforce a blameless postmortem culture?¶

How do you mentor junior SREs?¶

This level is typically for Lead SRE or Senior¶

What are the uses of `top`, `htop`, `netstat`, `ss`, `curl`, `ping`, `traceroute`?¶