RGR SRE Interview Question¶
🎯 Questions and Answers for Junior SRE (Support-focused)¶
🔹 Basic OS and Networking Concepts¶
What is the difference between a process and a thread?¶
- A process is a running program with its own memory and resources.
- A thread is an execution path inside a process that shares the process’s resources with other threads.
What does it mean when a program in Linux becomes zombie or defunct, and what should be done?¶
- A zombie process means it has finished execution but its entry remains in the process table because the parent hasn’t collected the exit status.
- Solution: Usually solved by killing/restarting the parent (
kill -9 <parent_pid>).
What’s the difference between TCP and UDP and their use cases?¶
- TCP is connection-oriented, reliable, guarantees ordered delivery (used for web, email).
- UDP is connectionless, faster, less reliable (used for DNS, video streaming, online games).
When a website doesn’t load in the browser, what troubleshooting steps would you take?¶
- Check internet connectivity (
ping 8.8.8.8) - Verify DNS resolution (
dig,nslookup) - Check routing (
traceroute) - Test port availability (
telnet <host> 80,curl -v) - Review firewall or network policies
What are the uses of top, htop, netstat, ss, curl, ping, traceroute?¶
top/htop→ Monitor CPU, memory, and processesnetstat/ss→ View network connections and portscurl→ Test HTTP endpointsping→ Check IP reachabilitytraceroute→ Trace the route packets take
🔹 Services and Monitoring¶
What is SRE, and how does it differ from DevOps or SysAdmin?¶
- SRE focuses on reliability, availability, and performance of services.
- SysAdmins manage servers and infrastructure.
- DevOps focuses on CI/CD and bridging development and operations.
What’s the difference between monitoring and observability?¶
- Monitoring: Collecting and viewing metrics/logs (e.g., CPU usage, error rates).
- Observability: The ability to understand the internal state of a system based on its outputs (logs, metrics, traces).
What are SLI, SLO, and SLA?¶
- SLI (Service Level Indicator): A measurable metric (e.g., percentage of successful requests).
- SLO (Service Level Objective): The target for SLI (e.g., 99.9% success rate).
- SLA (Service Level Agreement): A contractual commitment with customers, including penalties if targets aren’t met.
What would you do if a service’s CPU or memory usage suddenly spikes?¶
- Check resource usage (
top,htop) - Identify heavy processes
- Check logs for memory leaks or infinite loops
- Apply limits using cgroups/Kubernetes resource limits
- Scale the service if needed
You see HTTP 500 errors in logs — how do you investigate?¶
- Identify timestamps of the errors
- Check dependencies (databases, external services)
- Verify system resources
- Test endpoint manually with
curl - Review recent changes (deployments, configs)
🔹 Containers and Tools¶
Do you have Docker experience? How does a container interact with the host OS?¶
- A container is an isolated process running on the host kernel.
- It uses namespaces (for isolation of processes, network, mounts, etc.) and cgroups (to limit resources).
What’s the difference between an image and a container?¶
- Image: Immutable, read-only snapshot.
- Container: A running instance of an image with a writable layer.
Do you know Kubernetes? What is a Pod, and what does CrashLoopBackOff mean?¶
- Pod: The smallest deployable unit in K8s (one or more containers).
- CrashLoopBackOff: The container repeatedly crashes and Kubernetes keeps restarting it, often due to misconfiguration, dependency failure, or code bugs.
🔹 Support and Incident Handling¶
If you get an alert at 3 AM that a database is down, what’s your first action?¶
- Validate the alert
- Check service status (
systemctl status,docker logs, health checks) - Review system resources (CPU, RAM, disk)
- Attempt quick restore (restart or failover)
- Notify the team if escalation is required
What’s the difference between issue, incident, problem, and outage?¶
- Issue: Any reported problem
- Incident: An event affecting service reliability
- Problem: Root cause behind one or more incidents
- Outage: Severe or complete service downtime
Have you written a postmortem? What should it include?¶
-
A postmortem analyzes an incident and should include:
-
Description of the event
- Timeline
- Impact on users
- Root cause
- Actions taken
- Action items for prevention
If you notice a critical service is having issues but no alert has fired, what would you do?¶
- Investigate and confirm the issue
- Document and inform the team
- Stabilize or mitigate the issue
- Add or fix missing alerts afterward
🔹 Soft Skills¶
How do you collaborate with developers when troubleshooting?¶
- Provide logs and metrics clearly
- Explain issues factually without blame
- Suggest reliability improvements
Can you give an example of automating a repetitive issue?¶
- Automated restarting a service after crashes using a script or systemd unit instead of doing it manually.
When multiple requests come at once, how do you prioritize?¶
- Based on user and business impact.
- Issues affecting critical services or SLA violations come first.
🎯 SRE Interview Questions with Answers (Mid-level+)¶
🖥️ OS & Performance¶
What’s the difference between load average and CPU usage, and how do you interpret them?¶
- CPU usage shows the percentage of CPU time used.
- Load average shows the average number of processes waiting for CPU, disk, or I/O over 1, 5, and 15 minutes.
- High load with low CPU usage usually indicates I/O bottlenecks.
How do you detect memory leaks in a service?¶
- Monitor memory usage over time (
top,htop, Prometheus metrics). - Use tools like
valgrind,pmap,gperftools, orheapdumpdepending on the language. - Check GC (garbage collection) logs in languages like Java/Python.
What does high IOwait indicate, and how would you troubleshoot it?¶
- IOwait means CPU is idle but waiting for disk I/O.
- Causes: slow disks, too many concurrent requests, database contention.
- Troubleshooting: check
iostat,iotop, disk latency, optimize queries, use faster storage (SSD/NVMe).
Difference between vertical scaling and horizontal scaling?¶
- Vertical: Adding more resources (CPU/RAM) to a single machine. Easy but limited.
- Horizontal: Adding more machines/instances. More scalable, but adds complexity (load balancing, consistency).
🌐 Networking & Distributed Systems¶
How does DNS caching work at different levels?¶
- App-level cache (e.g., browser cache).
- OS resolver cache (
nscd,systemd-resolved). - Recursive resolver cache (like 8.8.8.8).
- Authoritative DNS has TTL that controls cache duration.
If latency increases between two microservices, how do you find the root cause?¶
- Check metrics (latency, error rate) from monitoring.
- Trace requests with distributed tracing (Jaeger, OpenTelemetry).
- Use network tools (
mtr,tcpdump). - Verify downstream dependencies (DB, API).
What are circuit breaker and retry patterns, and why are they important?¶
- Retry: Automatically retry failed requests (with backoff).
- Circuit breaker: Stops sending requests when failure rate is high, allowing system to recover.
- Important in distributed systems to prevent cascading failures.
Explain CAP theorem with examples¶
- C (Consistency): All nodes see the same data at the same time.
- A (Availability): Every request receives a response.
- P (Partition tolerance): System works despite network splits.
- Example CP: HBase, ZooKeeper.
- Example AP: Cassandra, DynamoDB.
📊 Observability & Monitoring¶
Difference between metrics, logs, and traces?¶
- Metrics: Aggregated numerical data (CPU, QPS, latency).
- Logs: Event-based text records.
- Traces: Track requests across services (useful in microservices).
What is cardinality in metrics, and why is it dangerous in Prometheus?¶
- Cardinality = number of unique label combinations.
- High cardinality (e.g., per-user metrics) causes high memory usage and slow queries.
- Rule: Avoid unbounded labels (like user IDs, request IDs).
Difference between blackbox and whitebox monitoring?¶
- Blackbox: Tests from outside (ping, HTTP check). Focus on availability.
- Whitebox: Monitors internal metrics/logs. Focus on internal health.
How do you deal with alert fatigue?¶
- Deduplicate and group alerts.
- Prioritize by severity and business impact.
- Use SLO-based alerting instead of raw metrics.
- Regularly review and tune alert rules.
🛠️ Containers / Kubernetes¶
Difference between Deployment, StatefulSet, and DaemonSet in K8s?¶
- Deployment: Stateless workloads, easy scaling.
- StatefulSet: Stateful apps (DBs, Kafka), keeps identity & stable storage.
- DaemonSet: Ensures one pod runs on every node (e.g., logging/monitoring agents).
Why might a Pod stay in Pending state?¶
- Not enough resources (CPU/RAM).
- Node selectors/affinity rules not satisfied.
- Missing PersistentVolume.
- Scheduler can’t place it due to taints/tolerations.
When do you use HPA (Horizontal Pod Autoscaler), and what are its limitations?¶
-
Use: Auto-scale pods based on metrics (CPU, custom metrics).
-
Limitations:
- Scaling delay (reactive, not predictive).
- Doesn’t handle queue-based workloads well.
- Needs accurate metrics.
What is the role of kube-proxy?¶
- Handles Service networking in K8s.
- Maintains iptables/IPVS rules for load balancing traffic to pods.
⚡ Incident Response & Reliability¶
Scenario: primary DB is down, replicas are healthy. What do you do?¶
- Promote a replica to primary (manual or via failover tool like Patroni, Orchestrator).
- Redirect traffic to new primary.
- Investigate and fix the original primary before reintroducing it.
What metrics are critical for DB monitoring?¶
- Connection count, query latency, slow queries.
- Replication lag.
- Buffer cache hit ratio.
- Disk I/O and locks.
What are MTTR, MTTD, and MTBF, and why are they important?¶
- MTTR (Mean Time to Recovery): Avg time to recover from failure.
- MTTD (Mean Time to Detect): Avg time to detect a failure.
- MTBF (Mean Time Between Failures): Avg time between failures.
- They measure reliability and efficiency of incident response.
If rollback isn’t possible during a critical outage, what fallback strategies exist?¶
- Feature flagging (disable problematic feature).
- Graceful degradation (disable non-critical services).
- Rate limiting or traffic shedding.
- Serving cached/stale data.
🧩 Automation & DevOps Practices¶
What is Infrastructure as Code (IaC), and why is it important? and it's tools?¶
- Managing infra with code instead of manual config.
- Benefits: repeatability, version control, automation.
- Tools: Terraform, Ansible, Pulumi, CloudFormation.
What are the stages of a CI/CD pipeline, and how do they relate to SRE?¶
- Stages: build → test → deploy → verify.
- SRE ensures reliability: automated rollbacks, monitoring after deploys, canary releases.
What’s the difference between blue-green and canary deployment?¶
- Blue-green: Two environments, switch all traffic at once.
- Canary: Gradually shift traffic to new version, monitor errors, roll back if needed.
What is chaos engineering, and why is it useful?¶
- Definition: Practice of intentionally injecting failures into systems.
- Goal: test resilience and validate recovery processes before real outages happen.
🔹 Senior/Lead SRE Interview Questions (Advanced Level)¶
🏗️ Architecture & Scalability¶
How would you design a highly available global system (e.g., an e-commerce site) that serves millions of users?¶
A (key points):
- Multi-region deployment (active-active or active-passive).
- Load balancing with global traffic managers (GSLB, Anycast DNS).
- Database replication across regions (read replicas, multi-master with conflict resolution).
- CDN for static assets.
- Caching layer (Redis, Memcached).
- Graceful degradation under heavy load.
What’s the difference between consistency models (strong, eventual, causal)? When would you use each?¶
- Strong consistency: Clients always read the latest write (banking).
- Eventual consistency: Updates propagate over time (social feeds).
- Causal consistency: Reads respect cause-effect order (chat systems).
How do you handle a “thundering herd” problem?¶
- Add caching with randomized TTLs.
- Use request coalescing (deduplicate identical requests).
- Apply rate limiting/backpressure.
- Queue requests with message brokers.
🌐 Reliability & Trade-offs¶
How do you decide between active-active vs. active-passive architecture for a critical system?¶
- Active-active: Higher availability, load distribution, complex conflict resolution.
- Active-passive: Easier to manage, cheaper, but slower failover.
- Decision depends on SLA requirements and budget.
How do you approach error budgets?¶
- Error budget = SLO target − actual reliability.
- If error budget is burned quickly → slow down releases, focus on reliability.
- If error budget is underused → increase release velocity.
In a system with 5 nines availability target (99.999%), what does that mean in downtime per year?¶
- 99.999% = ~5 minutes/year downtime allowed.
- Implies extreme redundancy, automation, and very fast failover.
📊 Observability at Scale¶
How do you design an observability stack for thousands of microservices?¶
- Centralized logging (ELK/Loki/Cloud-native log storage).
- Metrics with high retention and downsampling (Prometheus + Cortex/Thanos).
- Distributed tracing (OpenTelemetry/Jaeger).
- SLO-driven alerting (reduce noise).
How would you debug a performance regression in production when metrics look normal?¶
- Use tracing to track request latency across services.
- Compare baseline vs regression requests.
- Check for hidden bottlenecks (thread pools, DB indexes, GC pauses).
- Test with load replay (synthetic traffic).
🔄 Incident Response at Scale¶
Imagine a global outage is caused by a bad config push. How do you manage the incident?¶
- Declare incident and assign roles (commander, communicator, investigator).
- Rollback config immediately.
- Communicate status to stakeholders/customers.
- Run postmortem: root cause, why rollback didn’t happen faster, fix processes.
How do you balance speed of incident response with risk of making it worse?¶
- Use predefined runbooks for known failure modes.
- Prefer safe rollbacks before risky fixes.
- Separate mitigation (quick fix) from remediation (long-term fix).
⚡ Cost & Efficiency¶
How do you optimize cloud infrastructure cost while maintaining reliability?¶
- Rightsizing instances.
- Auto-scaling down at off-peak.
- Spot/preemptible instances for non-critical workloads.
- Storage lifecycle policies (tiering).
- Monitoring unused resources.
If management asks to cut infra costs by 20% without lowering SLOs, what strategies do you apply?¶
- Identify overprovisioned clusters.
- Consolidate workloads with bin-packing.
- Review error budgets → maybe slightly relax SLOs.
- Optimize DB queries and caching.
🧩 Culture & Leadership¶
How do you enforce a blameless postmortem culture?¶
- Focus on system/process failures, not individuals.
- Encourage sharing failures openly.
- Reward proactive prevention instead of heroics.
How do you mentor junior SREs?¶
- Pair on incident handling.
- Walk through troubleshooting thought processes.
- Encourage writing runbooks and automation scripts.
- Teach prioritization (impact vs urgency).
This level is typically for Lead SRE or Senior¶
- Must be able to design complex systems
- Explain trade-offs between cost, scale, and reliability
- Take a leadership role in critical incidents
- Help the team grow and develop the right culture.
🔹 Principal / Head of SRE Interview Questions (Executive-Level)¶
🏗️ System Design & Strategy¶
How would you define an SRE strategy for a company moving from monolith to microservices?¶
(key points):
- Establish SLIs/SLOs for critical services early.
- Introduce observability stack (metrics, logs, tracing).
- Gradual migration with service ownership models.
- Create platform teams for shared infra.
- Prioritize automation to reduce operational toil.
How do you decide what availability target (SLO) a service should have?¶
- Based on business impact of downtime (lost revenue, user trust).
- Balance cost vs reliability (99.9% may cost ×3 more than 99.5%).
- Use error budgets to align product/development with reliability.
How would you architect a system to handle a Black Friday scale event?¶
- Capacity planning + load testing well in advance.
- Auto-scaling + burstable capacity in cloud.
- Queueing and rate limiting to protect downstream services.
- Graceful degradation (recommendations off, checkout prioritized).
- “Dark launch” new features to test infra capacity.
📊 Organization-wide Reliability¶
How do you scale SRE practices across multiple teams?¶
- Define clear service ownership.
- Provide SRE platform/tools (CI/CD, monitoring, alerting).
- Create reliability standards (SLI/SLO frameworks).
- Run reliability reviews regularly.
How do you measure the success of an SRE organization?¶
- Reduction in MTTR/MTTD.
- Error budget burn rates.
- % of toil vs engineering work (goal: <50% toil).
- Deployment frequency & rollback safety.
- Team satisfaction & burnout levels.
How do you balance innovation (fast feature delivery) with reliability?¶
- Error budget policy: innovation allowed until budget is burned.
- Shared accountability with product teams.
- Canary releases and progressive rollouts.
- Strong CI/CD + automated testing.
🔄 Incident Management & Crisis Leadership¶
How do you run an incident management process for a large org?¶
- Define clear roles: incident commander, scribe, comms lead.
- Standardize incident severity levels.
- Ensure fast comms with stakeholders and customers.
- After-action reviews with blameless culture.
- Automate common mitigations (auto-remediation).
How would you prepare an org for “unknown unknowns”?¶
- Chaos engineering & game days.
- Cross-team incident simulations.
- Invest in observability and fast detection.
- Build culture of learning from small failures.
⚡ Cost, Risk, and Business Alignment¶
How do you explain the value of SRE to non-technical executives?¶
- Link downtime to lost revenue and reputation.
- Show how SRE reduces risk and accelerates safe delivery.
- Use metrics: incidents reduced, faster recovery, customer satisfaction.
If the CTO asks to cut infra costs by 30% without hurting customer experience, how do you respond?¶
- Review SLOs → maybe adjust slightly if business agrees.
- Optimize infra usage (rightsizing, auto-scaling, spot instances).
- Rationalize monitoring/alerting costs.
- Propose phased approach, not drastic cuts.
🧩 Culture & Leadership¶
How do you foster a blameless culture while still ensuring accountability?¶
- Focus on “what” and “how” not “who”.
- Encourage transparency in failures.
- Accountability via ownership, not punishment.
- Recognition for proactive reliability improvements.
How do you grow and retain top SRE talent?¶
- Provide clear career paths (IC vs Manager tracks).
- Give engineers time for automation and innovation, not just firefighting.
- Invest in learning (chaos days, conferences, training).
- Build psychologically safe environments.
This level of interview becomes much less command-line / debugging and much more focused on¶
- Organizational vision and strategy
- Business and product communication
- Team management and SRE culture Focuses.
📌 Summary¶
- Junior / Mid → Implementation and troubleshooting
- Senior → Distributed systems design and architecture
- Principal / Head → Team and organization strategy, scaling culture and SRE tools
- Director / VP / CTO → Leading the organization at the business level, budget, risk and global strategy