Staff Site Reliability Engineer
Company: Synthesis Health
Location: Houston
Posted on: February 18, 2026
|
|
|
Job Description:
Job Description Job Description Synthesis Health Who We Are
We're a mission- and values-driven company with tremendous
dedication to our customers. Our 100% remote team is dedicated to a
common goal – to revolutionize healthcare through innovation,
collaboration, and commitment to our core values and behaviors.
About the Opportunity We are looking for a Staff Site Reliability
Engineer (SRE) to serve as the guardian of our platform's
availability and the architect of our operational maturity. In this
high-impact role, you will own the strategy and execution required
to achieve and maintain a 99.99% availability SLA for our critical
healthcare services. You will not just respond to incidents; you
will build the automated systems that prevent them. You will design
the auto-scaling architectures and disaster recovery protocols that
allow us to handle bursty medical imaging traffic and catastrophic
failures without flinching. This is a hands-on leadership role. You
will define the standards for reliability engineering across the
organization, mentor Senior (L4) engineers, and embed SRE
principles into our development culture. You will serve as the
technical face of reliability to our enterprise customers,
providing the architectural assurances they need to trust us with
their most critical workflows. If you are obsessed with automation,
intolerant of manual toil, and ready to lead the reliability
strategy for a life-critical platform, we want to hear from you.
Key Responsibilities Uptime & Reliability Strategy Own the 99.99%
Target: You will define the Service Level Objectives (SLOs) and
Service Level Indicators (SLIs) for our critical user journeys. You
will be accountable for tracking our Error Budgets and governing
the release velocity based on platform stability. Incident
Management & Forensics: You will own the incident response process,
serving as the ultimate escalation point for complex production
outages. You will lead blameless post-mortems (RCAs) to identify
root causes and ensure systemic fixes are implemented to prevent
recurrence. Eliminate Toil: You will ruthlessly identify and
automate manual operational tasks. Your goal is to engineer
yourself out of operations work so you can focus on high-value
reliability architecture. Business Continuity & Disaster Recovery
(BC/DR) Architect for Catastrophe: You will design and implement
our Business Continuity and Disaster Recovery strategy. You will
orchestrate our regional failover capabilities, ensuring we meet
aggressive Recovery Time Objectives (RTO) and Recovery Point
Objectives (RPO). Enterprise-Grade Resilience: You will build the
technical credibility required to win grueling enterprise audits.
You will demonstrate that our platform is robust, stable, and
resistant to unexpected failures through rigorous documentation and
proof-of-concept demonstrations. "Game Day" Simulations: You will
lead regular disaster recovery drills and chaos engineering
experiments to validate our failover mechanisms, ensuring our team
is practically prepared for real-world scenarios. Scalability &
Performance Intelligent Auto-Scaling: You will design and implement
sophisticated auto-scaling strategies (HPA/VPA/Cluster Autoscaler)
on Kubernetes (GKE) to handle unpredictable spikes in medical data
ingestion. Capacity Planning: You will lead capacity planning and
cost optimization initiatives, ensuring our infrastructure scales
efficiently with our business growth. Architectural Leadership
Resilience Patterns: You will work with the Architecture Review
Board (ARB) to enforce resilience patterns (circuit breakers,
retries, fallbacks, bulkheads) in our application code and service
mesh. Mentorship & Culture: You will advocate for SRE culture
across the engineering organization, mentoring feature teams on how
to build operable, observable, and reliable software. What We're
Looking For Deep SRE Experience: 8 years of engineering experience,
with a significant focus on Site Reliability Engineering or DevOps
in a high-scale, 24/7 production environment. BC/DR Orchestration:
Proven experience designing active-passive or active-active
multi-region architectures. You have successfully executed regional
failovers and managed the complexities of data replication and
consistency during outages. Kubernetes Mastery: Deep, hands-on
expertise with Kubernetes (GKE preferred). You understand the
internals of scheduling, networking (CNI), and storage (CSI).
Infrastructure as Code: You treat infrastructure as software. You
have expert-level proficiency with Terraform or similar IaC tools.
Observability Expert: You have deep experience implementing and
tuning observability stacks (Prometheus, Grafana, Datadog, or
similar). You know how to extract meaningful signals from noise.
Coding Proficiency: You are a capable coder in Go, Python, or
TypeScript. You can dive into application code to debug production
issues or build complex automation tooling. Cloud Native: Deep
experience with public cloud providers (GCP preferred) and their
managed services. Preferred Qualifications Healthcare Experience:
Experience supporting HIPAA-compliant environments or handling PHI
(Protected Health Information). Global Traffic Management:
Experience with multi-region architectures, global load balancing,
and CDN tuning. Chaos Engineering: Experience designing and running
chaos experiments to validate system resilience. Why You Should
Join Us Solve Our Toughest Puzzles: This is a high-leverage role.
You will be working on the most impactful technical challenges that
are critical to the company's success. Define the Architecture: You
won't just be maintaining a system; you will be a primary author of
its future state, with the autonomy to make it happen. Lead from
the Front: This is a chance to establish yourself as a key
technical voice in a rapidly growing company. Competitive
Compensation & Benefits: We offer a strong salary, a 100% remote
culture, and significant opportunities for growth. We are a
values-driven company. Our values: Clinical service first.
Collaborate with our customers. Listen, respect, learn. Innovate to
excel. The behaviors we look for: Be nice. Be creative. Be honest.
Be helpful. Compensation and Benefits Typical salary range for this
position is $145,000 - $180,000 . However, Synthesis participates
in location based hiring and salary ranges can be adjusted based on
candidate's residence. Other benefits include, but are not limited
to: Medical, Dental, Vision, "Use as needed" vacation policy, and
participation in our employee option program. Synthesis Health is
an Equal Employment/Affirmative Action employer. We do not
discriminate in hiring on the basis of sex, gender identity, sexual
orientation, race, color, religious creed, national origin,
physical or mental disability, protected veteran status, or any
other characteristic protected by federal, state, or local law.
Keywords: Synthesis Health, The Woodlands , Staff Site Reliability Engineer, IT / Software / Systems , Houston, Texas