Roche

Principal Site Reliability Engineer

South San Francisco Full time

A healthier future. It’s what drives us to innovate. To continuously advance science and ensure everyone has access to the healthcare they need today and for generations to come. Creating a world where we all have more time with the people we love. That’s what makes us Roche.

Advances in AI, data, and computational sciences are transforming drug discovery and development. Roche’s Research and Early Development organisations at Genentech (gRED) and Pharma (pRED) have demonstrated how these technologies accelerate R&D, leveraging data and novel computational models to drive impact. Seamless data sharing and access to models across gRED and pRED are essential to maximising these opportunities. The new Computational Sciences Center of Excellence (CoE) is a strategic, unified group whose goal is to harness this transformative power of data and Artificial Intelligence (AI) to assist our scientists in both pRED and gRED to deliver more innovative and transformative medicines for patients worldwide.

Within the CS CoE organisation, the Data and Digital Catalyst (DDC) organization leads the modernization of our computational and data ecosystems by integrating digital technologies across Research and Early Development to empower stakeholders, advance data-driven science and accelerate decision-making.

The Solutions team within the DDC Organization develops modernized and interconnected computational and data ecosystems.  As a Site Reliability Engineer in the Solutions Engineering capability, you will work closely with our engineering colleagues to  play a pivotal role in designing, implementing, and maintaining scalable, resilient, and supportable cloud-based platform solutions. 

The Opportunity:

Infrastructure as Code (IaC) Design and Implementation:

  • Architect and implement IaC solutions using tools like Terraform, Pulumi, or CloudFormation to provision and manage cloud infrastructure for MLOps and HPC workloads across global regions.

Global Availability and Resiliency:

  • Implement disaster recovery (DR) and solution resiliency plans for critical systems to ensure global operational integrity, auto-scaling, load balancing, and failover mechanisms.

  • Conduct chaos engineering experiments to validate system reliability and identify potential weaknesses.

  • Implement robust monitoring, logging, and alerting frameworks using tools like Prometheus, Grafana, Datadog, or ELK Stack to provide deep insights into system health and performance.

Collaboration and Leadership:

  • Provide technical leadership to a team of engineers, fostering a culture of collaboration, innovation, and continuous improvement.

  • Partner with cross-functional teams to align infrastructure solutions with business objectives and ML/HPC workload requirements.

  • Ensure compliance with organizational security, governance, and regulatory policies in all IaC and cloud implementations.

Who You Are:

  • Bachelor’s or Master’s degree in Computer Science or similar technical field, or equivalent experience and 7+ years of experience in software engineering ( Site Reliability Engineering (SRE).

  • Proven expertise in supporting and deploying IaC solutions in cloud environments (AWS, Azure, or GCP) for MLOps and HPC workloads.

  • Deep understanding of cloud-native architectures and onPrem solutions, including autoscaling, serverless, and multi-region deployments.

  • Technical Skills:

    • Advanced proficiency with IaC tools: Terraform, Pulumi, or CloudFormation.

    • Hands-on experience with containerization and orchestration: Docker, Kubernetes, and Kubeflow.

    • Expert in scripting and automation: Python, Bash, or Go.

    • Strong understanding of GPU-accelerated computing (e.g., NVIDIA CUDA, TensorFlow) and HPC workload scaling.

    • Knowledge of distributed systems, storage solutions, and data pipelines.

    • Familiar with monitoring and observability tools: Prometheus, Grafana, Datadog, or similar.

  • Soft Skills:

    • Strong problem-solving skills, with a methodical approach to troubleshooting.

    • Excellent communication, leadership, and mentoring abilities.

    • Ability to work collaboratively across teams in a fast-paced, dynamic environment.

Preferred Qualifications:

  • Experience with distributed ML frameworks (e.g., Horovod, TensorFlow Distributed).

  • Familiarity with data engineering pipelines (e.g., Apache Airflow, Apache Spark).

  • Knowledge of chaos engineering tools (e.g., Chaos Monkey, Litmus).

  • Experience with compliance frameworks (e.g., GDPR, SOC 2, ISO 27001).

Onsite presence, on our South San Francisco campus, is expected for at least 3 days a week.

Relocation benefits are available for this job posting.

The expected salary range for this position based on the primary location of California is $162,600 - $302,000.  Actual pay will be determined based on experience, qualifications, geographic location, and other job-related factors permitted by law.  A discretionary annual bonus may be available based on individual and Company performance.  This position also qualifies for the benefits detailed at the link provided below.

Benefits

#LI-JD1

#ComputationCoE

Genentech is an equal opportunity employer. It is our policy and practice to employ, promote, and otherwise treat any and all employees and applicants on the basis of merit, qualifications, and competence. The company's policy prohibits unlawful discrimination, including but not limited to, discrimination on the basis of Protected Veteran status, individuals with disabilities status, and consistent with all federal, state, or local laws.

If you have a disability and need an accommodation in relation to the online application process, please contact us by completing this form Accommodations for Applicants.