Site Reliability Engineering Lead

Are you passionate about building resilient systems and empowering teams to deliver reliable cloud solutions?

Do you thrive in designing and managing scalable platforms that keep services running smoothly?

About our team:

The LexisNexis Intellectual Property (IP) division (https://www.lexisnexisip.com) provides international patent content and a suite of online and analytic tools that meet the evolving needs of the intellectual property market. We deliver data to support LexisNexis IP search and analytics applications, empowering our customers with actionable insights and metrics for critical business decisions.

Our corporate culture thrives on excellence, innovation, and a strong dedication to our customers, employees, and communities. Working here means joining a vibrant, diverse, and collaborative team where you are free to grow and contribute actively.

About the role:

We are seeking a highly skilled and motivated Site Reliability Engineering Lead to lead a team responsible for ensuring the reliability, scalability, and resilience of mission-critical systems. This role is pivotal in managing senior engineers, driving operational excellence, and fostering a culture of continuous improvement.

You will collaborate closely with product, development, architecture, and security teams to implement best practices in site reliability engineering, cloud platform management, and environment support for internal development and customer systems. The manager will lead initiatives around incident response, disaster recovery, automation, monitoring, FinOps cost optimisation, and customer support escalations.

Skills & Experience:

Cloud Platforms & Services: Azure and AWS (EKS, EC2, S3, RDS, Lambda, Azure VMs, Functions).
Infrastructure as Code: Terraform, ARM/BICEP.
Containerization & Orchestration: Docker, Kubernetes (EKS/AKS), Helm, ArgoCD.
Monitoring & Observability: Datadog, Splunk, Coralogix, CloudWatch, Azure Monitor, along with an understanding of baseline metrics.
Scripting & Automation: Python, Bash, PowerShell, TypeScript, JavaScript.
Programming Knowledge: Java, .NET/C#, SQL, React (for integration with supported products).
Systems & Networking: Linux/UNIX/Windows administration, networking, and security best practices.
Specialised Knowledge: Databricks, FinOps cost management, disaster recovery planning.
Core Competencies: Incident management, troubleshooting, IT service management frameworks, and GitOps/DevOps practices.

Soft Skills:

Solid understanding of Site Reliability Engineering (SRE) principles and practices.
Strong understanding of incident management, monitoring tools, IT service management frameworks and automation processes.
Previous experience in customer-facing roles or managing customer support escalations
Excellent technical problem-solving and troubleshooting abilities.
Strong communication and interpersonal skills, with the ability to collaborate across teams.
Leadership skills with a track record of mentoring and guiding technical teams
Strong collaboration and advanced communication skills at the peer and senior management level.
Strong skills in setting, communicating, implementing, and achieving business objectives and goals through indirect leadership of and collaboration with others.
Strong organisation/project planning, time management, and change management skills across multiple functional groups and departments, and strong delegation skills involving prioritising and reprioritising projects and managing projects of various size
and complexity.
Advanced problem-solving experience involving leading teams in identifying, researching, and coordinating the resources necessary to effectively troubleshoot/diagnose complex project issues; prior success extracting/translating findings into alternatives/solutions; and identifying risks/impacts and schedule adjustments to facilitate management decision-making.
Ability to manage multiple priorities and work effectively in a fast-paced environment.
Passion for continuous learning and staying up-to-date with industry trends and best practices.

Responsibilities:

Building & Leading the SRE Organisation -

Hire, mentor, and lead a team of SRE and platform engineers to ensure the timely and accurate performance of all team activities
Foster a culture of reliability, blameless post-mortems, and proactive incident prevention.
Define and implement SRE best practices for reliability, scalability, and performance.

Customer & Incident Management –

Manage intake, prioritisation, and resolution of critical customer-reported issues.
Act as an escalation point for high-severity incidents and outages.
In collaboration with Product Support and Development Managers, ensure SLAs, performance benchmarks, and response protocols are met.

Live System Monitoring & Support

Design and maintain robust monitoring, alerting, and incident response systems.
In collaboration with the Product Support Manager, lead incident management from detection to resolution and post-incident analysis.
Ensure system high availability goals are met.
Oversee disaster recovery and business continuity planning within IP Technology organisation.
Provide support for cloud resources management and workload capacity planning.
Drive automation to reduce manual intervention and improve efficiency.

Platform & Cloud Engineering

Support product development teams with infrastructure, non-functional requirements, and environment stability.
Manage Kubernetes deployments, Databricks environments, and other critical platforms.
Collaborate with cross-functional teams to deliver secure, reliable, and cost-effective platform and cloud solutions.
Ensuring all systems comply with security patching and vulnerability management tools.
In collaboration with architects, provide support for FinOps practices to monitor, optimise, and control cloud costs.

Leadership & Continuous Improvement -

Provide clear direction, performance evaluations, and career growth for team members.
Ensure proper documentation, reporting, and compliance with security and regulatory standards.
Promote continuous learning, knowledge sharing, and operational excellence.
Writing and reviewing documentation for the management, improvement, and support of platforms/assets.
Completing complex bug fixes and root-cause investigations.
Working closely with development and platform teams to understand requirements and translate them into high-quality solutions.
Implementing infrastructure management and deployment best practices, including code/solution reviews.
Operating in various development environments (Agile, Waterfall, etc.) while collaborating with key stakeholders.

Why Join Us?

Join our team and contribute to a culture of innovation, collaboration, and excellence. If you are ready to advance your career and make a significant impact, we encourage you to apply.

Work in a way that works for you

We promote a healthy work/life balance across the organisation. We offer an appealing working prospect for our people. With numerous wellbeing initiatives, shared parental leave, study assistance and sabbaticals, we will help you meet your immediate responsibilities and your long-term goals.

Working flexible hours - flexing the times when you work in the day to help you fit everything in and work when you are the most productive.

Working for you

We know that your well-being and happiness are key to a long and successful career. These are some of the benefits we are delighted to offer:

Dutch Share Purchase Plan
Annual Profit Share Bonus
Comprehensive Pension Plan
Home, office or commuting allowance
Generous vacation entitlement and option for sabbatical leave
Maternity, Paternity, Adoption and Family Care leave
Flexible working hours
Personal Choice budget
A variety of online training courses and career roadshows
Well-being programs and a gym facility in the office
Internal communities and networks
Various employee discounts
Recruitment introduction reward
Work from anywhere
Employee Assistance Program (global)
Annual Event

About the Business

A global leader in information and analytics, we help researchers and healthcare professionals advance science and improve health outcomes for the benefit of society. Building on our publishing heritage, we combine quality information and vast data sets with analytics to support visionary science and research, health education and interactive learning, as well as exceptional healthcare and clinical practice. At Elsevier, your work contributes to the world’s grand challenges and a more sustainable future. We harness innovative technologies to support science and healthcare to partner for a better world.

Site Reliability Engineering Lead

Are you passionate about building resilient systems and empowering teams to deliver reliable cloud solutions?

Do you thrive in designing and managing scalable platforms that keep services running smoothly?

About our team:

About the role:

You will collaborate closely with product, development, architecture, and security teams to implement best practices in site reliability engineering, cloud platform management, and environments support for internal development and customer systems. The manager will lead initiatives around incident response, disaster recovery, automation, monitoring, FinOps cost optimization, and customer support escalations.

Skills & Experience:

Cloud Platforms & Services: Azure and AWS (EKS, EC2, S3, RDS, Lambda, Azure VMs, Functions).
Infrastructure as Code: Terraform, ARM/BICEP.
Containerization & Orchestration: Docker, Kubernetes (EKS/AKS), Helm, ArgoCD.
Monitoring & Observability: Datadog, Splunk, Coralogix, CloudWatch, Azure Monitor along with understanding of baseline metrics.
Scripting & Automation: Python, Bash, PowerShell, TypeScript, JavaScript.
Programming Knowledge: Java, .NET/C#, SQL, React (for integration with supported products).
Systems & Networking: Linux/UNIX/Windows administration, networking, and security best practices.
Specialized Knowledge: Databricks, FinOps cost management, disaster recovery planning.
Core Competencies: Incident management, troubleshooting, IT service management frameworks, and GitOps/DevOps practices.

Soft Skills:

Solid understanding of Site Reliability Engineering (SRE) principles and practices.
Strong understanding of incident management, monitoring tools, IT service management frameworks and automation processes.
Previous experience in customer-facing roles or managing customer support escalations
Excellent technical problem-solving and troubleshooting abilities.
Strong communication and interpersonal skills, with the ability to collaborate across teams.
Leadership skills with a track record of mentoring and guiding technical teams
Strong collaboration and advanced communication skills at peer and senior management level.
Strong skills in setting, communicating, implementing, and achieving business objectives and goals through indirect leadership of and collaboration with others.
Strong organization/project planning, time management, and change management skills across multiple functional groups and departments, and strong delegation skills involving prioritizing and reprioritizing projects and managing projects of various size
and complexity.
Advanced problem-solving experience involving leading teams in identifying, researching, and coordinating the resources necessary to effectively troubleshoot/diagnose complex project issues; prior success extracting/translating findings into alternatives/solutions; and identifying risks/impacts and schedule adjustments to facilitate management decision-making.
Ability to manage multiple priorities and work effectively in a fast-paced environment.
Passion for continuous learning and staying up-to-date with industry trends and best practices.

Responsibilities:

Building & Leading the SRE Organization -

Hire, mentor, and lead a team of SRE and platform engineers to ensure timely and accurate performance of all team activities
Foster a culture of reliability, blameless post-mortems, and proactive incident prevention.
Define and implement SRE best practices for reliability, scalability, and performance.

Customer & Incident Management –

Manage intake, prioritization, and resolution of critical customer-reported issues.
Act as an escalation point for high-severity incidents and outages.
In collaboration with Product Support and Development Managers, ensure SLAs, performance benchmarks, and response protocols are met.

Live System Monitoring & Support

Design and maintain robust monitoring, alerting, and incident response systems.
In collaboration with Product Support Manager, lead incident management from detection to resolution and post-incident analysis.
Ensure system high availability goals are met.
Oversee disaster recovery and business continuity planning within IP Technology organization.
Provide support for cloud resources management and workload capacity planning.
Drive automation to reduce manual intervention and improve efficiency.

Platform & Cloud Engineering

Support product development teams with infrastructure, non-functional requirements, and environment stability.
Manage Kubernetes deployments, Databricks environments, and other critical platforms.
Collaborate with cross-functional teams to deliver secure, reliable, and cost-effective platform and cloud solutions.
Ensuring all systems comply with security patching and vulnerability management tools.
In collaboration with architects, provide support for FinOps practices to monitor, optimize, and control cloud costs.

Leadership & Continuous Improvement -

Provide clear direction, performance evaluations, and career growth for team members.
Ensure proper documentation, reporting, and compliance with security and regulatory standards.
Promote continuous learning, knowledge sharing, and operational excellence.
Writing and reviewing documentation for the management, improvement, and support of platforms/assets.
Completing complex bug fixes and root-cause investigations.
Working closely with development and platform teams to understand requirements and translate them into high-quality solutions.
Implementing infrastructure management and deployment best practices, including code/solution reviews.
Operating in various development environments (Agile, Waterfall, etc.) while collaborating with key stakeholders.

Why Join Us?

Join our team and contribute to a culture of innovation, collaboration, and excellence. If you are ready to advance your career and make a significant impact, we encourage you to apply.

Work in a way that works for you

We promote a healthy work/life balance across the organization. We offer an appealing working prospect for our people. With numerous wellbeing initiatives, shared parental leave, study assistance and sabbaticals, we will help you meet your immediate responsibilities and your long-term goals.

Working flexible hours - flexing the times when you work in the day to help you fit everything in and work when you are the most productive.

Working for you

We know that your well-being and happiness are key to a long and successful career. These are some of the benefits we are delighted to offer:

Dutch Share Purchase Plan
Annual Profit Share Bonus
Comprehensive Pension Plan
Home, office or commuting allowance
Generous vacation entitlement and option for sabbatical leave
Maternity, Paternity, Adoption and Family Care leave
Flexible working hours
Personal Choice budget
Variety of online training courses and career roadshows
Wellbeing programs and gym facility in the office
Internal communities and networks
Various employee discounts
Recruitment introduction reward
Work from anywhere
Employee Assistance Program (global)
Annual Event

About the Business

We are committed to providing a fair and accessible hiring process. If you have a disability or other need that requires accommodation or adjustment, please let us know by completing our Applicant Request Support Form or please contact 1-855-833-5120.

Criminals may pose as recruiters asking for money or personal information. We never request money or banking details from job applicants. Learn more about spotting and avoiding scams here.

Please read our Candidate Privacy Policy.

We are an equal opportunity employer: qualified applicants are considered for and treated during employment without regard to race, color, creed, religion, sex, national origin, citizenship status, disability status, protected veteran status, age, marital status, sexual orientation, gender identity, genetic information, or any other characteristic protected by law.

USA Job Seekers:

EEO Know Your Rights.

Site Reliability Engineering Lead

Related Jobs

Investment Solutions Specialist

Supervisory Dietitian (Food Operations)

Supervisory Dietitian (Food Operations)

Cytotechnologist

Cytotechnologist

Store Worker (Fork Lift Operator)