Service Reliability Eng - London, N1C 4AG
Service Reliability Eng - London, N1C 4AG, United Kingdom
Job Summary:
We are UMG, the Universal Music Group. We are the world’s leading music company. In everything we do, we are committed to artistry, innovation and entrepreneurship. We own and operate a broad array of businesses engaged in recorded music, music publishing, merchandising, and audiovisual content in more than 60 countries. We identify and develop recording artists and songwriters, and we produce, distribute and promote the most critically acclaimed and commercially successful music to delight and entertain fans around the world.
As a key member of our Global Technical Operations team, you will be responsible for the reliability, scalability, and performance of the critical systems that power a global enterprise. By blending a software engineering mindset with operational expertise, you will engineer solutions that improve system reliability, automate complex processes, and reduce manual toil. You will be an essential partner to our development, infrastructure, and security teams, driving a culture of resilience and continuous improvement across the organization.
As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on.
Job Functions:
Key Responsibilities:
System Reliability & Performance:
Design, build, and maintain the availability, scalability, and performance of critical services.
Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution.
Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.
Automation & Efficiency:
Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.
Create and maintain scripts and custom code to support and enhance our operational toolset.
Support and optimize CI/CD pipelines to improve deployment speed and reliability.
Incident Management & Collaboration:
Participate in an on-call rotation to troubleshoot and mitigate production incidents.
Lead post-incident reviews and root cause analyses to implement lasting solutions.
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.
Job Requirements:
Required Experience & Skills:
A strong background in systems administration (Linux/Windows) in a large-scale environment.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS.
Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible).
Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace).
Proven analytical and problem-solving abilities with experience in a high-pressure environment.
Excellent communication skills and the ability to foster a collaborative team environment.
Preferred Experience & Skills:
Bachelor's degree in an IT-related field.
Experience managing large-scale, distributed systems for a global organization.
Familiarity with IT governance standards like ITIL.
Direct experience with ServiceNow for IT service management.
Knowledge of chaos engineering, resilience testing, and advanced capacity planning.
Recommended Jobs
Year 2 Teacher - Primary School in Redbridge
We are seeking a passionate and dedicated Year 2 Teacher to join our vibrant and successful Primary School in Redbridge. This permanent role, starting January 2026, offers an exciting opportunity to …
Permanent Placement Live-out Afterschool Nanny in N10, Job ID J205AF
This London-based family is looking for an After-school Nanny to take care of their lovely children. All general Nanny duties are required in this role. This lovely family is seeking a dedicated, en…
Chef De Partie
Job Details We are Company of Cooks , and we believe brilliant food and drink starts with brilliant people. For over 25 years, we’ve been part of some of the UK’s most loved cultural destination…
Recruitment Researcher - Tax
Up to £30,000 + commission - London/ home working Are you an experienced Recruitment Researcher/Resourcer looking for your next role? Are you looking for a truly supportive agency where you can …
Project Manager, Corporate Real Estate (London or Manchester) - EMEA wide travel
Description About Alvarez & Marsal Alvarez & Marsal (A&M) is a global consulting firm with over 10,000 entrepreneurial, action and results-oriented professionals in over 40 countries. W…
Private Client Partner
My client is a prestigious, legal 500 law firm with a strong reputation for providing bespoke legal services to high-net-worth individuals, families, and businesses. They are currently on the lookout …
Nursery Teacher - Redbridge
We are delighted to invite applications for a Nursery Teacher to join our warm, inclusive, and nurturing primary school community in Redbridge. This is a wonderful opportunity for a passionate Early …
AI Engineering Internship 2026
AI Engineering Internship 2026 Do you want to tackle the biggest questions in finance with near infinite compute power at your fingertips? G-Research is a leading quantitative research and techno…
Year 4 Teacher - Southwark - Independent School
Join a forward-thinking Independent School in Southwark as a Year 4 Teacher from January 2026. This Independent School seeks a high-energy Year 4 Teacher who will drive curriculum depth, nurture inde…
Platform Engineer
About the Role Our world-leading Time & Frequency department are seeking a Platform Engineer with experience in scripting and automation . You will be joining our team at an exciting time …