Service Reliability Eng - London, N1C 4AG
Service Reliability Eng - London, N1C 4AG, United Kingdom
Job Summary:
We are UMG, the Universal Music Group. We are the world’s leading music company. In everything we do, we are committed to artistry, innovation and entrepreneurship. We own and operate a broad array of businesses engaged in recorded music, music publishing, merchandising, and audiovisual content in more than 60 countries. We identify and develop recording artists and songwriters, and we produce, distribute and promote the most critically acclaimed and commercially successful music to delight and entertain fans around the world.
As a key member of our Global Technical Operations team, you will be responsible for the reliability, scalability, and performance of the critical systems that power a global enterprise. By blending a software engineering mindset with operational expertise, you will engineer solutions that improve system reliability, automate complex processes, and reduce manual toil. You will be an essential partner to our development, infrastructure, and security teams, driving a culture of resilience and continuous improvement across the organization.
As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on.
Job Functions:
Key Responsibilities:
System Reliability & Performance:
Design, build, and maintain the availability, scalability, and performance of critical services.
Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution.
Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.
Automation & Efficiency:
Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.
Create and maintain scripts and custom code to support and enhance our operational toolset.
Support and optimize CI/CD pipelines to improve deployment speed and reliability.
Incident Management & Collaboration:
Participate in an on-call rotation to troubleshoot and mitigate production incidents.
Lead post-incident reviews and root cause analyses to implement lasting solutions.
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.
Job Requirements:
Required Experience & Skills:
A strong background in systems administration (Linux/Windows) in a large-scale environment.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS.
Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible).
Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace).
Proven analytical and problem-solving abilities with experience in a high-pressure environment.
Excellent communication skills and the ability to foster a collaborative team environment.
Preferred Experience & Skills:
Bachelor's degree in an IT-related field.
Experience managing large-scale, distributed systems for a global organization.
Familiarity with IT governance standards like ITIL.
Direct experience with ServiceNow for IT service management.
Knowledge of chaos engineering, resilience testing, and advanced capacity planning.
Recommended Jobs
Part-time Housekeeper in Canary Wharf, Job ID J1FF50
This lovely family is looking for a Part-time Housekeeper to maintain their household clean. All general Housekeeping duties are required in this role. An ideal candidate is proactive and discreet, j…
Lead Data Analyst - CRM Analytics (Hiring Immediately)
Company Description Wise is a global technology company, building the best way to move and manage the world’s money. Min fees. Max ease. Full speed. Whether people and businesses are sending m…
Part-Time Dental Nurse - Acton - Indeed
JOB OVERVIEW Date Posted: 24 September 2025 Employment Type: Part-Time Salary: £12 – £15 per hour (depending on experience) About the Role We are looking for a Part-Time Dental Nurse…
Data Privacy Lawyer
A global law firm is seeking a Data Privacy Lawyer to join its Office of General Counsel. This role is pivotal in developing, implementing and maintaining the organisation’s privacy program, ensuri…
Looking for long-term child care for 6-year-old after school
We’re seeking a reliable, caring nanny for after-school care for our 6-year-old daughter. The role would include picking her up from school, helping with homework, preparing a light snack, and supervi…
School Administrator - Girls’ Secondary School in Croydon
School Administrator – Girls’ Secondary School in Croydon (January Start) Location: Croydon Start Date: January 2026 Contract Type: Full-time, Permanent Salary: Paid to scale A welco…
Account Executive
THE COMPANY Kingdom Collective is an award-winning communications agency with a core team based in London. Our worldwide network of consultants, collaborators and creatives keeps us at the forefro…
Development Manager - Wimbledon
We’re looking for an experienced Development Manager to oversee the smooth operation of a high end residential development based in Watford. In this role at this high end development you will be re…
Teaching Assistant - Central London Independent School
We are seeking an enthusiastic Teaching Assistant to support pupils at a prestigious independent school in Central London . This is a permanent, full-time position. About the School This Cen…
Year 2 Teacher - Wandsworth
A friendly Wandsworth primary is seeking a committed and energetic Year 2 Teacher to join KS1 on a Full-Time basis from January 2026. The successful Year 2 Teacher will be invited to collaborative pl…