Senior ML Systems Engineer - Simulations
We are looking for a Senior ML Systems Engineer to build and validate simulation infrastructure for large-scale machine learning systems. This role focuses on modelling the compute and communication behaviour of systems used for ML training and inference, and using simulation to guide architecture, performance optimization, and capacity planning.
The ideal candidate combines strong systems experience with hands-on experience in measurement, benchmarking, and performance analysis of modern ML systems.
What You’ll Do:
Build simulation models for compute, memory, interconnect, and communication behavior in ML systems.
Develop tools to simulate performance for training and inference workloads.
Model distributed execution across accelerators, hosts, and network fabrics, including collectives, synchronization, and communication bottlenecks.
Use simulation and analytical modelling to evaluate tradeoffs, identify bottlenecks, and guide system design.
Run performance experiments and benchmarks on real ML systems to calibrate and validate simulation models.
Analyze end-to-end performance, including throughput, latency, scaling efficiency, utilization, and cost/performance tradeoffs.
Partner with hardware/software/Networking/ML teams to align simulation with real workloads and constraints.
Create reproducible benchmarking methodologies across models, system configurations, and compare against real system measurements to prove validity.
Communicate findings through technical reports and design recommendations.
Qualifications
Required:
Master’s, or PhD in Computer Science, Electrical Engineering, Computer Engineering, or a related field.
Strong experience in ML systems, distributed systems, performance engineering, computer architecture, or simulation.
Understanding of systems used for machine learning training and inference.
Experience analyzing compute, communication, and memory behavior in large-scale ML systems.
Hands-on experience with performance benchmarking, profiling, and measurement of ML systems.
Experience with distributed training concepts such as data parallelism, tensor/model parallelism, pipeline parallelism, collectives, and synchronization overheads.
Proficiency in one of the following Python, C++, or Rust.
Strong analytical skills and the ability to connect simulation results to real system behavior.
Preferred:
Experience with system performance modelling, network simulation, or architecture evaluation tools. - this background is ideal
Familiarity with accelerator-based systems such as GPUs, TPUs, or custom ML hardware.
Experience with PyTorch, JAX, TensorFlow, NCCL, XLA, CUDA, or similar tools.
Knowledge of interconnect and networking technologies such as InfiniBand, Ethernet/RDMA, NVLink, PCIe, or equivalent.
Experience evaluating both training throughput and inference latency/serving efficiency.
Background in workload characterization, trace-driven simulation, or model calibration.
Ability to work across hardware and software boundaries in a cross-functional environment.
What Success Looks Like:
Build simulation models that accurately predict performance trends and inform architectural decisions.
Identify compute and communication bottlenecks in ML training and inference systems.
Correlate simulation outputs with real-world benchmark data.
Improve system efficiency, scalability, and cost effectiveness through data-driven insights.
Recommended Jobs
Transactions Tax Partner - London
Transactions Tax Partner - London Location: London We are working with an award-winning, highly regarded mid-tier London firm that is looking to appoint an experienced Transactions Tax Part…
Chemistry Teacher (Maternity Cover) - Barnet |...
We are working with an Ofsted-rated "Outstanding" mixed secondary school in Barnet recruiting for a Chemistry Teacher. This is a full-time maternity cover position starting ASAP. The Role You wil…
Business Studies Teacher - Academic Rigour - Redbridge
Business Studies Teacher – Drive Academic Rigour in GCSE/A-Level Business and Economics – Redbridge An academically rigorous secondary school in Redbridge requires a highly capable Business St…
Office cleaning job in London (Upto £14.85 per hour DOE)
December 2025 - 2026 Office cleaning job in London (Upto £14.85 per hour DOE) Dazzle is a fast-growing tech-led commercial cleaning company that prides itself on exceptional customer care and tha…
Locum Advanced Nurse Practitioner role for a Urgent Care Centre
JOB OVERVIEW [vc_row][vc_column][vc_column_text] We are looking for an experienced Advanced Nurse Practitioner (ANP) to work in an Urgent Care Centre in the Barnet The shifts will run from day…
Physics Teacher - Mixed School in Croydon (January Start)
Physics Teacher – Mixed School in Croydon (January Start) Location: Croydon Start Date: January 2026 Contract Type: Full-time, Permanent Salary: Paid to scale A successful mixed seco…
Casual Fire Steward (Hiring Immediately)
Casual Fire Steward: Allianz Stadium is the Home of England Rugby, the headquarters of the Rugby Football Union and one of the foremost sports and concert venues in Europe. The stadium has a spectato…
Fire and Security Project Manager
Protec Fire and Security Group, a Bosch Company, have an opportunity for a Fire Alarm and Security Systems Project Manager in the London area. Purpose of the post / Job description To work on the pr…
Executive Head Chef
Job Details Company Description Job Title: Executive Head Chef Location: St. Dunstan's - plus other South East London locations. (SE6 4TY) Contract Type: Full-time, 52 weeks per year…
Senior MEP Project Manager
Senior MEP Project Manager Hayes 85,000 – 120,000 + Travel Allowance + Stay Away Expensed + Discretionary Bonus + Pension + Holidays + Private Medical Insurance + Package + Technical Progre…