Senior AI Workload Platform Engineer - Radian Arc
Location & work modality: EMEA (remote)
Start: ASAP
Type of Contract: Permanent, full-time
About Radian Arc
Radian Arc, now part of InferX, Submer's AI cloud and GPU infrastructure platform, provides an infrastructure-as-a-service (IaaS) platform for running cloud gaming, artificial intelligence and machine learning applications inside telecommunication carrier networks. Our teams across the USA, Australia, Central Europe, Malaysia, Singapore and Japan offer telecom operators a GPU-based edge computing platform without the need for capital expenditure, facilitating low latency and improved economics for value-added services and the monetization of 5G investments.
What impact you will have
Mission: Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads. (CloudStack, Kubernetes, Slurm, Argo).
The platform orchestrates GPU clusters supporting large-scale AI training and inference workloads across distributed compute infrastructure. This role bridges the current production platform, based on CloudStack, with the next-generation orchestration architecture built around Kubernetes, modern batch scheduling frameworks, and workflow orchestration systems.
You will be responsible for maintaining and evolving the existing CloudStack-based deployments while actively contributing to the design and implementation of the next-generation compute platform supporting distributed AI workloads. The role combines deep hands-on engineering with ownership of critical orchestration components, including Kubernetes-based compute orchestration, Slurm-based distributed training and batch scheduling, and workflow automation through Argo.
Working closely with networking, storage, and platform engineers, you will help implement the platform primitives that expose GPU infrastructure as a scalable, multi-tenant compute platform.
What you’ll do
CloudStack Platform Maintenance
Maintain the existing CloudStack code base used in current production deployments.
Integrate new upstream CloudStack releases into the internal platform fork.
Perform upgrades of existing customer environments to newer CloudStack versions.
Design and execute safe upgrade paths for running production environments.
Troubleshoot orchestration and provisioning issues in existing deployments.
CloudStack Networking & VPC Infrastructure
Maintain and troubleshoot CloudStack VPC networking
Work with and understand CloudStack Debian VPC routers
Manage networking implementations based on:
Open vSwitch (OVS)
OVN
Improve reliability of network orchestration components
Manage hypervisor implementations based on:
KVM
QEMU
Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines.
Next-Generation Compute Orchestration
Design orchestration and scheduling primitives for the next-generation platform based on:
Kubernetes
Slurm
Argo Workflows
Build orchestration workflows that expose GPU and CPU compute resources to platform users.
Integrate compute orchestration with storage and networking services.
Work closely with networking, storage engineers, and platform software engineers to integrate platform primitives.
Kubernetes GPU Scheduling & Cluster Orchestration
Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads.
Configure and maintain GPU device plugins and resource allocation mechanisms.
Implement GPU scheduling strategies including:
GPU partitioning, such as MIG where supported
Multi-GPU job placement
Topology-aware scheduling for distributed training and inference.
Design node lifecycle automation for GPU clusters including:
Node provisioning
Node draining
Workload migration
Implement Kubernetes scheduling extensions where necessary such as custom schedulers or batch schedulers.
Slurm Integration and HPC Scheduling
Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters
Implement Slurm compute partitions mapped to Kubernetes-managed GPU/CPU nodes
Develop mechanisms to submit distributed training, fine tuning, or batch workloads from platform APIs into Slurm clusters
Implement support for:
Multi-node distributed GPU training
Gang scheduling
GPU topology-aware scheduling
Build automation for:
Dynamic Slurm node registration
Elastic compute capacity
Node health monitoring and recovery
Integrate Slurm job lifecycle events with platform orchestration services
Argo Workflow Orchestration
Design and implement workflow orchestration using Argo Workflows
Develop reusable workflow templates for common platform workloads including:
AI training pipelines
Data preprocessing pipelines
Batch inference workloads
Platform operational workflows
Implement DAG-based execution pipelines coordinating compute workloads across Kubernetes and Slurm clusters
Build workflow primitives that expose platform capabilities to users such as:
Distributed training workflows
Model evaluation pipelines
Batch GPU compute workflows
Integrate workflow execution with platform APIs and platform user interfaces
Distributed AI Workload Orchestration
Implement orchestration support for distributed AI workloads including:
Multi-node training
Distributed inference
Large model fine-tuning workloads
Support execution environments such as:
PyTorch distributed training
MPI-based workloads
Containerized training jobs
Implement mechanisms to coordinate GPU workloads across nodes with low-latency networking
Platform Multi-Tenancy & Resource Isolation
Design and maintain mechanisms for multi-tenant GPU resource allocation
Implement quota and fairness policies for compute workloads
Develop resource isolation strategies across tenants including:
Namespace isolation
Compute quotas
GPU allocation limits
Integrate compute orchestration with platform billing and metering systems
Technical Stack
Programming languages
Java, Python + Bash, SQL for CloudStack-related work
Go for Kubernetes-related components
Python
Orchestration
CloudStack
Kubernetes
KubeVirt
Slurm/SUNK
Argo Workflows
Kubernetes CRDs and controllers
Batch scheduling frameworks
Networking
OVS
OVN
Linux networking
VPC networking
BlueField networking
Infrastructure
GPU infrastructure
Distributed compute clusters
High-performance networking for distributed AI workloads
What you’ll need
Platform & Distributed System s
Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
Strong experience with CloudStack internals, including extending and maintaining platform functionality.
Experience operating cloud orchestration platforms in production environments.
Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
Software Engineering
Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
Strong programming skills in Go and Python, with experience building cloud-native platform components.
Experience designing and maintaining control-plane services for infrastructure platforms.
Compute Orchestration
Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
Experience building or operating compute orchestration layers for large-scale clusters.
Familiarity with workflow orchestration systems such as Argo Workflows.
Networking & Infrastructure
Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
Leadership & Architecture
Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
Comfortable solving difficult implementation and operational problems across CloudStack, Kubernetes, Slurm, and workflow orchestration; improving orchestration quality through code, automation, and practical design decisions; collaborating effectively across compute, networking, storage, and platform teams; and influencing engineering practices through expertise and delivery.
Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.
What we offer
Attractive compensation package reflecting your expertise and experience.
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.
Our job titles may span more than one job level. The actual base pay is dependent on a number of factors, such as transferable skills, work experience, business needs and market demands.
Our Inclusive Responsibility
Radian Arc is committed to creating a diverse and inclusive environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, veteran status, or any other protected category under applicable law.
Recommended Jobs
Partnerships Director
The Opportunity Luminance is the pioneer of Legal-Grade™ AI for enterprise. Backed by internationally renowned VCs and named in both the Forbes AI 50 list of ‘Most Promising Private AI Companies i…
General Manager, Autonomous Vehicle Depot Operations
About Moove AV At Moove AV, we’re on a mission to make the world safer by enabling autonomy in transportation. Partnering with industry leaders like Waymo , we’re redefining fleet management and…
Deputy Manager
Location - The Joiners Deputy Manager About Us Stonegate Group is the UK’s largest operator of pubs, bars, and late-night venues, including Slug & Lettuce, Be At One, and Popworld, to name a…
Customer Service Officer
Customer Service Officer needed in Wembley This is a temporary contract initially paying £16.79ph PAYE The reference number is: 000A CE5F / 1 The successful candidate will p…
Housekeeper to start in January, Job ID J1F51E
This lovely family is seeking a Part-time Live-out Housekeeper to help maintain their home to a high standard. The ideal candidate will be English-speaking, child-friendly, and confident managing gen…
Credit Trader
Job Title: Credit Trader Department Overview: Global Credit Trading team (“GCT”) is responsible for credit trading in Emerging Europe, Asia and European IG, Covere…
Trainee Logistics & Sales Coordinator
About Holcim We are leaders in supplying innovative, sustainable building solutions to the UK construction industry; building progress for people and the planet. Since 1858, we’ve helped shape …
Monitoring Officer LBS-009
Job Category: Commercial Location: 160 Tooley Street, Southwark Council Hours Per Week: 36.00 Start Date: Immediate Start Start Time: 09:00 End Time: 17:00 Salary: £20.11 Job P…
Year 2 Teacher | Kingston
Are you an experienced Year 2 Teacher looking for a rewarding Year 2 role from January 2026? Do you want to teach in a school in Kingston that prioritises strong learning habits, rigorous foundatio…
Service Advisor
Location – Sidcup Salary – Dependant on experience (up to £35,000 OTE) I am recruiting for a very fast growing Prestigious automotive dealership in the Sidcup area who are looking for an e…