Jobs / Job page

Job details

GPU Cloud Platform Engineer

Location: Remote (Global)

Type: Full-time

Company: Yotta Labs

Apply: careers@yottalabs.ai

🧠 About Yotta Labs

Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development.

🛠️ Role Overview

We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform.

🎯 Responsibilities

Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues.
Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks.
Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users.
Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization.
Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms.
Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs.

✅ Qualifications

Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps.
5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands-on experience in Kubernetes multi-cluster management and orchestration.
Familiarity with the Kubernetes ecosystem; hands-on experience with tools such as kubectl, Helm, and expertise in multi-cluster deployment, upgrade, scaling, and disaster recovery.
Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution.
Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting.
Hands-on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture.
Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.
Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.
Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.
Strong communication skills, self-motivation, and team collaboration

🌟 Preferred Experience

Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.
Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.
Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.
Hands-on experience with optimization techniques such as model quantization, static compilation, and multi-GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.
Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions.

🌐 Why Join Yotta Labs?

Be part of a visionary team aiming to redefine AI infrastructure.
Work on cutting-edge technologies that bridge AI and decentralized computing.
Collaborate with experts from leading institutions and tech companies.
Enjoy a flexible, remote work environment that values innovation and autonomy.

📩 How to Apply

Interested candidates should apply directly or send their resume and a brief cover letter to careers@yottalabs.ai. Please include links to any relevant projects or contributions.

GPU Kubernetes Cloud-native Containerization Multi-cluster AI infrastructure Docker Prometheus Grafana AWS GCP Azure CUDA MaaS Performance optimization

Average salary estimate

$125000 / YEARLY (est.)

min

max

$90000K

$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Sr Manager, DevOps Engineering (Cortex)

Palo Alto Networks Hybrid Santa Clara, CA

VIEW

Posted 6 days ago

Lead the Cortex DevOps Engineering team at Palo Alto Networks to enhance production reliability, infrastructure automation, and operational excellence in cybersecurity products.

Staff DevOps Engineer, Remote

Experian Hybrid United States, United States, United States, United States

VIEW

Posted yesterday

Experienced DevOps professional wanted to architect and manage secure, scalable cloud infrastructure for Experian’s continued technological growth.

Sr Logic Design Engineer

PDDN INC. Hybrid Austin, Texas, United States

VIEW

Posted 12 days ago

Seeking a Sr Logic Design Engineer specializing in ARM Architecture and SoC integration for a contract role in Austin, TX.

Senior Mechanical Engineer - HVAC

Parsons Hybrid US - NJ, Atlantic City

VIEW

Posted 8 hours ago

An exciting opportunity for a Senior Mechanical Engineer with expertise in HVAC systems to join Parsons supporting critical FAA infrastructure in Atlantic City.

Sr. Professional Engineer

The City of Fort Worth Hybrid Transportation Public Works

VIEW

Posted 5 days ago

Lead innovative traffic safety projects and community-focused engineering efforts as a Senior Professional Engineer with the City of Fort Worth’s dynamic Public Works team.

Member of Technical Staff (Infrastructure)

ATG Hybrid New York City

VIEW

Posted 7 days ago

Contribute as a Member of Technical Staff by designing and scaling infrastructure to power advanced AI research at ATG, a cutting-edge AI company in finance.

Engineering Director - Compute Platform (Remote - United States)

Yelp, Inc Hybrid San Francisco

VIEW

Posted 11 days ago

Lead Yelp's globally distributed compute platform teams to drive innovation, reliability, and developer experience in a remote, collaborative environment.

Digital CCA Design

Boeing Hybrid USA - Huntsville, AL

VIEW

Posted 12 days ago

Boeing is hiring a Digital CCA Design Engineer expert in high-speed digital circuit design to join their Huntsville team.

Staff Manufacturing Reliability Engineer

Magic Leap Hybrid Plantation, Florida, United States

VIEW

Posted yesterday

Inclusive & Diverse

Collaboration over Competition

Growth & Learning

Passion for Exploration

Magic Leap is looking for an experienced Staff Reliability Engineer to drive product reliability and lifecycle assessments for innovative mixed reality devices.

Senior Engineer - Body Design (Hybrid)

Oshkosh Corporation Hybrid Oshkosh, Wisconsin, United States

VIEW

Posted 5 days ago

Senior Engineer role at Oshkosh Defense focusing on body design and engineering of heavy-duty tactical vehicle components with hybrid work flexibility.

Engineer Manufacturing Senior

Copeland Hybrid Sidney, OH, USA

VIEW

Posted 8 days ago

Senior Manufacturing Engineer role at Copeland to lead continuous improvement and automation projects in an innovative HVACR manufacturing environment.

Web Engineer

Rho Hybrid New York City

VIEW

Posted 9 days ago

Rho is seeking a detail-oriented Web Engineer to create scalable, pixel-perfect marketing web interfaces using React and Storyblok.

Technical Services Manager - Department of Building Inspection (5214 TPV)

City and County of San Francisco Hybrid 49 S Van Ness Ave, San Francisco, CA 94103, USA

VIEW

Posted 4 days ago

San Francisco's Department of Building Inspection is hiring a Technical Services Manager to lead technical program management and staff oversight within their Technical Services Division.

Y Yotta Labs

1 jobs

MATCH

Calculating your matching score...

FUNDING

Growth

DEPARTMENTS

Engineering

SENIORITY LEVEL REQUIREMENT

Mid-Level

TEAM SIZE

No info