Job Description
About HUD HUD is building infrastructure to create RL training data and evals for frontier AI agents, as well as a marketplace to sell these to frontier labs through the HUD marketplace. Our platform is used by frontier labs, Fortune 500 companies, and startups. We’ve raised $16M from top VCs and were YC W25. About the role We’re looking for a platform engineer who can own the reliability, scale, performance, and developer experience of HUD’s core infrastructure and backend systems. This is not a pure infrastructure role. The right person has strong production infra experience, but also thinks like a backend engineer: they can reason about service architecture, queues, databases, APIs, deployment safety, performance bottlenecks, and how product requirements translate into resilient systems. You’ll work across AWS, Kubernetes, Terraform, CI/CD, observability, and backend services to make HUD faster, more reliable, cheaper to run, and easier for engineers to build on. Responsibilities Own production uptime, latency, provisioning speed, infrastructure cost, and incident response for core platform services Build and maintain AWS infrastructure with Terraform, Kubernetes/EKS, Helm, Docker, EC2, CodeBuild, ECR, S3, IAM, networking, and secrets management Design and improve backend and platform systems for scale, including capacity planning, autoscaling, queueing, backpressure, cleanup jobs, retries, and rollback paths Define and improve dashboards, alerts, logs, traces, SLOs, runbooks, and on-call workflows so failures are detected, debugged, and resolved quickly Build reliable CI/CD, release automation, environment management, and deployment workflows that improve developer productivity and reduce production risk Write clean, maintainable code where needed to automate systems, improve backend services, and create internal tooling Experience You may be a good fit if you: Have owned production cloud infrastructure for a high-availability, user-facing platform, with respons