Job Description
Staf f Platform Engineer, AI/ML Infrastructure Department:AI Software & Operations Role Summary The Staff Platform Engineer, AI/ML Infrastructure will provide technical leadership for thecloud platforms, deployment systems, and operational foundations that power enterprise-scalegenerative AI applications. This role will define and evolve the infrastructure architecture for AI/ML platforms running across AWS,Kubernetes, serverless, and containerized environments. The engineer will lead platform standards forreliability, scalability, observability, CI/CD, security, and developer enablement, while partnering closelywith software engineering, AI engineering, security, and operations teams. The ideal candidate combines deep hands-on cloud engineering experience with staff-level technicalinfluence. They are comfortable designing infrastructure patterns, writing infrastructure-as-code,improving delivery pipelines, mentoring engineers, and making architectural decisions that raise theoperational maturity of AI platforms across multiple teams. Key Responsibilities Define and drive the technical strategy for AI/ML platform infrastructure supporting generative AIapplications, LLM integrations, model routing, and enterprise AI services. Architect, build, and operate scalable cloud platforms using AWS services such as EKS, ECSFargate, Lambda, DynamoDB, S3, OpenSearch, Secrets Manager, CloudWatch, ALB, and MWAA. Establish reusable infrastructure patterns using CloudFormation, Helm, and Terraform to supportreliable multi-environment and multi-region deployments. Lead CI/CD architecture using GitHub Actions, reusable workflows, OIDC-based AWSauthentication, automated quality gates, deployment promotion, and environment approvals. Design and improve observability across AI platforms, including CloudWatch dashboards, logs,alarms, Prometheus/Grafana, OpenSearch, Langfuse, and LLM-specific operational metrics. Build platform capabilities for GenAI workloads, including model availabilit