KubeCraftJobs

DevOps & Cloud Job Board

ML Ops Lead

Jobs via Dice

Location not specified

Remote
Senior
Full Time
Posted January 13, 2026

Tech Stack

amazon-web-services azure-arc microsoft-azure google-cloud-platform docker kubernetes terraform drift kubeflow ml-flow mlflow airflow apache-airflow tensorflow prometheus grafana azure-monitor appcast

Please log in or register to view job application links.

Job Description

Dice is the leading career destination for tech experts at every stage of their careers. Our client, Aptino, is seeking the following. Apply via Dice today! **Job Summary** The ML Ops Lead is responsible for architecting, implementing, and operating scalable and reliable machine learning infrastructure and workflows that take AI/ML models from experimentation into robust production environments. This role balances *hands-on engineering excellence* with *team leadership* and *strategic ownership* of machine learning operations practices across the organization. **Key Responsibilities** - ML Infrastructure & Operations - Architect and maintain scalable ML infrastructure, including compute, storage, orchestration, and monitoring, in cloud (AWS, Azure, Google Cloud Platform) and/or hybrid environments. - Build and manage end-to-end ML pipelines for data ingestion, model training, validation, deployment, monitoring, and retraining. - Containerize and orchestrate workloads using Docker, Kubernetes (EKS/AKS/GKE), Terraform, or similar IaC tools. - CI/CD & Automation - Design and operate CI/CD workflows for ML workflows (model retraining, version control, deployment, rollback). - Automate testing, validation, and release processes for production ML systems. - Production Reliability & Monitoring - Establish monitoring, logging, observability, drift detection, and alerting for deployed models. - Troubleshoot operational issues, optimize performance, and ensure high availability and scalability. - Leadership & Strategic Ownership - Lead, mentor, and grow a team of MLOps engineers & platform specialists. - Drive ML Ops strategy and roadmap, aligned with business goals and regulatory standards. - Collaborate closely with data scientists, software engineers, product owners, and DevOps teams to deliver production-ready models. - Governance & Best Practices - Implement governance, security, auditability, and compliance practices across ML operations. - Define and promote ML lifecycle best practices, documentation standards, and performance metrics. **Skills & Qualifications** **Technical Expertise** - Strong experience with ML Ops tools/frameworks: Kubeflow, MLflow, Airflow, TensorFlow Serving, TorchServe, Sagemaker, Azure ML, etc. - Proficiency in cloud platforms (AWS, Google Cloud Platform, Azure) and orchestration technologies (Docker & Kubernetes). - Solid background in Infrastructure as Code (Terraform, CloudFormation, Bicep, CDK). - Deep understanding of CI/CD pipelines, automation tooling, and version control systems. - Monitoring and observability tooling experience (Prometheus, Grafana, Azure Monitor, etc.). **Soft & Leadership Skills** - Demonstrated ability to lead technical teams and mentor engineering talent. - Excellent communication and cross-functional collaboration skills. - Strategic mindset with a focus on reliability, scalability, and operational excellence. **Education & Experience** - Bachelor’s or Master’s degree in Computer Science, Software Engineering, Data Science, or related field. - Typically 7+ years of experience in cloud/DevOps/ML Ops related roles; senior experience preferred depending on scale of operations.