Remote
Senior
Full Time
Posted January 13, 2026
Tech Stack
amazon-web-services
azure-arc
microsoft-azure
google-cloud-platform
docker
kubernetes
terraform
drift
kubeflow
ml-flow
mlflow
airflow
apache-airflow
tensorflow
prometheus
grafana
azure-monitor
appcast
Job Description
Dice is the leading career destination for tech experts at every stage of their careers. Our client, Aptino, is seeking the following. Apply via Dice today!
**Job Summary**
The ML Ops Lead is responsible for architecting, implementing, and operating scalable and reliable machine learning infrastructure and workflows that take AI/ML models from experimentation into robust production environments. This role balances
*hands-on engineering excellence*
with
*team leadership*
and
*strategic ownership*
of machine learning operations practices across the organization.
**Key Responsibilities**
- ML Infrastructure & Operations
- Architect and maintain scalable ML infrastructure, including compute, storage, orchestration, and monitoring, in cloud (AWS, Azure, Google Cloud Platform) and/or hybrid environments.
- Build and manage end-to-end ML pipelines for data ingestion, model training, validation, deployment, monitoring, and retraining.
- Containerize and orchestrate workloads using Docker, Kubernetes (EKS/AKS/GKE), Terraform, or similar IaC tools.
- CI/CD & Automation
- Design and operate CI/CD workflows for ML workflows (model retraining, version control, deployment, rollback).
- Automate testing, validation, and release processes for production ML systems.
- Production Reliability & Monitoring
- Establish monitoring, logging, observability, drift detection, and alerting for deployed models.
- Troubleshoot operational issues, optimize performance, and ensure high availability and scalability.
- Leadership & Strategic Ownership
- Lead, mentor, and grow a team of MLOps engineers & platform specialists.
- Drive ML Ops strategy and roadmap, aligned with business goals and regulatory standards.
- Collaborate closely with data scientists, software engineers, product owners, and DevOps teams to deliver production-ready models.
- Governance & Best Practices
- Implement governance, security, auditability, and compliance practices across ML operations.
- Define and promote ML lifecycle best practices, documentation standards, and performance metrics.
**Skills & Qualifications**
**Technical Expertise**
- Strong experience with ML Ops tools/frameworks: Kubeflow, MLflow, Airflow, TensorFlow Serving, TorchServe, Sagemaker, Azure ML, etc.
- Proficiency in cloud platforms (AWS, Google Cloud Platform, Azure) and orchestration technologies (Docker & Kubernetes).
- Solid background in Infrastructure as Code (Terraform, CloudFormation, Bicep, CDK).
- Deep understanding of CI/CD pipelines, automation tooling, and version control systems.
- Monitoring and observability tooling experience (Prometheus, Grafana, Azure Monitor, etc.).
**Soft & Leadership Skills**
- Demonstrated ability to lead technical teams and mentor engineering talent.
- Excellent communication and cross-functional collaboration skills.
- Strategic mindset with a focus on reliability, scalability, and operational excellence.
**Education & Experience**
- Bachelor’s or Master’s degree in Computer Science, Software Engineering, Data Science, or related field.
- Typically 7+ years of experience in cloud/DevOps/ML Ops related roles; senior experience preferred depending on scale of operations.