KubeCraftJobs

DevOps & Cloud Job Board

Agentic Platform Engineer

Jobs via Dice

Location not specified

Remote
Mid Level
Full Time
Posted January 03, 2026

Tech Stack

guardrails microsoft-graph docker kubernetes sanity_2 microsoft-azure langchain langgraph llamaindex python typescript microsoft-typescript java golang github github-actions argo argo-cd qdrant prometheus grafana elk opentelemetry kafka aws-sdks appcast

Please log in or register to view job application links.

Job Description

**Agentic Platform Engineer - Orchestration, Memory & Evaluation** **Title: Agentic Platform Engineer - Agentic Foundations Location: Global (India / Cheque Republic /LATAM, remote/hybrid) Reports to: Agentic Foundations Lead - Principal GenAI Architect** **Role Summary** **As Agentic Platform Engineer in the Agentic Foundations squad, you will build the core platform capabilities that every agent in Cloud Studio uses.** **You'll be responsible for implementing and evolving the orchestration runtime, shared memory and context layer, evaluation and guardrails, and the Agent SDK used by the Cloud Migration, Application Modernization, and Data Transformation squads.** **Your work turns the agentic architecture into production-ready, reusable components that make building new agents faster, safer, and more consistent.** **Key Responsibilities** **Orchestration & Runtime** - Implement the agent orchestration engine used by all squads: - Graph / state machine / DAG of agents and tools - Routing, retries, timeouts, parallel execution, fallbacks - Build and maintain the internal Agent SDK: - Base classes and interfaces for agents, tools, and workflows - Standard patterns for error handling, logging, and configuration - Integrate the agent runtime with cloud-native infrastructure: - Docker images, Kubernetes/OpenShift/Fargate workloads - CI/CD pipelines (build, test, deploy) for agent services - Ensure the platform supports multi-LLM and multi-cloud configurations. **Memory, Context & Retrieval** - Implement the shared memory layer: - Integrate vector stores, search, and other storage mechanisms - Define APIs for agents to read/write context safely and consistently - Build reusable retriever components: - Chunking, embedding, metadata strategies - Domain-specific retrievers (code, configs, schemas, documents) in collaboration with squads - Work closely with the Foundations Lead to refine context schemas and best practices. **Evaluation, Guardrails & Observability** - Implement evaluation tooling for agents: - Harnesses to run test sets, capture outputs, compare against expected behaviors - Metrics and dashboards for quality, reliability, and performance - Implement basic guardrails: - Output validation, constraints on actions, policy checks, and sanity checks - Build observability hooks: - Structured logging, traces, metrics (latency, cost, error types, memory hit rates, etc.) - Dashboards for monitoring agent workflows in dev/test and pilot environments. **Collaboration & Enablement** - Partner with Cloud Migration, App Modernization, and Data Modernization engineers to: - Understand their use cases and friction points - Evolve the platform to remove friction and improve reuse - Help domain squads adopt and correctly use the Agent SDK, memory, and evaluation tools. - Contribute to internal documentation, examples, and best-practice guides ("how to build an agent in Cloud Studio"). **Required Skills & Experience** **GenAI & Agentic Development** - Hands-on experience integrating LLMs (OpenAI / Azure OpenAI / others) into applications. - Experience with at least one LLM/agent framework (LangChain, LangGraph, LlamaIndex, custom). - Solid understanding of prompt design and context construction. - Familiarity with tool-using agents and basic multi-agent flows (task decomposition, handoff, etc.) is a strong plus. **Platform & Backend Engineering** - 5-8+ years of software engineering experience, ideally in backend/platform roles. - Strong proficiency in Python and at least one of: TypeScript, Java, Go, or .NET. - Experience building APIs, services, and libraries consumed by other teams. - Solid understanding of cloud-native environments: - Docker, Kubernetes/OpenShift - CI/CD pipelines (GitHub Actions, Argo CD, etc.) - Comfortable with distributed systems concepts (latency, retries, backpressure, idempotency). **Memory, Retrieval & Data Handling** - Experience working with databases and search/retrieval technologies. - Familiarity with vector databases (e.g., Pinecone, Qdrant, pgvector, Weaviate) or search engines. - Understanding of RAG patterns: chunking, embedding, metadata, retrieval tuning. **Evaluation, Guardrails & Observability** - Experience building testing or evaluation harnesses for complex systems (not necessarily only AI). - Familiar with metrics/logging stacks (e.g., Prometheus, Grafana, ELK, OpenTelemetry, etc.). - Understanding of basic safety and validation patterns for AI output (schema checks, constraints, domain rules). **Ways of Working** - Enjoys building platforms and tools that other engineers rely on daily. - Comfortable in a fast-paced R&D environment with iteration and ambiguity. - Strong communication skills; can explain platform features clearly and help others adopt them. - Collaborative mindset; happy to jump on calls with domain squads and debug issues end-to-end. **Nice-to-Have** - Experience with one of the domain areas (cloud migration, app modernization, data engineering). - Exposure to event-driven architectures and message buses (Kafka, etc.). - Prior work on developer platforms, SDKs, or internal frameworks.