Senior Site Reliability Engineer - HashiCorp Network, Infrastructure Services at IBM

**Introduction** A career in IBM Software means you'll be part of a team that transforms our customer's challenges into industry-leading solutions. We are an infinitely curious team, always seeking new possibilities, and dedicated to creating the world's leading AI-powered, cloud-native software solutions. Our renowned legacy creates endless global opportunities for our network of IBMers. We are a team of deep product experts, ensuring exceptional client experiences, with a focus on delivery, excellence, and obsession over customer outcomes. This position involves contributing to HashiCorp's offerings, now part of IBM, which empower organizations to automate and secure multi-cloud and hybrid environments. You will join a team managing the lifecycle of infrastructure and security, enhancing IBM's cloud solutions to ensure enterprises achieve efficiency, security, and scalability in their cloud journey. **Your Role And Responsibilities** **Our Team** The Vault Radar Infrastructure team builds and maintains the core systems that power our cloud and on-prem platforms. We focus on reliability, scalability, and security so the product team can ship features confidently. Our core stack includes Nomad, Consul, Vault, Terraform, Postgres, RabbitMQ and AWS services. **About The Role** As a Site Reliability Engineer focusing on network, infrastructure and test operations, you’ll help design, build, and support the networking foundations that connect our cloud and on-prem products. You’ll work with senior engineers to ensure reliable, secure connectivity between services and environments, and to automate routine tasks for faster, safer delivery. **In This Role, You Will** - Infrastructure as Code (IaC): Design and deploy AWS cloud infrastructure using Terraform. - Container Management: Orchestrate workloads with Nomad and Kubernetes. - Automation: Develop tools in Python, Go, and TypeScript to automate deployments and maintenance. - Observability: Utilize DataDog for comprehensive monitoring, logging, and alerting. - Testing: Maintain automated testing frameworks for infrastructure and pipelines. - Reliability & Response: Manage capacity planning, participate in on-call rotations, conduct post-mortems, and collaborate with development teams to ensure system resilience and scalability. **Preferred Education** Master's Degree **Required Technical And Professional Expertise** - Experience: Proven experience in an SRE/DevOps role managing production environments. - AWS Expertise: Deep knowledge of core AWS services (EC2, S3, VPC, RDS, IAM, EKS, etc.). - IaC & Automation: Hands-on experience with Terraform, Nomad or Kubernetes orchestration, and scripting in Python/Go/TypeScript. - Monitoring: Experience implementing monitoring/logging systems (Datadog, Prometheus, etc.). - Fundamentals: Strong understanding of Linux and networking fundamentals. - Methodologies: Familiarity with CI/CD pipelines and methodologies. - Soft Skills: Strong problem-solving, analytical, and communication skills. **Preferred Technical And Professional Experience** - Education in Computer Science or a related technical field. - Relevant certifications (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator - CKA, Terraform Associate, or similar). - Experience with softwares like Terraform, Vault, Nomad, Consul, Postgres, RabbitMQ. - Experience in defining and tracking Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

KubeCraftJobs

Senior Site Reliability Engineer - HashiCorp Network, Infrastructure Services

Tech Stack

Job Description