Senior Site Reliability Engineer - HashiCorp Network at IBM

- **Our Team** The Vault Radar Infrastructure team builds and maintains the core systems that power our cloud and on-prem platforms. We focus on reliability, scalability, and security so the product team can ship features confidently. Our core stack includes Nomad, Consul, Vault, Terraform, Postgres, RabbitMQ and AWS services. **About the Role** As a Site Reliability Engineer focusing on network, infrastructure and test operations, you??ll help design, build, and support the networking foundations that connect our cloud and on-prem products. You??ll work with senior engineers to ensure reliable, secure connectivity between services and environments, and to automate routine tasks for faster, safer delivery. **In this role, you will:** - **Infrastructure as Code (IaC):** Design and deploy AWS cloud infrastructure using Terraform. - **Container Management:** Orchestrate workloads with Nomad and Kubernetes. - **Automation:** Develop tools in Python, Go, and TypeScript to automate deployments and maintenance. - **Observability:** Utilize DataDog for comprehensive monitoring, logging, and alerting. - **Testing:** Maintain automated testing frameworks for infrastructure and pipelines. - **Reliability & Response:** Manage capacity planning, participate in on-call rotations, conduct post-mortems, and collaborate with development teams to ensure system resilience and scalability. - Required education - Bachelor''s Degree - Preferred education - Master''s Degree - Required technical and professional expertise - **Experience** Proven experience in an SRE/DevOps role managing production environments. - **AWS Expertise** Deep knowledge of core AWS services (EC2, S3, VPC, RDS, IAM, EKS, etc.). - **IaC & Automation** Hands-on experience with Terraform, Nomad or Kubernetes orchestration, and scripting in Python/Go/TypeScript. - **Monitoring** Experience implementing monitoring/logging systems (Datadog, Prometheus, etc.). - **Fundamentals** Strong understanding of Linux and networking fundamentals. - **Methodologies** Familiarity with CI/CD pipelines and methodologies. - **Soft Skills** Strong problem-solving, analytical, and communication skills. - Preferred technical and professional experience - Education in Computer Science or a related technical field. - Relevant certifications (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator - CKA, Terraform Associate, or similar). - Experience with softwares like Terraform, Vault, Nomad, Consul, Postgres, RabbitMQ. - Experience in defining and tracking Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

KubeCraftJobs

Senior Site Reliability Engineer - HashiCorp Network

Tech Stack

Job Description