Site Reliability Engineering at IBM

**Introduction** A career in IBM Software means you’ll be part of a team that transforms our customer’s challenges into solutions. Seeking new possibilities and always staying curious, we are a team dedicated to creating the world’s leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career. IBM’s product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive. **Your Role And Responsibilities** Your Role and Responsibilities As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes. **Your Primary Responsibilities Include** - Troubleshoot, monitor, and support critical production systems. - Perform root cause analysis and manage incidents to ensure timely resolution. - Provision and deploy environments in a cloud infrastructure - Handle initial intake for customer ticket requests for configuration changes, ensuring SLA commitments are met. - Provide on-call support, sharing rotation duties with global resources ensuring minimized MTTR (Mean Time to Recovery). - Perform regular patching and upgrades and collaborate with product support to resolve issues. - Execute on a number of tasks in an interrupt-driven environment without losing site of the customer requirements. **Preferred Education** Master's Degree **Required Technical And Professional Expertise** - Hands-on experience as a DevOps or SRE Engineer - Experience with at least one major public cloud provider or large scale private/hybrid cloud using container orchestration. - Proven experience in providing on-call support for critical production systems, with a focus on determining root cause analysis (RCA). - Familiarity with Kubernetes, EKS, ROSA, AKS, GKS, OpenShift. - Strong problem-solving skills and attention to detail. - Proficiency in scripting languages like Python and related tools. - Good understanding of CI/CD processes and tools (e.g., Jenkins). - Hands-on experience with Linux systems administration. **Preferred Technical And Professional Experience** - Familiarity with customer case management software a processes. - Experience with monitoring tools and incident management platforms. - Ability to work efficiently in a global, distributed team environment.

KubeCraftJobs

Site Reliability Engineering

Tech Stack

Job Description