Site Reliability Engineer - SCI (f/m/d)
We help the world run better
At SAP, we keep it simple: you bring your best to us, and we'll bring out the best in you. We're builders touching over 20 industries and 80% of global commerce, and we need your unique talents to help shape what's next. The work is challenging – but it matters. You'll find a place where you can be yourself, prioritize your wellbeing, and truly belong. What's in it for you? Constant learning, skill growth, great benefits, and a team that wants you to grow and succeed.
What You'll Build
- Build enterprise cloud infrastructure that provides European data sovereignty and hyperscaler-grade capabilities. You'll work on SAP Cloud Infrastructure, help to solve complex distributed systems challenges at scale: multi-region networking, container orchestration, storage systems, and the APIs that connect them.
- We develop solutions using Go, OpenStack, and Kubernetes, tackling problems like: How do you auto-scale thousands of containers across regions? How do you build resilient storage systems? How do you design APIs that handle massive traffic spikes? Incidents happen, how to keep them low in numbers to make customers and engineers a good life?
- Your work will power SAP's production systems and thousands of customer environments. You'll contribute to infrastructure that enables organizations to run mission-critical applications with the performance and reliability they expect from leading cloud platforms.
- In your role as Site Reliability Engineer, you'll ensure the operational excellence of SAP Cloud Infrastructure, improve monitoring, preventive alerting, and reliability engineering practices for enterprise cloud services that provide European data sovereignty and hyperscaler-grade capabilities. You'll help to maintain high availability and performance standards across distributed systems serving thousands of customer environments.
- Your focus will be on challenging robust observability solutions, implementing chaos engineering practices, question gaps, find weak spots, establishing and questioning SLOs/SLIs for complex infrastructure challenges like multi-region networking, container orchestration at scale, and storage systems that handle massive traffic spikes. You'll tackle operational challenges including automated remediation, performance optimization, and incident response for mission-critical systems.
- Contributing to production systems that serve SAP's global customer base, you'll establish reliability standards and contribute to operational tooling that enables organizations to run mission-critical applications with enterprise-grade performance and reliability. Your work will ensure that European organizations can maintain data sovereignty while accessing world-class cloud capabilities that remain highly available and performant under demanding production workloads.
What You Bring
- SRE Foundation: 5+ years of Site Reliability Engineering or operations experience with deep understanding of SLI/SLO/SLA concepts and error budget implementation. Relevant experience in Data Engineering or Data Analysis is of benefit.
- Cloud & Infrastructure: Strong knowledge of Virtualized Infrastructure; best with strong knowledge of OpenStack, Kubernetes, multi-cloud environments with experience managing hyperscaler-grade platforms
- Automation & Monitoring: Proficiency in Python, Go, Bash for reporting automation, with expertise in Prometheus, Grafana, ELK Stack, and distributed tracing would be of benefit.
- Reliability Engineering: Proven experience in high availability design, fault tolerance, chaos engineering, and performance optimization
- Incident Management: Hands-on experience with on-call duties, post-mortem analysis, and systematic toil reduction through automation.
- Data Sources & Data Analysis: Strong skills in handling divers dataset with different data quality should be a natural skill. We use OpenSearch, Prometheus, open telemetry, PostgreSQL, Redis, etc. Familiar with Jupyter Notebooks
Where You Belong
You'll join PlusOne Central Engineering, where innovation meets collaboration. Our culture values engineering excellence, continuous learning, and diverse perspectives. We embrace modern practices, support career growth, and maintain work-life balance while building enterprise-scale solutions that drive digital transformation.
Berlin, DE, 10557