Join us as we pursue our disruptive new vision to make machine data accessible, usable and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we’re committed to our work, customers, having fun and most importantly to each other’s success. Learn more about Splunk careers and how you can become a part of our journey!

The Splunk Observability Suite is a new generation of cloud applications for microservices and distributed applications. We work on new, world-class tools to monitor and observe microservice-based applications. Site Reliability Engineers at Splunk are hybrid Software/Systems Engineers whose overarching goal is to ensure that production services are always up and running reliably.

As a Site Reliability Engineer, you will help us run one of the largest and most sophisticated cloud-scale, big data systems in the world. You will be responsible for improving operational efficiency, optimal utilization and system resiliency for a real-time streaming analytics platform. You are passionate about automation, infrastructure-as-code, and getting rid of tedious, manual tasks.


Responsible for automating & operationalizing cloud provider infrastructure via Terraform, Kubernetes, Helm and Istio

Monitor capacity & utilization and work closely with the infrastructure team to orchestrate scale-up/down of backend services.

Own & operate critical back-end open-source services like Cassandra, Kafka, Elasticsearch, MongoDB, and Zookeeper.

Build tools and design processes that help improve observability and system resiliency.

Triage site availability incidents and proactively work towards reducing MTTR for customer-impacting incidents.

Implement service level metrics & service level objectives that act as service-level health indicators.

Establish design patterns for monitoring, benchmarking and deploying new features for the backend services.


Coding experience in one or more of Python, Go or Java.

Infrastructure as code experience within one or more of Terraform, Ansible, Puppet or Salt.

Experience with modern application development workflows and version control systems like GitHub, Gitlab or Bitbucket

Working knowledge of Docker containers and cloud platforms

Working knowledge of orchestration engines and package management including Kubernetes, Helm, and Istio

Experience operating one or more OSS technologies like Kafka, Cassandra, Zookeeper; other backends and streaming systems a plus

Experience with Unix/Linux systems from kernel to shell and beyond .

3+ years of experience as a Site Reliability Engineer, Production Engineer or Backend Software Engineer for web-scale or similar platforms.

BS degrees in Computer Science or related technical field, or equivalent practical experience.

We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.

For job positions in San Francisco, CA, and other locations where required, we will consider for employment qualified applicants with arrest and conviction records.

Minimum base salary of $95,000.00. You may also be eligible for incentive pay + equity + benefits.*Note: Disclosure per sb19-085 .


Job Details:

Posted Date : 2022-05-07

Job type : Full Time

Learn More & Apply