Being a Head of Reliability at balena
As a Head of Reliability, you will work with a team of SREs to ensure our services are available, resilient, and efficient. You will take an “Infrastructure as Product” approach towards enabling self-service for our developers and optimizing the experience for our end-users.
You will learn how our complex interdependent systems are built and run. You will review architecture for new features, refine designs, facilitate frictionless deployments to production, monitor availability, manage outages, and hold retrospectives. As you grow in the role, you will be empowered to implement innovative solutions for automating and streamlining the operation of the infrastructure powering the “balena fleet” and influence strategic decisions impacting the direction of our platform and company.
- Identify bottlenecks in services and failure patterns in production, and develop automated solutions to streamline operations
- Define high-quality metrics for our infrastructure and continuously drive their improvement
- Implement monitoring systems to collect health data, set error alerts, and increase app behavior visibility
- Own the incident response process and leverage postmortem learnings to prevent similar future issues
- Support balena developers with seamless, fault-tolerant deployments and production debugging
- Conduct load tests to ensure applications are ready to handle projected traffic
- Participate in on-call rotation and be a key resource for peers on support
- Strong technical background in software development, infrastructure and/or platform operations
- Experience working with Docker containers and running production-grade Kubernetes clusters
- Knowledge of modern software practices, such as instrumentation of applications for observability
- Ability to manage ambiguity, push through friction, and independently make critical trade-off decisions
- Drive to make yourself and others more effective through documentation and automation
- Willingness to constantly build on your knowledge of the balena platform and new technologies
- Excellent communication skills and fluency in English
- Familiarity with distributed systems, server load balancing, and high-availability architectures
- Experience with cloud automation, APM and log management (we use Grafana, Prometheus, and Loki)
- Good understanding of networking protocols (TCP/IP, HTTP, TLS), common failures, and mitigations
- Background in leading teams and working across functions to build robust products
- Experience with IoT, embedded SW, dev tools, or the balena platform as a user/contributor
- Contributions to OSS projects and community involvement
Make sure to let us know if any of these items apply to you! If possible, please also share a sample of your work or examples of projects (URL or attachment).