Site Reliability Engineer (Europe)
Site Reliability Engineer
About ArangoDB
ArangoDB is an innovative, open-source database company that empowers organizations to build scalable and efficient data architectures. Founded in Germany and now headquartered in San Francisco, ArangoDB is the most highly scalable, open-source, Graph Database with AI/ML capabilities available in the market. In addition to graphs, it is natively supporting a number of data models including Document, and Key-Value as well as Full-Text Search and Retrieval. It serves as the scalable backbone for Graph-Analytics and complex data architectures across many different industries. Developers can build high-performance applications using a convenient SQL-like query language or JavaScript extensions. Find out more at the Company page and follow us on Linkedin.
ArangoDB's culture thrives on collaboration, innovation, and continuous learning, offering employees the opportunity to work on state-of-the-art technology and contribute to the development of solutions that address real-world data challenges. Whether it's advancing distributed systems or enhancing cloud-native capabilities, ArangoDB is at the forefront of database technology, making it an exciting and impactful place to work.
Location: Europe
Job Overview
At ArangoDB, we are building a robust, cloud-native infrastructure to support our distributed database systems, which power mission-critical applications for a wide range of industries. We are searching for a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our infrastructure and applications, with a focus on automation, monitoring, and optimizing cloud environments.
As a Site Reliability Engineer (SRE), you will be responsible for maintaining and improving the reliability of our distributed database systems running on Kubernetes and Cloud environments (AWS, Google Cloud). You will design, implement, and maintain scalable infrastructure solutions, improve and expand observability into these solutions, and troubleshoot complex system issues. It is expected that you will come to work to write clean and efficient code in Golang, working closely with development teams.
Your goal is to ensure the high availability and performance of our cloud-based systems, automate repetitive tasks, and enhance our CI/CD pipelines. If you're passionate about building resilient systems, managing Cloud infrastructure, and using Golang to create scalable solutions, we want to hear from you!
About the Role
- Design, implement, and maintain Cloud infrastructure on AWS and Google Cloud platforms.
- Ensure the scalability, performance, and reliability of our Kubernetes-based distributed database systems.
- Collaborate with developers to write efficient, production-grade code in Golang to automate infrastructure management and improve system operations.
- Optimize and automate CI/CD pipelines, deployment processes, and monitoring systems to support our production environment.
- Develop strategies for disaster recovery, high availability, and fault tolerance.
- Proactively identify system bottlenecks, troubleshoot, and resolve issues across the stack (network, OS, cloud infrastructure).
- Implement monitoring, logging, and alerting systems to ensure visibility into system health and performance.
- Participate in On-call rotations to support critical production systems and respond to incidents.
- Collaborate with cross-functional teams to improve overall system reliability and scalability.
- Collaborate with the Customer Success team to resolve customer issues.
Required Skills
- 5+ years of proven experience as an SRE or DevOps Engineer in a Cloud-native environment (AWS or GPS).
- Proficiency with Kubernetes in managing large-scale, distributed systems.
- Solid understanding of networking, security practices, and troubleshooting methods
- Understanding of Linux internals (processes, environment variables etc.)
- Familiarity with containerization technologies (e.g., Docker).
- Knowledge of CI/CD practices and tools (Jenkins, CircleCI, etc.).
- Familiarity with alerting, monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack).
- Strong troubleshooting and problem-solving skills, with the ability to address complex infrastructure issues.
- Excellent communication and collaboration skills with a focus on continuous improvement and operational excellence.
- Strong ability to self-organize and to work independently as part of a remote team
- Knowledge of version control systems, particularly Git
- Familiarity with programming languages such as Golang or Python
Nice-to-Have Skills
- Experience managing distributed databases or large-scale data storage systems.
- Knowledge of security best practices
- Experience with Infrastructure-as-Code (IaC) tools like Terraform.
- Experience working with GitOps
- Strong programming skills in Golang, with experience in developing automation tools, scripts, or services.
Why Join ArangoDB
Our headquarters is in San Francisco (US) and we have an office in Cologne (Germany), but most of our diverse team works remotely worldwide. So, do you prefer your desk at home or do you want to join us at one of our locations? Your choice.
The ArangoDB team comes from 5 different continents and more than 20 countries. Diverse backgrounds enable us to see new solutions. We invite people from every culture, national origin, religion, sexual orientation, gender identity or expression, and of every age to apply to our positions. All employment decisions are based on business needs, job requirements, and individual qualifications. Arango is committed to a workplace free of discrimination and harassment based on any of these characteristics. We love this diversity and encourage everyone curious and visionary to join the multi-model movement.