SRE Engineer

  • Mistral AI
  • Paris
  • Full-Time
  • Posted 9 days ago

Job Description

Mistral AI is looking for an SRE Engineer to shape reliability, scalability, and performance of our platform and customer facing applications. 
You will work closely with our software engineers to ensure our systems meet and exceed our customers' expectations.


Responsibilities
- Make sure our inference and platform resources are always available and in good shape
- Ensure our products are reliable and ensure SLAs
- Design, build, and maintain scalable, highly available, and fault-tolerant standard and AI infrastructure to support our machine learning workloads and services
- Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime
- Develop and maintain comprehensive documentation for infrastructure designs, processes, and best practices
- Participate in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences
- Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform, …
- Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements
- Evaluate and implement new tools, technologies, and processes to enhance our AI infrastructure's efficiency, reliability, and scalability


About you : 
- 3+ years of experience in SW Engineering 
- Key technical skills: observability/alerting/operational maintenance
- Familiar with bare Kubernetes/Grafana/Prometheus
- Experience building cross datacenter & highly available distributed systems
- Experience profiling & optimizing stacks to the millisecond
- Good programming skills in one language (Python/Go/C++/Rust)
- Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role, ideally in an AI/ML-focused environment.
- Strong understanding of AI/ML infrastructure requirements
- Experience with containerization and orchestration technologies like Docker and Kubernetes.
- Familiarity with infrastructure-as-code tools such as Terraform
- Solid understanding of cloud computing platforms like AWS, GCP, or Azure.
- Experience with monitoring, logging, and alerting tools like Prometheus, Grafana, ELK Stack, …
- Strong problem-solving skills and the ability to work independently and collaboratively in a fast-paced environment.
- Excellent communication skills, both written and verbal.


What We Offer: 
- Ability to shape the exciting journey of AI and be part of the very early days of one of Europe’s hottest startup 
- A fun, young, multicultural team and collaborative work environment — based in Paris and London 
- Competitive salary and bonus structure 
- Comprehensive benefits package 
- Opportunities for professional growth and development