Dev-Ops HPC Engineer
Institution Overview
EPFL, the Swiss Federal Institute of Technology in Lausanne, is a dynamic university ranked among the top 20 worldwide. With over 6,500 employees, EPFL supports education, research, and innovation within a vibrant campus community of over 17,000 people from 120+ countries.
SCITAS and Swiss Twins Project
The SCITAS platform (Scientific IT & Application Support) provides EPFL researchers and partners with access to infrastructure and expertise in High Performance Computing (HPC). SCITAS also contributes to research and development activities so as to maintain the EPFL's reputation as a leading research facility, including Swiss Twins scientific activities.
The Swiss Twins project—a collaboration between ETH Zurich, PSI, EPFL, and CSCS—aims to advance High-Performance Computing (HPC) and cloud technologies with a focus on cloud abstractions and geo-redundancy.
SCITAS infrastructure includes:
- Over 2,000 compute nodes (CPU + GPU)
- Large-scale storage systems
- Automatic deployment and configuration management tools
Mission
As an Dev-Ops HPC Engineer, you will join the SCITAS Systems team to manage the deployment, operations, and evolution of geo-redundant scientific HPC solutions within the Swiss Twins project. This role focuses on automation, infrastructure optimization, cloud technologies, and modern infrastructure practices.
Main duties and responsibilities
- Design, build, and deploy portable HPC environments for on-premises and cloud.
- Implement provisioning layers using Terraform and manage container orchestration.
- Troubleshoot across hardware, operating systems, and cloud services.
- Develop automated tests to ensure system stability and reliability.
- Lead cloud abstraction and geo-redundancy initiatives.
- Train and support users in adopting new technologies.
Requirements
Must-Have Qualifications:
- Proven systems/devops experience.
- Familiarity with container technologies like Docker or Singularity.
- Programming skills in languages such as Bash and Python, with a solid understanding of algorithms and data structures.
- Proficiency with configuration management tools, CI/CD pipelines, Git, and provisioning tools.
- Strong networking fundamentals (HTTPS, DNS, TCP/IP).
- Extensive GNU/Linux systems experience.
- Ability to document procedures and share knowledge effectively.
- Proficiency in English or French.
Preferred Qualifications:
- Bachelor’s or Master’s degree in a relevant field.
- Experience in HPC or HTC environments, including batch systems.
- Deep knowledge of cloud technologies (AWS, Azure, GCP).
- Experience with Infrastructure as Code (IaC) practices, particularly with Terraform.
- Familiarity with workload management systems like Slurm.
- Knowledge of parallel file systems.
- Security-focused mindset.
- Proficiency with testing practices, including automated test development.
- Experience with distributed systems and monitoring/alerting systems.
Profile
- Enjoys thorough documentation.
- Actively shares knowledge and supports team members.
- Provides technical expertise and proposes innovative solutions.
Additional Information
- Languages: EPFL operates in English and French; non-bilingual applicants are encouraged to learn the other language.
- Application: Only applications submitted through EPFL’s internal website will be considered.
- Equality Commitment: EPFL actively promotes gender equality in its workforce.
- Start Date: to be agreed upon or 1.1.2025
- Employment Term: Fixed-term (CDD)
- Work Rate: 100%
- Contract Duration: 1 year, renewable
- EPFL offers the possibility to work remotely up to 2 days a week.
- Reference : 1183