Today
Public Trust
Unspecified
Unspecified
IT - Hardware
Remote/Hybrid• (Off-Site/Hybrid)
At GDIT, people are our differentiator. Our work depends on an HPC Systems Admin joining our team to support the National Oceanic and Atmospheric Administration (NOAA), Weather and Climate Operational Supercomputer System (WCOSS). This position is primarily remote with working hours aligned to the Eastern time zone.
WCOSS2 provides NOAA the operational High Performance Computing (HPC) resources essential to process sophisticated numerical models used to predict and understand atmospheric and oceanic phenomena for weather prediction operations. Operating 24/7, the 10-year WCOSS program will deliver significant computational capability that will evolve over time to keep pace with NOAA's growing environmental modeling needs.
We are looking for individuals to join GDIT's team to deploy, operate and support leading-edge technology for WCOSS. Specific technology training will be provided.
***Active Clearance is a plus***
We think. We act. We deliver. There is no challenge we can't turn into opportunity.
In this role, a typical day will include:
• Applying current HPC systems administrative skills; desire to learn and deploy new technologies.
• Developing and deploying monitoring capabilities.
• Developing and implementing tools for cluster administration.
• Providing technical support with team of HPC System & Storage Administrators to resolve operational issues.
• Providing off-hour on-call support on a rotating basis.
• Contributing to planning for software and hardware upgrades along with future installations.
REQUIRED QUALIFICATIONS
• Bachelor's degree or equivalent and 10+ years of experience with Linux-based HPC systems operations.
• Experience working in a 24X7 operational environment.
DESIRED QUALIFICATIONS
• Demonstrated experience to deploying and managing large-scale HPC systems using OS provisioning tools (e.g., xCat, HPCM, BCM).
• Demonstrated experience using configuration management tools (e.g., Ansible, Puppet).
• Linux system administration experience (e.g., SLES, RedHat or CentOS).
• Batch management/scheduling systems (SLURM, PBSPro, LSF) experience, PBSpro preferred.
• Parallel filesystem configuration and monitoring experience (e.g., Lustre, NFS), Lustre preferred.
• High Speed Network interconnect configuration and monitoring experience (Infiniband, OPA, Ethernet, Slingshot).
• Programming or scripting in at least two languages (e.g., Bash, Perl, Python, C).
• Strong writing skills for technical documents, system procedures, user wiki's and FAQs.
• Ability to work both independently and as part of a team.
• Knowledge/experience managing computer systems under Service Level Agreements (SLAs).
• Demonstrated expertise in at least one of these areas: Batch Schedulers, High Speed Networks, Parallel File systems.
• Experience running and optimizing HPC performance benchmarks or MPI codes would be a plus.
• Experience with utilization and configuration of monitoring solutions such as Nagios and Grafana would be a plus.
Work Requirements
WCOSS2 provides NOAA the operational High Performance Computing (HPC) resources essential to process sophisticated numerical models used to predict and understand atmospheric and oceanic phenomena for weather prediction operations. Operating 24/7, the 10-year WCOSS program will deliver significant computational capability that will evolve over time to keep pace with NOAA's growing environmental modeling needs.
We are looking for individuals to join GDIT's team to deploy, operate and support leading-edge technology for WCOSS. Specific technology training will be provided.
***Active Clearance is a plus***
We think. We act. We deliver. There is no challenge we can't turn into opportunity.
In this role, a typical day will include:
• Applying current HPC systems administrative skills; desire to learn and deploy new technologies.
• Developing and deploying monitoring capabilities.
• Developing and implementing tools for cluster administration.
• Providing technical support with team of HPC System & Storage Administrators to resolve operational issues.
• Providing off-hour on-call support on a rotating basis.
• Contributing to planning for software and hardware upgrades along with future installations.
REQUIRED QUALIFICATIONS
• Bachelor's degree or equivalent and 10+ years of experience with Linux-based HPC systems operations.
• Experience working in a 24X7 operational environment.
DESIRED QUALIFICATIONS
• Demonstrated experience to deploying and managing large-scale HPC systems using OS provisioning tools (e.g., xCat, HPCM, BCM).
• Demonstrated experience using configuration management tools (e.g., Ansible, Puppet).
• Linux system administration experience (e.g., SLES, RedHat or CentOS).
• Batch management/scheduling systems (SLURM, PBSPro, LSF) experience, PBSpro preferred.
• Parallel filesystem configuration and monitoring experience (e.g., Lustre, NFS), Lustre preferred.
• High Speed Network interconnect configuration and monitoring experience (Infiniband, OPA, Ethernet, Slingshot).
• Programming or scripting in at least two languages (e.g., Bash, Perl, Python, C).
• Strong writing skills for technical documents, system procedures, user wiki's and FAQs.
• Ability to work both independently and as part of a team.
• Knowledge/experience managing computer systems under Service Level Agreements (SLAs).
• Demonstrated expertise in at least one of these areas: Batch Schedulers, High Speed Networks, Parallel File systems.
• Experience running and optimizing HPC performance benchmarks or MPI codes would be a plus.
• Experience with utilization and configuration of monitoring solutions such as Nagios and Grafana would be a plus.
Work Requirements
group id: 90979310
Explore the Art of the Possible | GDIT