Senior HPC Systems Engineer



  full-time   employee

United States

CSRA has an opportunity available for a talented and innovative senior-level High Performance Computing (HPC) Linux Systems Administrator within our High Performance Computing Center of Excellence to provide the continuing support for our NOAA Research and Development High Performance Computing Systems customer at NOAA's Earth System Research Laboratory in Boulder, Colorado.

The qualified candidate will bring their hands-on technical and project management leadership skills to: ensure HPC environment stability;  plan for growth; and manage and support new technology insertions as well as provide remote technical support and consultation to our other supported NOAA sites at Fairmont, West Virginia and Princeton, New Jersey. 

Responsibilities and Duties:

  • Experience in daily management and oversight of medium to large HPC cluster environments is essential in order to maintain an overall situational awareness of HPC environment to identify support or operational issues before they impact operations;
  • Independent problem solving and troubleshooting skills will be leveraged to quickly advance towards viable resolutions;

  • Provide leadership and team coordination for successful planning and execution of scheduled maintenance periods;
  • Hands-on experience with Lustre, NFS, and other NAS and parallel file systems will be heavily leveraged to ensure optimal performance tuning and planning for growth/expansion;
  • Experience with provisioning tools, such as xCAT; will be used to manage the build and/or rebuilding of HPC cluster front ends, compute and administration nodes in both diskfull and diskless environments; Puppet will also be used to provide and deliver consistent configuration management within this HPC environment;
  • Demonstrated ability to install open source software packages;
  • Experienced with the installation of commercial software and license managers for commercial software;
  • Demonstrated ability to install and tune compiler and utility software;
  • Programming experience and familiarity with scripting languages such as bash, sh, Perl, and Python will be used to manage, extend, and develop customized scripts to support the HPC user and system administration environments;
  • Formal change and configuration management practices are ingrained into the daily operations to ensure changes and configuration control implementations are properly documented and approved;
  • Excellent communication skills, both verbal and written, are required in order to ensure the customer, support team, and HPC users and other stakeholders are properly engaged and informed throughout any support or project management effort;
  • Experienced leadership in project management will be leveraged to ensure projects are properly planned and developed, ensure stakeholder and project team coordination, proper project execution and monitoring is performed, and the projects are brought to closure, as scheduled;
  • Knowledge of network technologies, such as InfiniBand and GigE, and their associated tools will be leveraged to troubleshoot and tune existing network fabrics and plan for future expansion;
  • Documentation skills will be applied to develop, improve, and enhance user and system administration online documentation repositories; Extensive experience with Microsoft Suite (Project, Visio, Word, and Excel) will be leveraged considerably to support the various documentation efforts;
  • Coordination of troubleshooting and repairs with OEM vendors will be essential to identify root cause of issues, their timely resolution, and restoration of SLA-related services;
  • System administration responsibilities will include timely coordination of security patching for any identified IT security-related vulnerabilities;
  • Ability to work and contribute as a member of a small support team is an essential component within this effort;
  • Extended knowledge of batch scheduling and queuing systems (such as Moab/Torque or Slurm) is a plus;


Must have:

* 10 years or more years of experience in Systems Administration.
* Bachelor's degree, or equivalent, in computer-related field, CS preferred.
* Hands-on experience with Linux Red Hat and CentOS in particular.

* Hands-on experience with computer hardware maintenance, such as replacing DIMMs, disk drives, and PCIe cards.

* Experience with Lustre, NFS, and other NAS and parallel file systems.

* Understanding of basic networking components, and tools, with a solid understanding of routing concepts.

* Experience installing and removing software, both from prebuilt packages and compiling from source.

* Experience in Linux/Unix programming or scripting (including Perl and Bash), and interest in task automation.

* Ability to work in both local and remote technical support environments.

* Strong creative problem solving skills to tackle highly complex large-scale technical problems.

* Disciplined troubleshooting skills.

* Experience in project and technical management.

* Attention to detail skills in the areas such as; time management, organizational, analytical thinking, observation, and active listening.

* Exceptional verbal and written communication skills.


Nice to have:

* Experience in developing and maintaining software stacks.

* Experience with InfiniBand is a plus.

* Experience in writing C programs is a plus.

* Working knowledge of batch scheduling and queuing systems (such as Moab/Torque or Slurm) is a plus.




Save This Job

Email This Job to a Friend