Job Title: HPC Systems Engineer III
Location: Boulder, CO
Type: Full-time, Exempt
Relocation assistance is available for this position.
U.S. Citizenship, Permanent Residency, or other protected status under 8 U.S.C. 1324b(a)(3) is required for this position.
UCAR/NCAR will not sponsor a work visa (e.g., J-1, H1-B, etc.) for this position.
Who We Are:
Located in Boulder, Colorado, the National Center for Atmospheric Research (NCAR) is one of the world’s premier scientific institutions, with an internationally recognized staff and research program dedicated to advancing knowledge, providing community-based resources, and building human capacity in the atmospheric and related sciences. NCAR is sponsored by the National Science Foundation (NSF) and managed by the University Corporation for Atmospheric Research (UCAR).
What You Will Do:
As part of the Supercomputer Services Group (SSG) at the National Center for Atmospheric Research (NCAR), the High Performance Computing (HPC) System Engineer provides leadership and performs advanced systems programming, administration and technical support for the Computational & Information Systems Laboratory’s (CISL) High Performance Computational (HPC) systems. SSG is part of the High-End Services Group (HSS), which also includes the HPC Data Infrastructure Group (HDIG). The environment is composed of multi-vendor resources with numerous specialized hardware and software components.
The HPC Systems Engineer (SSG) performs installation, hardware and software integration, maintenance, administration, troubleshooting and operation of both software and hardware systems, including peta-scale computational systems, scheduling environment and high-speed networks. The HPC Systems Engineer leads system architecture design and implementation projects, works closely with team members from other groups, such as those responsible for the high performance file systems, mass storage, operations, and user services to ensure that systems remain highly available and usable. The HPC Systems Engineer mentors and trains junior team members.
The HPC Systems Engineer develops web-based documentation of system procedures and applications. Develops programs to support automated system maintenance, monitor system resources and usage, and implement system security policies. Participates and leads research and evaluation projects of new technological solutions for NCAR’s high-performance computing environment. Periodic work at the NCAR Wyoming Supercomputer Center (NWSC) may be required during periods of system installation, upgrade, or troubleshooting.
Software Engineering and Development
As part of the team, leads develops, implements and documents new features or capabilities in system administration and system monitoring software. Develops and maintains systems software as necessary for the deployment and management of all aspects of high-performance supercomputers and clusters. Develops and maintains security monitoring and analysis software. Performs installation and necessary hardware and software integration as part of supercomputer deployments and upgrades. Writes codes and scripts to enhance system management capabilities of the supercomputers and automate repeatedly performed system administration tasks.
Research and Evaluation
Researches new and emerging technology in the High-Performance Computing Futures Lab (HPCFL), evaluates the potential impact of the new hardware and software technology on our workflow, plans, and makes recommendations to the High-End Services Section and CISL management for future procurement of hardware and software products, configurations and functional enhancements or upgrades in support of the high-performance computing environment. Evaluation and benchmarks efforts, and compiles reports on new hardware and software systems related to high-performance computing.
Participates in projects relating to High Performance Computing and may have direct responsibility for design and procurement decisions. This may include development of systems level code to support the various aspects of supercomputer software and hardware. Participates in the RFP process by contributing to the technical specification, requirements definition, review, and decision making for future HPC and procurements.
Operational Monitoring and Troubleshooting
Operates and monitors the behavior of the group managed supercomputers and associated peripherals on a routine, daily basis to ensure proper and efficient operations. Alerts other Supercomputer Services Group staff, vendor representatives and/or CASG staff of abnormal conditions or behaviors, as appropriate, and takes remedial actions as necessary. Diagnoses and may repair failed software and /or hardware components, or may mentor/assist other staff in such.
Provides service on a 7x24 on-call basis troubleshooting and resolving system related problems presented by users, other sections in CISL, and vendor-employed engineers and analysts. Refers and escalates problems to senior members of the Supercomputer Services Group or appropriate staff as necessary. Documents troubleshooting and operational techniques and best practices, mentors other team members when necessary.
Provides systems support for diverse hardware and software architectures. Installs and upgrades system hardware and software, including computational systems, clusters, standalone machines, KVM systems and a variety of network fabrics including Ethernet, InfiniBand and OPA. Helps define standards and guidelines for operation and maintenance and produces systems operation and procedural documentation. Compiles, installs and maintains commercial and free application software. Documents system administration tasks and mentors other team members when necessary.
Organizational Representation and Reporting
Provides regular Supercomputer Services Group activities reports to management and contributes to the NCAR Annual Scientific Report and development plans. Attends group, section and divisional meetings and represents the Supercomputer Services Group and its activities at such meetings.
What You Need:
Education and Years of Experience:
Bachelor’s degree and eight to twelve years of progressive experience or equivalent combination of education and experience in one or more of the following fields: Computer Science, Mathematics, Computer Engineering, Information Sciences, Software Engineering, or equivalent.
Experience with Linux operating system environment in a multi-vendor environment.
Experience with job scheduling system.
Experience with the installation and management in a Linux based supercomputer environment
Experience administering a production supercomputer/cluster environment.
Experience with disaster recovery of single image systems and clusters
Knowledge, Skills, and Abilities:
Demonstrated skill in common scripting and programming languages and ANSI/GNU C, Python, Perl, or others.
Experience with high-performance computing, management of computer clusters and related technologies.
Demonstrated skill in the installation and administration of large-scale Unix based clusters.
Demonstrated skill in networking, advanced knowledge of computer security, specifically in a supercomputer/cluster environment.
Demonstrated skill in the installation, integration of hardware and software for deploying supercomputers.
Demonstrated skill in the configuration and tuning of 10/40/100 GigE, Infiniband and OPA networks .
Demonstrated skill installing and maintaining supercomputer software and hardware
Demonstrated skill in performing tasks requiring organization and attention to detail.
Excellent written and verbal English communication skills and the ability to write and interpret systems documentation.
Good project management skills and ability to provide leadership to other team members and mentor other staff.
Desired Knowledge, Skills, and Abilities:
Experience in ANSI C++.
Experience in supporting HPC systems for scientific and research computing.
Experience with a programming language.
Experience with cluster management tools.
Experience with cluster management tools, both proprietary and open source.
Experience with multiple UNIX/Linux based operating systems.
Experience with virtualization.
Experience with Cloud computing especially HPC in the cloud.
Experience with parallel or distributed file systems.
Experience with hardware troubleshooting & components replacement
Occasional travel to the NCAR Wyoming Supercomputer Center, which is approximately 90 miles north of Boulder is required.
Periodic on-call support in rotation with other staff is required for this position.
Providing assessment and feedback on vendor technology roadmap, RFI/RFP to the SSG group head and the HSS section head is also required.
What’s In It for You:
Benefits (Medical, Dental, Vision)
The University Corporation for Atmospheric Research (UCAR) is an equal opportunity/equal access/affirmative action employer that strives to develop and maintain a diverse workforce. UCAR is committed to providing equal opportunity for all employees and applicants for employment and does not discriminate on the basis of race, age, creed, color, religion, national origin or ancestry, sex, gender, disability, veteran status, genetic information, sexual orientation, gender identity or expression, or pregnancy.
Whatever your intersection of identities, you are welcome at the University Corporation for Atmospheric Research (UCAR). We are committed to inclusivity and promoting an equitable environment that values and respects the uniqueness of all members of our organization.