Are you the kind of person that is passionate about server hardware? Do you love working in a visionary environment where people are constantly solving challenging technical problems?
We are looking for a talented engineer to join our cloud infrastructure team to own server definition, validation and configuration supporting both game streaming and AI services.
NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform that runs everywhere. Data scientists and researchers can now rapidly build, train, and deploy neural network models to address some of the most complicated AI challenges. GeForce NOW is the world’s first cloud-gaming service capable of streaming PC games at up to 1080p resolution and 60 frames per second.
What you will be doing:
You will lead the fleet of servers that runs our cloud services
Define server SKU for specific functions within our environment, including the server’s CPU, memory, NIC and RAID configurations
You work with server vendors to integrate Nvidia’s requirements at the factory (BIOS and RAID configuration, firmware version, asset tags and custom testing and validation)
Create an accurate and up-to-date inventory of systems, including system configuration, firmware and BIOS version, location, etc
You automate the population of inventory into our DCIM system
Document all SKUs in use in our environment and provide on-demand reporting of system configuration
Create, track and report system reliability metrics for all systems under management
Work with server vendors and internal teams to create and lead system diagnostics tools to validate hardware health
Develop roadmap for servers, aligned to business requirements, performance, capabilities and technology trends
Work with cloud infrastructure team to implement correct hardware health monitoring and remediation states in our data center automation system.
Define standards and input to automated regression system to do full stack testing of various hardware/firmware/OS/key applications.
Create and maintain comprehensive validation test plan for new revisions of SBIOS, VBIOS, BMC and NIC firmware
You develop, test, and execute plan for fleet-wide BIOS and firmware updates
Provide recommendations to scripts and processes for server verification, burn in and decommissioning of hyperscale servers. Participate in the optimization of technology refresh program, including DC migrations.
Own server configuration methods and management to ensure consistency across the deployed infrastructure.
What we need to see:
Minimum 10+ years of server engineering experience supporting highly-available, large-scale, cloud service environments, with Bachelors degree
Solid understanding of datacenter power and cooling design
Deep knowledge of DCIM tools and integration
General understanding of IT infrastructure systems: server, network, storage, data center and colocation
Knowledge of industry data center standards, policies & methodologies
Good communication and soft skillset, able to present to senior management in a sensible and persuasive manner.
Love to influence and establish relationships with other software and IT functional groups such as development, server, storage and security teams.
With highly competitive salaries and a comprehensive benefits package, NVIDIA is widely considered to be one of the technology industry's most desirable employers. We have some of the most forward thinking and hardworking people in the world working with us and our product lines are growing fast in some of the hottest state of the art fields such as Virtual Reality, Artificial Intelligence, Deep Learning and Autonomous Vehicles.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression , sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.