Site Reliability Engineer

NVIDIA - Santa Clara, CA (30+ days ago)4.2


Are you an expert Software Engineer with a real passion for reliability? Are you looking to take the next step in your career with one of the most cutting-edge technology companies? Do you pride yourself in building fault tolerant systems? Are you interested in bringing next-generation AI technologies to the world of healthcare?

If so, join our team at NVIDIA, where we are currently building cutting edge healthcare solutions using the power of AI. Our team is charged with building highly available customer facing systems, and we need your help to continue to grow and mature our platform.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most brilliant and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you.

A majority of your focus will be on creating new DevOps tools/applications, working on CI/CD pipelines, as well as operational duties such as monitoring deployed applications for reliability and performance.

What we'll be doing:
Deploying and managing Kubernetes clusters and services deployed on these clusters
Delivering automation as a service
We understand failure is a part of any system and we build counter-measures for the things we can predict will fail, and monitors for everything else
Work with people from all facets of development and the business
Build and deploy APIs to accelerate automation and deployment-centric tasks of many different systems
Build a multi-cloud middle layer for deploying, monitoring, and testing services and containers
Create infrastructure as code on public clouds such as Azure
Drive for fault tolerance and reliability with a strong focus on velocity through clean contract design and implementation
Aid in architectural and design reviews of new and existing systems across different teams of developers
Engage in blameless post-mortems, incident reviews, and root cause analysis
Tackle administrative responsibilities of deployed Kubernetes clusters and applications running on those clusters
Review analytical data in Kibana dashboards and create alerts related to thresholds in uploaded data
Learn from our mistakes as we continue to take calculated risks
Participate in on call rotation

What we need to see:
BS plus 5 yrs related experience
3+ years of experience in testing, deploying and supporting large scale services on Azure, AWS or similar environments
Good working knowledge of Kubernetes and docker
Proficiency in at least one language, such as Python, Java, Go
Strong customer focus and interpersonal skills
Expert skills with Linux, networking, storage, and virtualization
Experience enabling automation with tools like Jenkins/Ansible/Chef/Puppet
Experience with setting up monitoring and alerting using Kibana, Logstash, Zabbix
Deep understanding of Service Oriented Architecture and RESTful APIs.
Experience with build systems

Ways to stand out from the crowd:
Experience or interest in distributed systems
Machine learning or AI experience
A track record of working independently and in conjunction with team members and product groups.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression , sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
#deeplearning