Today
Top Secret/SCI
Unspecified
Polygraph
IT - Hardware
Bethesda, MD (On-Site/Office)
Location: Bethesda, MD
Category: Software Engineering
Travel Required: No
Remote Type: Hybrid Remote
Clearance: Top Secret/SCI
Sunayu LLC is seeking a skilled and motivated Linux Server/NVidia GPU Administrator/Engineer to support the National Media Exploitation Center (NMEC). You will play a key role in maintaining and optimizing a high-performance computing environment built on Nvidia DGX1 and A100 servers, both physical and virtual. This is an excellent opportunity to work with cutting-edge technology and contribute to critical national security missions.
Responsibilities:
System Administration:
DevOps and Configuration Management:
Resource Management:
Incident Response and Support:
Qualifications:
Required:
Preferred:
Clearance:
Category: Software Engineering
Travel Required: No
Remote Type: Hybrid Remote
Clearance: Top Secret/SCI
Sunayu LLC is seeking a skilled and motivated Linux Server/NVidia GPU Administrator/Engineer to support the National Media Exploitation Center (NMEC). You will play a key role in maintaining and optimizing a high-performance computing environment built on Nvidia DGX1 and A100 servers, both physical and virtual. This is an excellent opportunity to work with cutting-edge technology and contribute to critical national security missions.
Responsibilities:
System Administration:
- Manage and maintain Linux servers (Red Hat/CentOS, Ubuntu) in a multi-enclave enterprise environment.
- Provide technical support, administration, and monitoring of Linux systems, Nvidia DGX1 and
A100 servers within a physical and virtual environment. - Troubleshoot hardware and software issues, including server failures, network connectivity problems, and application errors.
- Implement security updates, patches, and configurations to harden systems and protect against vulnerabilities.
- Monitor system performance and resource utilization, identifying and resolving bottlenecks.
- Automate system administration tasks using scripting languages like Bash and Python.
DevOps and Configuration Management:
- Utilize DevOps tools (Ansible, Salt, Gitlab) to automate configuration management, software updates, and system maintenance.
- Maintain and improve system availability through proactive monitoring and automation.
- Collaborate with developers and hardware architects to debug issues, define new requirements, and optimize workflows.
Resource Management:
- Monitor resource management system (SLURM) to keep resource allocation efficient and
aligned with organizational priorities - Work directly with users and management to plan and allocate resources effectively.
- Communicate clearly and proactively regarding resource availability and scheduling.
Incident Response and Support:
- Provide technical support to users, troubleshooting issues and resolving incidents in a timely manner.
- Analyze recurring problems and implement solutions to prevent reoccurrence.
- Document incident resolution steps and contribute to root cause analysis efforts.
- Participate in on-call rotation to provide 24/7/365 support during outages and emergencies.
Qualifications:
Required:
- Bachelor's degree in Computer Science or a related field and 6+ years of relevant experience (additional experience may be considered in lieu of a degree).
- 2+ years of experience administering Linux servers (Red Hat/CentOS, Ubuntu).
- Hands-on experience troubleshooting server hardware failures.
- Proficiency with configuration management tools (Ansible, Salt).
- Strong understanding of networking services (DNS, NFS, LDAP, DHCP).
- Experience with shell scripting and/or Python for automation.
- Knowledge of Linux security best practices.
- Excellent troubleshooting and problem-solving skills.
- Strong communication and interpersonal skills.
- Ability to work independently and as part of a team.1
- DoD 8570.11- IAT Level II certification (Security+ CE, CCNA-Security, GSEC, or SSCP) and an appropriate computing environment (CE) certification.
Preferred:
- Experience with container technologies (Docker, Kubernetes).
- Familiarity with monitoring tools (Prometheus/Grafana).
- Knowledge of distributed resource scheduling systems (Slurm, LSF).
- Experience with CUDA and GPU-accelerated computing systems.
- Basic understanding of deep learning frameworks and algorithms
Clearance:
- Due to the nature of the government contract we support, US Citizenship is required
- TS/SCI w/ Polygraph required for position
group id: 90958040