Today

Top Secret/SCI

Unspecified

Polygraph

IT - Hardware

Bethesda, MD (On-Site/Office)

Location: Bethesda, MD

Category: Software Engineering

Travel Required: No

Remote Type: Hybrid Remote

Clearance: Top Secret/SCI

Sunayu LLC is seeking a skilled and motivated Linux Server/NVidia GPU Administrator/Engineer to support the National Media Exploitation Center (NMEC). You will play a key role in maintaining and optimizing a high-performance computing environment built on Nvidia DGX1 and A100 servers, both physical and virtual. This is an excellent opportunity to work with cutting-edge technology and contribute to critical national security missions.

Responsibilities:

System Administration:

Manage and maintain Linux servers (Red Hat/CentOS, Ubuntu) in a multi-enclave enterprise environment.
Provide technical support, administration, and monitoring of Linux systems, Nvidia DGX1 and
A100 servers within a physical and virtual environment.
Troubleshoot hardware and software issues, including server failures, network connectivity problems, and application errors.
Implement security updates, patches, and configurations to harden systems and protect against vulnerabilities.
Monitor system performance and resource utilization, identifying and resolving bottlenecks.
Automate system administration tasks using scripting languages like Bash and Python.

DevOps and Configuration Management:

Utilize DevOps tools (Ansible, Salt, Gitlab) to automate configuration management, software updates, and system maintenance.
Maintain and improve system availability through proactive monitoring and automation.
Collaborate with developers and hardware architects to debug issues, define new requirements, and optimize workflows.

Resource Management:

Monitor resource management system (SLURM) to keep resource allocation efficient and
aligned with organizational priorities
Work directly with users and management to plan and allocate resources effectively.
Communicate clearly and proactively regarding resource availability and scheduling.

Incident Response and Support:

Provide technical support to users, troubleshooting issues and resolving incidents in a timely manner.
Analyze recurring problems and implement solutions to prevent reoccurrence.
Document incident resolution steps and contribute to root cause analysis efforts.
Participate in on-call rotation to provide 24/7/365 support during outages and emergencies.

Qualifications:

Required:

Bachelor's degree in Computer Science or a related field and 6+ years of relevant experience (additional experience may be considered in lieu of a degree).
2+ years of experience administering Linux servers (Red Hat/CentOS, Ubuntu).
Hands-on experience troubleshooting server hardware failures.
Proficiency with configuration management tools (Ansible, Salt).
Strong understanding of networking services (DNS, NFS, LDAP, DHCP).
Experience with shell scripting and/or Python for automation.
Knowledge of Linux security best practices.
Excellent troubleshooting and problem-solving skills.
Strong communication and interpersonal skills.
Ability to work independently and as part of a team.1
DoD 8570.11- IAT Level II certification (Security+ CE, CCNA-Security, GSEC, or SSCP) and an appropriate computing environment (CE) certification.

Preferred:

Experience with container technologies (Docker, Kubernetes).
Familiarity with monitoring tools (Prometheus/Grafana).
Knowledge of distributed resource scheduling systems (Slurm, LSF).
Experience with CUDA and GPU-accelerated computing systems.
Basic understanding of deep learning frameworks and algorithms

Clearance:

Due to the nature of the government contract we support, US Citizenship is required
TS/SCI w/ Polygraph required for position

group id: 90958040

Recruiter

Linux Engineer (GPU) - Top Secret/SCI

Sunayu, LLC

Sunayu, LLC Jobs

Location

Job Category

Clearance Level

Employer