Linux Engineer (GPU) - Top Secret/SCI

Sunayu, LLC

Today
Top Secret/SCI
Unspecified
Polygraph
IT - Hardware
Bethesda, MD (On-Site/Office)

Location: Bethesda, MD

Category: Software Engineering

Travel Required: No

Remote Type: Hybrid Remote

Clearance: Top Secret/SCI

Sunayu LLC is seeking a skilled and motivated Linux Server/NVidia GPU Administrator/Engineer to support the National Media Exploitation Center (NMEC). You will play a key role in maintaining and optimizing a high-performance computing environment built on Nvidia DGX1 and A100 servers, both physical and virtual. This is an excellent opportunity to work with cutting-edge technology and contribute to critical national security missions.

Responsibilities:

System Administration:
  • Manage and maintain Linux servers (Red Hat/CentOS, Ubuntu) in a multi-enclave enterprise environment.
  • Provide technical support, administration, and monitoring of Linux systems, Nvidia DGX1 and
    A100 servers within a physical and virtual environment.
  • Troubleshoot hardware and software issues, including server failures, network connectivity problems, and application errors.
  • Implement security updates, patches, and configurations to harden systems and protect against vulnerabilities.
  • Monitor system performance and resource utilization, identifying and resolving bottlenecks.
  • Automate system administration tasks using scripting languages like Bash and Python.

DevOps and Configuration Management:
  • Utilize DevOps tools (Ansible, Salt, Gitlab) to automate configuration management, software updates, and system maintenance.
  • Maintain and improve system availability through proactive monitoring and automation.
  • Collaborate with developers and hardware architects to debug issues, define new requirements, and optimize workflows.

Resource Management:
  • Monitor resource management system (SLURM) to keep resource allocation efficient and
    aligned with organizational priorities
  • Work directly with users and management to plan and allocate resources effectively.
  • Communicate clearly and proactively regarding resource availability and scheduling.

Incident Response and Support:
  • Provide technical support to users, troubleshooting issues and resolving incidents in a timely manner.
  • Analyze recurring problems and implement solutions to prevent reoccurrence.
  • Document incident resolution steps and contribute to root cause analysis efforts.
  • Participate in on-call rotation to provide 24/7/365 support during outages and emergencies.

Qualifications:

Required:
  • Bachelor's degree in Computer Science or a related field and 6+ years of relevant experience (additional experience may be considered in lieu of a degree).
  • 2+ years of experience administering Linux servers (Red Hat/CentOS, Ubuntu).
  • Hands-on experience troubleshooting server hardware failures.
  • Proficiency with configuration management tools (Ansible, Salt).
  • Strong understanding of networking services (DNS, NFS, LDAP, DHCP).
  • Experience with shell scripting and/or Python for automation.
  • Knowledge of Linux security best practices.
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and interpersonal skills.
  • Ability to work independently and as part of a team.1
  • DoD 8570.11- IAT Level II certification (Security+ CE, CCNA-Security, GSEC, or SSCP) and an appropriate computing environment (CE) certification.

Preferred:
  • Experience with container technologies (Docker, Kubernetes).
  • Familiarity with monitoring tools (Prometheus/Grafana).
  • Knowledge of distributed resource scheduling systems (Slurm, LSF).
  • Experience with CUDA and GPU-accelerated computing systems.
  • Basic understanding of deep learning frameworks and algorithms

Clearance:
  • Due to the nature of the government contract we support, US Citizenship is required
  • TS/SCI w/ Polygraph required for position
group id: 90958040
Find Sunayu, LLC on Social Media
Network Employers (2)
D
Recruiter
J
Recruiter
About Us
Inspired engineering | Scaleable solutions | Rapid deployment We provide advanced DevOps, integration solutions, big data analytics, and cyber security.

Sunayu, LLC Jobs


Job Category
IT - Hardware
Clearance Level
Top Secret/SCI
Employer
Sunayu, LLC