Linux Engineer (GPU)

CCS Global Tech

Today
Top Secret/SCI
Unspecified
Unspecified
IT - Hardware
Bethesda, MD (On-Site/Office)

Responsibilities:

System Administration:
  • Manage and maintain Linux servers (Red Hat/CentOS, Ubuntu) in a multi-enclave enterprise environment.
  • Provide technical support, administration, and monitoring of Linux systems, Nvidia DGX1 and
    A100 servers within a physical and virtual environment.
  • Troubleshoot hardware and software issues, including server failures, network connectivity problems, and application errors.
  • Implement security updates, patches, and configurations to harden systems and protect against vulnerabilities.
  • Monitor system performance and resource utilization, identifying and resolving bottlenecks.
  • Automate system administration tasks using scripting languages like Bash and Python.

DevOps and Configuration Management:
  • Utilize DevOps tools (Ansible, Salt, Gitlab) to automate configuration management, software updates, and system maintenance.
  • Maintain and improve system availability through proactive monitoring and automation.
  • Collaborate with developers and hardware architects to debug issues, define new requirements, and optimize workflows.

Resource Management:
  • Monitor resource management system (SLURM) to keep resource allocation efficient and
    aligned with organizational priorities
  • Work directly with users and management to plan and allocate resources effectively.
  • Communicate clearly and proactively regarding resource availability and scheduling.

Incident Response and Support:
  • Provide technical support to users, troubleshooting issues and resolving incidents in a timely manner.
  • Analyze recurring problems and implement solutions to prevent reoccurrence.
  • Document incident resolution steps and contribute to root cause analysis efforts.
  • Participate in on-call rotation to provide 24/7/365 support during outages and emergencies.

Qualifications:

Required:
  • Bachelor's degree in Computer Science or a related field and 6+ years of relevant experience (additional experience may be considered in lieu of a degree).
  • 2+ years of experience administering Linux servers (Red Hat/CentOS, Ubuntu).
  • Hands-on experience troubleshooting server hardware failures.
  • Proficiency with configuration management tools (Ansible, Salt).
  • Strong understanding of networking services (DNS, NFS, LDAP, DHCP).
  • Experience with shell scripting and/or Python for automation.
  • Knowledge of Linux security best practices.
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and interpersonal skills.
  • Ability to work independently and as part of a team.1
  • DoD 8570.11- IAT Level II certification (Security+ CE, CCNA-Security, GSEC, or SSCP) and an appropriate computing environment (CE) certification.

Preferred:
  • Experience with container technologies (Docker, Kubernetes).
  • Familiarity with monitoring tools (Prometheus/Grafana).
  • Knowledge of distributed resource scheduling systems (Slurm, LSF).
  • Experience with CUDA and GPU-accelerated computing systems.
  • Basic understanding of deep learning frameworks and algorithms
group id: 10290999

Match Score

Powered by IntelliSearchâ„¢
Create an account or Login to see how closely you match to this job!

Similar Jobs


Job Category
IT - Hardware
Clearance Level
Top Secret/SCI