RELIABILITY ENGINEER

Blue Obsidian Solutions

Today
Top Secret/SCI
Unspecified
Unspecified
IT - Software
Remote/Hybrid (Off-Site/Hybrid)

We are seeking a skilled and proactive Reliability Engineer to join our team. Reliability engineers are responsible for identifying potential issues or areas for improvement by analyzing data and recognizing patterns within it. Once problems are detected, the reliability engineer develops and implements solutions to prevent them, ultimately enhancing the reliability of systems, equipment, and processes.

Responsibilities:
  • Analyzing equipment failure data to detect patterns and trends.
  • Conducting root cause analysis to identify the underlying causes of issues.
  • Creating and implementing new maintenance procedures.
  • Designing and establishing new protocols for monitoring and testing equipment.
  • Exploring new technologies and processes to enhance equipment performance and reliability.
  • Developing and executing training programs for employees.
  • Collaborating with other departments to ensure reliability is incorporated into all areas of the organization.
  • System Reliability: Design and implement strategies to improve the availability, reliability, and performance of critical systems and applications.
  • Incident Management: Lead root cause analysis for major incidents, identify systemic issues, and implement long-term solutions to prevent recurrences.
  • Monitoring and Alerting: Develop and maintain robust monitoring systems to detect issues proactively and optimize alerting mechanisms to ensure timely response.
  • Capacity Planning: Analyze system usage patterns to predict future growth, optimize capacity, and ensure scalability.
  • Failure Analysis: Conduct thorough failure analysis and implement fault tolerant systems to minimize the impact of potential failures.
  • Collaboration: Work closely with software engineering, DevOps, and infrastructure teams to design reliable architecture and improve operational workflows.
  • Documentation: Create and maintain comprehensive documentation of reliability practices, system designs, and incident reports.
  • Continuous Improvement: Regularly evaluate current processes and systems, identifying areas for improvement and implementing enhancements.

Required Skills/Qualifications/Duty Experience

Essential:
  • The ability to think critically and logically, and to work with large datasets to draw meaningful conclusions.
  • A solid understanding of the systems, equipment, and processes involved, including knowledge of engineering principles and specific organizational systems.
  • The capacity to think creatively and develop innovative solutions for complex issues.
  • Be able to clearly explain technical concepts and collaborate effectively with others.
  • The ability to recognize potential hazards and take appropriate actions to mitigate risks..
  • Hands-on experience with cloud platforms (AWS, Azure, GCP) and securing cloud environments.
  • Strong understanding of containerization technologies (Docker, Kubernetes) and their security.
  • Knowledge of security tools like SAST, DAST, vulnerability scanners, and SIEM solutions.
  • Strong experience in system reliability, site reliability engineering (SRE), or a similar role.
  • Proficiency in cloud platforms (AWS, Azure, GCP) and associated reliability tools.
  • Hands-on experience with monitoring and logging tools such as Prometheus, Grafana, Datadog, Splunk, or ELK stack.
  • Proficiency in scripting languages like Python, Bash, or Go for automation.
  • Familiarity with containerization and orchestration tools (Docker, Kubernetes).
  • Strong understanding of distributed systems, fault tolerant design, and high availability architectures.
  • Experience in root cause analysis and implementing systemic improvements.


Preferred:
  • Certifications in cloud platforms (e.g., AWS Certified Solutions Architect, Google Cloud Engineer).
  • Certifications in Security+, CCNA/CCNP, Linux+
  • Experience in capacity planning and performance tuning of largescale systems.
  • Familiarity with chaos engineering practices.
  • Strong communication skills and the ability to work collaboratively with cross functional teams.


Security Requirements
  • Must possess and maintain a TS/SCI clearance at time of hire


Education/Certification Requirements
  • Bachelor's degree in program or project management, information technology, or a related technical discipline; or the equivalent


Travel:
  • Ability to travel as needed, estimate less than 25%


What We Offer:
  • Competitive salary and performance-based incentives
  • Comprehensive health, dental, and vision benefits
  • Opportunities for professional development and certifications
  • Flexible work environment with hybrid or remote options
  • A supportive, innovative, and growth-oriented culture


How to Apply:

If you are passionate about building and maintaining reliable systems and thrive in a fast-paced environment, we want to hear from you! Please submit your resume and a cover letter detailing your experience and enthusiasm for the role to hr@blueobsidian.solutions.
group id: 91156969

Match Score

Powered by IntelliSearch™
Create an account or Login to see how closely you match to this job!

Similar Jobs


Job Category
IT - Software
Clearance Level
Top Secret/SCI