Site Reliability Engineer
Company: Intrepid USA
Location: Holmdel
Posted on: October 31, 2025
|
|
|
Job Description:
We are seeking a skilled Engineer, Site Reliability (SRE) to
contribute to the reliability, scalability, and performance of our
multi-cloud SaaS platform serving thousands of customers worldwide.
This role involves hands-on technical work in incident response,
system monitoring, automation, and continuous improvement of our
platform reliability. The successful candidate will work within a
global SRE team to ensure optimal system performance and customer
satisfaction. About Us When you join iCIMS, you join the team
helping global companies transform business and the world through
the power of talent. Our customers do amazing things: design rocket
ships, create vaccines, deliver consumer goods globally, overnight,
with a smile. As the Talent Cloud company, we empower these
organizations to attract, engage, hire, and advance the right
talent. We’re passionate about helping companies build a diverse,
winning workforce and about building our home team. Were dedicated
to fostering an inclusive, purpose-driven, and innovative work
environment where everyone belongs. Responsibilities System
Monitoring & Reliability: Monitor multi-cloud infrastructure (AWS,
Azure, GCP) using New Relic, Grafana, and Sumo Logic Maintain
reliability of AWS resources, Auth0/Okta authentication, databases,
and legacy applications Implement monitoring, alerting, and
dashboards for assigned systems Incident Management & Response:
Respond to alerts and incidents within SLA timeframes Perform root
cause analysis and document findings Create and maintain runbooks
and troubleshooting procedures Participate in 24/7 on-call rotation
Automation & Improvement: Develop scripts to reduce manual
operational overhead Build monitoring and alerting solutions
Support infrastructure-as-code initiatives Implement automated
remediation where possible Success Metrics: Customer Impact:
Reduced MTTR and improved customer satisfaction scores Reliability:
Achievement of 99.9% uptime SLAs across all products and regions
Proactive Prevention: Reduction in incident frequency through
automated detection and prevention Cross-functional Collaboration:
Improved partnership metrics with Product, Engineering, and
Customer Success teams Automation Delivery: Complete assigned
automation projects to reduce manual tasks Knowledge Sharing:
Contribute to team knowledge base and mentor junior engineers
Qualifications 4 years experience in SRE, DevOps, or Infrastructure
Engineering Hands-on experience with AWS (required) and Azure
(preferred) Strong Linux system administration skills Experience
with monitoring tools (New Relic, Grafana, Prometheus) Scripting
skills in Python, Bash, or similar Knowledge of databases (SQL
Server, PostgreSQL, MongoDB) Preferred Technical Experience: SaaS
experience in a global environment Authentication and identity
management systems knowledge Cloud certifications (AWS, Azure, or
Google Cloud) Infrastructure-as-code tools (Terraform,
CloudFormation) Education/Certifications/Licenses: Bachelor’s
degree in computer science, Engineering, Information Systems, or
related technical field Equivalent combination of education and
experience will be considered Working Conditions: Global role
requiring flexibility for incident response and team coordination
across time zones Occasional client-facing responsibilities during
critical incidents Travel may be required for team building Hybrid
work environment with team members distributed globally EEO
Statement iCIMS is a place where everyone belongs. We celebrate
diversity and are committed to creating an inclusive environment
for all employees. Our approach helps us to build a winning team
that represents a variety of backgrounds, perspectives, and
abilities. So, regardless of how your diversity expresses itself,
you can find a home here at iCIMS. We are proud to be an equal
opportunity and affirmative action employer. We prohibit
discrimination and harassment of any kind based on race, color,
religion, national origin, sex (including pregnancy), sexual
orientation, gender identity, gender expression, age, veteran
status, genetic information, disability, or other applicable
legally protected characteristics. Compensation and Benefits We
accept applications for this position on an ongoing basis until the
position is filled. Applications will be reviewed as they are
received, and qualified candidates may be contacted throughout the
posting period. The anticipated base pay range for this position is
$100,000-140,000.00 annually. Final compensation will be based on
factors such as relevant experience, skills, education, internal
equity, and market data. This range aligns with our commitment to
equitable and transparent compensation practices, as required by
applicable law. Competitive health and wellness benefits include
medical, dental, vision, 401(k), dependent care, short term and
long-term disability, life and AD&D insurance, bonding and
parental leave, mindfulness resources, an open vacation policy,
sick days, paid holidays, quiet hours each workday, and tuition
reimbursement. Benefits and eligibility may vary by location, role,
and tenure.
Keywords: Intrepid USA, East Brunswick , Site Reliability Engineer, IT / Software / Systems , Holmdel, New Jersey