SRE- Monitoring & Observability (M&O) : W2 role

🌍 Remote, USA πŸš€ Full-time πŸ• Posted Recently

Job Description

Job Description SRE- Monitoring & Observability (M&O) Remote :: have to be willing to travel to Knoxville, TN sometimes. As a Senior Specialist in Monitoring & Observability, you will design, implement, and standardize enterprise-grade monitoring and alerting solutions across complex, cloud-based environments. This role sits at the intersection of Observability, SRE, and Incident Management, with a focus on ensuring systems are reliable, measurable, and proactively monitored. You'll collaborate with Cloud Operations, Architecture, and Platform Engineering teams to define best practices and build resilient, insight-driven infrastructure that supports business-critical services. Your Impact β€’ Implement and standardize monitoring and alerting tools across multiple cloud platforms to ensure consistent observability practices. β€’ Architect observability solutions with Splunk, OpenTelemetry, AWS CloudWatch, GuardDuty, Wiz, and other modern monitoring stacks. β€’ Design and build incident response workflows, playbooks, and dashboards for actionable insights and faster recovery. β€’ Define and operationalize SLOs, SLIs, and error budgets to align with reliability goals. β€’ Integrate observability tools with ServiceNow ITOM and CMDB for automated incident management and asset tracking. β€’ Collaborate with Cloud Operations and Architecture teams to ensure observability is embedded in design, build, and run phases. β€’ Automate monitoring configurations and embed observability into CI/CD pipelines. β€’ Optimize performance and reliability through log analysis, metrics correlation, and distributed tracing. β€’ Drive initiatives to improve MTTR, incident detection, and proactive issue prevention. β€’ Provide technical leadership and mentorship, sharing best practices across engineering and operations teams. Skills & Experience β€’ 5-10 years of experience in infrastructure engineering, with significant focus on monitoring and observability. β€’ Proven expertise with observability platforms such as Splunk, OpenTelemetry, AWS CloudWatch, GuardDuty, Wiz. β€’ Strong knowledge of logging, metrics, tracing, and open standards for observability. β€’ Experience designing and managing incident response workflows and escalation processes. β€’ Hands-on experience with ServiceNow ITOM and CMDB integrations. β€’ Proficiency in cloud-native monitoring (AWS, Azure, GCP) and container observability (Docker, Kubernetes). β€’ Familiarity with SRE principles: defining SLOs, SLIs, and error budgets. β€’ Knowledge of automation practices and Infrastructure as Code (Terraform, CloudFormation, ARM templates). β€’ Strong problem-solving skills with the ability to troubleshoot complex distributed systems. β€’ Excellent communication, presentation, and leadership skills. Set Yourself Apart With β€’ Cloud certifications such as AWS DevOps Engineer, Azure DevOps Engineer Expert, or Google Professional Cloud DevOps Engineer. β€’ Experience in AIOps, predictive analytics, and security-driven observability. β€’ Exposure to chaos engineering or performance engineering practices. Experience in multi-cloud and hybrid environments with advanced observability patterns Apply tot his job

Ready to Apply?

Don't miss out on this amazing opportunity!

πŸš€ Apply Now

Similar Jobs

Recent Jobs

You May Also Like