Journey with us! Combine your career goals and sense of adventure by joining our exciting team of employees Royal Caribbean Group is pleased to offer a competitive compensation and benefits package and excellent career development opportunities each offering unique ways to explore the world
We are proud to be the vacation-industry leader with global brands — including Royal Caribbean International Celebrity Cruises and Silversea Cruises — the most innovative fleet and private destinations and the best people Together we are dedicated to turning the vacation of a lifetime into a lifetime of vacations for our guests
The Royal Caribbean Group’s Site Reliability Team has an exciting career opportunity for a full time Senior Engineer Site Reliability reporting to the Senior Manager SIte Reliability
This position is onsite and based in Miramar Florida
Tis position is also not eligible for work authorization sponsorship
Position Summary:
We are seeking a highly skilled Senior Site Reliability Engineer to own operate and continuously mature our enterprise observability platform across one of the most complex hospitality and maritime technology environments in the world This role is the engineering backbone of RCG’s observability practice — responsible for ensuring deep reliable system visibility across 950+ applications serving 100000+ users across Royal Caribbean International Celebrity Cruises and Silversea
You will operate at the intersection of infrastructure application performance network intelligence and AIOps — driving measurable improvements in mean-time-to-detect (MTTD) mean-time-to-resolve (MTTR) and overall service reliability This is a platform engineering and standards leadership role not a tool administration position
Key Responsibilities:
Platform Ownership & Architecture
Own and evolve the enterprise observability platform spanning Cisco AppDynamics Splunk ThousandEyes
- and PagerDuty AIOps across AWS and Azure environments
Architect and enforce a unified telemetry strategy — metrics logs traces- and events — standardized via OpenTelemetry across all application tiers
Design and govern telemetry data pipelines including ingestion filtering routing- and retention to optimize signal quality and platform cost at enterprise scale
Drive full-stack observability coverage across ship and shore environments including maritime network paths contact center platforms- and revenue-critical booking systems
SLIs SLOs & Reliability Engineering
Define and implement Service Level Indicators (SLIs) Service Level Objectives (SLOs)
- and error budgets for all critical services across RCG’s three brands
Build alerting frameworks that minimize noise surface actionable signals- and integrate cleanly with PagerDuty AIOps on-call workflows
Partner with SRE teams to drive MTTR reduction post-incident observability improvements- and proactive reliability practices
Instrument and publish DORA metrics (Deployment Frequency Lead Time Change Failure Rate- MTTR) to support engineering productivity and release confidence
AIOps & Intelligent Detection
Drive AI-assisted incident detection anomaly correlation
- and root cause analysis using PagerDuty AIOps and Splunk IT Service Intelligence (ITSI)
- Tune and mature ML-based alert grouping and noise suppression models to reduce alert fatigue and accelerate triage
Integrate observability signals with ServiceNow ITSM for automated incident creation enrichment- and closed-loop resolution workflows
Kubernetes & Cloud-Native Observability
Enable and govern Kubernetes observability for EKS and AKS workloads — container health resource utilization pod-level tracing
- and cluster performance
- Integrate observability instrumentation into CI/CD pipelines (GitHub Actions) to enable deployment-correlated performance analysis
- Maintain and extend AWS CloudWatch and Azure Monitor integrations to ensure cloud infrastructure is fully represented in the observability estate
Standards Enablement & Technical Leadership
Define observability standards instrumentation best practices
- and onboarding frameworks for product and platform engineering teams
- Mentor junior engineers and serve as the technical authority for observability discipline across SRE and Platform Engineering
- Lead post-incident reviews (PIRs) and translate findings into observability platform improvements
Govern observability cost optimization: telemetry volume management retention tiering- and platform licensing efficiency
Required Qualifications
6–9+ years in Observability SRE
- or Platform Engineering in enterprise-scale environments
Deep hands-on expertise with Cisco AppDynamics — APM configuration business transaction mapping code-level diagnostics- and baseline management
Strong proficiency with Splunk — SPL query development ITSI service health trees KPI configuration alert policy management- and log pipeline design
Experience with Cisco ThousandEyes for network path monitoring ISP/WAN intelligence- and BGP-level visibility
Proficiency with PagerDuty AIOps — intelligent alert grouping noise suppression event orchestration- and on-call workflow design
Strong command of OpenTelemetry — collector configuration SDK instrumentation semantic conventions- and multi-backend exporting
Hands-on Kubernetes experience (EKS/AKS) — container observability resource metrics- and pod-level distributed tracing
- Experience with AWS CloudWatch and/or Azure Monitor for cloud infrastructure observability
Scripting and automation proficiency: Python Bash Terraform- and/or Ansible for observability tooling deployment and configuration
Experience defining SLIs/SLOs error budgets- and actionable alerting strategies tied to business service reliability
ServiceNow ITSM integration experience — event management incident auto-creation- and CMDB-enriched alerting
- Experience with CI/CD observability integration (GitHub Actions or equivalent)
Preferred Qualifications
Experience with Prometheus Grafana Loki
- or Tempo for supplemental or hybrid observability architectures
Familiarity with eBPF-based observability tooling (eg Pixie- Cilium) for deep kernel-level and network-layer visibility
- Experience with synthetic monitoring and real user monitoring (RUM) to capture end-user experience across digital channels
Familiarity with Cribl or equivalent telemetry pipeline tooling for data routing enrichment- and cost governance
- Exposure to DORA metrics instrumentation and developer experience observability frameworks
Experience in large-scale hospitality travel maritime- or consumer digital platforms
Certifications: Cisco AppDynamics Certified Associate Splunk Core Certified Power User AWS Solutions Architect Kubernetes (CKA/CKAD)- or OpenTelemetry Certified Associate (OTCA/CNCF)
Agency and Third-Party Submissions: Please note this is a direct search by the Company and applications through agencies and other third parties will not be accepted nor will fees be paid for unsolicited resumes Any unsolicited resumes will be considered the Company's property
We know there's a lot to consider As you go through the application process our recruiters will be glad to provide guidance and more relevant details to answer any additional questions Thank you again for your interest in Royal Caribbean Group We'll hope to see you onboard soon!
It is the policy of the Company to ensure equal employment and promotion opportunity to qualified candidates without discrimination or harassment on the basis of race color religion sex age national origin disability sexual orientation sexuality gender identity or expression marital status or any other characteristic protected by law Royal Caribbean Group and each of its subsidiaries prohibit and will not tolerate discrimination or harassment