Site Reliability Engineer

Skills and Experience you will bring:

  • 3 years of experience managing critical production infrastructure and maintaining reliability and uptime of serverless applications running on the cloud.
  • 2 years of experience with monitoring, log-aggregation, and observability services like Datadog, CloudWatch, Honeycomb, Splunk, and New Relic.
  • 2 years of experience implementing and managing production CI/CD pipelines using modern deployment mechanisms such as blue/green deployment
  • 2 years of experience translating SLO’s and SLI’s into actionable improvements. Reliability, monitoring, and observability are not just words to you.
  • Solid foundation in Linux systems administration, networking, and security

Additional skills and experience that will be useful:

  • Experience with security frameworks such as OWASP, ISO, CSA and PCI.
  • Experience conducting threat assessments and creating remediation plans based on the results of threat assessments.
  • Experience with penetration testing, threat modelling, open-source, and commercial security tools.
  • Experience developing new deployment mechanisms for webapp infrastructure, such as: canary, A/B, blue/green, red-line and other deployment patterns
  • Deep knowledge of performance tuning of core AWS services like Lambda, DynamoDB, APIGateway, SQS, EventBus, EC2
  • Experience with chaos engineering that pushes systems and products to their limits to see how they will respond to unexpected events.

About the Role:

The Engineering Team builds next-generation systems for content management and distribution in the Media and Entertainment industry. Disney, NBCUniversal, Discovery, BBC, and many other content producers and publishers use our products and services to make the most of their file-based and live content for the least effort.

We work with high quality video in real-time and non-real-time scenarios across a wide range of cutting-edge tech. Specializations within the group span from low-level video manipulation and analysis, through back-end management and orchestration services, to web delivered UIs. Working in parallel with these teams is the Scientific Computing Group who work in computer vision, data science and machine learning, taking experiments in Jupyter notebooks through to deployment in production. This makes for a challenging and rewarding engineering experience of continual learning and plenty of opportunity to explore different parts of the stack.

Our technology stack includes a Serverless microservice architecture that capitalizes on the full breadth of AWS services with code written in Python, Rust and Java, our UI uses the latest versions of Angular, Typescript and NgRx, our CI/CD pipelines leverage AWS, Jenkins, Nexus, and Bazel in addition to our in-house release-management application to build and release 100’s of software components.

As a Site Reliability Engineer, you will join our talented and passionate team building a collection of services that will be used by the biggest names in the exciting broadcast and media industry. Our services are hosted in AWS, with a Serverless First mindset.

“Work is a thing you do, not a place you go”

We work in agile, low-bureaucracy, high-creativity, cross-functional teams spread across the world. It’s a highly creative work environment where we support your growth with opportunities for career progression, mentoring others and third-party education. The team is built on trust and is relaxed, open and welcoming to all, and there’s fun to be had with regular social events and sports teams.

As part of this role, you will be expected to:

  • Establish and measure reliability goals like Uptime, Downtime, Mean time between failures, Mean time to resolution, etc.
  • Define operational maturity by defining and implementing SLIs, SLOs, enable faster detection, and isolation of failures and proactively work to mitigate them
  • Participate in an on-call rotation.
  • Participate in daily scrum standups, sprint planning, and other team rituals including retrospectives.
  • Implement and maintain CI/CD pipelines on AWS using CodeCommit, CodePipeline and CodeDeploy
  • Evaluate, Implement, and use various monitoring, log-aggregation, and observability services like AWS CloudWatch, Honeycomb to troubleshoot and resolve issues rapidly
  • Conducting and documenting root cause analysis (RCA) and post-incident reviews that document events.
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding


This role allows you to work with “Full Flexibility” - for any work where being physically close to fixed equipment is not a requirement, you have the option to work remotely.

Remote working is not the same as working from home, WFH is just one very common option. You can work from wherever gets the creative juices flowing: coffee shops, co-working places, the park, a different country even! Anywhere with Internet access.

Of course, working from an office is an option too especially if you’re craving some ad hoc in-person interaction! Evertz has offices in Canada, England, Scotland, India, Singapore, Hong Kong, Virginia, California, Arizona, Ohio, Hungary, Belgium, Poland and Australia. Many have great spaces for meet-ups as well as permanent or floating desk space.

Working Hours

This role allows you to work asynchronously meaning you can contribute at the times when you do your best work. Some people are early-birds, some are night-owls, maybe Saturday is better than Wednesday? Whilst some overlap for core meetings is needed, you don’t have to do your deep work between 9 and 5.

Salary & Benefits

We offer a competitive salary with annual performance-based bonus and stock option schemes. A pension plan; an employer funded health and medical plan; life insurance plan; long term disability coverage; paid time off; an employee assistance program; and a discount platform. The availability and specifics of these benefits vary by location, details of which will be provided during the hiring process.