Site Reliability Engineer

We’re looking for highly motivated, passionate site reliability engineers to join our growing team. At evertz.io, our teams are building services that are used by the biggest names in the exciting broadcast and media industry. Our services are hosted in AWS, with a Serverless First mindset.

As part of this role you will work with our talented teams to help harden our multi-tenant SaaS platform. Using best in class observabilty tooling, you will be working to debug incidents, while also identifying and implementing improvements to the platform to ensure its continued reliability. Your drive to eliminate toil will see you automating processes and building the tools to do so.

We offer flexible working hours, great benefits, and the freedom to experiment with new technologies and tools to build better products.

Skills and Experience you will bring:

  • 2 years of experience managing critical production infrastructure and maintaining reliability and uptime of applications.
  • 2 years of experience with monitoring, log-aggregation, and observability services like Datadog, CloudWatch, Honeycomb, Splunk, and New Relic.
  • 2 years of experience implementing and managing production CI/CD pipelines using modern deployment mechanisms such as blue/green deployment
  • 2 years of experience translating SLO’s and SLI’s into actionable improvements. Reliability, monitoring, and observability are not just words to you.
  • Solid foundation in Linux systems administration, networking, and security

Additional skills and experience that will be useful:

  • Experience with serverless applications running in the cloud.
  • Experience with security frameworks such as OWASP, ISO, CSA and PCI.
  • Experience conducting threat assessments and creating remediation plans based on the results of threat assessments.
  • Experience with penetration testing, threat modelling, open-source, and commercial security tools.
  • Experience developing new deployment mechanisms for webapp infrastructure, such as: canary, A/B, blue/green, red-line and other deployment patterns
  • Deep knowledge of performance tuning of core AWS services like Lambda, DynamoDB, APIGateway, SQS, EventBus, EC2
  • Experience with chaos engineering that pushes systems and products to their limits to see how they will respond to unexpected events.

About the Role:

The evertz.io Engineering Team builds next-generation systems for content management and distribution in the Media and Entertainment industry. Disney, NBCUniversal, Discovery, BBC, and many other content producers and publishers use our products and services to make the most of their file-based and live content for the least effort.

We work with high quality video in real-time and non-real-time scenarios across a wide range of cutting-edge tech. Specializations within the group span from low-level video manipulation and analysis, through back-end management and orchestration services, to web delivered UIs. Working in parallel with these teams is the Scientific Computing Group who work in computer vision, data science and machine learning, taking experiments in Jupyter notebooks through to deployment in production. This makes for a challenging and rewarding engineering experience of continual learning and plenty of opportunity to explore different parts of the stack.

Our technology stack includes a Serverless microservice architecture that capitalizes on the full breadth of AWS services with code written in Python, Rust and Java, our UI uses the latest versions of Angular, Typescript and NgRx, our CI/CD pipelines leverage AWS, Jenkins, Nexus, and Bazel in addition to our in-house release-management application to build and release 100’s of software components.

As a Site Reliability Engineer, you will join our talented and passionate team building evertz.io: a collection of services that will be used by the biggest names in the exciting broadcast and media industry. Our services are hosted in AWS, with a Serverless First mindset.

“Work is a thing you do, not a place you go”

We work in agile, low-bureaucracy, high-creativity, cross-functional teams spread across the world. It’s a highly creative work environment where we support your growth with opportunities for career progression, mentoring others and third-party education. The team is built on trust and is relaxed, open and welcoming to all, and there’s fun to be had with regular social events and sports teams.

As part of this role, you will be expected to:

  • Use various monitoring, log-aggregation, and observability services like AWS CloudWatch and Honeycomb to troubleshoot and resolve issues rapidly
  • Implement and maintain CI/CD pipelines on AWS using CodeCommit, CodePipeline and CodeDeploy
  • Foster a culture of reliability best practices across the evertz.io teams through the use of SLIs and SLOs and implementing changes directly in codebase
  • Establish and measure reliability goals such as uptime, downtime, mean time between failures, mean time to resolution, etc.
  • Conducting and documenting root cause analysis’ (RCA) and post-incident reviews
  • Participate in an on-call rotation

Location

This role allows you to work with “Full Flexibility” - for any work where being physically close to fixed equipment is not a requirement, you have the option to work remotely.

Remote working is not the same as working from home, WFH is just one very common option. You can work from wherever gets the creative juices flowing: coffee shops, co-working places, the park, a different country even! Anywhere with Internet access.

Of course, working from an office is an option too especially if you’re craving some ad hoc in-person interaction! Evertz has offices in Canada, England, Scotland, India, Singapore, Hong Kong, Virginia, California, Arizona, Ohio, Hungary, Belgium, Poland and Australia. Many have great spaces for meet-ups as well as permanent or floating desk space.

Working Hours

This role allows you to work asynchronously meaning you can contribute at the times when you do your best work. Some people are early-birds, some are night-owls, maybe Saturday is better than Wednesday? Whilst some overlap for core meetings is needed, you don’t have to do your deep work between 9 and 5.

Salary & Benefits

We offer a competitive salary with annual performance-based bonus and stock option schemes. A pension plan; an employer funded health and medical plan; life insurance plan; long term disability coverage; paid time off; an employee assistance program; and a discount platform. The availability and specifics of these benefits vary by location, details of which will be provided during the hiring process.