Building OpenStatus: A Deep Dive into Our Infrastructure Architecture

Dec 29, 2024•4 min read•

engineering

Infrastructure Overview

OpenStatus is a synthetic monitoring platform designed with resilience, scalability, and efficiency in mind. Our users rely on us to provide real-time insights into their service health, making it essential to maintain a robust and performant infrastructure.

In this post, we'll take a deep dive into our infrastructure architecture, exploring the key components, managed services, and design principles that power OpenStatus.

Application Landscape

Our platform consists of several interconnected applications, each designed for a specific purpose:

Frontend Ecosystem:
- A NextJS application that powers our marketing site, user dashboard, and status page hosted on Vercel.
- An Astro + Starlight-powered documentation application hosted on Cloudflare Pages.

We chose Vercel for the Next.js application because it performs exceptionally well there, the DX is great. And we selected Cloudflare Pages for the documentation since it is a static site and it's super cheap.

Backend Infrastructure All our backend services are hosted on Fly.io.
- API server: Our public API and our alerting engine
- Probes/Checker: a golang app deployed globally to monitor your service
- Screenshot app: a service that takes screenshot of your website when we detect an downtime (Playwright)
- Workflow engine: a server that handles the workflow of alerting, and our internal workflows (email automation).

We chose Fly.io for our backend services because it's a great platform for deploying globally distributed services. It's also very easy to deploy and manage. We are planning to add more providers (e.g. Koyeb) to our probes to have a more resilient system.

Managed Services

We also rely heavily on managed services to avoid handling it ourselves. Here are the services we use:

Scheduling

Recognizing the critical nature of monitoring, we've heavily rely on CRON to ensure timely checks:

Cron Jobs: Currently using Vercel Cron, with plans to migrate to Google Cron for an enhanced user experience (better UI e.g. we can see when the cron ran, retry policy).

Queue Architecture

Due to the critical nature of checks, we are using a queue to handle task processing and retry logic:

Every check is pushed to a queue and processed by our probes. If the probe fails to process the check, it is retried 3 times before being marked as failed.

Job Queue: Google Task Queues provide our distributed task management, with strategically segmented queues for different check frequencies

We've implemented a granular queue system to ensure efficient task processing, each queue is dedicated to a specific check frequency (e.g. every minute, every 10 minutes).

Data Infrastructure

We also don't want to handle the data infrastructure by ourselves. We rely on managed services for that:

Primary Database: Turso, providing a cost efficient data storage solution. We love the fact that's it's hosted SQLite database. It's just a file we can embedded in our services and sync it periodically.
Analytics Database: Tinybird, enabling complex analytical queries and insights.

Design Philosophy

Our infrastructure design is driven by several key principles:

Resilience: Ensuring high availability and fault tolerance
Scalability: Architectural choices that allow seamless growth
Cost-Efficiency: Leveraging managed services and cloud credits
Performance: Optimizing each component for maximum efficiency.

How much does it cost us?

Our current monthly cost is around $328. This includes:

Vercel: $40, we are two members in the team, so we had to upgrade to the team plan.
Fly.io: $154 36*4 (all our probes at $4 average, not all regions cost the same) + $10 (for the api server)
Google Cloud Platform: $0 (We are still using the free credits, but we expect to pay around $50 for the queue)
Tinybird: $100
Turso: $29
Cloudfare: $5

Conclusion

Building a resilient synthetic monitoring platform is hard. It's not just a $5 VPS that you can deploy and forget. It requires a more complex infrastructure to be able to provide a reliable service.

The drawback of this approach is the complexity of providing an easy self hostable services. Which is annoying because we are an open-source project and we want to provide a self-hostable version of OpenStatus. But we are working on community edition that will be easier to deploy.

Want to start monitoring your services with OpenStatus? Sign up for free and get started today!