Grokking System Design Interview, Volume II
Ask Author
Back to course home

0% completed

Vote For New Content
Designing a Notification System
On this page

Step 1: System Definition

Step 2. Clarify and Define Requirements

Functional Requirements (Features):

Non-Functional Requirements (Scale, Performance, Constraints):

Step 3. Back-of-the-Envelope Capacity Estimation

Step 4. API Specifications

Step 5: High-Level System Design

Step 6: Detailed Component Design and Data Management

Step 7: Scalability and Performance Strategies

Let's design a notification system that works seamlessly for various applications, whether it's a social media platform, an e-commerce site, or a productivity tool.

Image
Step 1: System Definition

A notification system is a centralized platform that delivers alerts and messages (notifications) to users on behalf of multiple separate applications (tenants). It acts as a shared service that any application – be it a social network, an e-commerce site, or a productivity tool – can use to send notifications to its users. For example, think of a system like Amazon Simple Notification Service (SNS) or Firebase Cloud Messaging, which many different apps leverage to inform users about events (like a new message, an order shipped, or a calendar reminder).

Real-world analogy: Consider how Facebook notifies you of a friend’s post, Amazon notifies you about your package delivery, and Slack alerts you about a new mention. A multi-tenant notification service is a single system capable of handling all those scenarios for many companies at once, ensuring each tenant’s data and users stay isolated from each other.

Key Entities/Terms:

  • Tenant: An independent application or client using the notification service. Each tenant has its own users and notification events (e.g., one tenant is a social media platform, another is an e-commerce site).
  • Notification: A message or alert about an event, intended for a user or set of users. For instance, “Alice liked your photo” or “Your order #1234 has shipped.”
  • User: The end recipient of notifications (each user is associated with a specific tenant/application).
  • Channel: The medium for delivering notifications. Common channels include in-app (within the application’s UI), push (mobile push notifications), email, and SMS. This system should support all major channels so tenants can reach users via their preferred method.
  • In-App Notification: Notifications shown inside an application (e.g., a badge or feed within the app). These should appear in real-time while the user is active.
  • Push Notification: Notifications delivered by mobile OS services (like Apple Push Notification service or Firebase Cloud Messaging) to a user’s device, even if the app is not open.
  • User Preferences: Settings that allow users to control notifications – for example, opting out of certain types, setting “quiet hours” (no notifications at night), or marking some alerts as high priority.
  • Multi-Tenancy: Architecture where one system serves multiple client applications in isolation. In our design, it means data and traffic for each tenant are logically separated (e.g., tenant A’s notifications and users are distinct from tenant B’s), even though they share the underlying infrastructure.

Before diving into design, let’s clarify what this notification service must do and the constraints it must meet.

Functional Requirements (Features):

  • Multi-Channel Delivery: Support delivering notifications via email, SMS, mobile push, and in-app channels. Every notification event can be routed to one or multiple channels as appropriate.
  • Real-Time & Batched Notifications: Handle both immediate event-driven notifications and bulk or scheduled notifications. Real-time notifications (e.g. a chat message or alert) should be delivered within ~1–2 seconds to the user. The system also supports batching use cases – for instance, an app might schedule a daily digest or a marketing campaign to millions of users at once. This requires efficient bulk dispatch (possibly by scheduling or trigger-based batch jobs) in addition to one-off, on-demand sends.
  • Message Scheduling: Allow notifications to be scheduled for future delivery (e.g., a reminder set to go out at a specific time). Instead of immediate dispatch, such requests are stored and sent at the correct time via a scheduler component. For example, a meeting reminder can be queued to send 10 minutes before the event.
  • Retry on Failure: Implement retry mechanisms for failed deliveries. If a notification fails to send on a channel (due to transient errors such as provider downtime or network issues), the system should retry sending after a short delay, possibly with exponential backoff. This ensures at-least-once delivery – the system will make multiple attempts so that temporary issues don’t cause message loss.
  • User Preferences & Opt-Out: Respect user notification preferences. The system should check if the user has unsubscribed or opted out of certain notification types or channels. For example, if a user disabled SMS notifications or opted out of promotional emails, the system must honor that and skip those channels. It should also support user-specific settings like preferred channels for certain alert types (e.g. critical alerts via SMS, newsletters via email).
  • Multi-Tenant Support: The service must isolate tenants’ data. Each tenant can only access and send notifications to its own users.
  • Notification Content & Templates: Support customizable content for notifications. Tenants (or their product teams) should be able to define message templates for each notification type, possibly with placeholders (e.g., “Hello {name}, your order {order_id} shipped.”). The service will fill in event-specific data.
  • No Channel Prioritization: All supported channels are treated equally by the system. If an event triggers notifications on multiple channels, each is delivered independently without prioritizing one over the other.
  • In-App Notification Management: Support in-app messaging – notifications that appear within the application’s UI (e.g., the notification bell 🔔 in a social app). This implies the system will store notifications in a database so that a user can fetch their notification list. It can also push in-app alerts in real-time to active user sessions (via WebSocket or similar).
  • API for Notification Producers: Provide clean APIs or endpoints that various applications (social media, e-commerce, productivity tools, etc.) can call to create notification requests. Each request includes details like the user(s) to notify, the content (possibly a template ID and data), and which channel(s) to use. The API should handle single-recipient notifications as well as bulk sends (e.g., specifying a list of users or a user segment for broadcast).

Non-Functional Requirements (Scale, Performance, Constraints):

  • Scalability: The system should handle web-scale load. This means potentially millions of users and high notification volumes:

    • For a social media tenant, there could be hundreds of millions of notifications per day.
    • The design should scale horizontally (add more machines to handle more load) without any bottlenecks.
  • High Throughput: Support very high write and read rates. For example, the system might ingest tens of thousands of notification events per second during peak (imagine a viral post generating notifications or a major sale event).

  • Low Latency: Especially for real-time channels, end-to-end latency (from event to user device) should be low (ideally under a second). Email/SMS can tolerate slight delays (e.g., < 1 minute).

  • High Availability: The service should be reliable – aiming for minimal downtime (e.g., 99.99% uptime). Notifications can be critical (password resets, security alerts, order confirmations), so the system must be redundant and fault-tolerant across data centers or AZs (Availability Zones).

  • Durability: No lost notifications. Once an event is accepted, it should be queued/stored so that even if servers crash, the notification will eventually be delivered.

  • Ordering: Within a single user’s notification feed, the system should maintain chronological order (at least eventual consistency in order). If two events for the same user occur, they should not appear out-of-order in the in-app feed. (Exact global ordering across the whole system is not required.)

  • Multi-Tenant Isolation & Security: One tenant’s heavy load or failure should not cascade to others. We may need to enforce per-tenant rate limits or partitioning. Data must be partitioned or tagged by tenant so that it’s impossible for tenant A to access tenant B’s notifications or user info. Also, ensure secure authentication for tenants calling the API.

  • Flexibility & Extensibility: It should be easy to add new channels (say, WhatsApp or push for a new platform) without redesigning the whole system. Similarly, adding new features like notification scheduling (send at a specific time) or batching (daily digests) in the future should be possible.

With these requirements, we have a clear target: a fast, reliable notification service that can handle all tenants’ events and deliver via the appropriate channels, respecting user preferences, even at a massive scale.

Let’s estimate the expected scale to ensure our design can handle the load. (These are rough numbers to guide our choices.)

  • Number of Tenants & Users: Suppose we serve 10–50 tenant applications. Some might be large (tens of millions of users) and others smaller. In total, imagine 100 million registered users across tenants, with 20–30 million daily active users (DAUs) generating or receiving notifications.

  • Notification Volume: Assume on average each active user triggers or receives ~5 notifications per day (varies by app: social apps might be higher, e-commerce lower). That’s on the order of 100–150 million notifications per day system-wide. In peak scenarios (like a viral event or big sale), this could spike higher.

    • Peak Throughput: 150 million/day is ~1.7k notifications per second on average. Peaks could be 10x higher. We should design for 20–50k notification events per second at peak. For example, if one tenant is a social network and a celebrity with 10 million followers posts something at noon, that single event could fan-out 10 million notifications, over a few minutes. That requires huge instantaneous throughput.
  • Read vs Write:

    • Writes: Every notification event is a write (to databases/queues). If we have ~150M/day, that’s ~1.7k writes/sec on average, with bursts as described (tens of thousands/sec).
    • Reads: Users reading notifications (in-app) also generates load. If 20M users check their in-app notifications roughly once a day, that could be 20M read operations per day (~230 reads/sec average, but likely spiky around morning/evening). Many reads will fetch a small list (the user’s recent notifications).
    • In real-time, many users will get notifications pushed without needing to pull, but when they open the app or if they load an inbox, it triggers reads. Write volume (sending notifications) will likely exceed read volume if push delivery is common.
  • Storage:

    • Notification Storage: If we store notifications for in-app history, assume we keep the last 100 notifications per user. With 100M users, that’s up to 10 billion notification records in storage. However, not all users are active; focusing on 20M actives with, say, 50 each, it’s around 1 billion records. If each stored notification entry is ~500 bytes (including message text, metadata), 1 billion records is ~500 GB. Add overhead and replication, a few TB of storage needed. This is large but manageable with distributed databases.
    • Preferences Storage: One record per user for preferences (which channels enabled, quiet hours etc.). 100M users * a few hundred bytes each = tens of GB. This can fit in a SQL DB partitioned or a NoSQL easily.
    • Other Data: Possibly device tokens for push, email addresses if stored here (though user data might live in tenant’s DB – we might just receive those when sending).
    • Bandwidth: Pushing text-based notifications is relatively lightweight. For example, 100M notifications * ~1KB each on average = ~100 GB of data transferred per day just in notification payloads. We need network capacity for this, but it’s feasible across a distributed system.
  • Latency Expectation:

    • In-app/push: ideally <1–2 seconds from event to device notification. So our pipeline (ingestion -> processing -> push) should introduce minimal delay (tens or low hundreds of milliseconds at each step).
    • Email/SMS: Sending of an email within ~1 minute is fine. External email gateways might introduce some seconds of delay, but generally sending out an email should be a sub-second operation on our side; the email might arrive in inbox seconds or tens of seconds later.
    • We’ll have to buffer or queue events, but the queue should not add too much delay except during enormous spikes.
  • Throughput per Channel:

    • Push notifications: If a major event happens, we might send tens of thousands per second to APNs/FCM. We need to maintain connections and possibly throttle per provider guidelines.
    • Emails: If using an email service or server, sending thousands per second is possible but might need multiple servers or providers for large scale. (150M/day emails would be extremely high; realistically, not all notifications go via email – many are in-app or push. Email might be a smaller fraction of total notifications.)
    • SMS: Likely lower volume due to cost and use-case (used only for urgent things like 2FA or order updates). Could be tens of thousands a day at most, which is trivial in comparison – but SMS has costs and external gateway limits that we’d account for.

These estimations highlight the need for aggressive horizontal scaling (lots of parallel processing), efficient data partitioning, and asynchronous processing. The system must be distributed to handle peak loads and large storage.

Now let’s design the architecture with these numbers in mind.

Input your markdown text here

Input your markdown text here

Input your markdown text here

Input your markdown text here

Mark as Completed

On this page

Step 1: System Definition

Step 2. Clarify and Define Requirements

Functional Requirements (Features):

Non-Functional Requirements (Scale, Performance, Constraints):

Step 3. Back-of-the-Envelope Capacity Estimation

Step 4. API Specifications

Step 5: High-Level System Design

Step 6: Detailed Component Design and Data Management

Step 7: Scalability and Performance Strategies