Documentation on Amazon SQS service

Amazon Simple Queue Service (Amazon SQS) — Practical Wiki Documentation (Detailed)

Goal of this page: A complete guide to Amazon SQS—from “what it is” to “how to use it well” with examples, analogies, and diagrams.


Table of Contents


1. Introduction

Amazon SQS is a fully managed message queue service used to move data between software components asynchronously.

In a typical system:

  • A producer (API, service, cron job) sends a message to a queue.
  • One or more consumers/workers receive messages and process them.
  • On success, the consumer deletes the message (this acts like an “ack”).

What problems does this solve?

  • Your API doesn’t need to wait for slow tasks (image processing, emails, billing).
  • Your downstream systems don’t get overloaded during traffic spikes.
  • Failures are isolated: if workers go down, the queue keeps messages until they return.

Example (high level)

User uploads image → API returns immediately → workers generate thumbnails in background


2. What SQS Is

SQS = a durable, scalable “mailbox” for messages between services.

Key properties (practical view)

  • Durable storage: messages are stored redundantly.
  • Pull-based consumption: consumers poll (or AWS integrations poll for you).
  • No broker management: you don’t manage servers, partitions, or clusters.

What SQS is NOT

  • Not a streaming platform (like Kinesis/Kafka) for ordered event logs over long periods.
  • Not a database for querying messages; you usually process and delete quickly.
  • Not a direct RPC replacement: it’s for async workflows.

3. Why Use SQS

3.1 Decoupling

Instead of Service A calling Service B directly (tight dependency), Service A publishes to SQS and Service B processes independently.

Benefit: deployments, failures, and scaling are independent.

3.2 Buffering / Smoothing

When traffic spikes, the queue absorbs load; workers drain at their own pace.

Benefit: prevents downstream timeouts and cascading failures.

3.3 Reliability (retries built-in)

If a worker fails or crashes mid-processing, the message becomes visible again after the visibility timeout.

Benefit: “at least once” delivery gives resilience without custom retry queues.

3.4 Cost and scaling control

You scale by:

  • increasing worker count
  • batching receives/deletes
  • using auto scaling based on backlog metrics

Detailed Example: “Email Sending”

Without SQS

  • API calls SMTP/provider directly
  • slow, fails under spikes, hard to retry reliably

With SQS

  • API enqueues {"type":"SEND_EMAIL","to":"...","templateId":"..."}
  • workers consume and send
  • failures retry; poison messages go to DLQ

4. Where SQS Is Used

4.1 Work queues (background jobs)

  • Image/video transcoding
  • PDF generation
  • Webhook delivery
  • Data enrichment pipelines

4.2 Event-driven microservices (async communication)

  • OrderPlaced event triggers billing and fulfillment separately

4.3 Ingestion buffering

  • Clickstream or log events buffered before ETL jobs

4.4 Fanout with SNS

  • One event published to SNS → delivered to multiple SQS queues → multiple services react independently

4.5 Rate limiting & backpressure

  • Put requests into SQS and process at a controlled rate so downstream systems are safe

5. Core Concepts

5.1 Queue

A named endpoint that stores messages.

Common configuration attributes:

  • Visibility Timeout
  • Message Retention Period
  • Receive Message Wait Time (long polling)
  • Delay Seconds
  • Redrive Policy (DLQ)

5.2 Message

A unit of work/data.

Typically:

  • MessageBody: JSON string
  • MessageAttributes: metadata (trace IDs, tenant ID, type)

Example MessageBody

{
  "jobId": "job_123",
  "jobType": "thumbnail",
  "imageKey": "uploads/u1/img_77.png",
  "requestedAt": "2026-02-26T10:00:00Z"
}

6. Queue Types: Standard vs FIFO

SQS provides two major queue types.

Standard Queue

Best for maximum throughput and scale.

  • Delivery: at-least-once (duplicates can happen)
  • Ordering: best-effort (not guaranteed)
  • Use when: you can handle occasional duplicates and don’t require strict ordering.

FIFO Queue

Best for strict ordering and exactly-once processing within FIFO semantics.

  • Delivery: designed for exactly-once processing (with deduplication features)
  • Ordering: preserved within a message group
  • Use when: order matters (e.g., per-customer operations), or duplicates are unacceptable.

Practical rule:

  • If you can design idempotent consumers (recommended), Standard is usually simpler and cheaper.
  • If you truly need ordering guarantees, choose FIFO.

7. How SQS Works (Message Lifecycle)

Lifecycle in steps

  1. Producer sends message to SQS.
  2. Consumer receives the message (message becomes invisible for a period).
  3. Consumer processes it.
  4. Consumer deletes it (completion).
  5. If consumer fails to delete in time, message becomes visible again → retry.

Important: Visibility Timeout

When a consumer receives a message, it is hidden from other consumers for the visibility timeout.

  • If you process longer than the visibility timeout, extend it.
  • If processing fails, do not delete—allow retry.

8. Key Features

8.1 Long Polling

Reduces empty responses and cost:

  • Waits for messages up to a configured time.
  • Recommended for most consumers.

8.2 Delay Queues & Per-Message Delay

Deliver messages after a delay (e.g., retry after 30 seconds).

8.3 Message Retention

Messages can remain in a queue for a retention period if not processed.

8.4 Dead-Letter Queue (DLQ)

After a message fails processing N times, move it to a DLQ for inspection.

8.5 Message Attributes

Attach metadata without embedding it into the payload.

8.6 Batching

Send/receive/delete up to a batch size to improve efficiency.

8.7 FIFO-Specific Features

  • Message Group ID: ordering is guaranteed within a group.
  • Deduplication ID: prevents duplicate delivery within the dedup window.
  • Content-based deduplication: derive dedup ID from payload (optional).

9. Common Architecture Patterns

This section covers the most-used SQS architecture patterns, what problem each solves, when to use it, and a concrete example payload + flow. Use these as “blueprints” when designing systems around SQS.


9.1 Work Queue (Competing Consumers)

What it is:
A single SQS queue holds tasks. Multiple workers (consumers) poll the queue; each message is processed by one worker at a time.

Why it’s used:

  • Horizontal scaling: add more workers to increase throughput
  • Fault tolerance: if a worker dies, the message becomes visible again
  • Backpressure: the queue buffers spikes

When to use:

  • Background jobs (emails, thumbnails, ETL steps)
  • Any “task list” workload where ordering is not critical

Key design notes:

  • Use Standard queue if you can handle duplicates (recommended)
  • Make workers idempotent
  • Set visibility timeout >= max processing time (or extend dynamically)
  • Use long polling (e.g., 20 seconds)

Diagram

                +----------------------+
Producer(s) --->|      SQS Queue       |<--- Worker A
                +----------------------+<--- Worker B
                         ^              <--- Worker C
                         |
                    (buffers load)

12. Dead-Letter Queues (DLQ) & Redrive

Dead-Letter Queues (DLQs) are one of the most important reliability tools in SQS. They prevent poison messages (messages that keep failing) from being retried forever and blocking healthy traffic.


12.1 What is a Dead-Letter Queue (DLQ)?

A Dead-Letter Queue is a separate SQS queue that receives messages that could not be processed successfully after a configured number of attempts.

  • Source queue (main queue): where messages are initially sent.
  • DLQ: where repeatedly failing messages are moved for investigation and recovery.

Key outcome: your main queue stays healthy, while failures are isolated for debugging.


12.2 Why a DLQ is Necessary

Without a DLQ:

  • A malformed payload can fail forever and keep reappearing.
  • Backlog increases and hides real throughput issues.
  • Workers waste CPU time repeatedly retrying permanent failures.
  • On-call debugging becomes harder because failures are mixed with healthy work.

With a DLQ:

  • Failures become visible and measurable.
  • You can set alarms and create a clear operational playbook.
  • You can pause, inspect, fix, and optionally reprocess safely.

12.3 Terminology (Important)

  • Receive count: how many times a message has been received from the main queue.
  • maxReceiveCount: threshold; after this many receives, the message moves to DLQ.
  • Redrive policy: configuration on the main queue that sends failed messages to the DLQ.
  • Redrive (re-drive): moving messages from DLQ back to the main queue (after fixing root cause).

12.4 How Messages End Up in the DLQ

A message is sent to the DLQ when:

  1. Worker receives the message.
  2. Worker fails to process it (throws error, times out, crashes).
  3. Worker does not delete the message.
  4. Message becomes visible again after visibility timeout.
  5. Steps repeat until ReceiveCount > maxReceiveCount.
  6. SQS moves it to the DLQ.

Diagram

Main Queue -> Worker receives -> processing fails -> message retries -> (maxReceiveCount exceeded) -> DLQ


Analogy (Simple + Sticky)

“Restaurant Order Tickets”

Think of SQS as the ticket rail in a busy restaurant:

  • Waiters (producers) place order tickets onto the rail (queue).
  • Chefs (consumers) pick up tickets and cook dishes (process messages).
  • When a chef finishes a dish, they remove the ticket (delete message).
  • If a chef drops a ticket or gets interrupted, the ticket reappears so someone else can cook it (visibility timeout + retry).
  • If a ticket keeps failing, it goes to the manager’s clipboard (DLQ).

Drawings / Diagrams

Diagram 1 — Basic Producer/Consumer

+-----------+        SendMessage         +-----------+
| Producer  | -------------------------> |   SQS     |
+-----------+                            |   Queue   |
                                         +-----------+
                                              |
                                              | ReceiveMessage (Long Poll)
                                              v
                                         +-----------+
                                         | Consumer  |
                                         |  Worker   |
                                         +-----------+
                                              |
                                              | DeleteMessage (on success)
                                              v
                                          (removed)

Diagram 2 — Visibility Timeout & Retry

Time -->
Message visible  ->  Consumer receives  ->  Message invisible  ->  (delete?) -> done
                                   |                         |
                                   | (no delete / timeout)   |
                                   +-------------------------+
                                             message visible again (retry)

Diagram 3 — DLQ Redrive Flow

                (fails N times)
+----------+      receive/process      +-----------+
| Workers  | <-----------------------> | Main Queue|
+----------+                           +-----------+
                                           |
                                           | redrive to DLQ
                                           v
                                      +-----------+
                                      |    DLQ    |
                                      +-----------+