Xortrix AI

Case Study — E-Commerce Infrastructure

Serverless E-Commerce Backend

How we replaced a rigid Shopify-plus-plugin stack with a fully serverless, event-driven commerce platform that handles 15-minute grocery delivery across 12 dark stores in Pune — and processed 2.3 million orders in its first six months of production.

2.3M

Orders in first 6 months

99.99%

Checkout availability

<200ms

Product search p99

40%

Lower infra cost vs prior stack

Client Overview

The client is a Pune-based direct-to-consumer grocery and daily essentials startup founded in 2022 with a single, aggressive promise: delivery in 15 minutes or less, to any pin code they serve, at prices competitive with the neighbourhood kirana store. The company aimed to be the digital-native successor to the ubiquitous family-run provision shops that have served Indian households for generations.

By the time the company engaged Xortrix AI in early 2023, they had already validated the concept with a lean operations team, a WhatsApp-based ordering flow, and three dark stores in the PCMC corridor. Monthly order volume was growing at roughly 30% MoM, and their improvised technical stack — a Shopify store stitched together with third-party plugins and a manually managed Google Sheet for stock levels — was visibly buckling under the load.

The business model is hyperlocal and inventory-intensive. The company stocks approximately 4,200 SKUs across categories including fresh produce, dairy, packaged foods, household cleaning supplies, and personal care. Each dark store carries a subset of that catalogue calibrated to the demand profile of its catchment area. Pricing adjusts dynamically based on expiry, overstock, time of day, and vendor-negotiated promotions — a complexity that Shopify's native pricing model was never designed to handle.

Business Snapshot

Founded2022, Pune, Maharashtra
ModelHyperlocal D2C grocery delivery
Delivery Promise15 minutes or less
Dark Stores (at launch)3 (scaling to 12)
Active SKUs~4,200
Monthly Orders (at engagement)~45,000
Target MarketsPune, Pimpri-Chinchwad, Wakad

Why They Came to Xortrix AI

Needed infrastructure that scales without ops overhead
Required custom dynamic pricing unavailable in Shopify
Inventory sync across dark stores was a manual nightmare
UPI Autopay (recurring orders) not supported in their stack
Wanted a team with both cloud architecture and India payments expertise

The Challenge

When the founders first described their stack, the conversation was illuminating. They had a Shopify Plus subscription, seven third-party apps bolted onto it (covering inventory, loyalty, discounts, delivery slots, and order routing), a Razorpay standard checkout integration that had been set up by a freelancer, and a Python script running on a DigitalOcean droplet that polled their WMS spreadsheet every five minutes and tried to push inventory updates to Shopify via the REST API.

It had worked at 1,000 orders a month. At 45,000 and accelerating, it was a liability.

Shopify's Structural Limitations for This Use Case

Shopify is an excellent platform for standard e-commerce. The client was not standard e-commerce. The first structural problem was inventory location granularity: Shopify's inventory model allows stock to be associated with a "location," but the pricing, product availability, and fulfilment routing logic the client needed was far more nuanced than Shopify's location system could express. A SKU might be available at the Baner dark store but not Wakad; it might be priced differently at each; and its visibility to a customer should depend on their real-time delivery pin code, not a static location assignment.

The second structural problem was Shopify's API rate limits. The platform caps REST API calls at 2 requests per second for most endpoints, with a burst allowance that drains quickly under load. The inventory sync script frequently hit rate limits during peak hours (7–9 AM and 6–9 PM), causing inventory updates to lag by up to 25 minutes. Customers were placing orders for items that were physically out of stock at their assigned dark store, leading to cancellations, refund processing delays, and mounting negative reviews.

The third problem was the checkout and payment layer. Shopify Payments is not available in India. The Razorpay integration in place worked for standard UPI and card payments but had no support for UPI Autopay — the NPCI-mandated mandate-based recurring payment system that the company wanted to use for their "Weekly Essentials Box" subscription product. Building recurring commerce on top of Shopify required a level of customisation that would have meant forking their checkout experience entirely.

Custom Pricing Engine Requirements

The client's pricing rules were sophisticated in ways that no off-the-shelf engine supports out of the box. Prices needed to reflect: base vendor cost plus margin target per SKU category; time-of-day adjustments (fresh produce priced lower after 6 PM to clear stock before the next morning delivery); proximity to best-before dates (items within 48 hours of expiry marked down by a configurable percentage); competitor price parity checks running every 30 minutes against two local competitors; and vendor-push promotional prices that arrived via a webhook from their procurement system. The result was a pricing matrix that could change dozens of times per hour per SKU, across multiple dark stores simultaneously.

UPI and Payment Gateway Complexity

India's payment landscape is unique. UPI accounts for over 80% of digital transactions in the grocery segment, and within UPI, there are multiple flows: collect requests, intent flows, QR-based payments, and the newer UPI Autopay (e-mandate) system. The company wanted to offer customers the ability to set up a standing instruction for weekly grocery boxes — deducted automatically every Sunday at 6 AM — without requiring any action on the customer's part beyond the initial mandate setup. This is technically non-trivial: Razorpay's UPI Autopay requires specific mandate creation flows, NPCI's e-mandate standards, and debit execution timing windows that are distinct from the standard Razorpay checkout. None of this was reachable within the existing Shopify integration.

Summary of Constraints Going In

Technical

Shopify API rate limits causing 25-min inventory lag at peak
No multi-store dynamic pricing in any existing SaaS
UPI Autopay mandate flow not supported in current setup
Order routing to dark stores was entirely manual
No event-driven stock deduction on order confirmation

Operational

Stock cancellations running at 8.4% — unacceptable for a 15-min promise
Pricing team manually editing Shopify variants for promotions
No real-time visibility into per-dark-store stock levels
Refund processing taking 4–6 hours due to manual intervention
Growing infrastructure cost with no meaningful scale ceiling

Solution Architecture

After two weeks of discovery — including on-site visits to two of the client's dark stores and detailed mapping of their order lifecycle from customer tap to rider pickup — we concluded that the right architecture was fully serverless, event-driven, and built entirely on AWS. The reasoning was straightforward: the company's traffic pattern is highly spiky (sharp peaks at breakfast and dinner, relatively quiet in between), their team is lean and cannot carry operational overhead for managing servers or clusters, and the product requires extreme reliability at checkout — because a failed checkout in a 15-minute delivery context means the customer simply orders from a competitor.

The architecture we designed and built has five primary layers: the product and catalogue layer, the inventory layer, the order orchestration layer, the pricing layer, and the payments layer. Each layer is independently deployable, independently scalable, and communicates with the others through a combination of DynamoDB Streams, EventBridge events, and direct Lambda invocations where latency demands it.

Core AWS Services Used

AWS Lambda

All compute — 34 functions covering catalogue, inventory, pricing, orders, payments, notifications, and admin operations. Node.js 20 runtime with Graviton2 ARM architecture for cost efficiency.

Amazon DynamoDB

Primary datastore. Single-table design with composite keys and 6 GSIs serving the catalogue, inventory, orders, customer, and session access patterns.

AWS Step Functions

Order orchestration state machine. Manages the full lifecycle from order placement through payment confirmation, inventory deduction, dark store routing, packing, dispatch, and delivery confirmation.

Amazon EventBridge

Event bus for asynchronous domain events: inventory.updated, order.placed, order.dispatched, payment.captured, mandate.created. Decouples producers from consumers across all layers.

Amazon S3 + CloudFront

Product image CDN. 4,200 SKUs with multiple image variants (thumbnail, grid, PDP hero, zoom). CloudFront distributions with fine-grained cache control by image type.

Amazon SQS

Dead-letter queues for all Lambda-triggered event consumers. FIFO queues for stock deduction to guarantee exactly-once processing and ordering integrity.

AWS API Gateway

HTTP API (not REST API — lower latency, lower cost) for customer-facing endpoints. Separate REST API for internal admin and WMS webhook receivers.

Amazon ElastiCache (Redis)

Hot path caching for product search results, category listings, and the dynamic price cache. TTLs tuned per data type — price cache at 60 seconds, category listings at 5 minutes.

The Frontend Decoupling Decision

The customer-facing app (a React Native mobile application and a Next.js web storefront) communicates exclusively with our API layer — there is no direct dependency on any commerce SaaS platform. This was a deliberate architecture decision: we wanted the client to own their commerce logic completely, with no third-party platform able to throttle, rate-limit, or alter the behaviour of their checkout. The API layer is the only integration surface, and it is entirely under the company's control.

Inventory Management: DynamoDB Single-Table Design

The most consequential architectural decision in the entire project was how to model inventory in DynamoDB. Inventory in a hyperlocal multi-store grocery context is not a simple integer counter against a SKU. Each SKU has a stock level per dark store, a reserved quantity (items in active orders not yet confirmed), an available quantity (stock minus reserved), a reorder threshold, and a series of batch records tracking expiry dates and supplier lot numbers. Getting this wrong at the data model level would cascade into overselling, incorrect availability signals, and operational chaos.

We chose DynamoDB single-table design — one table for the entire commerce domain — following the access-pattern-first methodology described by Rick Houlihan. The primary key structure uses a composite partition key and sort key, where the partition key encodes the entity type and identifier, and the sort key encodes the sub-entity or facet being stored.

Key Access Patterns and GSI Design

The table serves 18 distinct access patterns across five entity types. We designed six Global Secondary Indexes (GSIs) to serve these patterns without requiring table scans. The GSI designs required careful thought because DynamoDB GSIs are eventually consistent and carry a write amplification cost — every write to the base table that touches a GSI-projected attribute incurs an additional write unit per GSI. With high-frequency inventory updates, every unnecessary GSI adds cost at scale.

GSI-1: StoreInventory

PK: storeId  |  SK: skuId

Serves the primary picker screen in the dark store WMS app — list all SKUs in a specific store with current stock and reserved quantities. Scans the partition for a given storeId to render the full store catalogue.

GSI-2: CategorySearch

PK: categoryId  |  SK: GSI2SK (sortKey for name/price/popularity)

Powers category browse pages with server-side sorting by name, price, or popularity rank. Allows efficient pagination via LastEvaluatedKey without loading the full catalogue.

GSI-3: CustomerOrders

PK: customerId  |  SK: createdAt (ISO-8601)

Serves the customer order history screen sorted by recency. Supports filtering by status via a filter expression on the projected orderStatus attribute.

GSI-4: OrdersByStore

PK: storeId  |  SK: createdAt

Feeds the dark store operations dashboard — active orders assigned to a specific store, sorted by placement time, with picker assignment status projected.

GSI-5: ExpiryTracking

PK: storeId  |  SK: bestBefore (ISO-8601 date)

Drives the expiry-aware pricing engine. A scheduled Lambda queries this GSI nightly to identify batches expiring within 48 hours and triggers the markdown pricing flow.

GSI-6: MandateIndex

PK: customerId  |  SK: mandateStatus

Supports UPI Autopay mandate management — list all active mandates for a customer, query by status for the weekly debit scheduler, and identify mandates approaching renewal.

Real-Time Stock Sync Across 12 Dark Stores via DynamoDB Streams

The mechanism that replaced the five-minute polling script is a DynamoDB Streams consumer. Every write to the inventory records in the base table — whether triggered by an order deduction, a WMS receiving flow, or a manual stock correction — emits a stream event carrying the old and new images of the record. A Lambda function subscribed to this stream processes the event in near-real-time (typically within 200–500ms) and fans out two downstream actions: it updates the Redis price cache if the stock-level change crosses any threshold that triggers a pricing rule, and it publishes aninventory.updatedevent to EventBridge so that any downstream consumer (the search index refresher, the recommendation engine, the dark store operations dashboard) can react independently.

Stock deduction on order placement uses a conditional write pattern — DynamoDB'sConditionExpressionensures that the deduction only succeeds if the available quantity is greater than or equal to the ordered quantity. If the condition fails — meaning stock was concurrently depleted by another order — the write fails with aConditionalCheckFailedException, which our Lambda catches and handles by returning an out-of-stock response to the checkout flow rather than proceeding to payment. This eliminates overselling entirely without requiring distributed locks.

Reserved quantity tracking — to prevent double-selling during the window between order placement and payment confirmation — uses a two-phase commit pattern. On order placement, we atomically increment thereservedQtyfield and decrementavailableQtyusing a DynamoDB transactional write. If payment fails or times out, the Step Functions state machine triggers a compensation step that reverses the reservation. If payment succeeds,reservedQtyis decremented and the item is considered physically allocated to the order.

Dynamic Pricing Engine

The pricing engine is a Lambda-based microservice that computes the effective selling price for any SKU at any dark store at any point in time. It is invoked synchronously during catalogue API calls and at checkout — and its output is cached in Redis to avoid repeated computation under load.

The engine resolves prices through a deterministic priority chain. At the highest priority: active vendor-push promotions, which arrive via webhook from the client's procurement system and are stored as time-bounded records in DynamoDB. Next: expiry-aware markdowns, calculated by the nightly batch process against GSI-5. Then: time-of-day rules (fresh produce discounts after 18:00, configured per category). Then: competitor-parity overrides, sourced from a price-scraping Lambda that runs every 30 minutes and writes comparison signals to DynamoDB. Finally: the base price from the master catalogue record.

Each pricing layer writes apricingContextobject alongside the effective price — a structured record that explains which rule applied and why. This context is stored with every order line item, which gives the client's finance team a complete audit trail of exactly what price was charged and what rule drove it. This proved invaluable during a promotion misconfiguration incident in month three — the team could trace every mispriced order in under ten minutes.

Redis Caching Strategy for the Price Layer

The pricing Lambda can be invoked thousands of times per minute during peak traffic. Running the full priority chain for every invocation against DynamoDB would be unnecessarily expensive and slow. We cache the resolved price (and the pricingContext) in ElastiCache Redis under a key ofprice:{storeId}:{skuId}with a TTL of 60 seconds for standard SKUs and 10 seconds for any SKU currently under an active time-of-day or expiry promotion.

When a pricing rule changes — because a vendor webhook fires, a markdown batch runs, or a competitor parity update triggers — the pricing engine publishes anpricing.updatedEventBridge event, and a separate Lambda subscriber performs a targeted Redis key invalidation for the affected SKU-store combinations. This proactive invalidation means the effective cache TTL for actively promoted items is the latency of the EventBridge delivery (typically sub-second), not the 60-second wall-clock TTL.

Order Orchestration: The Step Functions State Machine

The centrepiece of the backend is the order orchestration state machine built in AWS Step Functions (Express Workflows — chosen for their high throughput and per-execution pricing model, which is significantly cheaper than Standard Workflows for short-lived, high-volume executions like order processing).

Every order placement triggers a new state machine execution. The machine manages the order from placement through to delivery confirmation, with explicit states for each logical step and well-defined transition conditions, retry policies, and compensation paths for failures. Using Step Functions rather than a chain of Lambda functions was a deliberate choice: it gives us durable execution state, built-in retry and backoff for transient failures, and a visual execution history that makes debugging failed orders trivially easy — the ops team can inspect any failed order and see exactly which state it failed in, without diving into CloudWatch logs.

State Machine Design

1

ValidateCart

Confirms all items in the cart are still available at the customer's assigned dark store. Re-runs the availability check at order placement time (not just add-to-cart) to catch concurrent depletions. Returns a cleaned cart if any items have become unavailable.

2

ReserveInventory

Executes the DynamoDB transactional write to reserve stock for all line items simultaneously. Uses a single TransactWriteItems call for atomicity — either all reservations succeed or none do. On ConditionalCheckFailedException, transitions to the CartConflict state which guides the customer to update their cart.

3

InitiatePayment

Creates a Razorpay order via the Razorpay Orders API and returns the order ID and payment options to the client. The state machine then waits in a callback pattern (using .waitForTaskToken) for the payment webhook to resume execution.

4

WaitForPayment

Heartbeat state using Step Functions' callback pattern. The execution is suspended here until Razorpay fires a payment.captured or payment.failed webhook. A separate Lambda receives the webhook, validates the Razorpay signature, and resumes the execution with the payment result using SendTaskSuccess or SendTaskFailure.

5

ConfirmPayment

Validates the payment amount matches the order total (guards against partial-amount exploits). Updates the order status to PAYMENT_CONFIRMED in DynamoDB and publishes the order.paid EventBridge event. Converts reserved inventory to allocated inventory.

6

RouteToStore

Runs the store routing algorithm — for the early multi-store era this was straightforward pin code to store mapping, but it now supports split orders across stores for items not available at the primary store. Assigns a picker and creates the pick list in the WMS.

7

AwaitPacking

Waits for the WMS to emit a order.packed event confirming all items have been picked and the package is ready. Monitors for SLA breach — if packing takes more than 8 minutes (for a 15-minute total promise), it fires an alert to the store manager.

8

AssignRider

Calls the rider assignment Lambda, which interfaces with the client's delivery management system to assign the nearest available rider and compute the estimated delivery time.

9

AwaitDelivery

Long-running wait state (up to 60 minutes) for the delivery completion event from the rider app. On delivery confirmation, updates order status, releases any over-reserved inventory, and triggers the post-delivery flow (review request, loyalty points accrual).

10

HandleFailure

Catch-all failure state reached via any uncaught exception. Releases inventory reservations, initiates refund if payment was captured, sends customer notification, and creates an ops ticket for manual review if the failure is in the post-payment flow.

Razorpay Integration: UPI, Cards, and UPI Autopay

Payments was the most India-specific and therefore most detail-intensive part of the integration. Indian consumers expect UPI to be the primary payment method, expect it to be fast (sub-5 seconds for a standard UPI collect), and are acutely sensitive to payment failures — a failed payment is often interpreted as the app being unreliable rather than the bank being slow, which means every payment failure is a potential customer churn event.

Standard UPI and Card Payments

For standard payments we use Razorpay's Orders API to create a server-side order record before rendering the payment UI on the client. This server-side order creation is important: it means the payment amount is set and validated server-side before any client interaction, preventing amount tampering. The Razorpay order ID is passed to the client, which invokes the Razorpay checkout SDK. Payment completion fires a webhook to our/webhooks/razorpayLambda endpoint, which validates the signature usingHMAC-SHA256(Razorpay's webhook signature scheme), then resumes the Step Functions execution viaSendTaskSuccesswith the payment details.

UPI payment timeouts are handled explicitly: if the Step Functions WaitForPayment state reaches its heartbeat timeout (90 seconds from payment initiation) without receiving a webhook, the machine transitions to aPaymentTimeoutstate that releases inventory reservations and allows the customer to retry. This is critical for UPI flows where the customer may have approved the payment in their UPI app but the bank's confirmation was delayed — a Razorpay payment fetch is performed at timeout to check the actual payment status before releasing inventory, catching late-arriving bank confirmations.

UPI Autopay (e-Mandate) for Subscription Orders

The "Weekly Essentials Box" subscription is powered by Razorpay's UPI Autopay product, which implements the NPCI e-mandate standard. The mandate setup flow is a one-time process: the customer selects their subscription box, is redirected to a Razorpay-hosted mandate creation page, authenticates their UPI ID, and authorises a standing debit of up to the specified amount every week. Razorpay returns a mandate token, which we store against the customer record in DynamoDB under GSI-6.

Weekly debit execution runs as a scheduled Lambda triggered by EventBridge Scheduler at 5:45 AM every Sunday. The Lambda queries GSI-6 for all active mandates due for execution, creates Razorpay subscription orders for each, and initiates the debit via Razorpay'sPOST /v1/payments/create/recurringendpoint. NPCI requires that recurring UPI debits be executed between 00:00 and 23:00 and that the customer receive a pre-debit notification at least 24 hours before execution — our Saturday 6 AM notification Lambda satisfies this requirement.

Failed mandate debits (insufficient funds, bank downtime, expired UPI ID) are handled with a two-retry policy — immediate retry after 4 hours, then a final retry the next morning. After two failures the mandate is suspended and the customer receives a push notification with a deep link to reactivate their subscription.

Product Catalogue, Search, and the Image CDN

With 4,200 active SKUs and catalogue updates arriving multiple times daily from the procurement system, product catalogue management needed to be both reliable and fast to update. Catalogue records live in the primary DynamoDB table with a partition key structure that isolates them from the inventory and order entities — enabling efficient batch operations on the catalogue without competing with transactional reads for order processing.

Product Search: ElastiCache + DynamoDB GSI Hybrid

Full-text product search in DynamoDB requires a thoughtful approach because DynamoDB is not a search engine — it does exact key lookups and range queries, not fuzzy text matching. We implemented a hybrid approach: for category browse and filter (which is the dominant customer behaviour in a grocery context), we serve directly from GSI-2 with results cached in Redis. For keyword search, we pre-index a lightweight inverted index of product names and common alternative spellings (Hindi transliterations are important here — "aloo" must find "potato," "dhaniya" must find "coriander") into Redis Sorted Sets, updated whenever the catalogue is modified.

The search Lambda resolves a query by looking up matching SKU IDs from the inverted index in Redis (typically <5ms), then performing a DynamoDBBatchGetItemfor the matched SKUs (up to 100 items per batch call) to retrieve full product details. The entire search resolution path — from API Gateway to response — runs in under 80ms cold and under 20ms warm at median latency. P99 latency is under 200ms even at peak load, satisfying the SLA set for search.

S3 and CloudFront Image Delivery

Product images are stored in S3 with a structured prefix scheme:products/{}skuId}/{}variant}.webpwhere variant is one of: thumb (80×80), grid (240×240), pdp (800×800), and zoom (1600×1600). Images are uploaded by the client's content team through an internal admin tool; a Lambda triggered by the S3 upload event runs Sharp-based image processing to generate all four variants from the original high-resolution image.

CloudFront sits in front of S3 with cache behaviours split by image variant. Thumbnail and grid images — served on listing pages and the search results grid — are cached at edge with a 7-day TTL and a long-livedCache-Control: public, max-age=604800, immutableheader. PDP and zoom images are cached for 24 hours with conditional revalidation. When a product image is updated, a Lambda calls CloudFront'sCreateInvalidationAPI with the specific path, rather than blanket invalidating the entire distribution, preserving the cache hit rate for unaffected products.

Push Notifications via Firebase Cloud Messaging

Customer notifications are handled by a dedicated notification microservice backed by Firebase Cloud Messaging (FCM) for push delivery and AWS SNS for SMS fallback. The notification service consumes EventBridge events — it subscribes toorder.*,payment.*, andmandate.*event patterns and maps each event to a notification template and delivery channel.

FCM device tokens are stored in DynamoDB against the customer record and refreshed on each app open. Multi-device support (a customer using both a phone and a tablet) is handled by storing multiple tokens per customer and fanning out the FCM send across all registered tokens. Token invalidation errors from FCM are handled by deleting the invalid token from DynamoDB — a common requirement that is often overlooked, leading to silently failing notifications and token table bloat.

Notification delivery is intentionally fire-and-forget — we publish to an SQS queue that the notification Lambda processes, rather than calling FCM synchronously from the event consumer. This ensures that a transient FCM API error or timeout does not propagate back into the order processing flow. The notification SQS queue has a DLQ; failed notification sends are retried up to three times before landing in the DLQ for investigation.

Lambda Cold Start Optimisation

Cold starts are the classic concern with a serverless architecture, and in a checkout flow they are unacceptable — a 2-second cold start on the payment initiation Lambda would meaningfully hurt conversion. We applied a layered cold start mitigation strategy across all latency-sensitive functions.

Graviton2 (ARM64) Architecture

All Lambda functions run on arm64 (Graviton2) rather than x86_64. AWS Graviton2 Lambdas typically initialise 10–20% faster and cost 20% less per GB-second. For a high-throughput system processing millions of invocations monthly, this compounds to meaningful cost savings.

Provisioned Concurrency for Hot-Path Functions

The five highest-latency-critical functions — catalogue list, product search, cart validate, checkout initiate, and payment webhook receiver — run with provisioned concurrency (PC). PC pre-warms the Lambda execution environment, eliminating cold starts entirely for these functions. PC is scheduled to scale up 15 minutes before the two daily peak windows (06:45 and 17:45) using Application Auto Scaling, and scales back down after the peak subsides to avoid paying for idle PC capacity 24/7.

Minimal Dependency Bundles

Each Lambda function is bundled with esbuild, tree-shaken to include only the code it actually uses. We do not use the AWS SDK v2 (which is global in the Lambda runtime) — we import only the specific DynamoDB and S3 client classes from AWS SDK v3, which dramatically reduces the initialised module footprint. The cart validation Lambda's bundle is 48KB gzipped, compared to the 12MB+ bundles we commonly see in projects that import the full SDK or use heavyweight ORMs.

Connection Reuse via Execution Context

DynamoDB and Redis client initialisation happens outside the handler function, at module load time. This means connections are initialised once per cold start and reused across warm invocations. The Redis client uses a connection pool of 5 connections, configured to respect ElastiCache's maximum connection count per node.

Lambda Layers for Shared Utilities

Shared business logic — price resolver utilities, validation schemas, DynamoDB expression builders — is packaged as Lambda Layers. This reduces the surface area that changes per deploy (most functions only need to deploy their own function code, not the shared utilities layer), improving deployment speed and reducing the risk of a utilities layer regression affecting all functions simultaneously.

Results: Six Months in Production

The new backend went live in September 2023 with a phased traffic migration — 10% of orders routing through the new system on day one, scaling to 100% over three weeks. The migration was seamless enough that most customers had no awareness a backend change had occurred. By the end of the first month, the metrics told a clear story.

2.3 Million Orders

Processed in the first six months of production

Zero data loss incidents, zero stuck orders requiring manual DynamoDB intervention. The Step Functions visual debugger was used twice — once to identify a pricing Lambda regression and once to trace a Razorpay webhook replay event. Both resolved in under 20 minutes.

99.99% Checkout Availability

Measured across the 6-month production window

The previous Shopify-based checkout had experienced three multi-hour outages in the six months prior to migration — twice due to Shopify platform incidents, once due to the Razorpay plugin failing to handle a Razorpay API change. The serverless architecture has had zero unplanned checkout outages.

Sub-200ms Product Search (p99)

Down from 1.2 seconds median on Shopify storefront

The hybrid Redis + DynamoDB search approach delivers median search latency of 18ms and p99 of 160ms. The Shopify storefront (with a third-party search app) had a median of 1.2 seconds — search was a known pain point in the old customer experience.

40% Infrastructure Cost Reduction

Versus Shopify Plus + plugins + DigitalOcean droplet

The serverless architecture scales to zero during off-peak hours, eliminating the baseline cost of always-on servers. Combined with Graviton2 pricing and the elimination of Shopify Plus and seven third-party app subscriptions, the monthly infrastructure cost dropped by 40% at the same order volume.

Stock Cancellation Rate: 0.3%

Down from 8.4% pre-migration

The DynamoDB conditional write inventory model with real-time Stream-based sync across all 12 stores eliminated the overselling that was driving cancellations. The 0.3% residual rate is primarily due to physical stock discrepancies discovered during picking.

12 Dark Stores

Scaled from 3 at engagement to 12 at 6 months

Adding a new dark store requires a configuration record in DynamoDB and a new entry in the store routing table. No code changes, no infra provisioning. The first new store after launch went live in 4 hours from decision to first order.

Tech Stack

Every component in this stack was chosen for a specific reason against this specific use case. The decisions below are documented with their rationale because understanding the "why" is as important as knowing the "what" — especially for a team that will own this system long-term.

Compute

  • AWS Lambda (Node.js 20, ARM64/Graviton2)
  • AWS Step Functions Express Workflows
  • EventBridge Scheduler (cron jobs)

Data

  • Amazon DynamoDB (single-table, 6 GSIs)
  • Amazon ElastiCache for Redis 7.x
  • DynamoDB Streams (inventory events)

API & Routing

  • Amazon API Gateway HTTP API
  • Amazon API Gateway REST API (admin/webhooks)
  • AWS WAF (rate limiting, bot protection)

Messaging

  • Amazon EventBridge (domain event bus)
  • Amazon SQS FIFO (inventory deduction queue)
  • Amazon SQS Standard (notification queue + DLQ)

Storage & CDN

  • Amazon S3 (product images, exports)
  • Amazon CloudFront (image CDN, API cache)
  • AWS Lambda@Edge (request normalisation)

Payments

  • Razorpay Orders API
  • Razorpay UPI Autopay / e-Mandate
  • Razorpay Webhook (HMAC-SHA256 signature validation)

Notifications

  • Firebase Cloud Messaging (push)
  • Amazon SNS (SMS fallback)
  • AWS SES (transactional email receipts)

Observability

  • Amazon CloudWatch Logs + Metrics
  • AWS X-Ray (distributed tracing)
  • CloudWatch Dashboards (ops + business KPIs)

DevOps

  • AWS CDK (TypeScript) — all infra as code
  • GitHub Actions (CI/CD pipelines)
  • esbuild (Lambda bundle optimisation)

Lessons Learned: Serverless E-Commerce in India

After six months of running this system at scale, several lessons emerged that we now apply as defaults on any India-market e-commerce backend engagement.

1. UPI Timing Windows Are Not Optional Constraints

NPCI imposes strict rules on UPI transaction timing, mandate execution windows, and pre-debit notification timelines. These are not soft guidelines — non-compliant debits are rejected by the bank, and repeated violations can result in the Razorpay integration being flagged. Build the compliance logic into the core scheduling system, not as an afterthought. We maintain a dedicatednpciComplianceCheckutility that validates any scheduled payment operation against current NPCI rules before execution.

2. DynamoDB Single-Table Design Requires Discipline, Not Dogmatism

Single-table design is powerful but it demands rigorous upfront access pattern analysis. Every time we were tempted to add a new GSI reactively — to serve an access pattern that had not been anticipated — we pushed back and re-examined whether the pattern could be served by existing GSIs with a filter expression, or whether it was actually a reporting access pattern that belongs in a read replica or data warehouse, not the operational table. We ended up with six GSIs against an initial estimate of four. Both additions were justified, but the discipline of questioning each one avoided unnecessary write amplification.

3. Step Functions Callback Patterns Require Careful Timeout Design

The WaitForPayment state's heartbeat timeout took several iterations to calibrate correctly. Too short (under 60 seconds) and you generate false payment timeouts for customers on slow UPI apps or during bank backend slowdowns — which are common during the first few days of every month in India, when salary credits trigger high UPI transaction volumes. Too long and you hold inventory reservations for abandoned checkouts, degrading availability for other customers. We settled on 90 seconds with a 60-second heartbeat check — and implemented the Razorpay payment status fetch at timeout as a safeguard against releasing inventory for payments that actually succeeded.

4. Webhook Idempotency Is Non-Negotiable

Razorpay (and most payment gateways) may deliver webhooks more than once — either due to retry logic on their end or due to network conditions causing duplicate delivery. Our webhook receiver Lambda implements idempotency by storing therazorpay_payment_idin a DynamoDB table with a TTL of 24 hours before acting on the event. If the payment ID is already present, the webhook is acknowledged (returning HTTP 200) but no state transition is triggered. Without this, duplicate webhooks would attempt to callSendTaskSuccesson an already-completed Step Functions execution, causing unnecessary errors and noise in the monitoring dashboards.

5. EventBridge vs SQS: Choose Based on Fan-Out vs Ordering

We use EventBridge where multiple independent consumers need to react to the same event (anorder.placedevent triggers the notification service, the analytics ingestion Lambda, and the loyalty points accrual function independently). We use SQS where ordering and exactly-once processing are required (the inventory deduction queue uses FIFO to ensure that concurrent orders for the same SKU are processed in sequence, preventing race conditions in the conditional write). Mixing up these two tools — using EventBridge where you need ordering, or SQS where you need fan-out — is a common serverless architecture mistake that becomes painful under load.

6. CloudFront Cache Invalidation Costs at Scale

AWS provides 1,000 free CloudFront invalidation paths per month; beyond that, each path costs $0.005. A naive implementation that invalidates the full distribution on any product update would have driven significant charges given the frequency of product image and price changes. We addressed this by using path-specific invalidations (only the changed SKU's image paths) and by implementing a batch invalidation pattern — collecting invalidation requests over a 60-second window and submitting them as a singleCreateInvalidationAPI call with multiple paths, rather than one API call per changed product.

Related Case Studies

Work with Xortrix AI

Building a commerce product that needs to work at India scale?

We bring deep experience in serverless backend architecture, India payments integration (UPI, Razorpay, Cashfree, PhonePe), DynamoDB data modelling, and the specific operational constraints of hyperlocal and D2C commerce. If you're outgrowing your current stack or building from scratch, let's talk.