Case Study — E-Commerce Infrastructure
Serverless E-Commerce Backend
How we replaced a rigid Shopify-plus-plugin stack with a fully serverless, event-driven commerce platform that handles 15-minute grocery delivery across 12 dark stores in Pune — and processed 2.3 million orders in its first six months of production.
2.3M
Orders in first 6 months
99.99%
Checkout availability
<200ms
Product search p99
40%
Lower infra cost vs prior stack
Client Overview
The client is a Pune-based direct-to-consumer grocery and daily essentials startup founded in 2022 with a single, aggressive promise: delivery in 15 minutes or less, to any pin code they serve, at prices competitive with the neighbourhood kirana store. The company aimed to be the digital-native successor to the ubiquitous family-run provision shops that have served Indian households for generations.
By the time the company engaged Xortrix AI in early 2023, they had already validated the concept with a lean operations team, a WhatsApp-based ordering flow, and three dark stores in the PCMC corridor. Monthly order volume was growing at roughly 30% MoM, and their improvised technical stack — a Shopify store stitched together with third-party plugins and a manually managed Google Sheet for stock levels — was visibly buckling under the load.
The business model is hyperlocal and inventory-intensive. The company stocks approximately 4,200 SKUs across categories including fresh produce, dairy, packaged foods, household cleaning supplies, and personal care. Each dark store carries a subset of that catalogue calibrated to the demand profile of its catchment area. Pricing adjusts dynamically based on expiry, overstock, time of day, and vendor-negotiated promotions — a complexity that Shopify's native pricing model was never designed to handle.
Business Snapshot
Why They Came to Xortrix AI
The Challenge
When the founders first described their stack, the conversation was illuminating. They had a Shopify Plus subscription, seven third-party apps bolted onto it (covering inventory, loyalty, discounts, delivery slots, and order routing), a Razorpay standard checkout integration that had been set up by a freelancer, and a Python script running on a DigitalOcean droplet that polled their WMS spreadsheet every five minutes and tried to push inventory updates to Shopify via the REST API.
It had worked at 1,000 orders a month. At 45,000 and accelerating, it was a liability.
Shopify's Structural Limitations for This Use Case
Shopify is an excellent platform for standard e-commerce. The client was not standard e-commerce. The first structural problem was inventory location granularity: Shopify's inventory model allows stock to be associated with a "location," but the pricing, product availability, and fulfilment routing logic the client needed was far more nuanced than Shopify's location system could express. A SKU might be available at the Baner dark store but not Wakad; it might be priced differently at each; and its visibility to a customer should depend on their real-time delivery pin code, not a static location assignment.
The second structural problem was Shopify's API rate limits. The platform caps REST API calls at 2 requests per second for most endpoints, with a burst allowance that drains quickly under load. The inventory sync script frequently hit rate limits during peak hours (7–9 AM and 6–9 PM), causing inventory updates to lag by up to 25 minutes. Customers were placing orders for items that were physically out of stock at their assigned dark store, leading to cancellations, refund processing delays, and mounting negative reviews.
The third problem was the checkout and payment layer. Shopify Payments is not available in India. The Razorpay integration in place worked for standard UPI and card payments but had no support for UPI Autopay — the NPCI-mandated mandate-based recurring payment system that the company wanted to use for their "Weekly Essentials Box" subscription product. Building recurring commerce on top of Shopify required a level of customisation that would have meant forking their checkout experience entirely.
Custom Pricing Engine Requirements
The client's pricing rules were sophisticated in ways that no off-the-shelf engine supports out of the box. Prices needed to reflect: base vendor cost plus margin target per SKU category; time-of-day adjustments (fresh produce priced lower after 6 PM to clear stock before the next morning delivery); proximity to best-before dates (items within 48 hours of expiry marked down by a configurable percentage); competitor price parity checks running every 30 minutes against two local competitors; and vendor-push promotional prices that arrived via a webhook from their procurement system. The result was a pricing matrix that could change dozens of times per hour per SKU, across multiple dark stores simultaneously.
UPI and Payment Gateway Complexity
India's payment landscape is unique. UPI accounts for over 80% of digital transactions in the grocery segment, and within UPI, there are multiple flows: collect requests, intent flows, QR-based payments, and the newer UPI Autopay (e-mandate) system. The company wanted to offer customers the ability to set up a standing instruction for weekly grocery boxes — deducted automatically every Sunday at 6 AM — without requiring any action on the customer's part beyond the initial mandate setup. This is technically non-trivial: Razorpay's UPI Autopay requires specific mandate creation flows, NPCI's e-mandate standards, and debit execution timing windows that are distinct from the standard Razorpay checkout. None of this was reachable within the existing Shopify integration.
Summary of Constraints Going In
Technical
Operational
Solution Architecture
After two weeks of discovery — including on-site visits to two of the client's dark stores and detailed mapping of their order lifecycle from customer tap to rider pickup — we concluded that the right architecture was fully serverless, event-driven, and built entirely on AWS. The reasoning was straightforward: the company's traffic pattern is highly spiky (sharp peaks at breakfast and dinner, relatively quiet in between), their team is lean and cannot carry operational overhead for managing servers or clusters, and the product requires extreme reliability at checkout — because a failed checkout in a 15-minute delivery context means the customer simply orders from a competitor.
The architecture we designed and built has five primary layers: the product and catalogue layer, the inventory layer, the order orchestration layer, the pricing layer, and the payments layer. Each layer is independently deployable, independently scalable, and communicates with the others through a combination of DynamoDB Streams, EventBridge events, and direct Lambda invocations where latency demands it.
Core AWS Services Used
AWS Lambda
All compute — 34 functions covering catalogue, inventory, pricing, orders, payments, notifications, and admin operations. Node.js 20 runtime with Graviton2 ARM architecture for cost efficiency.
Amazon DynamoDB
Primary datastore. Single-table design with composite keys and 6 GSIs serving the catalogue, inventory, orders, customer, and session access patterns.
AWS Step Functions
Order orchestration state machine. Manages the full lifecycle from order placement through payment confirmation, inventory deduction, dark store routing, packing, dispatch, and delivery confirmation.
Amazon EventBridge
Event bus for asynchronous domain events: inventory.updated, order.placed, order.dispatched, payment.captured, mandate.created. Decouples producers from consumers across all layers.
Amazon S3 + CloudFront
Product image CDN. 4,200 SKUs with multiple image variants (thumbnail, grid, PDP hero, zoom). CloudFront distributions with fine-grained cache control by image type.
Amazon SQS
Dead-letter queues for all Lambda-triggered event consumers. FIFO queues for stock deduction to guarantee exactly-once processing and ordering integrity.
AWS API Gateway
HTTP API (not REST API — lower latency, lower cost) for customer-facing endpoints. Separate REST API for internal admin and WMS webhook receivers.
Amazon ElastiCache (Redis)
Hot path caching for product search results, category listings, and the dynamic price cache. TTLs tuned per data type — price cache at 60 seconds, category listings at 5 minutes.
The Frontend Decoupling Decision
The customer-facing app (a React Native mobile application and a Next.js web storefront) communicates exclusively with our API layer — there is no direct dependency on any commerce SaaS platform. This was a deliberate architecture decision: we wanted the client to own their commerce logic completely, with no third-party platform able to throttle, rate-limit, or alter the behaviour of their checkout. The API layer is the only integration surface, and it is entirely under the company's control.
Inventory Management: DynamoDB Single-Table Design
The most consequential architectural decision in the entire project was how to model inventory in DynamoDB. Inventory in a hyperlocal multi-store grocery context is not a simple integer counter against a SKU. Each SKU has a stock level per dark store, a reserved quantity (items in active orders not yet confirmed), an available quantity (stock minus reserved), a reorder threshold, and a series of batch records tracking expiry dates and supplier lot numbers. Getting this wrong at the data model level would cascade into overselling, incorrect availability signals, and operational chaos.
We chose DynamoDB single-table design — one table for the entire commerce domain — following the access-pattern-first methodology described by Rick Houlihan. The primary key structure uses a composite partition key and sort key, where the partition key encodes the entity type and identifier, and the sort key encodes the sub-entity or facet being stored.
Key Access Patterns and GSI Design
The table serves 18 distinct access patterns across five entity types. We designed six Global Secondary Indexes (GSIs) to serve these patterns without requiring table scans. The GSI designs required careful thought because DynamoDB GSIs are eventually consistent and carry a write amplification cost — every write to the base table that touches a GSI-projected attribute incurs an additional write unit per GSI. With high-frequency inventory updates, every unnecessary GSI adds cost at scale.
GSI-1: StoreInventory
PK: storeId | SK: skuId
Serves the primary picker screen in the dark store WMS app — list all SKUs in a specific store with current stock and reserved quantities. Scans the partition for a given storeId to render the full store catalogue.
GSI-2: CategorySearch
PK: categoryId | SK: GSI2SK (sortKey for name/price/popularity)
Powers category browse pages with server-side sorting by name, price, or popularity rank. Allows efficient pagination via LastEvaluatedKey without loading the full catalogue.
GSI-3: CustomerOrders
PK: customerId | SK: createdAt (ISO-8601)
Serves the customer order history screen sorted by recency. Supports filtering by status via a filter expression on the projected orderStatus attribute.
GSI-4: OrdersByStore
PK: storeId | SK: createdAt
Feeds the dark store operations dashboard — active orders assigned to a specific store, sorted by placement time, with picker assignment status projected.
GSI-5: ExpiryTracking
PK: storeId | SK: bestBefore (ISO-8601 date)
Drives the expiry-aware pricing engine. A scheduled Lambda queries this GSI nightly to identify batches expiring within 48 hours and triggers the markdown pricing flow.
GSI-6: MandateIndex
PK: customerId | SK: mandateStatus
Supports UPI Autopay mandate management — list all active mandates for a customer, query by status for the weekly debit scheduler, and identify mandates approaching renewal.
Real-Time Stock Sync Across 12 Dark Stores via DynamoDB Streams
The mechanism that replaced the five-minute polling script is a DynamoDB Streams consumer. Every write to the inventory records in the base table — whether triggered by an order deduction, a WMS receiving flow, or a manual stock correction — emits a stream event carrying the old and new images of the record. A Lambda function subscribed to this stream processes the event in near-real-time (typically within 200–500ms) and fans out two downstream actions: it updates the Redis price cache if the stock-level change crosses any threshold that triggers a pricing rule, and it publishes aninventory.updatedevent to EventBridge so that any downstream consumer (the search index refresher, the recommendation engine, the dark store operations dashboard) can react independently.
Stock deduction on order placement uses a conditional write pattern — DynamoDB'sConditionExpressionensures that the deduction only succeeds if the available quantity is greater than or equal to the ordered quantity. If the condition fails — meaning stock was concurrently depleted by another order — the write fails with aConditionalCheckFailedException, which our Lambda catches and handles by returning an out-of-stock response to the checkout flow rather than proceeding to payment. This eliminates overselling entirely without requiring distributed locks.
Reserved quantity tracking — to prevent double-selling during the window between order placement and payment confirmation — uses a two-phase commit pattern. On order placement, we atomically increment thereservedQtyfield and decrementavailableQtyusing a DynamoDB transactional write. If payment fails or times out, the Step Functions state machine triggers a compensation step that reverses the reservation. If payment succeeds,reservedQtyis decremented and the item is considered physically allocated to the order.
Dynamic Pricing Engine
The pricing engine is a Lambda-based microservice that computes the effective selling price for any SKU at any dark store at any point in time. It is invoked synchronously during catalogue API calls and at checkout — and its output is cached in Redis to avoid repeated computation under load.
The engine resolves prices through a deterministic priority chain. At the highest priority: active vendor-push promotions, which arrive via webhook from the client's procurement system and are stored as time-bounded records in DynamoDB. Next: expiry-aware markdowns, calculated by the nightly batch process against GSI-5. Then: time-of-day rules (fresh produce discounts after 18:00, configured per category). Then: competitor-parity overrides, sourced from a price-scraping Lambda that runs every 30 minutes and writes comparison signals to DynamoDB. Finally: the base price from the master catalogue record.
Each pricing layer writes apricingContextobject alongside the effective price — a structured record that explains which rule applied and why. This context is stored with every order line item, which gives the client's finance team a complete audit trail of exactly what price was charged and what rule drove it. This proved invaluable during a promotion misconfiguration incident in month three — the team could trace every mispriced order in under ten minutes.
Redis Caching Strategy for the Price Layer
The pricing Lambda can be invoked thousands of times per minute during peak traffic. Running the full priority chain for every invocation against DynamoDB would be unnecessarily expensive and slow. We cache the resolved price (and the pricingContext) in ElastiCache Redis under a key ofprice:{storeId}:{skuId}with a TTL of 60 seconds for standard SKUs and 10 seconds for any SKU currently under an active time-of-day or expiry promotion.
When a pricing rule changes — because a vendor webhook fires, a markdown batch runs, or a competitor parity update triggers — the pricing engine publishes anpricing.updatedEventBridge event, and a separate Lambda subscriber performs a targeted Redis key invalidation for the affected SKU-store combinations. This proactive invalidation means the effective cache TTL for actively promoted items is the latency of the EventBridge delivery (typically sub-second), not the 60-second wall-clock TTL.
Order Orchestration: The Step Functions State Machine
The centrepiece of the backend is the order orchestration state machine built in AWS Step Functions (Express Workflows — chosen for their high throughput and per-execution pricing model, which is significantly cheaper than Standard Workflows for short-lived, high-volume executions like order processing).
Every order placement triggers a new state machine execution. The machine manages the order from placement through to delivery confirmation, with explicit states for each logical step and well-defined transition conditions, retry policies, and compensation paths for failures. Using Step Functions rather than a chain of Lambda functions was a deliberate choice: it gives us durable execution state, built-in retry and backoff for transient failures, and a visual execution history that makes debugging failed orders trivially easy — the ops team can inspect any failed order and see exactly which state it failed in, without diving into CloudWatch logs.
State Machine Design
ValidateCart
Confirms all items in the cart are still available at the customer's assigned dark store. Re-runs the availability check at order placement time (not just add-to-cart) to catch concurrent depletions. Returns a cleaned cart if any items have become unavailable.
ReserveInventory
Executes the DynamoDB transactional write to reserve stock for all line items simultaneously. Uses a single TransactWriteItems call for atomicity — either all reservations succeed or none do. On ConditionalCheckFailedException, transitions to the CartConflict state which guides the customer to update their cart.
InitiatePayment
Creates a Razorpay order via the Razorpay Orders API and returns the order ID and payment options to the client. The state machine then waits in a callback pattern (using .waitForTaskToken) for the payment webhook to resume execution.
WaitForPayment
Heartbeat state using Step Functions' callback pattern. The execution is suspended here until Razorpay fires a payment.captured or payment.failed webhook. A separate Lambda receives the webhook, validates the Razorpay signature, and resumes the execution with the payment result using SendTaskSuccess or SendTaskFailure.
ConfirmPayment
Validates the payment amount matches the order total (guards against partial-amount exploits). Updates the order status to PAYMENT_CONFIRMED in DynamoDB and publishes the order.paid EventBridge event. Converts reserved inventory to allocated inventory.
RouteToStore
Runs the store routing algorithm — for the early multi-store era this was straightforward pin code to store mapping, but it now supports split orders across stores for items not available at the primary store. Assigns a picker and creates the pick list in the WMS.
AwaitPacking
Waits for the WMS to emit a order.packed event confirming all items have been picked and the package is ready. Monitors for SLA breach — if packing takes more than 8 minutes (for a 15-minute total promise), it fires an alert to the store manager.
AssignRider
Calls the rider assignment Lambda, which interfaces with the client's delivery management system to assign the nearest available rider and compute the estimated delivery time.
AwaitDelivery
Long-running wait state (up to 60 minutes) for the delivery completion event from the rider app. On delivery confirmation, updates order status, releases any over-reserved inventory, and triggers the post-delivery flow (review request, loyalty points accrual).
HandleFailure
Catch-all failure state reached via any uncaught exception. Releases inventory reservations, initiates refund if payment was captured, sends customer notification, and creates an ops ticket for manual review if the failure is in the post-payment flow.
Razorpay Integration: UPI, Cards, and UPI Autopay
Payments was the most India-specific and therefore most detail-intensive part of the integration. Indian consumers expect UPI to be the primary payment method, expect it to be fast (sub-5 seconds for a standard UPI collect), and are acutely sensitive to payment failures — a failed payment is often interpreted as the app being unreliable rather than the bank being slow, which means every payment failure is a potential customer churn event.
Standard UPI and Card Payments
For standard payments we use Razorpay's Orders API to create a server-side order record before rendering the payment UI on the client. This server-side order creation is important: it means the payment amount is set and validated server-side before any client interaction, preventing amount tampering. The Razorpay order ID is passed to the client, which invokes the Razorpay checkout SDK. Payment completion fires a webhook to our/webhooks/razorpayLambda endpoint, which validates the signature usingHMAC-SHA256(Razorpay's webhook signature scheme), then resumes the Step Functions execution viaSendTaskSuccesswith the payment details.
UPI payment timeouts are handled explicitly: if the Step Functions WaitForPayment state reaches its heartbeat timeout (90 seconds from payment initiation) without receiving a webhook, the machine transitions to aPaymentTimeoutstate that releases inventory reservations and allows the customer to retry. This is critical for UPI flows where the customer may have approved the payment in their UPI app but the bank's confirmation was delayed — a Razorpay payment fetch is performed at timeout to check the actual payment status before releasing inventory, catching late-arriving bank confirmations.
UPI Autopay (e-Mandate) for Subscription Orders
The "Weekly Essentials Box" subscription is powered by Razorpay's UPI Autopay product, which implements the NPCI e-mandate standard. The mandate setup flow is a one-time process: the customer selects their subscription box, is redirected to a Razorpay-hosted mandate creation page, authenticates their UPI ID, and authorises a standing debit of up to the specified amount every week. Razorpay returns a mandate token, which we store against the customer record in DynamoDB under GSI-6.
Weekly debit execution runs as a scheduled Lambda triggered by EventBridge Scheduler at 5:45 AM every Sunday. The Lambda queries GSI-6 for all active mandates due for execution, creates Razorpay subscription orders for each, and initiates the debit via Razorpay'sPOST /v1/payments/create/recurringendpoint. NPCI requires that recurring UPI debits be executed between 00:00 and 23:00 and that the customer receive a pre-debit notification at least 24 hours before execution — our Saturday 6 AM notification Lambda satisfies this requirement.
Failed mandate debits (insufficient funds, bank downtime, expired UPI ID) are handled with a two-retry policy — immediate retry after 4 hours, then a final retry the next morning. After two failures the mandate is suspended and the customer receives a push notification with a deep link to reactivate their subscription.
Product Catalogue, Search, and the Image CDN
With 4,200 active SKUs and catalogue updates arriving multiple times daily from the procurement system, product catalogue management needed to be both reliable and fast to update. Catalogue records live in the primary DynamoDB table with a partition key structure that isolates them from the inventory and order entities — enabling efficient batch operations on the catalogue without competing with transactional reads for order processing.
Product Search: ElastiCache + DynamoDB GSI Hybrid
Full-text product search in DynamoDB requires a thoughtful approach because DynamoDB is not a search engine — it does exact key lookups and range queries, not fuzzy text matching. We implemented a hybrid approach: for category browse and filter (which is the dominant customer behaviour in a grocery context), we serve directly from GSI-2 with results cached in Redis. For keyword search, we pre-index a lightweight inverted index of product names and common alternative spellings (Hindi transliterations are important here — "aloo" must find "potato," "dhaniya" must find "coriander") into Redis Sorted Sets, updated whenever the catalogue is modified.
The search Lambda resolves a query by looking up matching SKU IDs from the inverted index in Redis (typically <5ms), then performing a DynamoDBBatchGetItemfor the matched SKUs (up to 100 items per batch call) to retrieve full product details. The entire search resolution path — from API Gateway to response — runs in under 80ms cold and under 20ms warm at median latency. P99 latency is under 200ms even at peak load, satisfying the SLA set for search.
S3 and CloudFront Image Delivery
Product images are stored in S3 with a structured prefix scheme:products/{}skuId}/{}variant}.webpwhere variant is one of: thumb (80×80), grid (240×240), pdp (800×800), and zoom (1600×1600). Images are uploaded by the client's content team through an internal admin tool; a Lambda triggered by the S3 upload event runs Sharp-based image processing to generate all four variants from the original high-resolution image.
CloudFront sits in front of S3 with cache behaviours split by image variant. Thumbnail and grid images — served on listing pages and the search results grid — are cached at edge with a 7-day TTL and a long-livedCache-Control: public, max-age=604800, immutableheader. PDP and zoom images are cached for 24 hours with conditional revalidation. When a product image is updated, a Lambda calls CloudFront'sCreateInvalidationAPI with the specific path, rather than blanket invalidating the entire distribution, preserving the cache hit rate for unaffected products.
Push Notifications via Firebase Cloud Messaging
Customer notifications are handled by a dedicated notification microservice backed by Firebase Cloud Messaging (FCM) for push delivery and AWS SNS for SMS fallback. The notification service consumes EventBridge events — it subscribes toorder.*,payment.*, andmandate.*event patterns and maps each event to a notification template and delivery channel.
FCM device tokens are stored in DynamoDB against the customer record and refreshed on each app open. Multi-device support (a customer using both a phone and a tablet) is handled by storing multiple tokens per customer and fanning out the FCM send across all registered tokens. Token invalidation errors from FCM are handled by deleting the invalid token from DynamoDB — a common requirement that is often overlooked, leading to silently failing notifications and token table bloat.
Notification delivery is intentionally fire-and-forget — we publish to an SQS queue that the notification Lambda processes, rather than calling FCM synchronously from the event consumer. This ensures that a transient FCM API error or timeout does not propagate back into the order processing flow. The notification SQS queue has a DLQ; failed notification sends are retried up to three times before landing in the DLQ for investigation.
Lambda Cold Start Optimisation
Cold starts are the classic concern with a serverless architecture, and in a checkout flow they are unacceptable — a 2-second cold start on the payment initiation Lambda would meaningfully hurt conversion. We applied a layered cold start mitigation strategy across all latency-sensitive functions.
Graviton2 (ARM64) Architecture
All Lambda functions run on arm64 (Graviton2) rather than x86_64. AWS Graviton2 Lambdas typically initialise 10–20% faster and cost 20% less per GB-second. For a high-throughput system processing millions of invocations monthly, this compounds to meaningful cost savings.
Provisioned Concurrency for Hot-Path Functions
The five highest-latency-critical functions — catalogue list, product search, cart validate, checkout initiate, and payment webhook receiver — run with provisioned concurrency (PC). PC pre-warms the Lambda execution environment, eliminating cold starts entirely for these functions. PC is scheduled to scale up 15 minutes before the two daily peak windows (06:45 and 17:45) using Application Auto Scaling, and scales back down after the peak subsides to avoid paying for idle PC capacity 24/7.
Minimal Dependency Bundles
Each Lambda function is bundled with esbuild, tree-shaken to include only the code it actually uses. We do not use the AWS SDK v2 (which is global in the Lambda runtime) — we import only the specific DynamoDB and S3 client classes from AWS SDK v3, which dramatically reduces the initialised module footprint. The cart validation Lambda's bundle is 48KB gzipped, compared to the 12MB+ bundles we commonly see in projects that import the full SDK or use heavyweight ORMs.
Connection Reuse via Execution Context
DynamoDB and Redis client initialisation happens outside the handler function, at module load time. This means connections are initialised once per cold start and reused across warm invocations. The Redis client uses a connection pool of 5 connections, configured to respect ElastiCache's maximum connection count per node.
Lambda Layers for Shared Utilities
Shared business logic — price resolver utilities, validation schemas, DynamoDB expression builders — is packaged as Lambda Layers. This reduces the surface area that changes per deploy (most functions only need to deploy their own function code, not the shared utilities layer), improving deployment speed and reducing the risk of a utilities layer regression affecting all functions simultaneously.
Results: Six Months in Production
The new backend went live in September 2023 with a phased traffic migration — 10% of orders routing through the new system on day one, scaling to 100% over three weeks. The migration was seamless enough that most customers had no awareness a backend change had occurred. By the end of the first month, the metrics told a clear story.
2.3 Million Orders
Processed in the first six months of production
Zero data loss incidents, zero stuck orders requiring manual DynamoDB intervention. The Step Functions visual debugger was used twice — once to identify a pricing Lambda regression and once to trace a Razorpay webhook replay event. Both resolved in under 20 minutes.
99.99% Checkout Availability
Measured across the 6-month production window
The previous Shopify-based checkout had experienced three multi-hour outages in the six months prior to migration — twice due to Shopify platform incidents, once due to the Razorpay plugin failing to handle a Razorpay API change. The serverless architecture has had zero unplanned checkout outages.
Sub-200ms Product Search (p99)
Down from 1.2 seconds median on Shopify storefront
The hybrid Redis + DynamoDB search approach delivers median search latency of 18ms and p99 of 160ms. The Shopify storefront (with a third-party search app) had a median of 1.2 seconds — search was a known pain point in the old customer experience.
40% Infrastructure Cost Reduction
Versus Shopify Plus + plugins + DigitalOcean droplet
The serverless architecture scales to zero during off-peak hours, eliminating the baseline cost of always-on servers. Combined with Graviton2 pricing and the elimination of Shopify Plus and seven third-party app subscriptions, the monthly infrastructure cost dropped by 40% at the same order volume.
Stock Cancellation Rate: 0.3%
Down from 8.4% pre-migration
The DynamoDB conditional write inventory model with real-time Stream-based sync across all 12 stores eliminated the overselling that was driving cancellations. The 0.3% residual rate is primarily due to physical stock discrepancies discovered during picking.
12 Dark Stores
Scaled from 3 at engagement to 12 at 6 months
Adding a new dark store requires a configuration record in DynamoDB and a new entry in the store routing table. No code changes, no infra provisioning. The first new store after launch went live in 4 hours from decision to first order.
Tech Stack
Every component in this stack was chosen for a specific reason against this specific use case. The decisions below are documented with their rationale because understanding the "why" is as important as knowing the "what" — especially for a team that will own this system long-term.
Compute
- AWS Lambda (Node.js 20, ARM64/Graviton2)
- AWS Step Functions Express Workflows
- EventBridge Scheduler (cron jobs)
Data
- Amazon DynamoDB (single-table, 6 GSIs)
- Amazon ElastiCache for Redis 7.x
- DynamoDB Streams (inventory events)
API & Routing
- Amazon API Gateway HTTP API
- Amazon API Gateway REST API (admin/webhooks)
- AWS WAF (rate limiting, bot protection)
Messaging
- Amazon EventBridge (domain event bus)
- Amazon SQS FIFO (inventory deduction queue)
- Amazon SQS Standard (notification queue + DLQ)
Storage & CDN
- Amazon S3 (product images, exports)
- Amazon CloudFront (image CDN, API cache)
- AWS Lambda@Edge (request normalisation)
Payments
- Razorpay Orders API
- Razorpay UPI Autopay / e-Mandate
- Razorpay Webhook (HMAC-SHA256 signature validation)
Notifications
- Firebase Cloud Messaging (push)
- Amazon SNS (SMS fallback)
- AWS SES (transactional email receipts)
Observability
- Amazon CloudWatch Logs + Metrics
- AWS X-Ray (distributed tracing)
- CloudWatch Dashboards (ops + business KPIs)
DevOps
- AWS CDK (TypeScript) — all infra as code
- GitHub Actions (CI/CD pipelines)
- esbuild (Lambda bundle optimisation)
Lessons Learned: Serverless E-Commerce in India
After six months of running this system at scale, several lessons emerged that we now apply as defaults on any India-market e-commerce backend engagement.
1. UPI Timing Windows Are Not Optional Constraints
NPCI imposes strict rules on UPI transaction timing, mandate execution windows, and pre-debit notification timelines. These are not soft guidelines — non-compliant debits are rejected by the bank, and repeated violations can result in the Razorpay integration being flagged. Build the compliance logic into the core scheduling system, not as an afterthought. We maintain a dedicatednpciComplianceCheckutility that validates any scheduled payment operation against current NPCI rules before execution.
2. DynamoDB Single-Table Design Requires Discipline, Not Dogmatism
Single-table design is powerful but it demands rigorous upfront access pattern analysis. Every time we were tempted to add a new GSI reactively — to serve an access pattern that had not been anticipated — we pushed back and re-examined whether the pattern could be served by existing GSIs with a filter expression, or whether it was actually a reporting access pattern that belongs in a read replica or data warehouse, not the operational table. We ended up with six GSIs against an initial estimate of four. Both additions were justified, but the discipline of questioning each one avoided unnecessary write amplification.
3. Step Functions Callback Patterns Require Careful Timeout Design
The WaitForPayment state's heartbeat timeout took several iterations to calibrate correctly. Too short (under 60 seconds) and you generate false payment timeouts for customers on slow UPI apps or during bank backend slowdowns — which are common during the first few days of every month in India, when salary credits trigger high UPI transaction volumes. Too long and you hold inventory reservations for abandoned checkouts, degrading availability for other customers. We settled on 90 seconds with a 60-second heartbeat check — and implemented the Razorpay payment status fetch at timeout as a safeguard against releasing inventory for payments that actually succeeded.
4. Webhook Idempotency Is Non-Negotiable
Razorpay (and most payment gateways) may deliver webhooks more than once — either due to retry logic on their end or due to network conditions causing duplicate delivery. Our webhook receiver Lambda implements idempotency by storing therazorpay_payment_idin a DynamoDB table with a TTL of 24 hours before acting on the event. If the payment ID is already present, the webhook is acknowledged (returning HTTP 200) but no state transition is triggered. Without this, duplicate webhooks would attempt to callSendTaskSuccesson an already-completed Step Functions execution, causing unnecessary errors and noise in the monitoring dashboards.
5. EventBridge vs SQS: Choose Based on Fan-Out vs Ordering
We use EventBridge where multiple independent consumers need to react to the same event (anorder.placedevent triggers the notification service, the analytics ingestion Lambda, and the loyalty points accrual function independently). We use SQS where ordering and exactly-once processing are required (the inventory deduction queue uses FIFO to ensure that concurrent orders for the same SKU are processed in sequence, preventing race conditions in the conditional write). Mixing up these two tools — using EventBridge where you need ordering, or SQS where you need fan-out — is a common serverless architecture mistake that becomes painful under load.
6. CloudFront Cache Invalidation Costs at Scale
AWS provides 1,000 free CloudFront invalidation paths per month; beyond that, each path costs $0.005. A naive implementation that invalidates the full distribution on any product update would have driven significant charges given the frequency of product image and price changes. We addressed this by using path-specific invalidations (only the changed SKU's image paths) and by implementing a batch invalidation pattern — collecting invalidation requests over a 60-second window and submitting them as a singleCreateInvalidationAPI call with multiple paths, rather than one API call per changed product.
Related Case Studies
Real-Time Logistics Platform
Event-driven shipment tracking with IoT GPS ingestion, geofencing, and automated dispatch on AWS.
Document Processing Pipeline
Automated insurance claims verification using Textract OCR and Step Functions workflows.
Multi-Tenant SaaS Analytics
B2B analytics dashboard with tenant isolation, Kinesis ingestion, and Aurora Serverless.
Work with Xortrix AI
Building a commerce product that needs to work at India scale?
We bring deep experience in serverless backend architecture, India payments integration (UPI, Razorpay, Cashfree, PhonePe), DynamoDB data modelling, and the specific operational constraints of hyperlocal and D2C commerce. If you're outgrowing your current stack or building from scratch, let's talk.