Case Study — Logistics & Infrastructure
Real-Time Shipment Tracking & Logistics Optimization Platform
How we replaced a crumbling monolith with a fully serverless, event-driven platform that processes 4 million GPS events per day, serves 2,000+ delivery partners in real time, and held its ground through Diwali peak traffic — at 65% lower infrastructure cost.
Client
Mid-size Logistics Company
Engagement
14 months
Team size
6 engineers
Launched
Q1 2024
Client Overview
The client is a Pune-based, mid-size logistics company specializing in last-mile delivery for e-commerce platforms, pharmaceutical distributors, and FMCG brands across India. Founded in 2014, the company grew from a regional courier service in Maharashtra into a national operation covering more than 50 cities — with a network of over 2,000 active delivery partners and processing upwards of 80,000 shipments per day at peak.
The company operates a hub-and-spoke model: goods are consolidated at regional sorting hubs, then dispatched to micro-fulfilment centres and finally assigned to individual delivery partners (DPs) for last-mile execution. The client serves a mix of B2B clients — including one of India's top-5 e-commerce platforms — and a growing B2C same-day segment in Tier 1 cities.
Their competitive positioning relies entirely on delivery speed and reliability. On-Time Delivery (OTD) percentage and real-time customer visibility are the two KPIs that determine whether large e-commerce clients renew annual contracts. By late 2022, both were in serious jeopardy.
2,000+
Active delivery partners
50+
Cities across India
80,000+
Shipments per day (peak)
The Challenge
A Legacy Monolith at Its Breaking Point
When Xortrix AI first engaged with the client, their core operations ran on a six-year-old Java monolith deployed on a fleet of manually provisioned EC2 instances behind a classic Elastic Load Balancer. The application handled everything: order ingestion, DP assignment, route planning, customer notifications, invoice generation, and reporting — all in a single deployable WAR file. The database layer was a heavily over-provisioned RDS MySQL instance (db.r5.4xlarge) that had accumulated years of schema debt, with no read replicas and no connection pooling outside the application itself.
The monolith had grown to over 340,000 lines of code with minimal test coverage (estimated at 11% by automated analysis). Deployment cycles required full application restarts, meaning even a minor bug fix in customer notification logic triggered a 12-to-18-minute downtime window. The team was deploying roughly once every three weeks — not because features were ready in three-week cycles, but because each deployment was a ritual risk event that required a Saturday-night maintenance window and two senior engineers standing by.
No Real-Time Visibility
Delivery partners carried Android handsets running a proprietary DP app that sent GPS location updates to the backend every 90 seconds — when it worked. In practice, the GPS polling endpoint was part of the monolith's REST API, and under moderate load (typically mid-afternoon on weekdays) the endpoint response times ballooned beyond 8 seconds. The DP app had a 10-second timeout, leading to silent location update failures. Ops teams discovered this only through downstream complaints: customers calling the support line because the tracking page showed their shipment as "Out for Delivery" with a last-known location from four hours ago.
The tracking dashboard used by operations managers was a polling web page that refreshed every 60 seconds and rendered a static map image generated server-side via Google Static Maps API. There were no live updates, no geofencing alerts, and no automated escalation when a delivery partner went off-route. Supervisors caught exceptions reactively — when a shipment missed its SLA window and a client emailed to ask what happened.
Festive Season as the Forcing Function
The immediate trigger for the project was the Diwali 2022 season. The client's largest e-commerce partner ran a 10-day sale event that pushed daily shipment volumes to 3.1x the typical daily average. The monolith began exhibiting cascading failures on day two of the sale: the GPS update queue backed up, the database hit its connection limit (max_connections was set to 500; the pool exhausted at 487), and the customer-facing tracking page returned 502 errors for roughly 40% of requests during a four-hour window on the third day. The company received a formal penalty notice from their e-commerce client citing SLA breach and was placed on a performance improvement plan with a six-month deadline.
Summary of Pain Points
Solution Architecture
After a four-week discovery phase — which included codebase archaeology, load testing the existing system, and running event storming sessions with the client's ops and product teams — we arrived at a target architecture built entirely on AWS managed and serverless services. The guiding principle was operational simplicity: no EC2 instances for application logic, no manually provisioned databases at capacity, and no deployment ceremonies.
The architecture is fundamentally event-driven, built around an immutable event log with Command Query Responsibility Segregation (CQRS) and event sourcing patterns to separate write throughput from read performance — a critical consideration when the same underlying domain data (a shipment's state) needs to be consumed in wildly different shapes: raw GPS stream for the ops dashboard, aggregated delivery metrics for the client portal, and structured events for downstream billing and compliance systems.
Core AWS Services
AWS Lambda
All application logic — GPS ingestion, geofence evaluation, dispatch optimization, notification dispatch, and API handlers — runs as discrete Lambda functions. No servers to patch, no capacity planning.
Amazon API Gateway (HTTP API)
Low-latency HTTP API for the DP mobile app and client-facing REST endpoints. WebSocket API for the live ops dashboard. Throttling and quota management configured per-API-key for B2B clients.
Amazon DynamoDB
Primary data store for shipments, delivery events, and DP state. Single-table design with composite keys enables sub-10ms point reads. On-demand capacity mode auto-scales through any traffic spike without pre-provisioning.
AWS IoT Core
MQTT broker for the DP Android app. Each delivery partner device connects as an authenticated IoT Thing. GPS payloads are published to structured topics and routed via IoT Rules to Kinesis Data Streams.
Amazon Kinesis Data Streams
Ingest layer for the high-throughput GPS event stream (up to 50,000 events/minute at peak). Provides ordered, durable, replayable event ingestion with configurable shard capacity.
Amazon EventBridge
Domain event bus. Every state change (ShipmentCreated, PickupConfirmed, OutForDelivery, DeliveryAttempted, DeliveryConfirmed, GeofenceBreached) is published as a structured event and routed to appropriate downstream consumers via rules.
Amazon SQS
Dead-letter queues for failed Lambda invocations. FIFO queues for ordered processing of dispatch commands. Standard queues for async notification delivery to decouple latency-sensitive paths.
Amazon ElastiCache (Redis 7)
Read-through cache for shipment state used by the customer tracking API. Pub/Sub used to fan out WebSocket broadcast messages across multiple API Gateway connections without polling DynamoDB.
Amazon S3 + CloudFront
Static assets, client portal frontend, and geofence polygon data (GeoJSON). Lambda@Edge for auth token validation at the CDN layer without origin round-trips.
AWS Step Functions
Orchestration of complex, stateful business processes: failed delivery retry workflows, SLA escalation sequences, and end-of-day reconciliation jobs that must run to completion across multiple Lambda invocations.
Architectural Patterns Applied
The platform is built around three interlocking architectural patterns that address different failure modes of the original system.
CQRS (Command Query Responsibility Segregation)
Write paths (DP location updates, delivery confirmations, pickup scans) flow through dedicated command handlers that validate, persist to DynamoDB, and emit domain events to EventBridge. Read paths (customer tracking API, ops dashboard, client SLA reports) are served from pre-materialized read models: DynamoDB projections optimized for each access pattern, with Redis caching in front of the highest-traffic endpoints. This separation means a spike in GPS write throughput has zero impact on the customer tracking read path — they do not share a database connection pool or a compute resource.
Event Sourcing
Every shipment's history is stored as an immutable ordered sequence of domain events in DynamoDB (partition key: shipmentId, sort key: eventTimestamp#eventType). The current state of a shipment is never stored directly — it is derived by replaying the event log. This gives the client complete audit trails for every shipment out of the box, enables time-travel debugging when clients dispute delivery records, and allows new downstream consumers to be added without modifying existing write-path logic — they simply subscribe to EventBridge and replay historical events from the stream if needed.
Saga Pattern for Distributed Workflows
Multi-step business processes — such as the failed-delivery retry workflow (attempt 1 → wait 2 hours → attempt 2 → wait 4 hours → escalate to supervisor → notify customer) — are implemented as choreography-based sagas using EventBridge rules and Step Functions. Each step is compensatable: if a downstream service fails, a compensating event is published to reverse the side effect. This replaces dozens of ad-hoc retry cron jobs in the original monolith with a single, auditable, observable workflow engine.
Implementation Details
Real-Time GPS Ingestion Pipeline
The GPS ingestion pipeline is the highest-throughput component of the platform, and the one that most directly failed in the previous system. We rebuilt it ground-up using AWS IoT Core as the connectivity layer and Kinesis Data Streams as the ingest backbone.
Each DP handset runs the updated delivery partner app (React Native, rewritten as part of this engagement) which establishes a persistent MQTT connection to AWS IoT Core over TLS 1.3. Each device is provisioned as an IoT Thing with a unique X.509 client certificate generated during the onboarding flow, eliminating the shared API key that the original system used for all 2,000+ devices. MQTT Quality of Service is set to QoS 1 (at-least-once delivery), ensuring location updates are not silently dropped even if the handset briefly loses connectivity — the client will redeliver on reconnect.
GPS payloads are published to the topic patternlogistics/dp/{dpId}/locationwith a compact JSON body containing latitude, longitude, accuracy radius, speed, bearing, battery level, and a client-side timestamp. An IoT Rule evaluates every message matching this topic pattern and writes it to a dedicated Kinesis Data Stream (10 shards, supporting up to 100,000 records/second aggregate). The IoT Rule also performs a lightweight SQL-like filter to discard records where GPS accuracy exceeds 50 meters — a common condition in dense urban areas where reflected signals produce spurious location readings that would otherwise pollute the delivery trail.
A Kinesis consumer Lambda function — configured with a batch size of 200 records, a parallelization factor of 3 per shard, and a bisect-on-error retry policy — processes the stream. For each batch, it groups records by DP ID, deduplicates by timestamp within a sliding 5-second window (using a Redis sorted set as the dedup store), and writes the cleaned location events to DynamoDB in batched writes. DynamoDB TTL is set to 90 days for raw GPS events; aggregated daily route polylines are retained indefinitely.
The update frequency moved from 90 seconds to 10 seconds per DP without any polling — a 9x improvement in location freshness — at a fraction of the previous infrastructure cost, because MQTT connections over IoT Core are priced per message, not per persistent compute resource.
Geofencing for Delivery Zones
The client divides each city into a set of named delivery zones: polygons defined by a GeoJSON FeatureCollection stored in S3 and cached in Lambda memory on cold start (typical file size: 380 KB for a metro with 120 delivery zones). Zone definitions are versioned in S3 and Lambda functions are notified of updates via an S3 event notification → EventBridge rule, triggering a cache invalidation across all warm Lambda instances.
After each GPS event batch is committed to DynamoDB, a geofencing Lambda evaluates whether each DP has crossed a zone boundary since their last recorded position. The point-in-polygon test uses the ray-casting algorithm implemented in TypeScript with no external dependencies — benchmarked at under 0.4ms per polygon evaluation on a Lambda with 512 MB memory. For DPs assigned to more than one zone (common during cross-zone deliveries near boundaries), all candidate polygons are tested in parallel usingPromise.all.
When a boundary crossing is detected, aGeofenceBreachedevent is published to EventBridge with the DP ID, zone names (from and to), timestamp, and coordinates. Downstream consumers include:
Amazon Location Service was evaluated as an alternative for geofencing but rejected: its update latency (designed for lower frequency batch evaluation) was incompatible with our 10-second GPS cadence, and its pricing at the client's volume was approximately 4x more expensive than the custom Lambda implementation.
Automated Dispatch Optimization
The original dispatch process was manual: a zone supervisor would receive a WhatsApp message from the hub scanner listing the day's shipments for their zone and then assign them to available DPs based on personal judgment — typically proximity to the first stop and subjective assessment of workload. This produced highly variable route efficiency: some DPs would cover 45 km in a day while an adjacent DP covered 18 km with an equal number of stops.
We built an automated dispatch optimization engine implemented as a Lambda function invoked by a Step Functions state machine that runs at configurable dispatch windows (6:00 AM, 9:00 AM, and 12:00 PM IST for same-day orders). The algorithm is a Clarke-Wright savings heuristic implementation — a classical VRP (Vehicle Routing Problem) solver that runs in polynomial time and is well-suited to the size of the client's typical dispatch batch (50–400 stops per zone per window).
Inputs to the optimizer: the set of unassigned shipments with delivery addresses geocoded to lat/lon (geocoding is done at order ingestion time via the Google Maps Geocoding API with results cached in DynamoDB), the set of currently available DPs with their last known position and their scheduled shift end time, vehicle capacity constraints (two-wheeler vs. four-wheeler), and time windows for attempted deliveries (customer-specified preferred slots stored as ISO 8601 intervals). The optimizer produces an assignment of shipments to DPs with a suggested stop sequence for each DP, which is pushed to the DP app via an MQTTlogistics/dp/{dpId}/manifesttopic.
Supervisors retain override capability through the ops dashboard: any auto-assigned manifest can be edited or reassigned with a justification note, which is recorded as aManifestOverriddenevent. Override data feeds a feedback loop: a weekly analytics job aggregates override patterns to identify route constraints not captured in the optimization model (e.g., a gated community that requires a specific entry time window).
WebSocket-Based Live Tracking Dashboard
The operations dashboard is a Next.js application (deployed on Vercel) backed by API Gateway WebSocket API. When a supervisor opens the dashboard, their browser establishes a persistent WebSocket connection to the$connectroute, which triggers a Lambda that records theconnectionIdin DynamoDB alongside the supervisor's assigned zone IDs and their IAM session context.
Location updates and geofence events flow through a Redis Pub/Sub channel keyed by zone ID. A broadcaster Lambda is subscribed to all zone channels and, when an update arrives, queries DynamoDB for all active WebSocket connection IDs associated with that zone, then fans out the update to each connection via the API Gateway Management API. This architecture supports hundreds of simultaneous dashboard sessions without any of them polling the backend — all updates are server-pushed.
The map rendering layer uses Mapbox GL JS with a custom tile layer overlaying the client's delivery zone polygons. DP positions are rendered as moving markers that smoothly interpolate between GPS updates using CSS transitions on the marker translate transform — this avoids visual jumping on the 10-second update cadence. Each marker is colour-coded by DP status: green (active, on-route), amber (idle >8 minutes), red (off-route or SLA at risk), and grey (offline).
The dashboard also surfaces three operational metrics in real time: current completion rate for the day's manifests, estimated SLA breach count (shipments predicted to miss their delivery window based on current pace), and zone-level fleet utilization. These are computed by a Lambda that runs every 5 minutes and caches results in Redis with a 4-minute TTL — serving the initial dashboard load instantly while keeping the metric freshness acceptable for operational decision-making.
Database Design & Caching Strategy
The primary data store is a single DynamoDB table following Rick Houlihan's single-table design principles. All entity types — Shipment, DeliveryEvent, DeliveryPartner, Zone, CustomerNotification, ManifestAssignment — live in one table, differentiated by their key schema:
| Entity | PK | SK | GSI1-PK |
|---|---|---|---|
| Shipment | SHP#{id} | METADATA | CLIENT#{clientId} |
| DeliveryEvent | SHP#{id} | EVT#{ts}#{type} | DP#{dpId} |
| GPSLocation | DP#{dpId} | LOC#{ts} | ZONE#{zoneId} |
| ManifestAssignment | DP#{dpId} | MFT#{date} | HUB#{hubId} |
Global Secondary Indexes (GSIs) cover the main access patterns that cannot be served by the base table keys: all shipments for a given client (GSI1 with CLIENT# partition key), all deliveries by a given DP on a given date (GSI2 with DP# + date sort key), and all active DPs in a zone (GSI3 with ZONE# partition key). DynamoDB on-demand capacity mode means we never pre-provision read/write capacity units — the table scales instantly in response to load without any intervention.
ElastiCache for Redis (cluster mode, 3 shards × 2 nodes each, r6g.large) serves multiple caching roles:
Results
The new platform went live in January 2024 after a 14-month engagement that included discovery, architecture design, phased implementation, load testing, and a parallel-run migration period. The first major stress test was the Republic Day sale period (January 26, 2024), followed by Holi weekend. Both events were handled without incident. The platform then served its first Diwali season in October 2024 — the same event that had caused the 2022 crisis — at 3.4x normal volume with no SLA penalties.
99.97%
Platform uptime (trailing 12 months)
Across all production services. The only downtime events were two 4-minute Lambda cold-start spikes during unexpected morning demand surges in Q2, both self-resolved.
340ms
Average API latency (p50)
Customer tracking endpoint. p99 is 780ms. Previous system averaged 4,200ms at moderate load, degrading to 8,000ms+ at peak.
65%
Reduction in infrastructure cost
Monthly AWS spend went from ₹11.4L (EC2 + RDS + misc.) to ₹3.9L on the serverless stack at higher traffic volume.
3.4×
Peak traffic multiplier handled (Diwali 2024)
No manual intervention, no pre-scaling required. Lambda concurrency auto-scaled to 2,800 concurrent executions across all functions.
10 sec
GPS update frequency (was 90 sec)
9x improvement in location freshness. Silent GPS failure rate dropped from ~40% to 0.08% (measured as IoT Core rejected publishes).
4M+
GPS events ingested per day (peak)
Kinesis Data Streams handled a peak of 58,000 events/minute during the Diwali afternoon delivery rush without shard saturation.
23%
Improvement in On-Time Delivery rate
OTD improved from 81.4% to 100.4% (>100% because some deliveries arrive ahead of window). Driven primarily by automated dispatch optimization reducing average route distance by 18%.
0
Deployment downtime events since go-live
Blue/green Lambda deployments via CodeDeploy with traffic shifting. Deploys happen multiple times per week with zero user-facing impact.
On the business outcome
The client renewed their contract with their largest e-commerce partner (the one that had issued the SLA penalty notice) with a 15% volume increase commitment for 2025. The ops team reduced their zone supervisor headcount by 30% through redeployment rather than redundancy — supervisors shifted from manual dispatch to exception management, reviewing the roughly 3% of manifests where the algorithm flags an uncertainty rather than handling every assignment manually. The company is now onboarding two new enterprise logistics clients on the strength of the platform's real-time tracking capability, which is now a formal differentiator in their sales collateral.
Full Technology Stack
Cloud Infrastructure (AWS)
Application & Mobile
Lessons Learned
1. Event sourcing is worth the upfront complexity
The event sourcing model added approximately 3 weeks of additional design and implementation time compared to a conventional CRUD approach. It paid back this investment within the first month of go-live: two separate client disputes about delivery records were resolved in under 20 minutes by replaying the event log for the disputed shipment and producing a timestamped audit trail with GPS coordinates for each state transition. In the legacy system, these disputes required manually cross-referencing call center logs, DP WhatsApp messages, and partially overwritten database records — a process that took 2–3 business days per case.
2. Single-table DynamoDB design requires deep access pattern analysis upfront
We ran four access pattern workshops before committing to the key schema. Two access patterns that emerged late in implementation — "all shipments pending delivery confirmation older than 6 hours" and "all DPs who have not sent a GPS update in the last 15 minutes" — required adding GSIs that were not in the original design. Adding a GSI to an existing DynamoDB table is non-disruptive (it backfills asynchronously), but it costs capacity and forces application changes. The lesson: spend more time on access pattern modelling before writing any infrastructure code.
3. Lambda cold starts matter more in B2B contexts than most benchmarks suggest
The DP app is used by delivery partners who start their shift in a tight time window (6:00–7:30 AM IST). This creates a sharp demand spike where 400+ Lambda functions that have been idle overnight are invoked nearly simultaneously. On Node.js 20.x with AWS SDK v3 and our middleware stack, cold start duration averaged 1,100ms — acceptable for most endpoints but unacceptable for the manifest delivery endpoint that DPs hit the moment they open the app. We addressed this by enabling Lambda SnapStart for the manifest function (reducing cold start to ~180ms) and using Provisioned Concurrency for the IoT Rule Lambda consumer during the 5:45–7:15 AM IST window via a scheduled EventBridge rule.
4. Observability is not optional in event-driven systems
In a monolith, a bug produces a stack trace in one place. In an event-driven system with 22 Lambda functions, 4 Kinesis streams, 2 EventBridge buses, and a Step Functions workflow, a bug produces a symptom 6 hops downstream from its cause. We invested heavily in distributed tracing from day one: every Lambda function emits structured JSON logs with a correlation ID that propagates through the entire event chain (carried in EventBridge detail, SQS message attributes, and DynamoDB item metadata). AWS X-Ray traces are aggregated in Datadog APM with service maps. This investment paid for itself in the first production incident, where a geofence event processing failure was traced from a missed customer SMS back to a malformed GeoJSON polygon in a zone update — in 11 minutes, with a single Datadog trace ID.
5. Migration strategy matters as much as architecture
We ran the old system and the new system in parallel for 8 weeks before switching traffic. During this period, every shipment was processed by both systems and the outputs were compared by a reconciliation Lambda that published discrepancies to a Slack channel. This approach caught 14 edge cases in the new system before they affected production traffic — including a timezone handling bug in the SLA calculation that would have incorrectly flagged on-time deliveries in the IST+5:30 timezone as late. Parallel running is expensive (you pay for both systems) but the cost of a broken migration in an active logistics operation — where physical deliveries are in flight — is far higher.
Related Case Studies
Serverless E-Commerce Backend
Hyperlocal grocery delivery with Step Functions order orchestration and Razorpay UPI integration.
Document Processing Pipeline
Automated insurance claims verification using Textract OCR and Step Functions workflows.
Multi-Tenant SaaS Analytics
B2B analytics dashboard with tenant isolation, Kinesis ingestion, and Aurora Serverless.
Work with Xortrix AI
Building something that needs to scale?
Whether you're operating a logistics platform, a high-throughput B2B API, or a consumer product with unpredictable traffic spikes — we design serverless architectures that handle the load, keep costs proportional to usage, and give your team the observability to operate confidently. Let's talk about what you're building.
Start a conversation