Case Study — Insurance & Document Automation

Automated Claims Verification at Scale for a General Insurance Company

How Xortrix AI designed and delivered a fully serverless document processing pipeline that reduced average claim processing time from 3.5 days to 4.2 hours — auto-verifying 87% of claims without a human touching them.

AWS LambdaAmazon TextractStep FunctionsDynamoDBAmazon ComprehendIRDAI Compliance

Client Overview

A General Insurance Company

The client is a mid-tier general insurance company headquartered in Pune, Maharashtra, with regional offices in Mumbai, Delhi, Bengaluru, Chennai, and Hyderabad. Founded in 2004, the insurer operates under the regulatory oversight of the Insurance Regulatory and Development Authority of India (IRDAI) and holds licences for motor, health, travel, and property insurance products.

With over 800,000 active policies across its motor and health books, the client processes an average of 6,200 claims per month — a figure that surges to roughly 11,000 claims per month during the June–September monsoon season, when vehicle accidents, flood-damaged vehicles, and water-ingress health events drive a dramatic spike in inbound volume.

The insurer's digital transformation mandate, communicated to the board in early 2023, identified claims processing as the single highest-impact area for automation. The claims function employed 74 full-time verification officers across three offices, yet average cycle times remained above industry benchmarks — and customer satisfaction scores on the post-claim NPS survey consistently pointed to speed and communication as the primary pain points.

Client at a glance

Industry: General Insurance (Motor + Health)
Headquarters: Pune, Maharashtra, India
Active Policies: 800,000+
Average Monthly Claims: 6,200 (peak: 11,000)
Regulator: IRDAI
Engagement Type: Design, Build, Operate
Timeline: 14 weeks (design to production)

Document types handled

Motor insurance: FIR copies, repair estimates, RC book, driving licence, Spot Survey Report

Health insurance: Hospital discharge summaries, prescription scans, diagnostic reports, pharmacy invoices

Common: Policy schedule, KYC documents (Aadhaar, PAN), NEFT mandates

The Challenge

Manual verification at the edge of what humans can absorb

A 3–5 day processing cycle

When a claimant submitted a motor claim, the file entered a physical and digital paper chase. Documents uploaded through the customer portal were downloaded by a junior officer, renamed, moved to a shared network drive folder, and then manually reviewed against the policy record in the client's legacy claims management system (CMS). A senior officer then independently verified critical fields before the file moved to the approvals queue. Average turnaround from claim registration to a decision letter was 3.5 days — rising to 5.2 days during monsoon peak.

Unacceptable error rates in data entry

Manual transcription of data from scanned documents into the CMS introduced significant error rates. An internal audit conducted by the client in Q3 2022 found that 18.4% of claim records had at least one data field incorrectly transcribed — most commonly policy number digit transpositions, incorrect accident dates, and miskeyed vehicle registration numbers. These errors caused downstream payment failures, policyholder escalations, and in four documented cases, incorrect claim settlements.

IRDAI compliance pressure

IRDAI's Motor Claims Service Standards mandate that surveyor reports be acknowledged within 24 hours and that settlement decisions be communicated within 30 days of the final survey — with penalties applicable for non-compliance. IRDAI's 2023 circular on digital document processing further required insurers to maintain tamper-evident audit trails for all claim-related document transactions. The insurer's then-current manual process provided no immutable audit trail, creating regulatory exposure.

Monsoon season spikes overwhelmed capacity

Between June and September each year, monsoon-related claims nearly doubled monthly inbound volume. The insurer historically responded by hiring contractual verification officers on short-term engagements — a slow, expensive, and error-prone approach. Training a new verification officer to a productive standard took approximately three weeks, meaning any surge response was structurally delayed. The 2022 monsoon season resulted in a backlog of over 4,100 unprocessed claims at its peak, with some policyholders waiting over three weeks for a decision on straightforward repairs.

Document quality variability

The insurer received documents in an extremely wide variety of formats and qualities: mobile phone photographs of physical FIR copies (often taken at an angle under fluorescent light), scanned PDFs from hospital administrative desks, native digital PDFs from large hospital chains, and WhatsApp-compressed JPEGs from rural surveyors. Any automated system had to handle all of these robustly — rejecting a document because it was a low-resolution JPEG was not an acceptable outcome given the client's rural policyholder base.

Problem summary

3.5-day average processing time (peak: 5.2 days)

18.4% data transcription error rate

No immutable audit trail for IRDAI compliance

Capacity collapses during monsoon season

High and growing operational headcount cost

Solution Architecture

A fully serverless, event-driven processing pipeline

Xortrix AI designed the pipeline from first principles as a serverless, event-driven system. The core design constraint was zero infrastructure to manage — the client's IT team is small and focused on maintaining the core policy administration system. The solution had to scale from 200 to 1,500 documents per hour without any manual intervention, and it had to be provably auditable to satisfy IRDAI requirements.

Document Ingestion via S3

Claimants and surveyors upload documents through a web portal and a REST API. Files land in a dedicated S3 ingestion bucket partitioned by claim ID and document type (e.g., s3://claims-raw/{year}/{month}/{claimId}/{docType}/). S3 Event Notifications trigger an SQS queue rather than invoking Lambda directly — this decouples ingestion throughput from processing capacity and provides natural backpressure. A dead-letter queue (DLQ) captures events that fail processing after three attempts, allowing operations teams to replay them without data loss.

Lambda Trigger and Pre-processing

An SQS-triggered Lambda function (ingestion-router) reads batches of up to ten messages at a time. It validates the S3 object key structure, checks the file's MIME type via magic-byte inspection (not just the extension), rejects corrupt or zero-byte files, and normalises filenames to a canonical format. All validation outcomes are written to DynamoDB with a status of RECEIVED or REJECTED. The function is packaged with a shared Lambda Layer (claims-processing-core) containing the DynamoDB client, logging utilities, and schema validators — keeping each function's deployment package under 5 MB.

Step Functions Workflow Orchestration

Validated documents start an AWS Step Functions Express Workflow. Express Workflows are used (rather than Standard) because claim processing is time-sensitive and high-volume — each workflow execution runs in under five minutes for the auto-verification path. The state machine has twelve states: DocumentClassification → TextExtractionDispatch → TextExtractionPoller → EntityExtraction → PolicyLookup → FraudSignalCheck → CoverageValidation → ClaimDecision → [AutoApprove | RouteToHuman] → NotificationDispatch → AuditLogWrite → WorkflowComplete. Each state invokes a dedicated Lambda function with a single, well-defined responsibility.

OCR with Amazon Textract

The TextExtractionDispatch state determines whether to use the Textract Synchronous API (for documents under 5 pages) or the Asynchronous API (for larger files such as hospital discharge summaries or multi-page survey reports). Asynchronous jobs are tracked by a Textract Job ID stored in DynamoDB. The TextExtractionPoller state uses a Wait state with a 15-second heartbeat, polling for job completion before proceeding — avoiding tight polling loops that would waste Lambda compute time. Textract returns raw block-level data (LINE, WORD, KEY_VALUE_SET, TABLE blocks) which is then assembled into a structured JSON document by a post-processing Lambda.

Entity Extraction with Amazon Comprehend

Raw text output from Textract is passed to Amazon Comprehend for named entity recognition — extracting PERSON (policyholder names, doctors), DATE (accident date, hospital admission/discharge), QUANTITY (invoice amounts, repair estimates), LOCATION (accident site, hospital address), and ORGANIZATION (workshop name, hospital name). Custom entity recognisers trained on a corpus of 4,200 annotated Indian insurance documents handle domain-specific entities: policy numbers in the insurer's proprietary format, vehicle registration numbers in the RTO format, and ICD-10 diagnosis codes for health claims. Comprehend confidence scores below 0.75 flag fields for human review rather than failing the claim outright.

Policy Matching and Coverage Validation

Extracted policy numbers and insured names are matched against the core policy administration system via a private REST API exposed through a VPC endpoint. The PolicyLookup Lambda holds a short-lived (60-second TTL) in-memory cache of recently accessed policies to reduce latency on high-concurrency spikes. Coverage validation checks whether the claimed event (accident date, diagnosis date) falls within the policy period, whether the claimed amount is within the sum insured, and whether the specific peril is covered. For motor claims, the vehicle registration number is cross-referenced against the Vahan database via an integration layer to confirm ownership and insurance linkage.

Fraud Signal Detection

A dedicated FraudSignalCheck Lambda runs a rule-based scoring model that aggregates signals across the claim record. Signals include: claim submitted within 48 hours of policy inception (high-risk indicator), multiple claims from the same IP address or device fingerprint within 30 days, document metadata inconsistencies (PDF creation timestamp post-dating the claimed event), implausible repair estimates relative to the vehicle's insured declared value (IDV), and hospital name mismatches between the discharge summary and the treating doctor's registration. Each signal adds a weighted score to a composite risk index. Claims scoring above 65 are automatically routed to the Special Investigations Unit (SIU) queue rather than the standard human review queue.

Claim Decision and Human-in-the-Loop

The ClaimDecision state evaluates the accumulated confidence scores, coverage validation result, fraud risk index, and completeness of extracted data. Claims meeting all auto-approval thresholds proceed directly to settlement initiation. Claims failing one or more checks — but not flagged for SIU — are routed to the human review queue, where a claims processor sees a pre-populated review form with all extracted data, confidence indicators, and the specific fields that failed validation. Processors can approve, reject, or request additional documents with a single click. Their decision is written back to DynamoDB and the workflow resumes from the ClaimDecision state via a callback token pattern (Step Functions .waitForTaskToken).

Implementation Details

Engineering decisions that shaped the system

Multi-format document ingestion

One of the earliest design decisions was to accept any document format at the ingestion boundary and normalise it downstream, rather than enforcing format constraints at upload time. Rejecting a policyholder's WhatsApp-compressed JPEG at the upload stage would generate a support ticket, a callback, and a multi-day delay — worse than the manual process. Instead, the ingestion-router Lambda runs a format detection pass using the file's magic bytes (the first 4–8 bytes that identify file type regardless of extension). It classifies inputs into four categories:

Native PDF

Passed directly to Textract AnalyzeDocument API. Textract handles multi-page PDFs natively up to 3,000 pages.

Scanned PDF

Detected by absence of embedded text layers. Passed to Textract with FORMS and TABLES feature flags enabled.

JPEG / PNG / WEBP

High-resolution images processed synchronously. Low-resolution images (below 100 DPI equivalent) run through an image-enhancement Lambda using Sharp before Textract.

HEIC / TIFF

Converted to JPEG via a Sharp-based conversion Lambda before extraction. The original file is preserved in S3; only a converted copy is processed.

All original files are stored immutably in the S3 ingestion bucket with Object Lock enabled in Compliance mode and a 7-year retention policy aligned to IRDAI record-keeping requirements. Versioning is enabled; no file can be deleted or overwritten through any application path — only through an explicit, separately authenticated IRDAI-mandated disposal workflow.

Textract output assembly and post-processing

Textract returns a flat list of Block objects — each block representing a unit of detected text at varying granularity (PAGE, LINE, WORD, KEY_VALUE_SET, TABLE, CELL). For insurance documents, the raw block output is not immediately useful; it must be reassembled into a structured representation. Xortrix AI built a dedicated textract-assembler Lambda that performs three assembly passes:

Pass 1 — Spatial assembly

Blocks are sorted by their Geometry.BoundingBox coordinates (top-to-bottom, left-to-right within each page) to reconstruct reading order. This is necessary because Textract sometimes returns blocks out of reading order for multi-column documents like hospital discharge summaries.

Pass 2 — Key-value pair extraction

KEY_VALUE_SET blocks are linked via their Relationships array to reconstruct form field pairs. Fields matching known insurance vocabulary (e.g., "Date of Accident", "Sum Insured", "Policy Number", "Hospital Name") are mapped to canonical field names in the claim schema using a fuzzy matcher tolerant to OCR errors (e.g., "Pol1cy Number" → policy_number).

Pass 3 — Table extraction

TABLE blocks are reconstructed into row/column structures. For health claims, invoice tables containing line items, quantities, unit costs, and totals are extracted and individually cross-referenced against the CGHS rate schedule to flag anomalous pricing.

The assembled structured document is stored in a separate S3 processed bucket in JSON format, alongside a confidence manifest listing every extracted field with its Textract confidence score. Fields below a 0.82 confidence threshold are flagged in the manifest with a REVIEW_REQUIRED marker, which is propagated into the claim's DynamoDB record and surfaced in the human review UI as amber-highlighted fields.

Step Functions error handling and retry strategies

Distributed systems fail. The Step Functions state machine is designed with explicit Retry and Catch configurations on every state that calls an external service. The retry strategy is intentionally conservative — a misconfigured aggressive retry policy can DDoS an internal API or exhaust a Textract per-account quota.

Retry configuration (representative)

TextExtractionDispatch

MaxAttempts: 3, IntervalSeconds: 2, BackoffRate: 2.0. Catches ProvisionedThroughputExceeded and ThrottlingException from Textract. After 3 failures, transitions to DocumentEscalation state which routes the claim to human review with reason TEXTRACT_UNAVAILABLE.

PolicyLookup

MaxAttempts: 4, IntervalSeconds: 1, BackoffRate: 1.5. Catches connection timeouts from the VPC-internal policy API. Circuit-breaker pattern: if a policy lookup Lambda invocation returns HTTP 503 three times in 60 seconds, an EventBridge rule fires to alert the on-call engineer via PagerDuty.

FraudSignalCheck

MaxAttempts: 2, IntervalSeconds: 5, BackoffRate: 1.0. Non-retriable errors (invalid claim schema, missing required fields) immediately transition to DocumentEscalation. Fraud check failures do not block the workflow — a failed check defaults to routing the claim for human review rather than auto-rejection.

NotificationDispatch

MaxAttempts: 5, IntervalSeconds: 10, BackoffRate: 2.0. Catches SNS ThrottlingException. Notification failures are non-blocking — the workflow completes successfully even if the notification cannot be delivered; a separate CloudWatch Alarm monitors notification failure rates above 1% over 5 minutes.

Every terminal failure state — whether from retries exhausted or a non-retriable error — writes a structured failure record to DynamoDB with the state name, error code, cause, and the input that triggered the failure. This enables exact-once replay via EventBridge Scheduler: a nightly scheduled rule scans for claims in FAILED status older than 2 hours and replays them from the last successful state, not from the beginning.

DynamoDB data model and TTL-based retention

The DynamoDB table uses a composite primary key: partition key claimId (a UUID4 prefixed with the product code, e.g., MTR-2024-a3f8c12b) and sort key eventTimestamp (ISO 8601 millisecond precision). This key design enables two critical access patterns simultaneously: fetching the current state of a specific claim (GSI on claimId + status attribute) and retrieving the full chronological event history for a claim (base table query on claimId, sort by eventTimestamp).

Claim records carry the following attributes: claimId, productType, policyNumber, insuredName, status (one of RECEIVED, PROCESSING, EXTRACTION_COMPLETE, POLICY_MATCHED, FRAUD_CHECKED, AUTO_APPROVED, PENDING_HUMAN_REVIEW, SIU_REFERRED, APPROVED, REJECTED, FAILED), workflowExecutionArn (linking back to the Step Functions execution for traceability), extractedData (a Map attribute containing all Textract/Comprehend outputs), confidenceManifest (field-level confidence scores), fraudRiskIndex (numeric, 0–100), and processingDurationMs.

Hot claim records (status not in {APPROVED, REJECTED}) have no TTL set and persist indefinitely. Completed claim records have a TTL attribute set to 90 days post-settlement — after which DynamoDB automatically deletes the item. The underlying source documents and structured JSON extracts in S3 are governed by the 7-year Object Lock retention, satisfying IRDAI's 5-year minimum record retention requirement with margin. DynamoDB Streams are enabled on the table; a stream processor Lambda forwards all MODIFY and INSERT events to an S3 audit sink (Parquet format, partitioned by date), enabling historical analytics without querying the operational table.

IRDAI-compliant immutable audit trail

IRDAI's 2023 guidelines on digital insurance document processing require that every action taken on a claim document — upload, access, modification, approval, rejection — be recorded in a tamper-evident audit log. The system achieves this through three complementary mechanisms:

AWS CloudTrail — API-level audit

All AWS API calls (S3 GetObject, PutObject, DynamoDB UpdateItem, Lambda Invoke, Step Functions StartExecution) are captured by CloudTrail and delivered to a dedicated S3 audit bucket with Object Lock. CloudTrail log files are validated with SHA-256 digest files; tampering with a log file invalidates its digest, providing cryptographic tamper evidence. Log delivery is configured with SNS notifications so any delivery failure triggers an immediate alert.

Application-level audit events in DynamoDB

Every state transition in the Step Functions workflow writes a discrete audit event item to DynamoDB with a AUDIT# sort key prefix. These items capture: actor (system Lambda ARN or human reviewer IAM identity), action (e.g., STATUS_CHANGED, FIELD_OVERRIDDEN, CLAIM_APPROVED), previous value, new value, timestamp, and the claimId. These items are append-only — no update operations are permitted on audit items; a separate IAM policy explicitly Deny-s UpdateItem on items whose sort key begins with AUDIT#.

Human reviewer action logging

When a claims processor approves, rejects, or modifies a claim in the human review UI, the action is submitted through an API Gateway endpoint that requires Cognito authentication. The processor's Cognito sub (unique user identifier) is captured in the audit event alongside their decision and any free-text notes. Bulk operations are explicitly disallowed — each claim decision is a separate API call, ensuring granular accountability.

Observability: CloudWatch metrics, dashboards, and alarms

Every Lambda function publishes custom metrics to CloudWatch using Lambda Powertools' Metrics utility, which batches metric data and emits it as Embedded Metric Format (EMF) — a structured JSON format that CloudWatch parses into metrics without a PutMetricData API call per invocation. Key business-level metrics published include: documents_received_count, extraction_success_rate, extraction_confidence_p50 / extraction_confidence_p95, policy_match_rate, fraud_flag_rate, auto_approval_rate, and human_review_queue_depth.

CloudWatch Alarms are configured at three severity levels. A P1 alarm (immediate PagerDuty page) fires when: the SQS ingestion queue depth exceeds 500 messages and is not draining (indicating a processing stall), the Step Functions error rate exceeds 5% over 10 minutes, or the DynamoDB write throttle count exceeds zero (the table uses on-demand billing mode, making throttling unexpected and indicative of a bug). A P2 alarm (Slack notification to the operations channel) fires when: the human review queue depth exceeds 200 items (indicating a staffing gap or an unusual fraud flag spike), or the Textract asynchronous job success rate falls below 98% over 30 minutes.

AWS X-Ray is enabled on all Lambda functions and Step Functions state machines, providing end-to-end distributed traces for every claim. During initial load testing, X-Ray traces identified a cold-start latency spike in the textract-assembler Lambda (1.8 seconds p95 cold start) caused by loading a large vocabulary file on initialisation. The fix — moving the vocabulary file to a Lambda Layer and lazy-loading it into module scope — reduced cold start to 340 ms p95.

Results

Measurable impact, six months after go-live

The system went live in a phased rollout — motor claims first (week 10), health claims second (week 14). The following metrics are drawn from the client's internal operations report covering the first full six months of production operation, compared against the same six months in the prior year.

87%

Claims auto-verified

Without any human intervention, up from ~12% baseline

4.2 hrs

Average processing time

Down from 3.5 days — a 20x reduction

94%

Data extraction accuracy

Across PDF, image, and scanned document formats

60%

Cost reduction

In end-to-end claims processing operational cost

Monsoon season stress test

The 2024 monsoon season — the first after go-live — saw inbound claim volume peak at 13,200 claims in August, 20% above the historical maximum. The pipeline handled the volume without incident. The SQS queue depth peaked at 847 messages at 2:30 PM on 14 August (the day after Cyclone Dana made landfall) and drained to zero within 38 minutes. No manual intervention was required. Lambda concurrency peaked at 312 concurrent executions — well within the account-level limit of 3,000. No throttling events were recorded in the production account throughout the monsoon period.

Human reviewer productivity

With 87% of claims auto-verified, the 74-person verification team was restructured. 41 officers were redeployed to customer service and relationship management roles. The remaining 33 officers handle the 13% of claims requiring human review — but with a dramatically better toolset. The pre-populated review UI surfaces all extracted data alongside confidence indicators, so a reviewer spends an average of 6.2 minutes per claim rather than the previous 22 minutes. Monthly throughput per reviewer increased from 84 claims per month to 310 claims per month.

Fraud detection impact

The fraud signal detection layer identified 312 high-risk claims in the first six months (SIU referral rate: 0.84%). Of these, 89 were confirmed as fraudulent attempts following SIU investigation — a 28.5% confirmation rate. Prior to the system, the insurer's SIU received referrals entirely from human intuition during manual review; the referral rate was approximately 0.2% with a 9% confirmation rate. The systematic, signal-based approach increased the referral rate and dramatically improved the quality of referrals.

Regulatory compliance posture

The client underwent an IRDAI supervisory inspection in March 2025 that specifically examined digital document processing practices. The inspection team reviewed the CloudTrail audit logs, sampled 50 claim records for audit trail completeness, and examined the S3 Object Lock configuration. The inspection resulted in zero findings related to document processing — a first for the insurer in three consecutive inspection cycles.

Tech Stack

Services and tools used

Document Ingestion

Amazon S3 (multi-region buckets)
S3 Event Notifications
Amazon SQS (ingestion queue with DLQ)

Compute

AWS Lambda (Node.js 20.x runtime)
Lambda Layers for shared utilities
Lambda Powertools for structured logging

AI / ML

Amazon Textract (Sync + Async APIs)
Amazon Comprehend (entity recognition)
Custom post-processing logic (Node.js)

Orchestration

AWS Step Functions (Express Workflows)
Error handling with Catch/Retry states
EventBridge for scheduled reprocessing

Data

Amazon DynamoDB (claim state store)
DynamoDB Streams for change propagation
DynamoDB TTL for data retention policy

Notifications

Amazon SNS (multi-channel fan-out)
Amazon SES (policyholder emails)
AWS Pinpoint (SMS notifications)

Observability

Amazon CloudWatch Metrics + Alarms
AWS X-Ray distributed tracing
CloudWatch Logs Insights for ad-hoc queries

Security & Compliance

AWS KMS (envelope encryption)
VPC endpoints for private AWS API access
AWS CloudTrail (immutable audit log)
IAM least-privilege roles per Lambda function

Compliance & Security

Security posture and regulatory alignment

Encryption at rest and in transit

All S3 buckets use SSE-KMS with customer-managed KMS keys (one key per data classification tier: raw documents, processed extracts, and audit logs). KMS key rotation is enabled on an annual schedule. DynamoDB uses AWS-managed encryption at rest. All data in transit uses TLS 1.3; legacy TLS 1.1/1.2 is disabled at the API Gateway and VPC endpoint level. Internal Lambda-to-Lambda calls do not traverse the public internet — all inter-service communication uses VPC endpoints for S3, DynamoDB, Textract, Comprehend, Step Functions, and SNS.

IAM least-privilege per function

Each Lambda function has a dedicated IAM execution role with the minimum permissions required for its specific task. The ingestion-router role has s3:GetObject on the ingestion bucket and dynamodb:PutItem on the claims table — nothing more. The textract-assembler role adds textract:GetDocumentAnalysis and s3:PutObject on the processed bucket. No role has Administrator or PowerUser permissions. All IAM policies are defined in CDK (TypeScript) and validated in CI using IAM Access Analyzer policy validation before deployment.

VPC network isolation

All Lambda functions run within a dedicated processing VPC with no internet gateway. Outbound traffic to AWS service endpoints uses VPC Gateway Endpoints (S3, DynamoDB) and Interface Endpoints (all other services). The policy administration API is accessed via a PrivateLink connection to the client's on-premises data centre, eliminating public internet exposure for policy data. Security groups permit only the specific protocol and port combinations required by each service-to-service communication path.

CloudTrail and immutable logging

A dedicated multi-region CloudTrail trail captures management events and S3 data events (GetObject, PutObject, DeleteObject) across all processing accounts. Trail logs are delivered to a separate, isolated audit AWS account where neither application team members nor operations engineers have write access. Log file integrity validation is enabled; the CloudTrail console surfaces a green checkmark for each log file confirming its SHA-256 digest matches. Any break in the validation chain triggers an SNS alert to the CISO team.

IRDAI data localisation compliance

IRDAI requires that policyholder data pertaining to Indian nationals be stored within Indian jurisdiction. All S3 buckets and DynamoDB tables are provisioned in ap-south-1 (Mumbai). S3 bucket replication rules explicitly prohibit cross-region replication to non-India regions. A Service Control Policy (SCP) at the AWS Organisation level denies all resource creation actions outside ap-south-1 for the client processing account, providing a hard guardrail that cannot be overridden by any IAM user or role within the account.

Secrets management

All credentials — database connection strings for the policy API integration, third-party service keys, and internal signing secrets — are stored in AWS Secrets Manager with automatic rotation enabled where supported. Lambda functions retrieve secrets at cold-start initialisation and cache them in module scope; secrets are never written to environment variables, CloudWatch logs, or DynamoDB records. Secrets Manager access is logged via CloudTrail, enabling detection of any unexpected secret retrieval pattern.

Lessons Learned

What we learned building this system

Design for the worst document, not the average document

The temptation when building OCR pipelines is to optimise for the clean, native PDF case. Real-world insurance documents in India are overwhelmingly low-quality scans and mobile photographs. We spent roughly 30% of the build time on image pre-processing and quality handling — an investment that directly explains the 94% extraction accuracy in production.

Express Workflows are the right choice for high-volume, time-sensitive pipelines

Step Functions Standard Workflows have a lower execution price per state transition but are rate-limited to 2,000 new executions per second. During monsoon peaks we need sustained throughput of 800+ new executions per minute. Express Workflows support 100,000 new executions per second and are priced per duration rather than per state transition — a significantly better fit for this workload profile.

Callback tokens are the correct pattern for human-in-the-loop

We initially prototyped the human review integration using polling — a Lambda function querying DynamoDB every 60 seconds to check if a human had made a decision. This consumed Lambda compute for idle waiting and introduced 0–60 second decision lag. Switching to Step Functions callback tokens (.waitForTaskToken) eliminated both problems: the workflow pauses at zero cost until the human review API delivers the token back, at which point execution resumes within milliseconds.

Invest in CloudWatch dashboards before go-live, not after

Operational observability is often treated as post-launch work. On this project we required production-ready CloudWatch dashboards — showing queue depth, extraction rates, auto-approval rate, and error rates — as a go-live gate. This paid off immediately: on day three of production, a dashboard showed a sudden drop in the policy match rate to 61% (normal: 94%). The root cause — a schema change to the policy API response that broke our mapping layer — was identified and patched in 42 minutes, before a single claim was incorrectly rejected.

Related Case Studies

Real-Time Logistics Platform

Event-driven shipment tracking with IoT GPS ingestion, geofencing, and automated dispatch on AWS.

LambdaIoT CoreDynamoDB

Serverless E-Commerce Backend

Hyperlocal grocery delivery with Step Functions order orchestration and Razorpay UPI integration.

Step FunctionsDynamoDBRazorpay

Multi-Tenant SaaS Analytics

B2B analytics dashboard with tenant isolation, Kinesis ingestion, and Aurora Serverless.

Aurora ServerlessKinesisCognito

Work with Xortrix AI

Building a document processing or claims automation system?

We've designed and shipped serverless document pipelines for regulated industries where accuracy, auditability, and scale are non-negotiable. If you're working on a similar problem — whether in insurance, banking, healthcare, or logistics — we'd like to hear about it.

Start a conversation View all case studies