dltHub
Blog /

Why Taktile runs dlt on AWS Lambda to process millions of daily tracking events

  • Simon Bumm,
    Data and Analytics Lead at Taktile
TL;DR: Combining dlt and AWS Lambda creates a secure, scalable, lightweight, and powerful instrumentation engine that Taktile uses for its low-code, high-volume data processing platform. I explain why dlt and AWS Lambda work together so well and how to get everything set up in less than one hour. If you want to jump to the code right away, you can find the accompanying GitHub repo here.

An important aspect of being a data person today is being able to navigate and choose from among many tools when setting up your company’s infrastructure. (And there are many tools out there!). While there is no one-size-fits-all when it comes to the right tooling, choosing ones that are powerful, flexible, and easily compatible with other tools empowers you to tailor your setup to your specific use case.

I am leading Data and Analytics at Taktile: a low-code platform used by global credit- and risk teams to design, build, and evaluate automated decision flows at scale. It’s the leading decision intelligence platform for the financial service industry today. To run our business effectively, we need an instrumentation mechanism that can anonymize and load millions of events and user actions each day into our Snowflake Data Warehouse. Inside the Warehouse, business users will use the data to run product analytics, build financial reports, set up automations, etc.

Choosing the right instrumentation engine is non-trivial

Setting up the infrastructure to instrument a secured, high-volume data processing platform like Taktile is complicated and there are essential considerations that need to be made:

  1. Data security: Each day, Taktile processes millions of high-stakes financial decisions for banks and Fintechs around the world. In such an environment, keeping sensitive data safe is crucial. Hence, Taktile only loads a subset of non-sensitive events into its warehouse and cannot rely on external vendors accessing decision data.
  2. Handling irregular traffic volumes: Taktile’s platform is being used for both batch and real-time decision-making, which means that traffic spikes are common and hard to anticipate. Such irregular traffic mandates an instrumentation engine that can quickly scale out and guarantee timely event ingestion into the warehouse, even under high load.
  3. Maintenance: a fast-growing company like Taktile needs to focus on its core product and on tools that don't create additional overhead.

dlt and AWS Lambda as the secure, scalable, and lightweight solution

AWS Lambda is Amazon’s serverless compute service. dlt is a lightweight Python ETL library that runs on any infrastructure. dlt fits neatly into the AWS Lambda paradigm, and by just adding a simple REST API and a few lines of Python, it converts your Lambda function into a powerful and scalable event ingestion engine.

  • Security: Lambda functions and dlt run within the perimeter of your own AWS infrastructure, hence there are no dependencies on external vendors.
  • Scalability: serverless compute services like AWS Lambda are great at handling traffic volatility through built-in horizontal scaling.
  • Maintenance: not only does AWS Lambda take care of provisioning and managing servers, but inserting dlt into the mix, also adds production-ready capabilities such as:
    • Automatic schema detection and evolution
    • Automatic normalization of unstructured data
    • Easy provisioning of staging destinations

Get started with dlt on AWS Lambda using SAM (AWS Serverless Application Model)

SAM is a lightweight Infrastructure-As-Code framework provided by AWS. Using SAM, you simply declare serverless resources like Lambda functions, API Gateways, etc. in a template.yml file and deploy them to your AWS account with a lightweight CLI.

  1. Install the SAM CLI [add link or command here]
pip install aws-sam-cli

Define your resources in a template.yml file

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Resources:
  ApiGateway:
    Type: AWS::Serverless::Api
    Properties:
      Name: DLT Api Gateway
      StageName: v1
  DltFunction:
    Type: AWS::Serverless::Function
    Properties:
      PackageType: Image
      Timeout: 30 # default is 3 seconds, which is usually too little
      MemorySize: 512 # default is 128mb, which is too little
      Events:
        HelloWorldApi:
          Type: Api
          Properties:
            RestApiId: !Ref ApiGateway
            Path: /collect
            Method: POST
      Environment:
        Variables:
          DLT_PROJECT_DIR: "/tmp" # the only writeable directory on a Lambda
          DLT_DATA_DIR: "/tmp" # the only writeable directory on a Lambda
          DLT_PIPELINE_DIR: "/tmp" # the only writeable directory on a Lambda
      Policies:
        - Statement:
            - Sid: AllowDLTSecretAccess
              Effect: Allow
              Action:
                - secretsmanager:GetSecretValue
              Resource: !Sub "arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:DLT_*"
    Metadata:
      DockerTag: dlt-aws
      DockerContext: .
      Dockerfile: Dockerfile
Outputs:
  ApiGateway:
    Description: "API Gateway endpoint URL for Staging stage for Hello World function"
    Value: !Sub "https://${ApiGateway}.execute-api.${AWS::Region}.amazonaws.com/v1/collect/"