Site Loader
Located: Melbourne, VIC, Australia

This article is to discuss a modern data pipeline architecture using S3 storage and AWS Serverless services to lower the storage and operation cost, leverage auto scale, fault-toleration and high availability

What is Serverless?

Serverless is the native architecture of the cloud that enables you to shift more of your operational responsibilities to AWS, increasing your agility and innovation. Serverless allows you to build and run applications and services without thinking about servers. It eliminates infrastructure management tasks such as server or cluster provisioning, patching, operating system maintenance, and capacity provisioning.

Why use serverless?

Serverless enables you to build a modern application with increased agility and lower total cost of ownership. Benefits of serverless are:

  • FLEXIBLE SCALING: application can be scaled automatically or by adjusting its capacity through toggling the units of consumption (e.g. throughput, memory) rather than units of individual servers.
  • EXTREMELY LOW COST: Pay for consistent throughput or execution duration rather than by server unit.
    For example, one of AWS serverless services is the Lambda function. First 1M requests per month are free. $0.20 PER 1M REQUESTS THEREAFTER. $0.0000002 per request.
  • AUTOMATED HIGH AVAILABILITY: Serverless provides built-in availability and fault tolerance. You don’t need to architect for these capabilities since the services running the application provide them by default.
  • NO SERVER MANAGEMENT: There is no need to provision or maintain any servers. There is no software or runtime to install, maintain, or administer.

AWS Real-time Serverless Data Pipeline

In this article, I proposed a low-cost real-time data pipeline using AWS serverless: Lambda, Anthena and AWS Glue.

  • Lambda is a managed compute service that allows you to run a custom function in response to an event (e.g., writes to a specific AWS S3 bucket) without having to set up a server. Compute resources, capacity provisioning, automatic scaling, code monitoring, logging, and code and security patch deployment are all managed by AWS. It supports three programming languages, Java, Python and Node.Js.
  • AWS S3 storage with life cycle management is the only storage in the solution, which provides low cost, durability, high availability, security. The older and infrequent access objects in S3 are automatically moved to less performing storage tiers after certain days to reduce the storage cost.
  • The S3 has 3 storage layers: Landing bucket, Validated Data Bucket and Transformed Data Bucket. Landing bucket is where the raw data arrives. Once raw data arrives, it triggers an event which sends message to Simple Queue Service (SQS) – discussed below, then send a request the Lambda for validating the data and stored validated data in the Validated Data bucket. When data is written into this bucket, it triggers another event which sends messages to the next Lambda for data transformation and store in the S3 Transformed data.
  • Simple Queue Services (SQS) are used for decoupling and enhance fault-tolerant. Once Lambda processed the messages in the SQS, it deletes the messages from the SQS. If Lambda timeout or errors incur, then new Lambda instances will be created and process unprocessed / unsuccessfully processed messages.
  • AWS Glue Crawler is used to discover and register the metadata into Glue Data Catalog which is used by Anthena to query the S3 Transformed Data Bucket.
Low-cost, Fault-tolerant, HA, Real-time Serverless Data Pipeline Architecture
a Low-cost, Fault-tolerant, HA, Real-time Serverless Data Pipeline Architecture

(Click on Diagram to zoom-in)

Databricks for optimizing the process

For the simple transformation without exceeding Lambda timeout, Lambda handles it. For the complex transformation, if Lambda keeps timeout because it takes longer than 15 minutes for the data transformation, then Lambda triggers the Databricks (Apache Spark) function and let Databrick to handle these complex transformation in the pipeline. Apache Spark is a lightning fast in-memory distributed computing engine which can process a large volume of data more effectively.

How to call DataBricks in Lambda?

The Databricks REST API enables programmatic access to Databricks, (instead of going through the Web UI). It can automatically create and run jobs, productionalize a data flow, and much more. For more information on how the API works, read the documentation. or this blog

So, in the Lambda, write codes in Scale or Python to call a REST APIs which create a Databricks cluster, run a Spark job, and delete the Cluster (if not used). A few basic examples using curl on the Linux terminal are below:

Create a new cluster

curl -u user:pwd -H "Content-Type: application/json" -X POST -d 
   '{ "cluster_name": "flights", "spark_version": "1.6.x-ubuntu15.10", 
   "spark_conf": { "spark.speculation": true }, 
   "aws_attributes": { "availability": "SPOT", "zone_id": "us-west-2c" }, 
   "num_workers": 2 }'

Delete a cluster

curl -u user:pwd -H "Content-Type: application/json" -X POST -d 

Run a job

curl -u user:pwd -H "Content-Type: application/json" -X POST -d 
   '{ "job_id":2, "jar_params": ["param1", "param2"]}'

Cost estimate

The table below gives the estimated costs, not including the cost of Databricks and S3 Glacier for Archive.

  • 5 million Standard queues per month
  • 5 million Lambda requests per month, and 1 TB S3 storage
  • 1 TB storage in S3 Standard


There are many technologies and approaches help us to achieve the same goals. The architecture above is one of many ways to leverage the serverless technologies to lower the cost, increase operation effectiveness, fault tolerant and auto scale capacity.

Post Author: Tony Nguyen

Leave a Reply

Your e-mail address will not be published. Required fields are marked *