How to Build a Lambda Architecture in AWS for Big Data

Laith Al-Saadoon, an AWS Solutions Architect in our big data practice discusses building a Lambda Big Data Architecture in Amazon Web Services that can be used to analyze big data. Watch our video to learn more about how we are helping our customers grow their business in the cloud with AWS.

What is Lambda in AWS?

AWS Lambda is a serverless compute service that makes it easy for you to build highly responsive applications. In response to events, AWS Lambda runs your code on a high-availability compute infrastructure and automatically performs all administration of the underlying compute resources for you. You can use the lambda architecture as a framework for designing enterprise big data architectures that are resilient, highly available, distributed, and pluggable.

What is AWS Lambda good for?

AWS Lambda architecture provides a pluggable architecture that is adaptable and scalable. It’s excellent for processing big data. In any big data architecture, you have one or hundreds of data sources coming into your systems. You need to have that in a repeatable fashion. You also may have different ways of ingesting data, through data streams or batch pushes and pulls. The lambda architecture accounts for all of those things, so that whether you have a new data source or a data source similar to the last one you plugged in, you know that the system will be able to take that data source and run with it. Since Lambda runs your code on AWS’ compute infrastructure, it performs all of the server and operating system maintenance, capacity provisioning and automatic scaling, code and security patch deployment, and code monitoring and logging. All you need to do is supply the code.

How to build a lambda architecture in AWS

You’ll want to start with a VPC. A VPC is a Virtual Private Cloud or an isolated network in the AWS public cloud environment. Let’s us a 10.0.0.0/16 network for this example. There are a couple of components to a lambda architecture. You have a speed layer and a batch layer. A speed layer is coming at you in real-time, in AWS this translates to a Kinesis Stream. This could be a generic Kinesis Stream, Kinesis Firehose, or even a DynamoDB Stream, if you are levering AWS services out of the box. The speed layer is great for monitoring data, for instance if you are an Ad Tech company, you may want to monitor your clickstream in real-time to understand a customer’s behavior. The batch layer can be for data sources like SFTP or other forms where your customers are uploading data or you are getting data every hour or so. The batch layer is where you want to query your batch and streaming sources so that you have a holistic view of how things are running in your business.

There are different ways you can process your speed and batch layer data. You can use services like Amazon EMR (Elastic MapReduce) for real-time stream processing, like an EMR Hadoop Cluster or Spark Streaming. You can also just sync it directly to Amazon S3 (Amazon Simple Storage Service) as raw data. For your batch layer, you could use EMR again to process your data periodically from a raw format to a CSV or comma separated format. Finally, you can load this to Amazon Redshift. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence (BI) tools. This is where you would run your complex BI or segmentation queries and data science operations to analyze your data. Once you have come up with that algorithm or your segmentation logic, you can then apply it to something like Spark Streaming so it can complete the operation in real-time.

A great example using the lambda architecture would be if you are an e-commerce shop and have a customer on your website that has put items into their online cart and then leaves your website. You can setup a real-time alert to trigger an email to that customer that has abandoned their cart with a coupon code or some other incentive to get them to come back and complete their purchase.

In summary, the Lambda architecture is a decoupled system with different strategies for processing both fast and slow data with a unified layer. This provides companies with a holistic view into how their business is running. It’s agnostic and a well-defined architecture that is great for big data architectures in AWS.

Related posts:
Learn more about our big data and analytics services by downloading our AWS Data Pipeline Whitepaper or watching our latest Big Data video.

Explore More Cloud Insights from Onica

Blogs

The latest perspectives on navigating an ever-changing cloud landscape

Case Studies

Explore how our customers are driving cloud innovation in their industries

Videos

Watch an on-demand library of cloud tutorials, tips and tricks

Publications

Learn how to succeed in the cloud with deep-dives into pressing cloud topics