Architecting for Big Data Processing on AWS

Onica

Data Analytics
November 17, 2016

Amazon Web Services helps you build and deploy big data analytics applications, so you can rapidly scale any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing. Watch our video below to learn more about architecting big data on AWS.

Let’s talk a little bit about architecting Big Data processing on AWS. What are my options?
AWS big data services give you a lot of options. It all starts with Amazon Simple Storage Service (S3), where you can essentially start building a data lake to pull in all of your data sources into one place to be analyzed by a greater processing system. From there, you can use something like Amazon Elastic MapReduce (EMR) Service to analyze and process all of that data, aggregate it, apply it to your data models, and build your data lake with streams through Amazon Kinesis. Amazon Kineses allows you to stream data in real-time to Amazon S3 or with Kinesis Firehose. Or you can do true real-time streaming by using Kinesis in EMR. You can process a stream in a matter of minutes with the tools available in AWS.

Amazon seems to be making a tremendous investment in the Big Data space with EMR, Kinesis, Lambda, etc. As we look to the future of what AWS is going to do with Big Data, what would be your prediction?
First, it makes sense to look back and figure out why AWS is making that investment. And I would say it all starts with the cloud adoption story where people are first moving, shutting down data centers, and going into AWS. Once in AWS, they have all of these awesome new tools at their fingertips that they could either never afford or would take a ton of time and resources to have in their existing data center. Now that a lot of people have adopted AWS and are going all in, it’s time that they start thinking about leveraging unlimited storage where they can keep data virtually forever and make predictions and extract competitive business insights from it. Lowering the barrier of entry for that, Amazon Machine Learning is available, so I would predict that they’re going to continue to advance that and provide new algorithms that weren’t available before like QuickSite, a Big Data visualization tool that’s one-tenth of the cost of the other big players out there. I predict that Amazon is going to continue to move themselves up the stack to simplify things for the end-user.

Amazon just introduced The Big Data competency which is something that we were very eager to get and I know that you were the Lead Senior Solutions Architect that led the process of helping us to achieve this.
Just over the last few months we achieved the Big Data consulting competency from AWS, we are very honored and proud to achieve that. That puts us in a league where we’ve delivered BIG Big Data solutions for customers. We’ve had verifiable case studies and solutions and analysis say, this is where the customer is coming from and this is what they were able to achieve as an outcome because of our involvement in architecting Big Data initiatives.

What are some of your favorite things that you’ve done within the realm of Big Data for our clients?
You have to design something that essentially is pluggable where adding one data source is no different from adding the next source or with a very minimal or low amount of friction. A Lambda Architecture is essentially the Big Data processing framework that gives you that kind of scalability and pluggable architecture. You can play around with the Lambda architecture to create its own thing: that’s where you have a speed layer for streaming, fast moving data where you need real-time updates, and then a batch layer for huge volumes of data. This is more for business reports, complex analytics – things that don’t have to be done in a matter of seconds. That’s where the batch layer is. That all comes together in a query layer where you can even combine those things at different points in time.

When you’re creating a Big Data based architecture and you’re pulling in data from hundreds of different places and leveraging third-party tools, you’re opening yourself up to potential security risks. When we are thinking about architecting big data for security, when it comes to Big Data, what are we doing to give our clients the confidence that they’re actually going to be secure even though we’re enabling a mass amount of data to come into their business?
When we’re building a Big Data architecture in AWS, it definitely helps to think about things more holistically, you have to really start with the business process with things like data governments and master data management (MDM) to understand what data is coming in, why it is coming in, what are you using it for downstream and does it contain private data, does it need to be anonymized, and so forth. That definitely needs to be in place when it comes to data security. But technologically we’re utilizing the AWS Services, that bring us higher in the stack where AWS is responsible for a lot of the security so we don’t have to think about it as much versus building our own data centers and hosting our own services. Most AWS Services are audited by third parties for things like PCI compliance, HIPAA compliance, and so forth, so we are able to just focus and make sure that those things are in place while AWS maintains the rest.

Contact us to learn more about enabling your business with AWS big data services or read our related blogs on what is Big Data on AWS.