Serverless data lake using AWS

The power of Data in determining operational agility and enterprise business value cannot be overlooked. Analytics performed over data sources acquired from click-streams, social media, internet-connected devices, and log files provide fast integration to improve time to insights, business growth, production boost, customer retention, and taking the right calls at the right time.

A serverless data lake is a popular system of storing and analyzing data in a single repository and features autonomous maintenance and architectural flexibility for diverse kinds of data. The purpose of the Data Lake is the democratization of access to Data across the organization.

Enterprises are now migrating to the public cloud for creating Data Lakes on platforms, particularly AWS. Some of the reasons include cost optimization, zero requirement of operational maintenance, large and cheap storage, faster time to market, competent serverless components on AWS, DR and BCP availability, faster scalability, better security, and much more.

An architecture for a cloud-native, serverless data lake using AWS native resources like S3, Athena, and Glue.

Serverless datalake architecture

Steps to create a quick data lake in AWS is as follows,

Create an S3 bucket to store the data,

S3 bucket

Let’s try to query a sample data set which is currently in CSV format,

dataset

Create a new folder called CSV and upload the CSV to that folder in the S3 bucket that was created earlier.

CSV File

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

We will use AWS Glue to crawl the data to form the schema.

First, let’s add a new database.

database
database

Create a new crawler and use it to crawl the data that is stored in S3.

crawler

Add the S3 bucket as a source,

data source

Create an IAM role to have permissions to crawl the S3 data,

IAM role

Select the database which was created earlier,

target  database

Click on Review and Create.

Now once the crawler is created, click on “Run” to run the crawler,

Run the crawlers

Once the crawler has run, it will let us know the number of tables created,

created tables using crawlers

Check the schema of the table that was created,

schema

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

We will use Amazon Athena to query the data in S3,

Inside , set the query location

query result location

Using Athena – we can query the data inside S3 using standard SQL like below,

Athena

Athena result

To adapt and succeed, a technologically advanced organization must take advantage of every opportunity available to it. Today, no organization can afford to ignore the massive amount of Data at its disposal. A data lake provides unparalleled flexibility for unlocking data’s analytics potential.

Share This

tring-whatsapp