Serverless Reference Architecture for Real-time File Processing

Overview

Serverless Reference Architecture: Real-time File Processing

The Real-time File Processing reference architecture is a general-purpose, event-driven, parallel data processing architecture that uses AWS Lambda. This architecture is ideal for workloads that need more than one data derivative of an object.

In this example application, we deliver notes from an interview in Markdown format to S3. S3 Events are used to trigger multiple processing flows - one to convert and persist Markdown files to HTML and another to detect and persist sentiment.

Architectural Diagram

Reference Architecture - Real-time File Processing

Application Components

Event Trigger

In this architecture, individual files are processed as they arrive. To achive this, we utilize AWS S3 Events and Amazon Simple Notification Service. When an object is created in S3, an event is emitted to a SNS topic. We deliver our event to 2 seperate SQS Queues, representing 2 different workflows. Refer to What is Amazon Simple Notification Service? for more information about eligible targets.

Conversion Workflow

Our function will take Markdown files stored in our InputBucket, convert them to HTML, and store them in our OutputBucket. The ConversionQueue SQS queue captures the S3 Event JSON payload, allowing for more control of our ConversionFunction and better error handling. Refer to Using AWS Lambda with Amazon SQS for more details.

If our ConversionFunction cannot remove the messages from the ConversionQueue, they are sent to ConversionDlq, a dead-letter queue (DLQ), for inspection. A CloudWatch Alarm is configured to send notification to an email address when there are any messages in the ConversionDlq.

Sentiment Analysis Workflow

Our function will take Markdown files stored in our InputBucket, detect the overall sentiment for each file, and store the result in our SentimentTable.

We are using Amazon Comprehend to detect overall interview sentiment. Amazon Comprehend is a machine learning powered service that makes it easy to find insights and relationships in text. We use the Sentiment Analysis API to understand whether interview responses are positive or negative.

The Sentiment workflow uses the same SQS-to-Lambda Function pattern as the Coversion workflow.

If our SentimentFunction cannot remove the messages from the SentimentQueue, they are sent to SentimentDlq, a dead-letter queue (DLQ), for inspection. A CloudWatch Alarm is configured to send notification to an email address when there are any messages in the SentimentDlq.

Building and Deploying the Application with the AWS Serverless Application Model (AWS SAM)

This application is deployed using the AWS Serverless Application Model (AWS SAM). AWS SAM is an open-source framework that enables you to build serverless applications on AWS. It provides you with a template specification to define your serverless application, and a command line interface (CLI) tool.

Pre-requisites

Clone the Repository

Clone with SSH

git clone [email protected]:aws-samples/lambda-refarch-fileprocessing.git

Clone with HTTPS

git clone https://github.com/aws-samples/lambda-refarch-fileprocessing.git

Build

The AWS SAM CLI comes with abstractions for a number of Lambda runtimes to build your dependencies, and copies the source code into staging folders so that everything is ready to be packaged and deployed. The sam build command builds any dependencies that your application has, and copies your application source code to folders under .aws-sam/build to be zipped and uploaded to Lambda.

sam build --use-container

Note

Be sure to use v0.41.0 of the AWS SAM CLI or newer. Failure to use the proper version of the AWS SAM CLI will result in a InvalidDocumentException exception. The EventInvokeConfig property is not recognized in earlier versions of the AWS SAM CLI. To confirm your version of AWS SAM, run the command sam --version.

Deploy

For the first deployment, please run the following command and save the generated configuration file samconfig.toml. Please use lambda-file-refarch for the stack name.

sam deploy --guided

You will be prompted to enter data for ConversionLogLevel and SentimentLogLevel. The default value for each is INFO but you can also enter DEBUG. You will also be prompted for AlarmRecipientEmailAddress.

Subsequent deployments can use the simplified sam deploy. The command will use the generated configuration file samconfig.toml.

You will receive an email asking you to confirm subscription to the lambda-file-refarch-AlarmTopic SNS topic that will receive alerts should either the ConversionDlq SQS queue or SentimentDlq SQS queue receive messages.

Testing the Example

After you have created the stack using the CloudFormation template, you can manually test the system by uploading a Markdown file to the InputBucket that was created in the stack.

Alternatively you test it by utilising the pipeline tests.sh script, however the test script removes the resources it creates, so if you wish to explore the solution and see the output files and DynamoDB tables manually uploading is the better option.

Manually testing

You can use the any of the sample-xx.md files in the repository /tests directory as example files. After the files have been uploaded, you can see the resulting HTML file in the output bucket of your stack. You can also view the CloudWatch logs for each of the functions in order to see the details of their execution.

You can use the following commands to copy a sample file from the provided S3 bucket into the input bucket of your stack.

INPUT_BUCKET=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id InputBucket --query "StackResourceDetail.PhysicalResourceId" --output text)
aws s3 cp ./tests/sample-01.md s3://${INPUT_BUCKET}/sample-01.md
aws s3 cp ./tests/sample-02.md s3://${INPUT_BUCKET}/sample-02.md

Once the input files has been uploaded to the input bucket, a series of events are put into motion.

  1. The input Markdown files are converted and stored in a separate S3 bucket.
OUTPUT_BUCKET=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id ConversionTargetBucket --query "StackResourceDetail.PhysicalResourceId" --output text)
aws s3 ls s3://${OUTPUT_BUCKET}
  1. The input Markdown files are analyzed and their sentiment published to a DynamoDB table.
DYNAMO_TABLE=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id SentimentTable --query "StackResourceDetail.PhysicalResourceId" --output text)
aws dynamodb scan --table-name ${DYNAMO_TABLE} --query "Items[*]"

You can also view the CloudWatch logs generated by the Lambda functions.

Using the test script

The pipeline end to end test script can be manually executed, you will need to ensure you have adequate permissions to perform the test script actions.

  • Describing stack resources
  • Uploading and deleting files from the S3 input bucket
  • Deleting files from the S3 output bucket
  • Reading and deleting entries from the DynamoDB table
bash ./tests.sh lambda-file-refarch

While the script is executing you will see all the stages output to the command line. The samples are uploaded to the InputBucket, the script will then wait for files to appear in the OutputBucket before checking they have all been processed and the matching html file exists in the OutputBucket. It will also check that the sentiment for each of the files has been recorded in the SentimentTable. Once complete the script will remove all the files created and the entries from the SentimentTable.

Extra credit testing

Try uploading (or adding to ./tests if you are using the script) an oversized (>100MB) or invalid file type to the input bucket. You can check in X-ray to explore how you can trace these kind of errors within the solution.

  • Linux command
fallocate -l 110M ./tests/sample-oversize.md
  • Mac OS X command
mkfile 110m ./tests/sample-oversize.md

X-Ray Error Tracing - Real-time File Processing

Viewing the CloudWatch dashboard

A dashboard is created as a part of the stack creation process. Metrics are published for the conversion and sentiment analysis processes. In addition, the alarms and alarm states are published.

CloudWatch Dashboard - Real-time File Processing

Cleaning Up the Example Resources

To remove all resources created by this example, run the following command:

bash cleanup.sh

What Is Happening in the Script?

Objects are cleared out from the InputBucket and ConversionTargetBucket.

for bucket in InputBucket ConversionTargetBucket; do
  echo "Clearing out ${bucket}..."
  BUCKET=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id ${bucket} --query "StackResourceDetail.PhysicalResourceId" --output text)
  aws s3 rm s3://${BUCKET} --recursive
  echo
done

The CloudFormation stack is deleted.

aws cloudformation delete-stack \
--stack-name lambda-file-refarch

The CloudWatch Logs Groups associated with the Lambda functions are deleted.

for log_group in $(aws logs describe-log-groups --log-group-name-prefix '/aws/lambda/lambda-file-refarch-' --query "logGroups[*].logGroupName" --output text); do
  echo "Removing log group ${log_group}..."
  aws logs delete-log-group --log-group-name ${log_group}
  echo
done

SAM Template Resources

Resources

The provided template creates the following resources:

  • InputBucket - A S3 bucket that holds the raw Markdown files. Uploading a file to this bucket will trigger processing functions.

  • NotificationTopic - A SNS topic that receives S3 events from the InputBucket.

  • NotificationTopicPolicy - A SNS topic policy that allows the InputBucket to publish events to the NotificationTopic.

  • NotificationQueuePolicy - A SQS queue policy that allows the NotificationTopic to publish events to the ConversionQueue and SentimentQueue.

  • ApplyS3NotificationLambdaFunction - A Lambda function that adds a S3 bucket notification when objects are created in the InputBucket. The function is called by ApplyInputBucketTrigger.

  • ApplyInputBucketTrigger - A CloudFormation Custom Resource that invokes the ApplyS3NotificationLambdaFunction when a CloudFormation stack is created.

  • ConversionSubscription - A SNS subscription that allows the ConversionQueue to receive messages from NotificationTopic.

  • ConversionQueue - A SQS queue that is used to store events for conversion from Markdown to HTML.

  • ConversionDlq - A SQS queue that is used to capture messages that cannot be processed by the ConversionFunction. The RedrivePolicy on the ConversionQueue is used to manage how traffic makes it to this queue.

  • ConversionFunction - A Lambda function that takes the input file, converts it to HTML, and stores the resulting file to ConversionTargetBucket.

  • ConversionTargetBucket - A S3 bucket that stores the converted HTML.

  • SentimentSubscription - A SNS subscription that allows the SentimentQueue to receive messages from NotificationTopic.

  • SentimentQueue - A SQS queue that is used to store events for sentiment analysis processing.

  • SentimentDlq - A SQS queue that is used to capture messages that cannot be processed by the SentimentFunction. The RedrivePolicy on the SentimentQueue is used to manage how traffic makes it to this queue.

  • SentimentFunction - A Lambda function that takes the input file, performs sentiment analysis, and stores the output to the SentimentTable.

  • SentimentTable - A DynamoDB table that stores the input file along with the sentiment.

  • AlarmTopic - A SNS topic that has an email as a subscriber. This topic is used to receive alarms from the ConversionDlqAlarm, SentimentDlqAlarm, ConversionQueueAlarm, SentimentQueueAlarm, ConversionFunctionErrorRateAlarm, SentimentFunctionErrorRateAlarm, ConversionFunctionThrottleRateAlarm, and SentimentFunctionThrottleRateAlarm.

  • ConversionDlqAlarm - A CloudWatch Alarm that detects when there there are any messages sent to the ConvesionDlq within a 1 minute period and sends a notification to the AlarmTopic.

  • SentimentDlqAlarm - A CloudWatch Alarm that detects when there there are any messages sent to the SentimentDlq within a 1 minute period and sends a notification to the AlarmTopic.

  • ConversionQueueAlarm - A CloudWatch Alarm that detects when there are 20 or more messages in the ConversionQueue within a 1 minute period and sends a notification to the AlarmTopic.

  • SentimentQueueAlarm - A CloudWatch Alarm that detects when there are 20 or more messages in the SentimentQueue within a 1 minute period and sends a notification to the AlarmTopic.

  • ConversionFunctionErrorRateAlarm - A CloudWatch Alarm that detects when there is an error rate of 5% over a 5 minute period for the ConversionFunction and sends a notification to the AlarmTopic.

  • SentimentFunctionErrorRateAlarm - A CloudWatch Alarm that detects when there is an error rate of 5% over a 5 minute period for the SentimentFunction and sends a notification to the AlarmTopic.

  • ConversionFunctionThrottleRateAlarm - A CloudWatch Alarm that detects when ther is a throttle rate of 1% over a 5 minute period for the ConversionFunction and sends a notification to the AlarmTopic.

  • SentimentFunctionThrottleRateAlarm - A CloudWatch Alarm that detects when ther is a throttle rate of 1% over a 5 minute period for the SentimentFunction and sends a notification to the AlarmTopic.

  • ApplicationDashboard - A CloudWatch Dashboard that displays Conversion Function Invocations, Conversion Function Error Rate, Conversion Function Throttle Rate, Conversion DLQ Length, Sentiment Function Invocations, Sentiment Function Error Rate, Sentiment Function Throttle Rate, and Sentiment DLQ Length.

License

This reference architecture sample is licensed under Apache 2.0.

Comments
  • Update repo link

    Update repo link

    Issue #, if available:

    Description of changes: Update repository cloning instructions to use aws-samples.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mcnamarabrian 0
  • Changes to reflect updated best practices

    Changes to reflect updated best practices

    Issue #, if available:

    Description of changes: This is a really large change. The updates reflect best practices, including observability and monitoring.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mcnamarabrian 0
  • Mcnamarabrian/internal feedback

    Mcnamarabrian/internal feedback

    Issue #, if available:

    Description of changes:

    Addressing feedback provided by internal reviewers.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mcnamarabrian 0
  • Mcnamarabrian/sentiment improvements

    Mcnamarabrian/sentiment improvements

    Issue #, if available:

    Description of changes:

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mcnamarabrian 0
  • Mcnamarabrian/sentiment

    Mcnamarabrian/sentiment

    Issue #, if available:

    Description of changes:

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mcnamarabrian 0
  • Mcnamarabrian/updated version

    Mcnamarabrian/updated version

    Issue #, if available:

    Description of changes:

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mcnamarabrian 0
  • Architectural Diagram barely visible

    Architectural Diagram barely visible

    Hello. I am using GH with dark mode activated. Due to Architectural Diagram being transparent some of the labels are barely visible. Would appreciate if you align this diagram with the rest in the readme file (white background). Thanks!

    opened by ddosia 0
  • Architecture Diagram Visualization

    Architecture Diagram Visualization

    Hi @mcnamarabrian @bmcnamar-amazon @georgmao

    I’m Afonso and I came across your repo and I took the liberty to scan it with the tool that I am currently developing. If I caught your attention, feel free to take a look at the following diagram and let me know your thoughts on this. I am really interested to get your feedback and understand if the diagram we presented matches the architecture you have in mind.

    opened by afonsoppedro 0
  • broken diagram link

    broken diagram link

    https://s3.amazonaws.com/awslambda-reference-architectures/file-processing/lambda-refarch-fileprocessing.pdf

    <Error>
    <Code>AccessDenied</Code>
    <Message>Access Denied</Message>
    <RequestId>61DAD719BE77F4B9</RequestId>
    <HostId>
    cQEm7tUVxgVoPY/xdZiIOuB99+clSjHL8f301cA4PEoE0fXnCZ2s6DT6ec+/r/HUZTU6Wce5mYc=
    </HostId>
    </Error>
    
    opened by gelpenaddict 2
  • Upgrade deprecated runtime nodejs6.10

    Upgrade deprecated runtime nodejs6.10

    CloudFormation templates in lambda-refarch-fileprocessing have been found to include a deprecated Lambda function runtime (nodejs6.10). The affected templates have been updated to a supported runtime (nodejs10.x).

    Please note, this pull request has been generated by a bot; please check the base branch and files changed before merging.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by chriscoombs 0
  • node6 invalid, updated cloudformation to node10.x

    node6 invalid, updated cloudformation to node10.x

    *Issue #8 *

    Description of changes:

    node6 invalid, updated cloudformation to node10.x

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by abjoerne 0
  • NodeJs Version not supported

    NodeJs Version not supported

    The NodeJs version mentioned in the CFN scripts is nodejs6.10 whereas the currently supported version of AWS is nodejs10.x. Can you update it, please?

    opened by jaanifly 0
Owner
AWS Samples
AWS Samples
Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text

Welcome to Apache OpenNLP! The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. This toolkit is

The Apache Software Foundation 1.2k Dec 29, 2022
This is a basic Java project focused on applying object-oriented design in a real world application

Email-Administration-System-in-Java This is a basic Java project focused on applying object-oriented design in a real world application. Email Adminis

Muhammad Asad 5 Feb 4, 2022
Mentoria: Abstraindo situações do mundo real com Orientação a Objetos

Mentoria: Abstraindo situações do mundo real com Orientação a Objetos

Camila Cavalcante 12 Aug 5, 2022
An in-memory file system for Java 7+

Jimfs Jimfs is an in-memory file system for Java 7 and above, implementing the java.nio.file abstract file system APIs. Getting started The latest rel

Google 2.2k Jan 3, 2023
A simple file sharing program

FileSharing A simple file sharing program How to use Place all the files to be shared in /html/files (symbolic links work).

AK 3 May 13, 2021
A super simple system for easily creating messages and putting them in in a file, whilst also being able to add replacements without struggle.

A super simple system for easily creating messages and putting them in in a file, whilst also being able to add replacement values without struggle. Please remember: Give constructive feedback, not negative feedback. There are probably a million things to improve, and I am aware of that.

Solyze 2 Sep 21, 2021
ThirdEye is an integrated tool for realtime monitoring of time series and interactive root-cause analysis.

ThirdEye is an integrated tool for realtime monitoring of time series and interactive root-cause analysis. It enables anyone inside an organization to collaborate on effective identification and analysis of deviations in business and system metrics. ThirdEye supports the entire workflow from anomaly detection, over root-cause analysis, to issue resolution and post-mortem reporting.

null 87 Oct 17, 2022
Function allowing you to find duplicate in an array in linear time and constant space

Floyd's tortoise and hare Function allowing you to find duplicate in an array in linear time and constant space ! Floyd's method is a pretty amazing,

null 7 Mar 29, 2022
Time-Based One-Time Password (RFC 6238) and HMAC-Based One-Time Password (RFC 4226) reference implementations and more.

Crypto Time-Based One-Time Password (RFC 6238) and HMAC-Based One-Time Password (RFC 4226) reference implementations and more. Getting Started TOTP ge

Oliver Yasuna 1 May 12, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Dagger is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data.

Dagger Dagger or Data Aggregator is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processi

Open DataOps Foundation 238 Dec 22, 2022
SWE5003 - Achitecting Real Time Systems for Data Processing - Code Base

ARTS2022 SWE5003 - Achitecting Real Time Systems for Data Processing (ISS NUS Offering) - Code Base This module is part of the ISS MTech Graduate Cert

Suria R Asai 5 Apr 2, 2022
A Java architecture test library, to specify and assert architecture rules in plain Java

ArchUnit is a free, simple and extensible library for checking the architecture of your Java code. That is, ArchUnit can check dependencies between pa

TNG Technology Consulting GmbH 2.5k Jan 2, 2023
Clean-architecture-guide - Guia sobre Clean Architecture criado a partir dos meus estudos sobre o tema.

Clean Architecture Guide Arquitetura são as práticas e fundamentos de como organizamos um sistema. Tem relação e como os componentes estão relacionado

Jean Jacques Nascimento Barros 3 Apr 23, 2022
Archinsight project tends to implement architecture-as-code definition of a standard c4 architecture model

Archinsight project tends to implement architecture-as-code definition of a standard c4 architecture model. This project offers a new Insight language designed in such way that an Architect can focus on architecture definition, not visualization. Compared to UML, the Insight language is more specific and is unable to describe an arbitrary entity, but shorter and probably easier to use.

null 25 Nov 24, 2022
Stream Processing and Complex Event Processing Engine

Siddhi Core Libraries Siddhi is a cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to captur

Siddhi - Cloud Native Stream Processor 1.4k Jan 6, 2023
null 4 Oct 21, 2022