Revature 200413

Data Engineering with Java & Apache Spark

Project 2

A data pipeline with several modular components, including but not limited to:

The pipeline is intended to be several applications which run in succession, taking a datasource from an S3 and deploying a Spark job on a cluster for analysis before saving the results in a SQL database. Some recommendations for organizing and extending the project:

Use a Maven multi-module in a single Git repository
Create an automation script or tool to kick-off the entire pipeline
Create a program to pull or create data and store to AWS S3 before the pipeline runs
Create a simple CLI or HTTP based analysis tool to query the SQL database

Features

Tech Stack

Presentation

3-5 minute slide deck
5-10 minute live demonstration