Data Engineering with Java & Apache Spark
A data pipeline with several modular components, including but not limited to:
The pipeline is intended to be several applications which run in succession, taking a datasource from an S3 and deploying a Spark job on a cluster for analysis before saving the results in a SQL database. Some recommendations for organizing and extending the project: