Big Data with Spark

Introduction

In this series of posts, I take a fresh look at Apache Spark and investigate its applicability to a smaller problem (which in time may grow into a “true” big data problem). The companion Github project contains the sample code and installation instructions.

The series starts by introducing Spark and the bus time table case study.

Spark on a cluster

This post describes how Spark is run on cluster. First locally and then on Amazon AWS.

January 25, 2016

Extracting and processing data using Spark

This post describes how Spark can be used to extract and process data from the bus timing and weather data sources.

January 24, 2016

Spark installation, set-up, and a simple test application

This post starts by showing how Spark is installed and set up. I then develop a simple test application.

January 23, 2016

Introduction to Spark and the bus timetable data case study

This post describes what Spark is and why one might use it. It also describes the case study where Spark is applied.

January 22, 2016