Using Apache Spark to read data from Cassandra tables

By Prasad Khode

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data, across many commodity servers. It provides high availability with no single point of failure (SPOF- does not stop entire system even if a part of system fails ). Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous master less replication allowing low latency operations for all clients.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. Spark is well-suited to machine learning algorithms , as it allows user programs to load data into a cluster’s memory and query data repeatedly.

Let us consider we have a table with name “table_user” within our Cassandra database with the columns “user_first_name, user_last_name, user_email, date_of_birth”. We  create a pojo class as below:

To read data from the Cassandra tables using Apache Spark:

  1. Add Apache Spark & Cassandra dependencies in pom.xml
  2. Configure SparkConf object with the Cassandra database details
  3. Use SparkConf object and create JavaSparkContext object
  4. Use JavaSparkContext object and read the data from Cassandra table.

Step-1: Integrate Apache Spark & Cassandra database using the following dependencies

Step-2: Configure SparkConf object with the Cassandra database details

Step-3: Create JavaSparkContext object using SparkConf object

Step- 4:  Now we have JavaSparkContext object. Read data from Cassandra table, providing the key space and table name using the following code snippet:

Now userRDD will have all the records from the table in the form of Spark RDD. We can perform any aggregate or spark operation on top of this filter.