How and when to index data in Cassandra for fast and efficient retrieval? – A simple explanation

Cassandra is a multi-node, peer-to-peer cluster/distributed system that distributes/stores data across all nodes in the cluster.

Every table in Cassandra is physically stored in multiple SSTable files spread across one or multiple nodes. Rows are spread around the cluster based on a hash of the partition key, which is the first part of the primary key.

So with a known partition key, Cassandra can identify the node and retrieve the record from it – so it is a one node, exact location lookup – fastest way to query in Cassandra. Thus partition keys can be thought of as global indexes.

Business requirements often require querying data from non partition key columns. There are two ways we can do this in Cassandra efficiently 1) secondary indexes and 2) materialized view.

Secondary indexes are local to the node where indexed data is stored. So if a query includes a partition key and indexed column, Cassandra can pin point the node to query and then use index on that node to get the result. However if a query does not include a partition key then Cassandra needs to contact all nodes hosting the table data, use index on each node to get the result, and then combine results from all the nodes to form the final result. So secondary index-based queries without partition data are multi-node lookups – expensive way to query in Cassandra. To add to this if data is of high cardinality – distinctness of partition keys (hash of partition keys) spreads data evenly across multiple nodes – number of nodes to lookup would be higher. For data of low cardinality – lack of distinctness of partition keys (hash of partition keys) spreads data to fewer nodes in the cluster – number of nodes to lookup would be smaller and secondary index-based query without partition key might be justifiable.

Materialized views are global indexes and enable querying data of high cardinality by non partition keys in an efficient way. The cost of materialized view is expensive writes.

 

 

2 thoughts on “How and when to index data in Cassandra for fast and efficient retrieval? – A simple explanation

  1. Very nice explanation. I have secondary index on column “email” for table”employee”. Very high cardinality. I have written query like “select * from employee where email = ’emp@test.com’;”
    I have 30 node cluster with RF=3. Query is not using partition key in where clause. My question , is there any way or formula to identify as to how many nodes would get contacted in order to fetch data for above query?

    Like

    1. As you know, secondary index is useful only when used along with a partitioning key. In your case, just top of my head I am thinking it could take anywhere from 1 to 30 trips – as it will have to go through each and every node until it finds the email… You may want to create a materialized view to lookup by email…

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s