How and when to index data in Cassandra for fast and efficient retrieval? – A simple explanation

Cassandra is a multi-node, peer-to-peer cluster/distributed system that distributes/stores data across all nodes in the cluster.

Every table in Cassandra is physically stored in multiple SSTable files spread across one or multiple nodes. Rows are spread around the cluster based on a hash of the partition key, which is the first part of the primary key.

So with a known partition key, Cassandra can identify the node and retrieve the record from it – so it is a one node, exact location lookup – fastest way to query in Cassandra. Thus partition keys can be thought of as global indexes.

Business requirements often require querying data from non partition key columns. There are two ways we can do this in Cassandra efficiently 1) secondary indexes and 2) materialized view.

Secondary indexes are local to the node where indexed data is stored. So if a query includes a partition key and indexed column, Cassandra can pin point the node to query and then use index on that node to get the result. However if a query does not include a partition key then Cassandra needs to contact all nodes hosting the table data, use index on each node to get the result, and then combine results from all the nodes to form the final result. So secondary index-based queries without partition data are multi-node lookups – expensive way to query in Cassandra. To add to this if data is of high cardinality – distinctness of partition keys (hash of partition keys) spreads data evenly across multiple nodes – number of nodes to lookup would be higher. For data of low cardinality – lack of distinctness of partition keys (hash of partition keys) spreads data to fewer nodes in the cluster – number of nodes to lookup would be smaller and secondary index-based query without partition key might be justifiable.

Materialized views are global indexes and enable querying data of high cardinality by non partition keys in an efficient way. The cost of materialized view is expensive writes.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s