Cassandra is a multi-node, peer-to-peer cluster/distributed system that distributes/stores data across all nodes in the cluster.
Every table in Cassandra is physically stored in multiple SSTable files spread across one or multiple nodes. Rows are spread around the cluster based on a hash of the partition key, which is the first part of the primary key.
So with a known partition key, Cassandra can identify the node and retrieve the record from it – so it is a one node, exact location lookup – fastest way to query in Cassandra. Thus partition keys can be thought of as global indexes.
Business requirements often require querying data from non partition key columns. There are two ways we can do this in Cassandra efficiently 1) secondary indexes and 2) materialized view.
Secondary indexes are local to the node where indexed data is stored. So if a query includes a partition key and indexed column, Cassandra can pin point the node to query and then use index on that node to get the result. However if a query does not include a partition key then Cassandra needs to contact all nodes hosting the table data, use index on each node to get the result, and then combine results from all the nodes to form the final result. So secondary index-based queries without partition data are multi-node lookups – expensive way to query in Cassandra. To add to this if data is of high cardinality – distinctness of partition keys (hash of partition keys) spreads data evenly across multiple nodes – number of nodes to lookup would be higher. For data of low cardinality – lack of distinctness of partition keys (hash of partition keys) spreads data to fewer nodes in the cluster – number of nodes to lookup would be smaller and secondary index-based query without partition key might be justifiable.
Materialized views are global indexes and enable querying data of high cardinality by non partition keys in an efficient way. The cost of materialized view is expensive writes.