Cassandra Migration Tips

#1. Cassandra allows creating a composite primary key (primary key consisting of more than one columns – first part being partition key and rest being clustering keys)

For any queries using a composite pk column without including all of its partitioning key columns will require an index for lookup. For example:

CREATE TABLE test.groupmechanism (
groupid int,
mechanismid int,
createdby int,
createdon timestamp,
description text,
lastmodifiedby int,
lastmodifiedon timestamp,
name text,
PRIMARY KEY (groupid, mechanismid)
)

To query by mechanismID such as, select * from test.groupmechanism where mechanismid = 123; will result in

Cassandra cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

without an index on mechanismid;

Creating an index such as, create custom index groupmechanism_c1 on test.groupmechanism(mechanismid) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'case_sensitive': 'false'};

will result in successful execution of the query. See How and when to index data in Cassandra for fast and efficient retrieval? A simple explanation

#2. Cassandra does not support multi-column index.

Let’s say we have a table:

CREATE TABLE test.groupmechanism2 (
groupmechanismid int,
groupid int,
mechanismid int,
createdby int,
createdon timestamp,
description text,
lastmodifiedby int,
lastmodifiedon timestamp,
name text,
PRIMARY KEY (groupmechanismid)
)

and (groupid, mechanismid) together is a unique key. Now we want to check if a record exists by a unique key based query: select groupmechanismid from groupmechanism2 where groupid = 1 and mechanismid = 2;

For this query to work per #1 you would think that since Cassandra does not support a multi-column index, adding two separate constraints one on groupid and another on mechanismid as below would make the query run without requiring allow filtering:

create custom index groupmechanism2_c1 on test.groupmechanism2(groupid) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'case_sensitive': 'false'}; and create custom index groupmechanism2_c2 on test.groupmechanism2(mechanismid) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'case_sensitive': 'false'};

However even with two separate indexes, the query still requires allow filtering: select groupmechanismid from groupmechanism2 where groupid = 1 and mechanismid = 2 ALLOW FILTERING;

The reason for this is Cassandra executes this query by first figuring out which index out of the two returns least number of records. It then gets those rows (using that index) and then filters out rows that don’t match the other column value. This is the reason why whenever ALLOW FILTERING is used there could be performance impact if least number of record count is in millions (or some unreasonable number) Cassandra has to iterate through the entire set to filter out unmatched rows! Thus every use of ALLOW FILTERING must be intentional at the design time and monitored at runtime. See How and when to index data in Cassandra for fast and efficient retrieval? A simple explanation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s