Cassandra Primary Key = ((Partitioning Key), Clustering Key): A simple explanation

Cassandra primary key (a unique identifier for a row) is made up of two parts – 1) one or more partitioning columns and 2) zero or more clustering columns.

To search a table without any indexes, all partitioning columns must be provided to avoid error message: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

What this error means is that searching without all partitioning key columns, the query cannot locate one-and-only-one node where the result exists. Let’s create a sample table with a primary key composed of two-column partitioning key and zero-column clustering key to understand this better:

cqlsh:test> create table if not exists test.testCompositePartitionKey (
k1 text,
k2 text,
c1 text,
c2 text,
primary key ((k1, k2)) );

cqlsh:test> insert into test.testCompositePartitionKey (k1, k2, c1, c2) values (‘k1-1’, ‘k2-1’, ‘c1-1’, ‘c2-1’);

cqlsh:test> select * from test.testCompositePartitionKey where k1 = ‘k1-1’;

InvalidRequest: Error from server: code=2200 [Invalid query] message=”Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING”

Now let’s allow Cassandra to do filtering, i.e. allow it to search entire cluster of nodes if necessary. Obviously your query will be a lot more expensive.

cqlsh:test> select * from test.testCompositePartitionKey where k1 = ‘k1-1’ allow filtering;

k1   | k2   | c1   | c2
——+——+——+——
k1-1 | k2-1 | c1-1 | c2-1

The example above used a multi-column partitioning key. Note the double parentheses around the first two columns defined in the PRIMARY KEY primary key ((k1, k2)).

Now let’s consider another sample table with a primary key composed of one-column partitioning key and one-column clustering key to understand that the primary key is still the same – just the partitioning key has changed (thus the hash used by Cassandra to distribute records across cluster):

cqlsh:test> create table if not exists test.testSingleColumnPartitionKey (
k1 text,
k2 text,
c1 text,
c2 text,
primary key ((k1), k2) );

cqlsh:test> insert into test.testSingleColumnPartitionKey (k1, k2, c1, c2) values (‘k1-1’, ‘k2-1’, ‘c1-1’, ‘c2-1’);

cqlsh:test> select * from test.testSingleColumnPartitionKey where k1 = ‘k1-1’;

k1   | k2   | c1   | c2
——+——+——+——
k1-1 | k2-1 | c1-1 | c2-1

As you see, this time we did not get the allow filtering error because we included all columns (one in this example) of partitioning key in the query.

Now let’s try to insert another record by same primary key ((k1), k2):

cqlsh:test> insert into test.testSingleColumnPartitionKey (k1, k2, c1, c2) values (‘k1-1’, ‘k2-1’, ‘c1-1u1’, ‘c2-1u1’);

cqlsh:test> select * from test.testSingleColumnPartitionKey where k1 = ‘k1-1’;

k1   | k2   | c1     | c2
——+——+——–+——–
k1-1 | k2-1 | c1-1u1 | c2-1u1

(1 rows)

Notice that there is still one-and-only-one record (updated with new c1 and c2 values) in Cassandra by the primary key k1=k1-1 and k2=k2-1.

To summarize, all columns of primary key, including columns of partitioning key and clustering key make a primary key. Partitioning key columns are used by Cassandra to spread the records across the cluster.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s