Utilize RethinkDB Index - Part 2 - Secondary Index

14 October 2017
rethinkdb

This post is the second part of the first post here. This post focuses on how to utilize RethinkDB Secondary Index in different use cases efficiently.

Some rules when using RethinkDB Indexes

RethinkDB Indexes, similar to Indexes in other database, are the trade-off between read and write performance. Therefore, the basic rules for RethinkDB Indexes are similar to other database.

Don’t create index if not necessary.
Don’t create index if the data set is small enough so that you can use the filter query.
Indexes require memory to process, be careful with tables that have a lot of indexes.
Indexes can slow down write operations significantly.

RethinkDB Simple Secondary Index

Continue from last post, sometimes, what you need is not simply querying by primary key. More often, you may want to get all the records belong to one specific category. In case the number of records is too big for one simple filter command, you will need to create one secondary index for faster querying.

For example, you are organizing all the products in your warehouse into several categories. Each category is stored as one record in the categories table as below

{
  "id":             "category id",
  "categoryName":   "category name"
}

Each product is stored as one record in the products table with one prop categoryId for referencing back to the category table.

{
  "id":             "product id",
  "productName":    "product name",
  "categoryId":     "category id",
  "categoryId":     "0-10",
  "categoryId":     "product status"
}

If you want to query all products belong to one category, the simplest solution is to create a simple secondary index on the field categoryId of the product table

function* createIndex() {
  yield r.table('products').indexCreate('categoryId');
}

To query all the products by category, simply use RethinkDB getAll command

function* getProductsByCategoryId(categoryId) {
  return yield r.table('products').getAll(categoryId, { index: 'categoryId' });
}

Joining 2 tables with the help of secondary indexes is also very fast. In case you want to get all the products belong to one category with the category name, simply do

function* getProductsWithCategoryName(categoryId) {
  return yield r
    .table('products')
    .getAll(categoryId, { index: 'categoryId' })
    .eqJoin('categoryId', r.table('categories'))
    .zip();
}

Or if you want to do the reverse way, find the category by id and all its related products

function* getCategoryByIdAndProducts(categoryId) {
  return yield r.table('categories')
    .get(categoryId)
    .merge({
      products: r.table('products')
        .getAll(categoryId, {index: 'categoryId'})
        .coerceTo('array')
    });
}

RethinkDB Compound Index

As your application grows bigger, your requirements may be extended to filtering products based on more properties, not just categoryId. That is where a compound index would be useful. If you design it carefully, you can even use one compound index to serve many different kinds of queries.

Back to the above example, if we change the requirements to find all the products belong to one category and has the rating is 5, the index would look like this

function* createIndex() {
  yield r.table('products').indexCreate(
    'categoryId_rating',
    [r.row('categoryId'), r.row('rating')]
  );
}

and the getAll query will receive an array param like this

function* getProductByCategoryIdAndRating(categoryId, rating) {
  return yield r.table('products')
    .getAll([categoryId, rating], { index: 'categoryId_rating' });
}

A compound index is also helpful when you want to do a range query, too. If you want to find all the products belong to one category with rating from 1 to 5, use between in combination with that index

function* getProductsByCategoryInRatingRange(categoryId, fromRating, toRating) {
  return yield r.table('products')
    .between(
      [categoryId, fromRating],
      [categoryId, toRating],
      { index: 'categoryId_rating' }
    );
}

getProductsByCategoryInRatingRange('sample-cat-id', 1, 5);

And hey, you can also take advantage of the new index to eliminate the previous one categoryId. If you want to search for all the products belong to one category, simply use between with all the possible rating values

function* getProductsByCategoryId(categoryId) {
  return yield r.table('products')
    .between(
      [categoryId, r.minval],
      [categoryId, r.maxval],
      { index: 'categoryId_rating' }
    );
}

This would be a big win when you only to one compound index to serve multiple purposes.

Extending the Compound Index

To extend the above example, if your application requires querying all the products belong to one category, but also care about the rating and status of those products, the index will look like this

function* createIndex() {
  yield r.table('products').indexCreate(
    'categoryId_rating_status',
    [r.row('categoryId'), r.row('rating'), r.row('status')]
  );
}

The above index be used for several use cases, from exact value to range querying.

Get all products belong to one category

function* getProductsByCategoryId(categoryId) {
  return yield r
    .table('products')
    .between([realmId, r.minval, r.minval], [realmId, r.maxval, r.maxval], {
      index: 'categoryId_rating_status'
    });
}

Get all products belong to one category, which have specific rating score

function* getProductsByCategoryAndRating(categoryId, rating) {
  return yield r
    .table('products')
    .between([categoryId, rating, r.minval], [categoryId, rating, r.maxval], {
      index: 'categoryId_rating_status'
    });
}

Get all products by 3 props

function* getProducts(categoryId, rating, status) {
  return yield r.table().getAll([categoryId, rating, status], {
    index: 'categoryId_rating_status'
  });
}

You can even order the results retrieved from that index

Get all products belong to one category, order by rating first and then status

function* getProductsByCategoryId(categoryId) {
  return yield r
    .table('products')
    .between([realmId, r.minval, r.minval], [realmId, r.maxval, r.maxval], {
      index: 'categoryId_rating_status'
    })
    .orderBy({
      index: 'categoryId_rating_status'
    });
}

Get all products belong to one category, with a specific rating score, order by the status

function* getProductsByCategoryAndRating(categoryId, rating) {
  return yield r
    .table('products')
    .between([categoryId, rating, r.minval], [categoryId, rating, r.maxval], {
      index: 'categoryId_rating_status'
    })
    .orderBy({
      index: 'categoryId_rating_status'
    });
}

A Note on Compound Index

One important thing to notice for RethinkDB Compound Index is the order of the fields.

You may not need to care about the order when you just need to query by the exact values of the compound index. For example, you can change the above index to whatever order you want (rating_status_categoryId, status_categoryId_rating) and these queries will always produce the same result with the same speed

// filter all products by 3 props
function* getProducts1(categoryId, rating, status) {
  return yield r.table().getAll([rating, status, categoryId], {
    index: 'rating_status_categoryId'
  });
}

function* getProducts2(categoryId, rating, status) {
  return yield r.table().getAll([status, categoryId, rating], {
    index: 'status_categoryId_rating'
  });
}

However, when it comes to range querying, it’s another story. If you take a closer look at the above queries, you will notice that the left-most values are always fixed/determined values while the right-most values can be vary. If you want to do a range query with the vary left-most values, it is similar to a whole table sequence scan.

For example, to get all the products belong to one category, this will work efficiently

function* getProductsByCategoryId(categoryId) {
  return yield r
    .table('products')
    .between(
      [categoryId, r.minval],
      [categoryId, r.maxval],
      { index: 'categoryId_rating' }
    );
}

This one also works but it’s similar to a sequence scan and is not efficient

function* getProductsByCategoryId(categoryId) {
  return yield r
    .table('products')
    .between(
      [r.minval, categoryId],
      [r.maxval, categoryId],
      { index: 'rating_categoryId' }
    );
}