Re-implementing the CKAN API for performance

CKAN, an open source data portal platform, provides an API for fetching everything from datasets to individual records (using the Datastore extension). Here we look at how CKAN's architecture allows developers to transparently re-implement the datastore API, and how this was used to improve performance by switching all searches to using a Solr backend.

The issue arose while working on the Natural History Museum's Data Portal: with 2.8M rows, which over 70 fields each, and a user interface that allows users to search on any combination of fields we felt that PostgreSQL was providing poor performance. At this scale using more hardware was an option, but we felt this was not the right solution when Solr could run the same searches 20 times faster.

CKAN's architecture

CKAN implements an RPC style API which exposes all of CKAN's core features. What is particularly useful is that internal calls are also routed via the same API: CKAN's get_action is used to return the functions that can be used to perform various actions, such as creating a dataset or performing a datastore query.

This approach has numerous advantages:

  • Decouples interface and implementation;
  • Enables plugins to override actions;
  • Provides a consistent interface, whether developing an extension or a client;
  • Server side extensions can use the same API without going through de/serialization process.

Of course there are some disadvantages - one of them is the absence of an ORM style interface: all data is provided simply as a dictionary, and is manipulated by invoking functions.

Re-implementing the datastore API

Thanks to CKAN's architecture, we were able to re-implement the API completely and provide a compatible API that uses Solr, rather then PostgreSQL, for datastore searches: ckanext-datasolr.

To override calls to the datastore_search API endpoint, we created a plugin that implements the IRoutes interface so we could change where calls to datastore_search would be routed. This is done simply as:

import ckan.plugins as p

class DataSolrPlugin(p.SingletonPlugin):
    p.implements(p.IRoutes, inherit=True)

    def before_map(self, map): 
        map.connect(
            'datasolr',
            '/api/3/action/datastore_search',
            controller='api',
            action='action',
            logic_function='datastore_solr_search',
            ver=u'/3'
        )

We need to declare our logic function datastore_solr_search by implementing the IActions interface in our plugin:

# ...
from ckanext.datasolr.logic.action import datastore_solr_search

class DataSolrPlugin(p.SingletonPlugin):
    # ...
    p.implements(p.interfaces.IActions)

    # ...
    def get_actions(self):
        return {
            'datastore_solr_search': datastore_solr_search
        }

And this is it - all calls, internal or external, to datastore_search will be routed to ckanext.datasolr.logic.action.datastore_solr_search - we are now free to re-implement the API as we wish (do check the implementation for details).

Making the new plugin extensible

CKAN's interface architecture which allows plugins to easily add functionality to other parts of the system is another element that allowed us to implement this plugin. As such, and given that the datastore extension has it's own interface, it seemed like a good idea to implement one for our plugin.

This is done simply by creating a class that inherits from ckan.plugins.interfaces.Interface:

from ckan.plugins import interfaces

class IDataSolr(interfaces.Interface):
    def datasolr_validate(self, context, data_dict, field_types):
        return data_dict

    def datasolr_search(self, context, data_dict, field_types, query_dict):
        return query_dict

Plugins that want to add their own validation and/or to modify the search expression just need to implement this interface. They need to declare this by including ckan.plugins.implements(IDataSolr).

The ckanext_datasolr code can now invokes all plugins that extend it by doing, for example:

from ckan.plugins import PluginImplementations

for plugin in PluginImplementations(IDataSolr):
    data_dict = plugin.datasolr_validate(
        self.context, data_dict, self.fields
    )

Extensibility is important for any sort of framework - and finding the right balance between that and architectural complexity is often tricky. CKAN's approach is in some aspects rigid, but the fact we were able to re-implement this API is testament to it's effectiveness.