Drupal coder

Three things we learned from indexing a Drupal site with millions of nodes in Apache SOLR

For one of our clients, we are running a Drupal site with about a millions of nodes. Before launch, those nodes are imported from another database and then indexed into Apache SOLR. The total time to index all of these nodes in an empty SOLR instance is measured in days rather than hours or minutes.

A bit too long to do this import regularly. So me and my (XDebug) profiler delved into the Apache SOLR module code to look where we could scrape of a few hours/days of the execution time.

Seemed like in our case, there were 3 components responsible for a large share of the execution time. Let's have a look.

BTW. We are using the latest dev build of version 2 of the Apache SOLR module.

Tip 1: Not indexing $document->body

When indexing nodes, the SOLR module needs to construct an Apache_Solr_Document object for each node. It passes all fields and metadata of the node in that document. The heaviest part of constructing this document is the assembling of the $document->body field. The module uses the node_build_content and drupal_render($node->content) functions to generate the body of the node.

In our case, we didn't really use the body since we were indexing companies with fields like name, address, manager, ... So we decided to remove the code from apachesolr_node_to_document that calculates the body. Although this one gave us a major performance boost, it might not be applicable in your case. We could use this because we didn't need the body of a node.

Keep in mind also that in the body all other fields and metadata are assembled too (dependent on your search build mode configuration).

Tip 2: Add static caching to apachesolr_add_taxonomy_to_document

Another heavy thing that is going on while generating the Apache_Solr_Document object is fetching the taxonomy terms in apachesolr_add_taxonomy_to_document. For each term, the ancestors are calculated. In some cases you don't have a hierarchical vocabulary, so you could remove that code, but in case you have a hierarchical vocabulary, you could benefit a lot from static caching. You might have millions of nodes, but you probably have only a handful of terms (hundreds). So the ancestors of some term will be calculated multiple times.

Keep in mind though that you won't benefit too much from static caching if you're using batch processing for indexing with small batches, since the static cache is rebuilt for each batch step. So we wrote a Drush command to do the indexing. This way we're keeping the static cache for the full batch.

function drush_slimkopen_solr_index() {
  $cron_limit = variable_get('apachesolr_cron_limit', 50);

  while ($rows = apachesolr_get_nodes_to_index('apachesolr_search', $cron_limit)) {
    apachesolr_index_nodes($rows, 'apachesolr_search');
  }
}

Tip 3: Don't check excluded content types

The SOLR module has a nice feature that allows you to exclude certain content types from being indexed. Turns out the check for excluded content types is pretty expensive. This happens in the apachesolr_get_nodes_to_index('apachesolr_search', $limit) call where the apachesolr_search_node table is joined with the node table. For the initial import, we removed the check for excluded types (the join with node) and indexed all nodes. The excluded ones we removed after indexing.

This was possible in our case since the bulk of nodes (99.9% of them) needed to be indexed.

Conclusion

Drupal and its modules are developed to work in a lot of environments and situations. So next to the implementation of what they're designed to do, they also contain a lot of code that checks if a certain condition or context applies. But when you are using or deploying a module, you know what the context is. So you may be able to remove some code. Keep in mind though that tampering with core and module code is bad, but there are a few practices that can help here!

For those curious about what kind of performance gain you might have with these tricks: in our case it was about 50% but it highly depends site's implementation.

September 06, 2010

Comments

You can also increase the cron limit if you are using drush that will help you to index more document per cron. Currently only 200 documents can be indexed at max for one cron job. So if you increase it to 1000 it will index more documents per cron job and indexing will become faster.

I checked out apachesolr_node_to_document. I'm dealing with a content type that has the body disabled via /admin/content/node-type/NODETYPE and then blanking out the field in submission form settings. The end result of this is that the node_type table gets a value of "0" for has_body.

In that apachesolr_node_to_document function, how about wrapping the $node->body = drupal_render($node->content) line inside of an if statement that checks to see if the body is globally disabled for that content type? Would that globally avoid this for all content types that have no body?

@Robert: Most of these are only applicable in certain situations as mentioned in my conclusion. But I see how patches can created for this. Maybe add this as an option like "pass exclusion test, no body".

@Jason: It doesn't matter how many terms per nodes you have. Only how many terms you have.

Patches, patches, patches! =)

The static caching sounds like a good idea. As does anything you can add to the drush integration. Please share.

-Robert

Thanks very much for this! I'm also indexing millions of nodes. These tips are very helpful.

Although I have millions of terms as well, I don't have a hierarchy in them. So I removed that bit completely - great stuff!

Thanks again,

Thanks for the tips. I was wondering if you think the the taxonomy cache would help with a large number of term per node? We have around 50 terms per node and around 100k nodes. We actually have several catagories as well to add different point boost.

Also do you run your drush from command line or do you call that function from another script in the background? I ask beause it seems like the cli might time out.