Drupal script to automatically add terms to nodes

I had a client recently who had a vocabulary of all car brands and thousands of news items on cars. He wanted to tag all news items (implemented as nodes) with the card brand(s) they were about. No Drupal module available (yet) to do that, so I've written a script myself.

I think the script (which you will find further on) is very easy to understand. I iterate over all terms in the vocabulary and for each term I do a Drupal search. I then iterate over all search results an see if the score of the result is high enough to mark it as a good search result for this term.

Easy! Right? Well, yes. But there were a few things to take in account.

I had to modify the default Drupal do_search function because it was written for paged search result display for one search operation per page request.

To make the function display all search results was fairly easy. I just had to remove the pager_query function line and replace it with a db_query one.

  // Do actual search query
  //$result = pager_query("SELECT * FROM temp_search_results", 10, 0, $count_query); // This is the line that got removed from the original function implementation.
  $result = db_query("SELECT * FROM temp_search_results"); // This is what we'll use now. We don't want any paging. We want it all!

Allowing the function to make more search operations in one script run was a bit trickier. By default, the function makes use of a temporary table to build its results. This temporary table is automatically cleared after the request. But we want to clear it earlier. So I added a few lines at the end of the function.

  // Drop the temporary table to allow multiple searches in one run
  if (($count = db_result(db_query('SELECT COUNT(*) FROM temp_search_results'))) > 0) {
    db_query_temporary("DROP table temp_search_results");    
  }
  if (($count = db_result(db_query('SELECT COUNT(*) FROM temp_search_sids'))) > 0) {
    db_query_temporary("DROP table temp_search_sids");    
  }

Below you find the complete script. You can also download it. The script defines a few parameters. You can specify

  • the vocabulary from which the terms should be taken
  • the minimum score a result has to have to be taken into account
  • a run mode (verbose/run)
  • the node type you want to run the script upon

To run the script, put it in the root of your Drupal installation. To get the best results, play with the MIN_SCORE parameter.

Please make sure all your nodes are indexed by Drupal's search before you run this script. Otherwise you won't get any results.

<?php

define('VOCABULARY', 1); // Vocabulary from which the terms should be taken
define('MIN_SCORE', 0.3); // Minimum score for the search index. Results lower then this are not taken into account

define('VERBOSE', TRUE); // TRUE = Verbose run : Display what nodes will get what terms
define('RUN', TRUE); // TRUE = Fill in the terms in Drupal / FALSE = Don't fill them in, we want to see the verbose information first

define('NODE_TYPE', 'news'); // Only nodes of this node type will be taken into account

// Bootstrapping Drupal
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

// Logging in as admin user
global $user;
$user->uid = 1;

// Getting all terms from our vocabulary
$sql_terms = "SELECT tid, name FROM {term_data} WHERE vid = %d";
$result_terms = db_query($sql_terms, VOCABULARY);

// Iterating over all terms
while($term = db_fetch_object($result_terms)) {
  
  // Do a search in Drupal
  $search_results = custom_do_search($term->name, 'node');
  
  // Iterating over all results
  foreach($search_results as $item) {
    
    // We only want results with a decent score
    if($item->score > MIN_SCORE) {
      
      // Fetch the node info
      $sql_node = "SELECT nid, type, title FROM {node} WHERE nid = %d";
      $result_node = db_query($sql_node, $item->sid);
      $node = db_fetch_object($result_node);
      
      // Is the node of the correct type
      if($node->type == NODE_TYPE) {
        
        // If we're in verbose mode, print term, node title and search score
        if(VERBOSE) {
          print $term->name . " | " . $node->title . " | " . $item->score . "\n";
        }
        
        // Fill in the terms in Drupal if we need to
        if(RUN) {
          db_query("INSERT INTO {term_node} (nid, tid) VALUES (%d, %d)", $item->sid, $term->tid);
        }
      }
    }
  }
}

function custom_do_search($keywords, $type, $join1 = '', $where1 = '1', $arguments1 = array(), $select2 = 'i.relevance AS score', $join2 = '', $arguments2 = array(), $sort_parameters = 'ORDER BY score DESC') {
  $query = search_parse_query($keywords);

  if ($query[2] == '') {
    form_set_error('keys', t('You must include at least one positive keyword with @count characters or more.', array('@count' => variable_get('minimum_word_size', 3))));
  }
  if ($query === NULL || $query[0] == '' || $query[2] == '') {
    return array();
  }

  // First pass: select all possible matching sids, doing a simple index-based OR matching on the keywords.
  // 'matches' is used to reject those items that cannot possibly match the query.
  $conditions = $where1 .' AND ('. $query[2] .") AND i.type = '%s'";
  $arguments = array_merge($arguments1, $query[3], array($type, $query[4]));
  $result = db_query_temporary("SELECT i.type, i.sid, SUM(i.score * t.count) AS relevance, COUNT(*) AS matches FROM {search_index} i INNER JOIN {search_total} t ON i.word = t.word $join1 WHERE $conditions GROUP BY i.type, i.sid HAVING COUNT(*) >= %d", $arguments, 'temp_search_sids');

  // Calculate maximum relevance, to normalize it
  $normalize = db_result(db_query('SELECT MAX(relevance) FROM temp_search_sids'));
  if (!$normalize) {
    return array();
  }
  $select2 = str_replace('i.relevance', '('. (1.0 / $normalize) .' * i.relevance)', $select2);

  // Second pass: only keep items that match the complicated keywords conditions (phrase search, negative keywords, ...)
  $conditions = '('. $query[0] .')';
  $arguments = array_merge($arguments2, $query[1]);
  $result = db_query_temporary("SELECT i.type, i.sid, $select2 FROM temp_search_sids i INNER JOIN {search_dataset} d ON i.sid = d.sid AND i.type = d.type $join2 WHERE $conditions $sort_parameters", $arguments, 'temp_search_results');
  if (($count = db_result(db_query('SELECT COUNT(*) FROM temp_search_results'))) == 0) {
    return array();
  }
  $count_query = "SELECT $count";

  // Do actual search query
  //$result = pager_query("SELECT * FROM temp_search_results", 10, 0, $count_query); // This is the line that got removed from the original function implementation.
  $result = db_query("SELECT * FROM temp_search_results"); // This is what we'll use now. We don't want any paging. We want it all!
  $results = array();
  while ($item = db_fetch_object($result)) {
    $results[] = $item;
  }

  // Drop the temporary table to allow multiple searches in one run
  if (($count = db_result(db_query('SELECT COUNT(*) FROM temp_search_results'))) > 0) {
    db_query_temporary("DROP table temp_search_results");    
  }
  if (($count = db_result(db_query('SELECT COUNT(*) FROM temp_search_sids'))) > 0) {
    db_query_temporary("DROP table temp_search_sids");    
  }

  return $results;
}

?>
AttachmentSize
drupalcoder-autotag.php.zip2.46 KB
Written on September 29, 2008 at 10:58, tagged as auto tagging, Drupal, scripts, search, taxonomy

Comments

no more information ?

Thanks.

Just re-read his project description - this no longer runs on cron in drupal 6
This script may be the best place to start in the end.

My colleague has developed the autotagger module which does this on cron as well as at node submit time:
http://drupal.org/project/autotag

Sorry I forgot to ask if it works with drupal 5 and 6 ?

Hello,

Sorry but I don't understand what your script is doing exactly, can you explain me better ?

Thanks a lot.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

About

drupalcoder.com is a blog on all things Drupal in specific and LAMP on OS X in general. It is maintained by Davy Van Den Bremt, a Belgian (Drupal) web developer and designer living in Ghent. The goal of this blog is to log all interesting things that have crossed the writer's path while developing Drupal sites. You can read all about Davy's professional activities on his LinkedIn profile. If you want to get in touch, use the contact form.