Archive for the ‘Google’ Category

Creating a custom search engine from RSS feeds

Monday, March 26th, 2007

Update:The CSE team has included this functionality directly into Coop-search, so this code isn’t necessary anymore.

I’ve been playing around with the Google Custom Search Engine (CSE) feature recently. CSE lets anyone create a search engine that indexes a user-defined set of URL patterns. I thought it would useful to create a CSE based on links in RSS feeds. The motivating example was programming.reddit.com. Pages linked there are submitted and voted on by the users, and the top-ranked items are consistently interesting and relevent. A CSE derived from sites like this is useful for a few reasons:

  1. Answering the question, “What was that page I saw on reddit 2 weeks ago about that thing?” I’m often trying to track down a page I remember seeing on reddit a long time ago, and a standard web search isn’t useful in this case.
  2. Building search engines that take advantage of the filtering done by human editors. CSE can search a set of URL patterns exclusively, or can augment a full web search by emphasizing selected URLs.

I’ve written a set of python scripts that can turn an RSS feed into a CSE annotations file and then upload the annotation list automatically. Here is an example of the resulting search engine. At the time of writing this CSE searches 1,323 web pages collected from the programming.reddit.com RSS feed over the last month. The set of links is updated once a day by a cron job.

More information on how to run the scripts yourself is in the README file. The code is distributed under the PSF licence and can be retrieved with a subversion client from here. Any feedback is of course appreciated.

Note: According to my reading of the CSE TOS this is a legitimate usage. I asked on the group and had no response. If anyone knows otherwise please let me know.