del.icio.us driven Google custom search

This is an account of how and why I wrote a Google custom search engine to search sites that I had bookmarked on del.icio.us.

I’ve liked the Google custom search engine since I first played with it shortly after it came out. If you don’t know about Google CSE, it allows an individual or group to create a search form that will perform a full text search using the Google search engine but limited to sites which they choose. This search form, and the results page, can be embedded in any website. I think it is the obvious way to build a cross search across all the centres in an organization like the HE Academy (this was one of my first custom search engines). Better, for teaching and learning you can set up a reading list of recommended sites for a course and let students do a full google search that prioritizes those sites (for a sort of generic variation on this see Tony Hirst‘s Open Educational Resources search). Better still, let the students as a group decide which sites they want on their course reading list.

Building and editing a Google custom search using the interface on the Google site is by no means difficult, but over the past year or so some tools to make it even easier have come out. The Google Marker allows you to add sites to a CSE by pressing a button on your browser. Or you can make CSEs on the fly that search across the links in a web page (the example from Tony Hirst was based on links from this page from ZaidLearn). Scott Leslie enhanced this idea by putting the links into a wiki so that it was possible add sites.

That’s great, showing the potential for Google CSEs to fit in with other tools that are already being used. Now the tool that I use for marking interesting sites is del.icio.us, so what I really wanted was a CSE that searched over sites I have tagged in del.icio.us. I found a couple of examples of this. One wanted my delicious username and password. No thanks. The other, deligoo, looks good, but wanted me to install a Firefox plugin, which didn’t seem quite what I was after. Anyway, I couldn’t get it working (it wasn’t a good day).

I knew that you could specify the sites for a CSE to search in an XML file and thought it shouldn’t be too difficult to take a feed from del.icio.us and convert it into that XML file. There’s even a RESTy API thing from Google that promises to do that sort of thing. But I couldn’t get it working (it wasn’t a good day). So I thought why not take a tool like RSS2HTML to make a page with the links I wanted to search and cut’n’paste the code for an on-the-fly CSE which would search the links on that page. But I couldn’t get it working (it wasn’t a good day).

So then I thought I probably needed to start from scratch and find out what these XML files that define custom searches actually looked like and how they worked and perhaps I would start to get somewhere. The Google developer guide for this isn’t great, but with its help and by exporting the context and annotation files for some custom searches I had created through the Google web interface, I eventually worked out what I was doing. Along the way I found out that something somewhere was caching my files so that updates weren’t getting to Google. Even when I used the Google refresh and test page for CSEs the updates didn’t get through: I would make change to the XML file, try to test it and find it didn’t work, make some modification, test that and find that suddenly the CSE was working with the previous version. I have a bruised forehead now, but once I worked out where the problem was the day started getting better.

At the end of the day, I had a perl script that would read the RSS feed for pages bookmarked with a specified user/tag combination on del.icio.us to create the CSE XML description of hosted Google custom search engine that searches those sites tagged on del.icio.us. The del.icio.us user/tag combination that specifies the sites to be searched is hard coded into the script. It would be pretty trivial to change this so that they could given dynamically, but I think the caching might cause problems.

Advertisements

8 Responses to “del.icio.us driven Google custom search”

  1. Martin Poulter Says:

    Thanks for slogging through all that adversity so we don’t have to, Phil.

  2. Tony Hirst Says:

    One issue with taking the RSS feed from delicious is that if you have a lot of links tagged a particular way, they won’t all appear in the RSS feed… which is why you need a password and the delicious api…

    I’m not sure if the Google CSE can be used to scrape links from a delicious page (eg set to display 100 links?)

    There are also a couple of things to bear in mind when searching over links – do you want to search just the bookmarked pages, or are you happy to search over the whole domain that the bookmarkd page sits on? I explored several variations on this theme under the notion of ‘search hubs’ ( http://blogs.open.ac.uk/Maths/ajh59/010686.html ). My demonstrator site for this (in desperate need of a rewrite) is searchfeedr ( http://blogs.open.ac.uk/Maths/ajh59/010000.html )

    If you’re happy with only searching over a couple of dozen links at a time, this delisearch pipe should work: http://blogs.open.ac.uk/Maths/ajh59/009301.html

    I thought I’d done a demo of how to use this pipe in a grazr widget that let you select a delicious user name and tag ( http://blogs.open.ac.uk/Maths/ajh59/010044.html ) but it looks as if i only ever did a cut’n’paste OPML generator ( http://blogs.open.ac.uk/Maths/ajh59/deli2opml2.html – the delicous search pipe code is at the bottom of the generated OPML)

    tony

  3. Phil Barker Says:

    Thanks Tony, that’s very useful.
    On the point of whether the search is limited to the exact page or the whole domain (there’s a third option: everything in directories below where the bookmarked page is) the Perl script has options to modify the links in the RSS feed for each of these–I think, I’ve only tested the third option, which is the one I’m using. So if I bookmark http://ocw.mit.edu/OcwWeb/Mechanical-Engineering/ I search everything in in that subdirectory and below it, which is all the OCW Mechanical-Engineering resources. It’s nice of MIT to arrange their courses like that.

  4. Phil Barker Says:

    I have a play with the API as a way round the limit on the number of entries in the RSS feed, but I don’t want to have to ask for a password every time I update the search engine and i don’t want to leave it lying around in a text file somewhere. Besides, I’ld be limited to my own tags. The best option seems to be the JSON feed, which will give up to 100 posts.

  5. jermallie Says:

    why don’t you just u the ” extract links option” in the sites tab of the cse and use the url that displays 100 bookmarks and add the other url’s that diplays the next 100 etc.

    regards,jeroen

  6. Phil Barker Says:

    Hi Jeroen, Short answer: because I hadn’t seen that option. :-} Yes, I think that would work if you weren’t generating the CSE on the fly (which, once I started doing I kind of found to be useful), also if you didn’t mind searching the links on that page that weren’t bookmarks (which I guess would not be a problem if you added del.icio.us to the list of sites not to search?). If I get the chance I’ll check it out–it would be better for people who don’t want to / can’t get involved with any scripting. Thanks.
    edit: here’s one that works like that. [Later: well it did work]

  7. jermallie Says:

    hi Phil

    we did use that way in scoofers, http://www.scoofers.com. we use del.icio.us, stumbleupon etc. and filled it now with app 5000 extract links times 10 on avarage links on 1 extraxted links site, so app. 50000 site have been labelled now, using social bookmark sites. But we are also looking at using the API’s

    regards, Jeroen

  8. Phil Barker Says:

    The one that I created that I created using the extract links option, mentioned in my previous comment, worked once or twice and then stopped working. Looking at the advanced tab, it’s equivalent to using the CSE REST interface to make an annotation file, which I had problems with originally. When I look at the output of that interface, where I would expect a list of URLs based on my del.icio.us bookmarks, what I actually get is a set of URLs for Yahoos terms and conditions (Yahoo own del.icio.us). I have managed to fiddle with the options for what del.icio.us page I extract URLs from and get it working briefly, but pretty soon the Yahoo Ts&Cs come back. It’s almost like Yahoo doesn’t want to play nicely with Google. 🙂


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: