Ahoy mateys, so that “moose fever” – turned into pneumonia for me! On top of which my entire family got sick too. But we’re finally over that now, so time to break the silence and set sail on the seas of end-user mashups.
As much as I felt some small discouragement with the NV mashups workshop because of certain technologies blowing up during the session and us not sticking with a more hands-on format, I have not given up on the dream of exploring mashups for non-programmers and have continued on, scratching a few of my personal itches.
Job Postings Want to be Free!
The first one, which I mentioned in that session, is around HR departments that don’t provide RSS feeds for their job postings. I’m sure some smart HR professional out there will clue me in to how this is intentional and keeps the riff-raff out, but from where I’m sitting I would love to be able to help people who are already interested in my organization to easily monitor new openings, especially in this tight job market with more jobs than potential employees.
But if you are in higher ed in B.C., my particular niche, you are out of luck unless you want a job at UBC, the only institution that so far seems to have grokked this. I’ve found ways, over the years, to get this information pushed to my email (the WatchThisPage service has been particularly useful in this regard,) but I mean, email, ick, that’s so 1994 (whoops, a real pirate would never have said ‘ick’). So my goal was to see if I could build an aggregated page of BC post-secondary job postings, one with an RSS feed too.
There’s basically 4 steps in the process:
1. Identify the Pages
This one’s easy – go to the various institution sites in the province, and locate their job postings pages. In this experiment I used the job postings pages from schools local to me, Royal Roads University, Camosun College, University of Victoria (and UBC’s because they were already RSS).
2. Scrape the pages
So the problem with all these pages – no RSS feeds. That’s where a service like Dapper comes in. Dapper offers a fairly simple way, though the use of a ‘virtual browser,’ to look at a web page and tell it which elements on the page you would like to scrape out as data. It then allows you to access this scrapped information as XML, HTML, RSS, CSV, JSON, as a Netvibes module or Google gadget, and more. An example is this dapp that scrapes the UVic jobs postings page.
3. Clean up the feeds
Now you’ll notice in the UVic job posting example, there is all sorts of cruft in the feed. Dapper is ultimately only as good as the page that it is scrapping. It does its best to identify logical groupings based on the page markup, and the more that XHTML has been used in a logical way the better it does, but HTML itself isn’t a logical markup language. Dapper does offer you the ability to tweak the scrapper with constraints, but this is one aspect of Dapper I find not to be overly intuitive.
So instead of trying to clean the feeds up in Dapper, I take the crufty feeds from Dapper into Yahoo Pipes, which offers a much easier way to clean up the feeds. In the case of the UVic feeds, it’s by creating a filter to allow only those items whose feed contains the text “Comp,” which turns out to be the common element in all of their postings. Here’s the pipe, which if you clone you can see the various feeds being cleaned up.
4. Aggregate all of the new feeds
This turns out to be simply once these other steps have been taken. There are lots of feed aggregation services out these, but since we already brought all of the feeds into Pipes to clean them up, it’s easy to just use the ‘union’ function there to join them into one master feed of job postings.
About the result, and why am I talking like a pirate?
So obviously the resulting feed above only contains 4 of the 26 institutions in BC. It’s really just rinse and repeat to get the rest and then some formatting cleanup, which I purposely didn’t do (and least not publicly, hehe).
My intent in documenting this exercise (and the next one) was not to provide a production-ready feed of all BC post-secondary job postings, handy as that might be. It was instead to
- illustrate how YOU can use tools like Dapper and Yahoo pipes to create feeds and aggregations for data on almost any webpage, (made seemingly even easier now with Dapper’s release of the DapperFox plugin)
- spur information providers on to doing it right the first time – there is NO reason (as we will see in the next example too) to ever provide another list, another calendar, another set of links, etc, in a way that by default traps the content in a single presentation, only ever editable by a single author. NO REASON, and lots of GOOD reasons not to. The separation of content and presentation should have already become one of the default criteria you use to select any technology. If the tools you are using don’t support RSS or some other means to do this, use one of the HUNDREDS of FREE ones that do. And at the very least, please adopt tools that produce proper XHTML – accessibility means providing access, and if you won’t do it to cater to web wonks like me, do it at least to serve people who have no other choice but to consume your page through a text reader or other assistive device. If you don’t, someday someone may make you.
And the pirate metaphors? Well certain, shall we say, ‘issues’ around intellectual property were pointed out to me during the NV mashups workshop, and I guess this is kind of my reply – if you aren’t going to provide the data for users in a way that enables them to use it how THEY want to, don’t be surprised when they go and do it themselves, arrgghhh.
So until our next swashbuckling adventure, I remain yours truly, Cabin Boy Nessman of the good ship Syndication.