OLNet Fellowship Week 2 – Initial Thoughts on Tracking Downloaded OERs

As I mentioned when I first posted that I was coming to the UK for this fellowship, my main focus is how to generate some data on OER usage after it has been downloaded from a repository. In looking at the issue, it became clear that the primary mechanism to do so is actually the same as to track content use for sites themselves, by using a “web bug” in the same sort of way that many web analytics apps do, but instead of the tracking code being inserted into the repository software/site itself, it needs to be inserted into each piece of content. The trick then becomes

how do we get authors to insert these as part of their regular workflow
how do we make sure they are all unique / at what level do they need to be unique
how do we easily give the tracking data back to the authors.

My goal was to do all this without really altering the current workflow in SOL*R nor requiring any additional user accounts.

The solution I’ve struck upon (in conversation with folks here at the OU) is to use pwiki an open source analytics package with an extensive API to do the majority of the work, and to then work on how to insert this into the existing SOL*R workflow. So the scenario looks like this:

1a. Content owners are encouraged (as we do now) to use the BC Commons license generator to insert a license tag into their content. As part of the revised license generator, we insert an additional question – “Do you wish to enable tracking for this resource?”

1b. If they answer yes, the license code is ammended with a small html comment –

<!–insert tracking code here–>

1c. The content owner then pastes the license code and tracking placeholder into their content as they normally would. We let them know that the more places they place it into their content, the more detailed the tracking data will be. We also can note that this is *only* for web-based (e.g. html) content.

2. The content owner then uploads the finished product as they normally would.

3a. Each night a script (that I am writing now) runs on the server. It goes through the filesystem, and every time it finds the tracking placeholder:

based on the files location in the filesystem, it deconstructs the UUID assigned it in SOL*R
uses the UUID to get the resource name from SOL*R through the Equella web services
re-constructs the resource home url from its UUID
sends both of these to the piwik web service, which in return creates a new tracking site as well as the javascript to insert in the resource
finally writes this javascript where the tracking placeholder was.

4a. Finally, in modifying the SOL*R records, we also include a link to the new tracking results for each record that has it enabled.

4b. For tracking data the main things we will get is:

what are the new servers this content lives on
how many time each page of content in the resource (depending on how extensively they have pasted the tracking code) has been viewed, both total and unique views
other details about the end users of the content, for instance their location and other client details

I ran a test last week. This resource has a tracking code in it. The “stock” reports for this resource are at http://u.nu/3q66d It should be noted: we are fully able to customize a dashboard that only shows *useful* reports (without all the cruft) as well as potentially incorporate the data from inside Equella on resource views / license acceptances. This is one of the HUGE benefits of using the SOL*R UUID in the tracking is that it is consistent both inside and outside of SOL*R.

I am pretty happy with how this is working so far; while I have expressed numerous times that I think the repository model is flawed for a host of reasons, to the extent to which it can be improved, this starts to provide content owners (and funders) details on how often resources are being used after they are downloaded, and (much like links and trackbacks in blogs) offer content owners a way to follow up with re-users, to start conversations that are currently absent.

But… I can hear the objections already. Some are easy to deal with: we plan to implement this in such a way that it will not be totally dependent on javascript. Others are much more sticky – does this infringe on the idea of “openness”? What level of disclosure is required? (This last especially given that potentially 2nd and 3rd generation re-users will be sending data back to the original server if the license retains intact.)

I do want to respect these concerns, but at the same time, I wonder how valid they are. You are reading this content right now, and it has a number of “web bugs” inserted in it to track usage yet is shared under a license that permits reuse. Even if it is seen as a “cost,” it seems like a small one to pay, with a large potential benefit in terms of reinforcing the motivations of people who have shared. But what do you think – setting aside for a second arguments about “what is OER?” and “the content’s not important,” does this seem like a problem to you? Would you be less likely to use content like this if you knew it sent usage data back? Would anonymizing the data (something piwik can easily do) ease your mind about this?

16 thoughts on “OLNet Fellowship Week 2 – Initial Thoughts on Tracking Downloaded OERs”

John Hilton III says:

July 12, 2010 at 8:58 am

Thanks for sharing this! I concur with you that this “cost” of tracking is “a small one to pay with a large potential benefit.” If it can be demonstrated that OER is used the potential for it to grow dramatically increases.
Tony Hirst says:

July 12, 2010 at 9:57 am

So you’re squirting arbitrary javascript into a resource I’ve uploaded to a server? Yeah, right… good plan 😉

What’s wrong with creating a 1×1 transparent gif, or creating a unique tracking URL for each resource that returns the appropriate CC license badge, so eg the badge is pulled from http://example.com/NCBY-sdgTRACKINGCODEuniwueblerhg784.gif
Scott says:

July 12, 2010 at 11:05 am

Tony, that’s in essence what the tracking code does, indeed the workaround we have planned does do just that to avoid the javascript dependance. Either way, though, this model does include “inject” something into the existing content; obviously not without permission, and only to replace content the user has already pasted in themselves.
Allyn J Radford says:

July 12, 2010 at 1:53 pm

This doesn’t seem much different to Connexions (cnx.org) using Google analytics when you upload content. You have the same choice and it would seem that Google have already done the work needed to track. Am I missing something?
Scott says:

July 12, 2010 at 3:07 pm

Hi Allyn, it’s mostly the same, with the exceptions that
– as far as I know (and I am no Connexions expert), it allows you to insert tracking code on the web pages that are displayed on the site; does it also embed it in the downloadable content? This is more aimed at explicitly tracking the content once its downloaded out of the repository
– it doesn’t require the end user to mess around with the tracking code or get a Google account
– it does not use Google, so the data is on our servers, available to us to display the way we need it to (lots of the reporting you get back from analytics packages are bumpf, we can cull those out and combine it with local data we gather through our app)
– the data will be both public (something I’m not aware you can do with Google) and aggregatable across the whole collection, again, not something I think you can do with Google, though I do notice they also seem to be inserting a Hewlett tracking code across the entire site, so they do likely get some aggregate data

So a bit different I think though not radically. Thanks for bringing Connexions’ model to light, I hadn’t known about it; they are, as usual, ahead of the field.
Allyn J Radford says:

July 12, 2010 at 8:53 pm

Thanks for the clarification. Sorry I missed that distinction. What I did think about after I completed the previous comment was that I would be interested to know the approach you plan to take with regard to derivative works. Counting the usage of an item is valid whether it is reuse of the original or a derivative, but somewhere down the line I guess we would all like to get some sense of whether collections and individual modules (in CNX speak) are adopted ‘as is’ or adapted for use. I am sure we all have intuitive takes on that.

As for anyone getting irked about the usage being tracked, well, that would be completely out of keeping with the spirit of OER to my way of thinking. The more data we get about reuse the more case there is to support it and the greater the benefit to everyone.

Best of luck with the work.
Scott says:

July 13, 2010 at 2:58 am

Hey Allyn, I don’t have a perfect response to how to distinguish between “as-is” and derivative work reuse, but here’s my take: I am considering (though not 100% certain) on actually embedding the tracking code (or image, for Tony’s sake) within the existing CC/BC Commons license declaration code in the hopes that it persists through reuses and people are less likely to simply rip it out. So that means hopefully we’ll get data from both these types of reuse. How to differentiate this data though? I don’t know that there is a programmatic way; the one good thing about this approach is that (given the license/tracking code is left intact) it does send back the new URL from which the resource is being launched. This means at least the content owner has the chance to click through and se the new context in which it is being reused – not quantitative data, per se, but still valuable I think for seeing improvements/derivations on one’s work. Very open to ideas and suggestions on how one could accomplish this in a non-invasive/intrusive way (which I think is key here). Cheers, Scott
Katherine Fletcher says:

July 13, 2010 at 7:40 am

Hi,

Connexions here. We have three forms of tracking on Connexions through Google Analytics (GA) and one locally hosted Web Analytics tracking which is available at http://cnx.org/stats for the overall site, and also available for content by choosing the “statistics” view for any search, or by looking at any author’s profile.

We chose Google Analytics because of robust support, comparability with other sites using analytics, and to avoid a lot of “analytics” analysis which is not our core focus.

The three GA codes we track are 1. A Hewlett sponsored GA code they use to compare all the OER sites they have helped launch, 2. A Connexions wide code that tracks views and artifact downloads (but doesn’t travel with those downloads) and 3. Authors can insert their own GA code so they get reporting on their modules.

Kathi
Richard Baer says:

July 13, 2010 at 9:27 am

Scott,
This is a great idea. I submit my .mp4s both to sol*r and DSpace at UCalgary.

I know that there are lots of DSpace downloads but I am curious if they take the source code and modify it to brand it for their site or just use it as is.
If I read your post right, I could see if any derivative works were being made… That would be very cool.
Scott says:

July 13, 2010 at 10:50 am

Richard, that’s the idea, but the caveat here is we’re talking about web content, typically HTML pages. Some day, we can start to crack the nut of other types of content, but unless it is “wrapped” in an HTML page, we won’t get numbers back on .mp4 files specifically.
Pingback: Tracking OER «
Allyn J Radford says:

July 15, 2010 at 11:06 pm

I have been giving some further thought to this but without knowing what the actual objectives are there are only some high-level suggestions. (Your earlier post about the fellowship doesn’t state particular objectives other than identifying reuse). Not a problem, as anyone that has given any thought to this will know, it is not an easy problem and doing something that provides a leg up for future work is valuable indeed.

May I suggest (probably too late of course, but…) some perspectives that may help, probably more in thinking about how future effort might build on this work but some may be useful now. My understanding is that you are making an early approach to answering the question, “Have OERs been reused on the web in other contexts?” and not the follow-on questions related to derivatives, methods, observance of attribution etc etc. On that basis the approach you are taking is consistent with other similar work and seems just fine. Until it’s done and some data are available it is only guesswork as to what method is most effective but I suspect that the different methods people are trying will yield slightly different insights, which in itself is good.

Throught 1: Reuse is closely aligned to the same problems associated with federation and original discovery. Reasonable?

Thought 2: It is in the interests of all OER initiatives to establish the case that OER’s bring benefit, and being able to point to good data on reuse is exactly what is needed. Given that, it should be possible to coordinate common effort on minimal changes to all content to support that. Difficulties will arise due to “non-friendly” content formats
even for textual materials, and also in communication between content stores, however, these factors will probably be part of the long term solution, assuming that funding allows anyone to take a long term view.

Thought 3: The only way to _really_ track reuse, especially in relation to derivative works, is to include a few bits of metadata to achieve that outcome.

These three thoughts considered together may contribute to a reliable solution for harvesting the data required. Some initiatives can actually do a reasonable job of working with existing metadata within their own space to answer some of those questions because their content formats allow it. Others do not and will not be able to unless they revise their Content Strategy (assuming they have been allowed to have one). Good data on reuse (adoption and/or adaptation) will not be available until repositories in OER initiatives and institutional OER instances actually talk to each other.

Not sure if any of this is of use to you and it may be history rather than news but it might help someone somewhere…

Thanks
Seth says:

July 18, 2010 at 1:31 pm

I’m a little late to this post, but I wanted to give my thoughts. I think most people wouldn’t mind have some tracking data on their OER usage, especially if they understand how often they are tracked by for-profit web sites. Anonymizing helps, but it really comes down to confidence in the institution (Really, if Facebook says they anonymize something, how much does that meant to you?). That’s why I would feel confident sharing my usage data with your institution. I trust respectable, non-profit higher education institutions. I trust Canadians to be sensitive to personal rights. Lastly, I’m a little biased, but I trust you.

A possible solution might be to have the standard legalese jargon “Privacy Policy” page followed by a succinct, human-readable version (similar to Creative Commons). I think the gesture towards transparency would be welcome.

The fact that you are concerned about student privacy tells me that you are on the right track.
Pingback: unplugging third party trackers | D'Arcy Norman dot net
Pingback: 2011 The Year of Open « Paul Stacey
Pingback: 2011 The Year of Open | BCcampus

Comments are closed.