In creating a multimedia travelogue of my recent trip to Europe, I checked in to the most interesting places I visited using the location-based service Foursquare [1]. Launched in March 2009 and serving more than 50 million users, Foursquare is a social networking site that lets you bookmark (i.e.,“check in” at) venues based on your geographic location.
As with all such services, the more information about you available to the system, the better the interactions. However, much as I love it, Foursquare suffers from data sparsity and the closed world problem for me—I’m an irregular user, so the service really knows very little about me. For example, in Cambridge I checked in to the River Cam, a river I rowed on frequently once upon a time, and Foursquare excitedly exclaimed, “Your first river!”
Not so, Foursquare. Not my first river. I concede it is perhaps the first river we have shared together.
While this error is charming and amusing, information-poor user models can be dangerous. More trivially, they are a waste of our attentional resources, distracting us with irrelevant content. In the world of product recommendation, this manifests most irritatingly in recommendations for things we already own or would never purchase. Enough experiences like these with a service and one is likely to feel bemused at best, frustrated at worst. Users have a low threshold for how many poor experiences they are willing to endure before a service loses its allure. Such reduced engagement negatively impacts service viability from a business perspective. Thus, users and services have the same goal—to improve inference and recommendation quality. And that requires data, not just what I did data (behavioral and transaction logs) but also why I did it (motivation) data and other things I’d like to do/explore (aspiration) data.
Internet services that offer recommendations usually rely on “big data”, and machine learning algorithms to crunch the data to find patterns and make inferences and predictions. These are in some sense “user models”. But many such “user models” behind information targeting are not really focused on us as individual users, as people or persons. This is their power (they produce generic user models and scale well) and also their weakness (none of us is entirely generic). These techniques fail in the face of data sparsity—without enough data to crunch on, there are no conclusions (or poor ones), no recommendations (or bad ones). This leads service providers to break down data silos by buying and selling data, entering into data-sharing agreements, and tracking users beyond their virtual walls using cookies and device identifiers [2, 3]. Even then, the algorithms are not able to get at the why I did it part, and they seldom ask the more important questions about the recommendations that result, questions like: Did we get it wrong? And if so, how wrong? They are neither conversational nor collaborative. They aren’t “listening” carefully.
While such practices could be construed as benign when it comes to generating targeted recommendations for products like soap powder, in more politically sensitive or safety-critical situations, our hackles are rightly raised. It isn’t just about what may be found out; it is also that errors based on partial, incomplete, or erroneous data can lead to unwarranted conclusions with potentially serious negative consequences [4].
Yet we, as consumers who could provide the necessary information for higher quality inferences and predictions, are neither inclined nor invited to work on creating truly personal data models. We don’t get to curate the data about us, to manage our data. There are limited efforts in this regard in narrow domains (movie ratings being the most well known). And there are user profiles that we are often asked to fill out. However, personally, I feel disinclined to fill out profiles and comply with providing more data about myself. Why? Because when I have done so, I have not noticed a significant improvement in service provision, certainly not enough to have made my efforts worthwhile. I don’t know how that information is used. I also don’t know what is potentially at stake from handing over more information about myself—companies have not, to date, established a great track record for scrupulous behavior when it comes to personal data.
I am not alone in my reticence. Many people go further than I do: They actively resist data capture. They engage in data play to intentionally withhold personal information. They leverage the seams within and between services. They create multiple accounts and engage in data scrubbing. They disallow browser cookies [2] and flush browser caches. They avoid importing data from one social network to another. They are concerned about location tracking.
Consensus is building that transparency around personal-data collection and how the data is used would help build trust between corporations and users. Consumer-service corporations have a great opportunity to engage their users more effectively as protected and engaged personal-data curators. This will require legal and infrastructure changes, but it will undoubtedly also be mutually beneficial. Changes in the ecosystem of personal data capture, data management, and data use would lead to more willing participation in the emerging economies around personal data.
There’s plenty of research and practice to draw on, should the willingness be there to more effectively investigate this space. Designers, human-computer-interaction specialists, recommender-systems developers and researchers, and experts in user modeling all have deep expertise. There’s a gathering focus on “scrutable” user models—models that are transparent and inspectable, intelligible, customer-centered, use-oriented, and built for long-term curation [5]. Issues like privacy, visibility, error correction, “wasted data,” and shared control of model building are key. Researchers Judith Kay and Bob Kummerfeld, who are focused on the design of student models in tutoring systems, say that “scrutable user models are designed and implemented so that the user can study, or scrutinize, the way she works, to determine what information the user model holds, the processes used to capture it, and the ways that it is used.” Clearly, to import these ideas from tutoring systems to the world of Internet services is ambitious, but location-based services like Foursquare and fitness applications that allow us to export and explore our activity data are showing us one way forward.
So, how about that for a future? Co-ownership and the possibility to curate our own data….The ability to delete, to aggregate—the ability to refute, augment, expand, annotate. The ability to narcissistically, as well as defensively, manage our own digital selves. A scrupulous (ethical, principled, meticulous), scrutable (let me check it), and sumptuous world of personal data? I’ll sign up for that.
1. Foursquare launched a lightweight check-in app called Swarm earlier this year which tracks your location automatically: http://blog.foursquare.com/post/85826325458/swarm-is-ready-for-you-download-it-now. At the time of writing, I am still using the old Foursquare app, I have not upgraded to Swarm.
2. This is the rationale behind Facebook’s recent announcement in June 2014 that it was going to start tracking users’ behavior off-site. See, for example, http://gigaom.com/2014/06/12/facebook-will-track-your-web-history-for-ads-but-now-you-can-complain-about-them/
3. “Cookies” (tiny code attached to Web pages) track our progress over the Internet, and our personal devices have unique numbers that allow them to be tracked (and us with them) wherever we go.
4. See for example, Zeynep Tufecki’s writings, e.g., Tufecki, Z (2014) Big data, surveillance and computational politics. First Monday, Vol. 19, No. 7, 7 July 2014, http://firstmonday.org.
5. See Kay, J. and Kummerfeld, B. Creating personalized systems that people can scrutinize and control: Drivers, principles and experience. ACM Trans. on Interactive Intelligent Systems 2, 4 (Dec. 2012).