Whose Data Is It Anyway? | Susan Hall · IP/ICT Lawyer

Whose Data Is It Anyway?

Sadness at the Beach

There are also unknown unknowns – the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones - Donald Rumsfeld

There is a long-running debate raging across social media about the legal and ethical issues surrounding user-generated content (UGC). Without UGC social media would grind to a halt, but who owns it? Who has the right to mine it for data? What are the safeguards that need to be imposed to protect people who may be unwittingly committing more of themselves than is safe, wise or intended to a publically viewable forum?

At bottom, it comes down to a tension between two distinct worldviews: one motivated by profit ("profit" including not just financial gain but enhanced social and professional capital) and the other by community, whereby people participate in social media to find others like them, share experiences and contribute to common, if ill-defined, social enterprises.

The clash between these worldviews is a running battle: it dies down in one area, only to flare up in another. From the Samaritans Radar debacle to Steven A. Walker's shameless plundering of online fan reviews of Torchwood episodes to include in his books or the notorious 2009 Ogi Ogas and Sai Gaddam survey of female attitudes to slash fanfiction, there are numerous episodes of unwarned newcomers entering an online community with an eye to profit, only to find the community banding together and using every means at their disposal to repel the unwelcome invader.

When Anika Mandla, Jo Billings & Joanna Moncrieff of UCL described the topic of their 2017 paper as a thematic analysis of Internet “blogs” by self-identified bipolar sufferers 1 the use of inverted commas around the word "blogs" immediately raised the suspicion that social media is a foreign country to them. This suspicion is reinforced on reading the paper as a whole, especially the moment where a blogger's referring to a supportive physician as "Dr Awesomesauce" leads the researchers to conclude that the blogger in question:

perceived doctors as being principally a source of medication, maybe implying an overreadiness to prescribe

and that the use of this and similar nicknames for doctors2

may also indicate that bloggers valued the direct, mind-altering effects of medication in the same way that people seek out effects of drugs like Valium or amphetamines.

"Awesomesauce" entered the Oxford Online Dictionaries in 2015, as a synomym for "excellent", having been used online in that sense for at least a decade. The paper's authors' evident lack of familiarity with online slang, and their willingness to supply their own interpretation, without troubling to look it up, is a further warning bell.

In fact, one theme of the paper seems to be its authors' conviction that Bipolar Disorder may be over-described and over-prescribed, which potentially brings them into conflict with many of the bloggers whose work they mine for applicable quotations. Indeed, a curious mix of scepticism and credulity about the entire blogging process emerges from the paper, as can be seen from the two paragraphs extracted below:

Internet data have the advantage that people are free to express their concerns in their own language, without the constraints of a formal research context. On the other hand, it is not possible to verify what people say about who they are and the experiences they have had. There is evidence that people sometimes masquerade as the patients on the Internet (Kleeman, 2011). Financial inducements may be implicated. We cannot confirm the veracity of the blogs in our sample, but most of them were derived from large blogging sites with an established reputation. We do not know whether any of the bloggers in the included sample received payment for their blogs, but some of the blogging sites on which the blogs were posted featured advertising, although none included advertisements for medicines.

Our search results illustrate the vast amount of Internet activity that relates to BD and it would be impossible to represent all the views being expressed in a single study. Since we selected the most accessible blogs, our sample is likely to overrepresent sites associated with large organizations, capable of ensuring high visibility. There are doubtless many other representations of BD available on the Internet in different locations. Nevertheless, since the included blogs were easily accessed, whether or not they were genuine or representative, their content is important because it is likely to exert a disproportional influence over public views.

The highlighted portions (all emphasis mine) reflect areas where the authors appear unaware that the points they make have been the subject of research, debate and, in some cases, legislative intervention going back many years.

The claim "It is not possible to verify what people say about who they are and the experiences they have had" can be traced back at least to Peter Steiner's 1993 New Yorker cartoon, captioned, "On the Internet, no-one knows you're a dog." However, in the context of scientific research it is certainly possible to verify one's sources to some degree, beginning by contacting the blogger in question. The choice not to go down the consent route, or to supplement the use of online material with off-line interviews or questionnaires, is a puzzling one.

Financial inducements may be implicated. It does not seem a particularly difficult task to work out whether a blog is overtly intended to make money, whether that is to cover the costs of running it or support the blogger, because bloggers wanting to monetise blogging wish to make it as easy as possible for their readers to pay them, and include appropriate links to facilitate this. The researchers do not mention if the sites they visited have Patreons or other crowdfunding links, although they do mention advertising. Suggesting that a specific blogger is passing off advertorial as editorial in return for covert payments is a rather different matter, since it is specifically an automatically unfair trading practice under the Consumer Protection from Unfair Trading Regulations 2008 ("the CPRs").

Finally, the admitted lack of care in selecting the blogs reviewed (Since we selected the most accessible blogs, our sample is likely to overrepresent sites associated with large organizations, capable of ensuring high visibility and since the included blogs were easily accessed, whether or not they were genuine or representative, their content is important) cast considerable doubt on any conclusions they seek to draw. They seem unaware as to the "silo" or "echo chamber" effect on the internet, where specific platforms develop their own political and cultural flavour, and where popularity and ease of access do not necessarily translate into respresentation of a wide spread of voices. The lack of any substantive analysis of the methodology chosen to select blogs makes it impossible to draw any conclusions about whether unconscious biases may have restricted the range of voices heard.

However, the most disturbing sentence in the whole paper is the researchers' conclusion

Ethical approval was considered to be unnecessary given the blogs are publicly available, but all quotations have been anonymized.

In that one sentence, Mandla et al betray not simply a lack of knowledge of online etiquette, but a cavalier approach to data protection law. Furthermore, they appear to have contravened UCL's own internal policy on data protection and the use of sensitive personal data in research3.

The UCL guidance policy notes, accurately, that the legal framework for research using personal data requires researchers to demonstrate

  • You are using the data only for research purposes. This includes statistical and historical research.
  • You do not use the information to support decisions about the research subject or any other living person.
  • You do not use the data in such a way that it causes substantial damage or substantial distress to the subject.
  • You do not make the results of the research available in a way that identifies any of the research subjects (except if identification is part of the explicit consent condition - see Data Protection Principle 1). Students and supervisors should be particularly careful about the potential for identifying individuals in theses containing interview transcripts.

The policy warns that even whether these conditions are met, the requirements of applicable data protection law still need to be complied with. Researchers need to show a legitimate basis for processing personal data at all, and where, as here, the data in question is sensitive personal data, one of the additional grounds for processing sensitive data must also be shown. The guidance also advises either effective anonymisation or express consent.

Especially given the lack of thought given to the UCL policy in this area, Mandla et al's stated basis for avoiding ethical oversight fails on two main points.

First, the quotations have not been anonymised. The researchers know which blogs they selected for quotations about the lived experience of sufferers from bipolar disorder, and could reproduce that raw data in the event of any criticism of their research methodology. What they have attempted is, instead, pseudonymisation, and they have done it very poorly.

The blogs remain online, associated with all the identifying information which the researchers stripped out. The bloggers, should they ever read the paper, may well recognise their own words.

The quotations themselves are sufficiently distinctive that, with the aid only of Google and with less than two minutes effort, it was possible to take one of them, and from it discover the blogger's age, gender, the US urban area in which they were living and their favourite sports team. Similar exercises were also carried out with other quotations. Studies in k-anonymity have posited that 88% of Americans could be uniquely identified from age, gender and nine digit zip code. It seems probable that many of the individual bloggers cited could be linked to what is referred to online as "wallet names".

Apart from anonymity, the reason cited for not seeking ethics approval is that "the blogs are publically available." Linda A. Eastman's 2011 paper, cited as support for this proposition4, is an interesting and nuanced look at the issues, but brings the researchers into immediate conflict with one of those very "unknown unknowns." Eastham is based at University of Virginia, Charlottesville and Virginia Commonwealth University, School of Nursing, Charlottesville, VA, and accordingly approached the issue from a US-centric perspective.

She concludes

Blogs offer an alternative avenue to examine illness experiences. When research designs include blogs in addition to interview or survey data from those blog authors, the need for informed consent is self-evident. However, when the research design uses blog data as the only data from a particular
source, ethical clarity blurs. The public/private tension inherent in blogging presents a challenge to the researcher to design studies with appropriate privacy protections. Combining knowledge of blogs with an assessment of the blogger’s intended privacy level, researchers will be better able to design studies that entail minimal risk to blog authors.

This is an excellent starting point, but needs to be refined so that the "appropriate privacy protections" take into account the relevant data protection environment.

Unlike in the US, the European Economic Area protects personal data (a very broadly defined class.) Data relating to health, including mental health, fall within the further category of "sensitive personal data" (or "special categories of data", under GDPR) which are subject to additional safeguards. To process sensitive personal data one needs to show, first, a legitimate basis for processing any personal data and, secondly, one of the additional bases for processing sensitive personal data.

The exceptions built into the legislation, which are intended to protect uses of personal data for scientific research must be

proportionate to the aim pursued, respect the essence of the right to data protection and provide for suitable and specific measures to safeguard the fundamental rights and the interests of the data subject.5

The relevant parts of section 33 DPA 1998 (which applied at the date of the paper, as did The Data Protection (Processing of Sensitive Personal Data) Order 2000) reads as follows

Research, history and statistics.

(1)In this section— “research purposes” includes statistical or historical purposes; “the relevant conditions”, in relation to any processing of personal data, means the conditions—

(a)that the data are not processed to support measures or decisions with respect to particular individuals, and

(b)that the data are not processed in such a way that substantial damage or substantial distress is, or is likely to be, caused to any data subject.

(2)For the purposes of the second data protection principle, the further processing of personal data only for research purposes in compliance with the relevant conditions is not to be regarded as incompatible with the purposes for which they were obtained.

The "substantial damage or substantial distress" test reappears in s.19 Data Protection Act 2018, which enshrines into English law the principles of GDPR Recital 153 and Article 85 relating to use of personal data for research.

Whether under the old or new law, the principles are substantially the same. What the researchers should have done is to carry out a specific analysis of their proposal with regard to the UCL policy and to the relevant law. The question as to whether material which has been voluntarily posted to the internet remains protectable personal data has been conclusively answered in the affirmative (the Samaritans Radar article referenced above makes the legal basis for this reasoning clear.) Not only does it remain personal data, people repurposing data derived from social media cannot rely on the fact that the original posters made it public themselves, where the data has been repurposed in a way which the original posters would not have reasonably expected.

Finally, and crucially, the researchers should have addressed their minds to whether what they were doing was capable of causing substantial damage or distress, and if there was a risk of such damage or distress, what steps could have been taken to reduce that risk.

In short, people venturing into unfamiliar territory should take the precaution of acquiring local knowledge; a proposition equally true whether the territory is real or virtual. Otherwise they risk being bitten by those unknown unknowns.


  1. Anika Mandla, Jo Billings & Joanna Moncrieff (2017) “Being Bipolar”: A Qualitative Analysis of the Experience of Bipolar Disorder as Described in Internet Blogs, Issues in Mental Health Nursing, 38:10, 858-864, DOI: 10.1080/01612840.2017.1355947. My appreciation to @cattebear who made this and the Eastham paper available to me. 

  2. In fairness, the researchers’ point may be somewhat stronger with respect to another blogger who nicknamed their physician “Dr Candyman”, though since this seems to be the same blogger who also complains of being “stoned out of my mind, a constant general malaise, and barely able to function” the suggestion that the nickname should be taken as a positive one appears questionable. 

  3. In particular, this guidance states that ethical approval should be sought for "research involving sensitive topics – for example participants’ sexual behaviour, their illegal or political behaviour, their experience of violence, their abuse or exploitation, their mental health or their gender or ethnic status." --UCL Library Services: Handling Sensitive and Personal Data 

  4. Eastham, L. A. (2011). Research using blogs for data: Public documents or private musings? Research in Nursing & Health, 34, 353–361 

  5. General Data Protection Regulation, Article 9 - Processing of special categories of personal data: 2 (j) 

The main image used for this article is: 'Sadness at the Beach' and was used under the terms detailed at the above link on the date this article was first published.

Tweet your Comments...

Article tagged with: