This is the first post in a series of blog posts of excerpts of my paper Ethical and Legal Considerations of reCAPTCHA to be presented at PST 2012. The paper’s primary purpose is to provoke thought and discussion. I’ve signed a document prohibiting me from publishing the final copy of the paper, but I am allowed to post the paper as originally submitted for consideration, so here it is…
Abstract
reCAPTCHA is a technology that aims to stop computers from abusing automated services (e.g., stop spamming) while harnessing a large amount of brainpower to complete tasks amenable to being broken into small quanta. The technology, currently owned by Google, is being used to make old documents searchable with digital text through the process of optical character recognition (OCR). It has proven to be accurate and effective. In this paper, the ethics and legality of reCAPTCHA as it is currently used is discussed. Solutions for improving reCAPTCHA in these two contexts are proposed.
Introduction
Automated agents (“bots”) have abused services, for ex- ample to post spam, resulting in the use of completely automated public Turing test to tell computers and humans apart (CAPTCHA). These tests employ different methods for differentiating humans from bots including picking cats from a grid of cats and dogs and optical character recognition (OCR) tasks ((J. Elson, J. R. Doucer, J. Howell, and J. Saul, “Asirra: A captcha that exploits interest-aligned manual image categorization,” in Computer and Communications Security, 2007)). CAPTCHA systems generate a challenge to which they have the solution; for example, in an OCR task, distorted characters are presented to users and require the solver to correctly identify the characters under the premise that only humans can accurately perform the task.
In a reCAPTCHA system, a pair of challenges are presented. Unlike with a regular CAPTCHA, the re-CAPTCHA system only has a correct solution for one of the challenges. In the case of the OCR reCAPTCHA employed by Google, the second image contains text that Google would like to have decoded; the solver is not made aware of which is the known text. If the known challenge is successfully solved, it is assumed that the unknown challenge was also correctly solved because the system has determined that a human is solving the re-CAPTCHA; the answer provided by the (supposed) human is recorded. Here, humans serve as an optical character recognition system implemented in wetware rather than in soft- and hardware.
Minimal CAPTCHA-reCAPTCHA
With current implementations of reCAPTCHA, there are two sets of work to be completed: the CAPTCHA portion, Wc, used to determine if the solver is human, and the reCAPTCHA portion, Wr, that is being harnessed for productivity. Using existing approaches, re-CAPTCHAs double the amount of work done by humans and would be equally effective at differentiating between humans and bots without the extra work thrust upon users.
The additional text to be decoded in a reCAPTCHA system seems, on the surface, to differ only quantitatively in the amount of work performed by humans. However, in this paper, it will be argued that the there are underlying qualitative differences between the unknown and known images that make reCAPTCHAs ethically and legally problematic. To do this, first a pair of situations that fall into an ethical grey area will be presented that highlight three important criteria distinguishing them from reCAPTCHAs. This will be followed by a deconstruction of reCAPTCHA under various ethical frameworks. Next, three legal domains in which reCAPTCHA is in violation of laws in various countries will be discussed. Lastly, solutions for making ethical reCAPTCHAs and reducing legal issues surrounding them will be proposed.