DOKK Library

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

Authors Angelina McMillan-Major Emily M. Bender Shmargaret Shmitchell Timnit Gebru

License CC-BY-4.0

Plaintext
                                    On the Dangers of Stochastic Parrots:
                                    Can Language Models Be Too Big?
                              Emily M. Bender∗                                                                      Timnit Gebru∗
                              ebender@uw.edu                                                                     timnit@blackinai.org
                           University of Washington                                                                   Black in AI
                               Seattle, WA, USA                                                                   Palo Alto, CA, USA

                       Angelina McMillan-Major                                                                Shmargaret Shmitchell
                                aymm@uw.edu                                                              shmargaret.shmitchell@gmail.com
                           University of Washington                                                                 The Aether
                               Seattle, WA, USA
ABSTRACT                                                                                      alone, we have seen the emergence of BERT and its variants [39,
The past 3 years of work in NLP have been characterized by the                                70, 74, 113, 146], GPT-2 [106], T-NLG [112], GPT-3 [25], and most
development and deployment of ever larger language models, es-                                recently Switch-C [43], with institutions seemingly competing to
pecially for English. BERT, its variants, GPT-2/3, and others, most                           produce ever larger LMs. While investigating properties of LMs and
recently Switch-C, have pushed the boundaries of the possible both                            how they change with size holds scientific interest, and large LMs
through architectural innovations and through sheer size. Using                               have shown improvements on various tasks (§2), we ask whether
these pretrained models and the methodology of fine-tuning them                               enough thought has been put into the potential risks associated
for specific tasks, researchers have extended the state of the art                            with developing them and strategies to mitigate these risks.
on a wide array of tasks as measured by leaderboards on specific                                 We first consider environmental risks. Echoing a line of recent
benchmarks for English. In this paper, we take a step back and ask:                           work outlining the environmental and financial costs of deep learn-
How big is too big? What are the possible risks associated with this                          ing systems [129], we encourage the research community to priori-
technology and what paths are available for mitigating those risks?                           tize these impacts. One way this can be done is by reporting costs
We provide recommendations including weighing the environmen-                                 and evaluating works based on the amount of resources they con-
tal and financial costs first, investing resources into curating and                          sume [57]. As we outline in §3, increasing the environmental and
carefully documenting datasets rather than ingesting everything on                            financial costs of these models doubly punishes marginalized com-
the web, carrying out pre-development exercises evaluating how                                munities that are least likely to benefit from the progress achieved
the planned approach fits into research and development goals and                             by large LMs and most likely to be harmed by negative environ-
supports stakeholder values, and encouraging research directions                              mental consequences of its resource consumption. At the scale we
beyond ever larger language models.                                                           are discussing (outlined in §2), the first consideration should be the
                                                                                              environmental cost.
CCS CONCEPTS                                                                                     Just as environmental impact scales with model size, so does
                                                                                              the difficulty of understanding what is in the training data. In §4,
• Computing methodologies → Natural language processing.
                                                                                              we discuss how large datasets based on texts from the Internet
ACM Reference Format:                                                                         overrepresent hegemonic viewpoints and encode biases potentially
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar-
                                                                                              damaging to marginalized populations. In collecting ever larger
garet Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language
                                                                                              datasets we risk incurring documentation debt. We recommend
Models Be Too Big
                ?       . In Conference on Fairness, Accountability, and Trans-
parency (FAccT ’21), March 3–10, 2021, Virtual Event, Canada. ACM, New
                                                                                              mitigating these risks by budgeting for curation and documentation
York, NY, USA, 14 pages. https://doi.org/10.1145/3442188.3445922                              at the start of a project and only creating datasets as large as can
                                                                                              be sufficiently documented.
1 INTRODUCTION                                                                                   As argued by Bender and Koller [14], it is important to under-
                                                                                              stand the limitations of LMs and put their success in context. This
One of the biggest trends in natural language processing (NLP) has
                                                                                              not only helps reduce hype which can mislead the public and re-
been the increasing size of language models (LMs) as measured
                                                                                              searchers themselves regarding the capabilities of these LMs, but
by the number of parameters and size of training data. Since 2018
                                                                                              might encourage new research directions that do not necessarily
∗Joint first authors
                                                                                              depend on having larger LMs. As we discuss in §5, LMs are not
                                                                                              performing natural language understanding (NLU), and only have
                                                                                              success in tasks that can be approached by manipulating linguis-
                                                                                              tic form [14]. Focusing on state-of-the-art results on leaderboards
This work is licensed under a Creative Commons Attribution International 4.0 License.         without encouraging deeper understanding of the mechanism by
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                            which they are achieved can cause misleading results as shown
ACM ISBN 978-1-4503-8309-7/21/03.
https://doi.org/10.1145/3442188.3445922




                                                                                        610
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                                                                       Bender and Gebru, et al.


in [21, 93] and direct resources away from efforts that would facili-             Year    Model                           # of Parameters       Dataset Size
tate long-term progress towards natural language understanding,                   2019    BERT [39]                                 3.4E+08              16GB
without using unfathomable training data.                                         2019    DistilBERT [113]                         6.60E+07              16GB
   Furthermore, the tendency of human interlocutors to impute                     2019    ALBERT [70]                              2.23E+08              16GB
meaning where there is none can mislead both NLP researchers                      2019    XLNet (Large) [150]                      3.40E+08             126GB
and the general public into taking synthetic text as meaningful.                  2020    ERNIE-Gen (Large) [145]                  3.40E+08              16GB
Combined with the ability of LMs to pick up on both subtle biases                 2019    RoBERTa (Large) [74]                     3.55E+08             161GB
and overtly abusive language patterns in training data, this leads                2019    MegatronLM [122]                         8.30E+09             174GB
to risks of harms, including encountering derogatory language and                 2020    T5-11B [107]                             1.10E+10             745GB
                                                                                  2020    T-NLG [112]                              1.70E+10             174GB
experiencing discrimination at the hands of others who reproduce
                                                                                  2020    GPT-3 [25]                               1.75E+11             570GB
racist, sexist, ableist, extremist or other harmful ideologies rein-
                                                                                  2020    GShard [73]                              6.00E+11                 –
forced through interactions with synthetic language. We explore                   2021    Switch-C [43]                            1.57E+12             745GB
these potential harms in §6 and potential paths forward in §7.
   We hope that a critical overview of the risks of relying on ever-                  Table 1: Overview of recent large language models
increasing size of LMs as the primary driver of increased perfor-
mance of language technology can facilitate a reallocation of efforts
towards approaches that avoid some of these risks while still reap-
                                                                                the maximum development F1 score in 10 epochs as opposed to
ing the benefits of improvements to language technology.
                                                                                486 without ELMo. This model furthermore achieved the same F1
                                                                                score with 1% of the data as the baseline model achieved with 10%
2    BACKGROUND                                                                 of the training data. Increasing the number of model parameters,
                                                                                however, did not yield noticeable increases for LSTMs [e.g. 82].
Similar to [14], we understand the term language model (LM) to
                                                                                   Transformer models, on the other hand, have been able to con-
refer to systems which are trained on string prediction tasks: that is,
                                                                                tinuously benefit from larger architectures and larger quantities of
predicting the likelihood of a token (character, word or string) given
                                                                                data. Devlin et al. [39] in particular noted that training on a large
either its preceding context or (in bidirectional and masked LMs)
                                                                                dataset and fine-tuning for specific tasks leads to strictly increasing
its surrounding context. Such systems are unsupervised and when
                                                                                results on the GLUE tasks [138] for English as the hyperparameters
deployed, take a text as input, commonly outputting scores or string
                                                                                of the model were increased. Initially developed as Chinese LMs, the
predictions. Initially proposed by Shannon in 1949 [117], some of
                                                                                ERNIE family [130, 131, 145] produced ERNIE-Gen, which was also
the earliest implemented LMs date to the early 1980s and were used
                                                                                trained on the original (English) BERT dataset, joining the ranks
as components in systems for automatic speech recognition (ASR),
                                                                                of very large LMs. NVIDIA released the MegatronLM which has
machine translation (MT), document classification, and more [111].
                                                                                8.3B parameters and was trained on 174GB of text from the English
In this section, we provide a brief overview of the general trend of
                                                                                Wikipedia, OpenWebText, RealNews and CC-Stories datasets [122].
language modeling in recent years. For a more in-depth survey of
                                                                                Trained on the same dataset, Microsoft released T-NLG,1 an LM
pretrained LMs, see [105].
                                                                                with 17B parameters. OpenAI’s GPT-3 [25] and Google’s GShard
    Before neural models, n-gram models also used large amounts
                                                                                [73] and Switch-C [43] have increased the definition of large LM by
of data [20, 87]. In addition to ASR, these large n-gram models of
                                                                                orders of magnitude in terms of parameters at 175B, 600B, and 1.6T
English were developed in the context of machine translation from
                                                                                parameters, respectively. Table 1 summarizes a selection of these
another source language with far fewer direct translation examples.
                                                                                LMs in terms of training data size and parameters. As increasingly
For example, [20] developed an n-gram model for English with
                                                                                large amounts of text are collected from the web in datasets such
a total of 1.8T n-grams and noted steady improvements in BLEU
                                                                                as the Colossal Clean Crawled Corpus [107] and the Pile [51], this
score on the test set of 1797 Arabic translations as the training data
                                                                                trend of increasingly large LMs can be expected to continue as long
was increased from 13M tokens.
                                                                                as they correlate with an increase in performance.
    The next big step was the move towards using pretrained rep-
                                                                                   A number of these models also have multilingual variants such
resentations of the distribution of words (called word embeddings)
                                                                                as mBERT [39] and mT5 [148] or are trained with some amount of
in other (supervised) NLP tasks. These word vectors came from
                                                                                multilingual data such as GPT-3 where 7% of the training data was
systems such as word2vec [85] and GloVe [98] and later LSTM
                                                                                not in English [25]. The performance of these multilingual mod-
models such as context2vec [82] and ELMo [99] and supported
                                                                                els across languages is an active area of research. Wu and Drezde
state of the art performance on question answering, textual entail-
                                                                                [144] found that while mBERT does not perform equally well across
ment, semantic role labeling (SRL), coreference resolution, named
                                                                                all 104 languages in its training data, it performed better at NER,
entity recognition (NER), and sentiment analysis, at first in Eng-
                                                                                POS tagging, and dependency parsing than monolingual models
lish and later for other languages as well. While training the word
                                                                                trained with comparable amounts of data for four low-resource
embeddings required a (relatively) large amount of data, it reduced
                                                                                languages. Conversely, [95] surveyed monolingual BERT models
the amount of labeled data necessary for training on the various
                                                                                developed with more specific architecture considerations or addi-
supervised tasks. For example, [99] showed that a model trained
                                                                                tional monolingual data and found that they generally outperform
with ELMo reduced the necessary amount of training data needed
to achieve similar results on SRL compared to models without, as                1 https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-
shown in one instance where a model trained with ELMo reached                   language-model-by-microsoft/




                                                                          611
Stochastic Parrots                                                                                                             FAccT ’21, March 3–10, 2021, Virtual Event, Canada


mBERT across 29 tasks. Either way, these models do not address                                  green energy,4 underscoring the need for energy efficient model
the inclusion problems raised by [65], who note that over 90% of                                architectures and training paradigms.
the world’s languages used by more than a billion people currently                                 Strubell et al. also examine the cost of these models vs. their
have little to no support in terms of language technology.                                      accuracy gains. For the task of machine translation where large
    Alongside work investigating what information the models re-                                LMs have resulted in performance gains, they estimate that an
tain from the data, we see a trend in reducing the size of these                                increase in 0.1 BLEU score using neural architecture search for
models using various techniques such as knowledge distillation                                  English to German translation results in an increase of $150,000
[26, 58], quantization [118, 153], factorized embedding parame-                                 compute cost in addition to the carbon emissions. To encourage
terization and cross-layer parameter sharing [70], and progressive                              more equitable access to NLP research and reduce carbon footprint,
module replacing [146]. Rogers et al. [110] provide a comprehensive                             the authors give recommendations to report training time and
comparison of models derived from BERT using these techniques,                                  sensitivity to hyperparameters when the released model is meant
such as DistilBERT [113] and ALBERT [70]. While these models                                    to be re-trained for downstream use. They also urge governments to
maintain and sometimes exceed the performance of the original                                   invest in compute clouds to provide equitable access to researchers.
BERT model, despite their much smaller size, they ultimately still                                 Initiatives such as the SustainNLP workshop5 have since taken
rely on large quantities of data and significant processing and stor-                           up the goal of prioritizing computationally efficient hardware and
age capabilities to both hold and reduce the model.                                             algorithms. Schwartz et al. [115] also call for the development of
    We note that the change from n-gram LMs to word vectors dis-                                green AI, similar to other environmentally friendly scientific de-
tilled from neural LMs to pretrained Transformer LMs is paralleled                              velopments such as green chemistry or sustainable computing. As
by an expansion and change in the types of tasks they are use-                                  shown in [5], the amount of compute used to train the largest deep
ful for: n-gram LMs were initially typically deployed in selecting                              learning models (for NLP and other applications) has increased
among the outputs of e.g. acoustical or translation models; the                                 300,000x in 6 years, increasing at a far higher pace than Moore’s
LSTM-derived word vectors were quickly picked up as more effec-                                 Law. To promote green AI, Schwartz et al. argue for promoting
tive representations of words (in place of bag of words features)                               efficiency as an evaluation metric and show that most sampled
in a variety of NLP tasks involving labeling and classification; and                            papers from ACL 2018, NeurIPS 2018, and CVPR 2019 claim accu-
the pretrained Transformer models can be retrained on very small                                racy improvements alone as primary contributions to the field, and
datasets (few-shot, one-shot or even zero-shot learning) to perform                             none focused on measures of efficiency as primary contributions.
apparently meaning-manipulating tasks such as summarization,                                    Since then, works such as [57, 75] have released online tools to
question answering and the like. Nonetheless, all of these systems                              help researchers benchmark their energy usage. Among their rec-
share the property of being LMs in the sense we give above, that                                ommendations are to run experiments in carbon friendly regions,
is, systems trained to predict sequences of words (or characters or                             consistently report energy and carbon metrics, and consider energy-
sentences). Where they differ is in the size of the training datasets                           performance trade-offs before deploying energy hungry models.
they leverage and the spheres of influence they can possibly affect.                            In addition to these calls for documentation and technical fixes,
By scaling up in these two ways, modern very large LMs incur new                                Bietti and Vatanparast underscore the need for social and political
kinds of risk, which we turn to in the following sections.                                      engagement in shaping a future where data driven systems have
                                                                                                minimal negative impact on the environment [16].
3        ENVIRONMENTAL AND FINANCIAL COST                                                          While [129] benchmarks the training process in a research set-
                                                                                                ting, many LMs are deployed in industrial or other settings where
Strubell et al. recently benchmarked model training and develop-                                the cost of inference might greatly outweigh that of training in
ment costs in terms of dollars and estimated 𝐶𝑂 2 emissions [129].                              the long run. In this scenario, it may be more appropriate to de-
While the average human is responsible for an estimated 5t 𝐶𝑂 2𝑒                                ploy models with lower energy costs during inference even if their
per year,2 the authors trained a Transformer (big) model [136] with                             training costs are high. In addition to benchmarking tools, works
neural architecture search and estimated that the training procedure                            estimating the cost increase associated with the introduction of LMs
emitted 284t of 𝐶𝑂 2 . Training a single BERT base model (without                               for particular applications, and how they compare to alternative
hyperparameter tuning) on GPUs was estimated to require as much                                 NLP methods, will be important for understanding the trade-offs.
energy as a trans-American flight.                                                                 When we perform risk/benefit analyses of language technology,
   While some of this energy comes from renewable sources, or                                   we must keep in mind how the risks and benefits are distributed,
cloud compute companies’ use of carbon credit-offset sources, the                               because they do not accrue to the same people. On the one hand, it
authors note that the majority of cloud compute providers’ energy is                            is well documented in the literature on environmental racism that
not sourced from renewable sources and many energy sources in the                               the negative effects of climate change are reaching and impacting
world are not carbon neutral. In addition, renewable energy sources                             the world’s most marginalized communities first [1, 27].6 Is it fair or
are still costly to the environment,3 and data centers with increasing                          just to ask, for example, that the residents of the Maldives (likely to
computation requirements take away from other potential uses of                                 be underwater by 2100 [6]) or the 800,000 people in Sudan affected

                                                                                                4 https://news.microsoft.com/2017/11/02/microsoft-announces-one-of-the-largest-
                                                                                                wind-deals-in-the-netherlands-with-vattenfall/
2 Data   for 2017, from https://ourworldindata.org/co2-emissions, accessed Jan 21, 2021         5 https://sites.google.com/view/sustainlp2020/organization
3 https://www.heraldscotland.com/news/18270734.14m-trees-cut-scotland-make-way-                 6 https://www.un.org/sustainabledevelopment/blog/2016/10/report-inequalities-
wind-farms/                                                                                     exacerbate-climate-impacts-on-poor/




                                                                                          612
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                                                                                         Bender and Gebru, et al.


by drastic floods7 pay the environmental price of training and                                discussions which will be included via the crawling methodology,
deploying ever larger English LMs, when similar large-scale models                            and finally the texts likely to be contained after the crawled data
aren’t being produced for Dhivehi or Sudanese Arabic?8                                        are filtered. In all cases, the voices of people most likely to hew to
   And, while some language technology is genuinely designed to                               a hegemonic viewpoint are also more likely to be retained. In the
benefit marginalized communities [17, 101], most language technol-                            case of US and UK English, this means that white supremacist and
ogy is built to serve the needs of those who already have the most                            misogynistic, ageist, etc. views are overrepresented in the training
privilege in society. Consider, for example, who is likely to both                            data, not only exceeding their prevalence in the general population
have the financial resources to purchase a Google Home, Amazon                                but also setting up models trained on these datasets to further
Alexa or an Apple device with Siri installed and comfortably speak                            amplify biases and harms.
a variety of a language which they are prepared to handle. Fur-                                  Starting with who is contributing to these Internet text collec-
thermore, when large LMs encode and reinforce hegemonic biases                                tions, we see that Internet access itself is not evenly distributed,
(see §§4 and 6), the harms that follow are most likely to fall on                             resulting in Internet data overrepresenting younger users and those
marginalized populations who, even in rich nations, are most likely                           from developed countries [100, 143].12 However, it’s not just the In-
to experience environmental racism [10, 104].                                                 ternet as a whole that is in question, but rather specific subsamples
   These models are being developed at a time when unprece-                                   of it. For instance, GPT-2’s training data is sourced by scraping out-
dented environmental changes are being witnessed around the                                   bound links from Reddit, and Pew Internet Research’s 2016 survey
world. From monsoons caused by changes in rainfall patterns due                               reveals 67% of Reddit users in the United States are men, and 64%
to climate change affecting more than 8 million people in India,9                             between ages 18 and 29.13 Similarly, recent surveys of Wikipedians
to the worst fire season on record in Australia killing or displacing                         find that only 8.8–15% are women or girls [9].
nearly three billion animals and at least 400 people,10 the effect                               Furthermore, while user-generated content sites like Reddit,
of climate change continues to set new records every year. It is                              Twitter, and Wikipedia present themselves as open and accessible
past time for researchers to prioritize energy efficiency and cost                            to anyone, there are structural factors including moderation prac-
to reduce negative environmental impact and inequitable access                                tices which make them less welcoming to marginalized populations.
to resources — both of which disproportionately affect people who                             Jones [64] documents (using digital ethnography techniques [63])
are already in marginalized positions.                                                        multiple cases where people on the receiving end of death threats
                                                                                              on Twitter have had their accounts suspended while the accounts
4     UNFATHOMABLE TRAINING DATA                                                              issuing the death threats persist. She further reports that harass-
The size of data available on the web has enabled deep learning                               ment on Twitter is experienced by “a wide range of overlapping
models to achieve high accuracy on specific benchmarks in NLP                                 groups including domestic abuse victims, sex workers, trans people,
and computer vision applications. However, in both application                                queer people, immigrants, medical patients (by their providers),
areas, the training data has been shown to have problematic charac-                           neurodivergent people, and visibly or vocally disabled people.” The
teristics [18, 38, 42, 47, 61] resulting in models that encode stereo-                        net result is that a limited set of subpopulations can continue to
typical and derogatory associations along gender, race, ethnicity,                            easily add data, sharing their thoughts and developing platforms
and disability status [11, 12, 69, 69, 132, 132, 157]. In this section,                       that are inclusive of their worldviews; this systemic pattern in turn
we discuss how large, uncurated, Internet-based datasets encode                               worsens diversity and inclusion within Internet-based communica-
the dominant/hegemonic view, which further harms people at the                                tion, creating a feedback loop that lessens the impact of data from
margins, and recommend significant resource allocation towards                                underrepresented populations.
dataset curation and documentation practices.                                                    Even if populations who feel unwelcome in mainstream sites set
                                                                                              up different fora for communication, these may be less likely to be
4.1     Size Doesn’t Guarantee Diversity                                                      included in training data for language models. Take, for example,
                                                                                              older adults in the US and UK. Lazar et al. outline how they both in-
The Internet is a large and diverse virtual space, and accordingly, it
                                                                                              dividually and collectively articulate anti-ageist frames specifically
is easy to imagine that very large datasets, such as Common Crawl
                                                                                              through blogging [71], which some older adults prefer over more
(“petabytes of data collected over 8 years of web crawling”,11 a
                                                                                              popular social media sites for discussing sensitive topics [24]. These
filtered version of which is included in the GPT-3 training data) must
                                                                                              fora contain rich discussions about what constitutes age discrimi-
therefore be broadly representative of the ways in which different
                                                                                              nation and the impacts thereof. However, a blogging community
people view the world. However, on closer examination, we find that
                                                                                              such as the one described by Lazar et al. is less likely to be found
there are several factors which narrow Internet participation, the
                                                                                              than other blogs that have more incoming and outgoing links.
7 https://www.aljazeera.com/news/2020/9/25/over-800000-affected-in-sudan-flooding-               Finally, the current practice of filtering datasets can further atten-
un                                                                                            uate the voices of people from marginalized identities. The training
8 By this comment, we do not intend to erase existing work on low-resource languages.         set for GPT-3 was a filtered version of the Common Crawl dataset,
One particularly exciting example is the Masakhane project [91], which explores               developed by training a classifier to pick out those documents
participatory research techniques for developing MT for African languages. These
promising directions do not involve amassing terabytes of data.
9 https://www.voanews.com/south-central-asia/monsoons-cause-havoc-india-climate-
change-alters-rainfall-patterns                                                               12 This point is also mentioned in the model card for GPT-3: https://github.com/openai/
10 https://www.cnn.com/2020/07/28/asia/australia-fires-wildlife-report-scli-intl-             gpt-3/blob/master/model-card.md
scn/index.html                                                                                13 https://www.journalism.org/2016/02/25/reddit-news-users-more-likely-to-be-male-
11 http://commoncrawl.org/                                                                    young-and-digital-in-their-news-preferences/




                                                                                        613
Stochastic Parrots                                                                                                    FAccT ’21, March 3–10, 2021, Virtual Event, Canada


most similar to the ones used in GPT-2’s training data, i.e. docu-                    in the type of online discourse that potentially forms the data that
ments linked to from Reddit [25], plus Wikipedia and a collection                     underpins LMs.
of books. While this was reportedly effective at filtering out docu-                     An important caveat is that social movements which are poorly
ments that previous work characterized as “unintelligible” [134],                     documented and which do not receive significant media attention
what is unmeasured (and thus unknown) is what else it filtered out.                   will not be captured at all. Media coverage can fail to cover protest
The Colossal Clean Crawled Corpus [107], used to train a trillion                     events and social movements [41, 96] and can distort events that
parameter LM in [43], is cleaned, inter alia, by discarding any page                  challenge state power [36]. This is exemplified by media outlets
containing one of a list of about 400 “Dirty, Naughty, Obscene or                     that tend to ignore peaceful protest activity and instead focus on
Otherwise Bad Words” [p.6].14 This list is overwhelmingly words                       dramatic or violent events that make for good television but nearly
related to sex, with a handful of racial slurs and words related to                   always result in critical coverage [81]. As a result, the data under-
white supremacy (e.g. swastika, white power) included. While possi-                   pinning LMs stands to misrepresent social movements and dispro-
bly effective at removing documents containing pornography (and                       portionately align with existing regimes of power.
the associated problematic stereotypes encoded in the language of                        Developing and shifting frames stand to be learned in incomplete
such sites [125]) and certain kinds of hate speech, this approach will                ways or lost in the big-ness of data used to train large LMs — particu-
also undoubtedly attenuate, by suppressing such words as twink,                       larly if the training data isn’t continually updated. Given the com-
the influence of online spaces built by and for LGBTQ people.15 If                    pute costs alone of training large LMs, it likely isn’t feasible for
we filter out the discourse of marginalized populations, we fail to                   even large corporations to fully retrain them frequently enough to
provide training data that reclaims slurs and otherwise describes                     keep up with the kind of language change discussed here. Perhaps
marginalized identities in a positive light.                                          fine-tuning approaches could be used to retrain LMs, but here again,
   Thus at each step, from initial participation in Internet fora, to                 what would be required is thoughtful curation practices to find ap-
continued presence there, to the collection and finally the filtering                 propriate data to capture reframings and techniques for evaluating
of training data, current practice privileges the hegemonic view-                     whether such fine-tuning appropriately captures the ways in which
point. In accepting large amounts of web text as ‘representative’                     new framings contest hegemonic representations.
of ‘all’ of humanity we risk perpetuating dominant viewpoints,
increasing power imbalances, and further reifying inequality. We                      4.3     Encoding Bias
instead propose practices that actively seek to include communities                   It is well established by now that large LMs exhibit various kinds of
underrepresented on the Internet. For instance, one can take inspi-                   bias, including stereotypical associations [11, 12, 69, 119, 156, 157],
ration from movements to decolonize education by moving towards                       or negative sentiment towards specific groups [61]. Furthermore,
oral histories due to the overrepresentation of colonial views in                     we see the effects of intersectionality [34], where BERT, ELMo, GPT
text [35, 76, 127], and curate training datasets through a thoughtful                 and GPT-2 encode more bias against identities marginalized along
process of deciding what to put in, rather than aiming solely for                     more than one dimension than would be expected based on just the
scale and trying haphazardly to weed out, post-hoc, flotsam deemed                    combination of the bias along each of the axes [54, 132]. Many of
‘dangerous’, ‘unintelligible’, or ‘otherwise bad’.                                    these works conclude that these issues are a reflection of training
                                                                                      data characteristics. For instance, Hutchinson et al. find that BERT
4.2      Static Data/Changing Social Views                                            associates phrases referencing persons with disabilities with more
A central aspect of social movement formation involves using lan-                     negative sentiment words, and that gun violence, homelessness,
guage strategically to destabilize dominant narratives and call at-                   and drug addiction are overrepresented in texts discussing mental
tention to underrepresented social perspectives. Social movements                     illness [61]. Similarly, Gehman et al. show that models like GPT-3
produce new norms, language, and ways of communicating. This                          trained with at least 570GB of data derived mostly from Common
adds challenges to the deployment of LMs, as methodologies re-                        Crawl16 can generate sentences with high toxicity scores even when
liant on LMs run the risk of ‘value-lock’, where the LM-reliant                       prompted with non-toxic sentences [53]. Their investigation of GPT-
technology reifies older, less-inclusive understandings.                              2’s training data17 also finds 272K documents from unreliable news
   For instance, the Black Lives Matter movement (BLM) influenced                     sites and 63K from banned subreddits.
Wikipedia article generation and editing such that, as the BLM                            These demonstrations of biases learned by LMs are extremely
movement grew, articles covering shootings of Black people in-                        valuable in pointing out the potential for harm when such models
creased in coverage and were generated with reduced latency [135].                    are deployed, either in generating text or as components of classi-
Importantly, articles describing past shootings and incidents of po-                  fication systems, as explored further in §6. However, they do not
lice brutality were created and updated as articles for new events                    represent a methodology that can be used to exhaustively discover
were created, reflecting how social movements make connections                        all such risks, for several reasons.
between events in time to form cohesive narratives [102]. More                            First, model auditing techniques typically rely on automated
generally, Twyman et al. [135] highlight how social movements                         systems for measuring sentiment, toxicity, or novel metrics such
actively influence framings and reframings of minority narratives                     as ‘regard’ to measure attitudes towards a specific demographic
                                                                                      group [119]. But these systems themselves may not be reliable

14 Available at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-         16 https://commoncrawl.org/the-data/
Otherwise-Bad-Words/blob/master/en, accessed Jan 18, 2021                             17 GPT-3’s training data is not openly available, but GPT-2’s training data was used
15 This observation is due to William Agnew.                                          indirectly to construct GPT-3’s [53].




                                                                                614
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                                                                                          Bender and Gebru, et al.


means of measuring the toxicity of text generated by LMs. For                                     documentation as part of the planned costs of dataset creation, and
example, the Perspective API model has been found to associate                                    only collect as much data as can be thoroughly documented within
higher levels of toxicity with sentences containing identity markers                              that budget.
for marginalized groups or even specific names [61, 103].
   Second, auditing an LM for biases requires an a priori under-                                  5     DOWN THE GARDEN PATH
standing of what social categories might be salient. The works cited
                                                                                                  In §4 above, we discussed the ways in which different types of
above generally start from US protected attributes such as race and
                                                                                                  biases can be encoded in the corpora used to train large LMs. In
gender (as understood within the US). But, of course, protected
                                                                                                  §6 below we explore some of the risks and harms that can follow
attributes aren’t the only identity characteristics that can be subject
                                                                                                  from deploying technology that has learned those biases. In the
to bias or discrimination, and the salient identity characteristics
                                                                                                  present section, however, we focus on a different kind of risk: that
and expressions of bias are also culture-bound [46, 116]. Thus, com-
                                                                                                  of misdirected research effort, specifically around the application
ponents like toxicity classifiers would need culturally appropriate
                                                                                                  of LMs to tasks intended to test for natural language understanding
training data for each context of audit, and even still we may miss
                                                                                                  (NLU). As the very large Transformer LMs posted striking gains
marginalized identities if we don’t know what to audit for.
                                                                                                  in the state of the art on various benchmarks intended to model
   Finally, we note that moving beyond demonstrating the exis-
                                                                                                  meaning-sensitive tasks, and as initiatives like [142] made the mod-
tence of bias to building systems that verify the ‘safety’ of some
                                                                                                  els broadly accessible to researchers seeking to apply them, large
LM (even for a given protected class) requires engaging with the
                                                                                                  quantities of research effort turned towards measuring how well
systems of power that lead to the harmful outcomes such a system
                                                                                                  BERT and its kin do on both existing and new benchmarks.19 This
would seek to prevent [19]. For example, the #MeToo movement has
                                                                                                  allocation of research effort brings with it an opportunity cost, on
spurred broad-reaching conversations about inappropriate sexual
                                                                                                  the one hand in terms of time not spent applying meaning captur-
behavior from men in power, as well as men more generally [84].
                                                                                                  ing approaches to meaning sensitive tasks, and on the other hand in
These conversations challenge behaviors that have been historically
                                                                                                  terms of time not spent exploring more effective ways of building
considered appropriate or even the fault of women, shifting notions
                                                                                                  technology with datasets of a size that can be carefully curated and
of sexually inappropriate behavior. Any product development that
                                                                                                  available for a broader set of languages [65, 91].
involves operationalizing definitions around such shifting topics
                                                                                                      The original BERT paper [39] showed the effectiveness of the
into algorithms is necessarily political (whether or not developers
                                                                                                  architecture and the pretraining technique by evaluating on the
choose the path of maintaining the status quo ante). For example,
                                                                                                  General Language Understanding Evaluation (GLUE) benchmark
men and women make significantly different assessments of sexual
                                                                                                  [138], the Stanford Question Answering Datasets (SQuAD 1.1 and
harassment online [40]. An algorithmic definition of what con-
                                                                                                  2.0) [108], and the Situations With Adversarial Generations bench-
stitutes inappropriately sexual communication will inherently be
                                                                                                  mark (SWAG) [155], all datasets designed to test language under-
concordant with some views and discordant with others. Thus, an
                                                                                                  standing and/or commonsense reasoning. BERT posted state of
attempt to measure the appropriateness of text generated by LMs, or
                                                                                                  the art results on all of these tasks, and the authors conclude by
the biases encoded by a system, always needs to be done in relation
                                                                                                  saying that “unsupervised pre-training is an integral part of many
to particular social contexts and marginalized perspectives [19].
                                                                                                  language understanding systems.” [39, p.4179]. Even before [39]
                                                                                                  was published, BERT was picked up by the NLP community and
4.4        Curation, Documentation & Accountability                                               applied with great success to a wide variety of tasks [e.g. 2, 149].
In summary, LMs trained on large, uncurated, static datasets from                                     However, no actual language understanding is taking place in
the Web encode hegemonic views that are harmful to marginalized                                   LM-driven approaches to these tasks, as can be shown by careful
populations. We thus emphasize the need to invest significant re-                                 manipulation of the test data to remove spurious cues the systems
sources into curating and documenting LM training data. In this,                                  are leveraging [21, 93]. Furthermore, as Bender and Koller [14]
we follow Jo et al. [62], who cite archival history data collection                               argue from a theoretical perspective, languages are systems of
methods as an example of the amount of resources that should be                                   signs [37], i.e. pairings of form and meaning. But the training data
dedicated to this process, and Birhane and Prabhu [18], who call for                              for LMs is only form; they do not have access to meaning. Therefore,
a more justice-oriented data collection methodology. Birhane and                                  claims about model abilities must be carefully characterized.
Prabhu note, echoing Ruha Benjamin [15], “Feeding AI systems on                                       As the late Karen Spärck Jones pointed out: the use of LMs
the world’s beauty, ugliness, and cruelty, but expecting it to reflect                            ties us to certain (usually unstated) epistemological and method-
only the beauty is a fantasy.” [p.1541]                                                           ological commitments [124]. Either i) we commit ourselves to a
   When we rely on ever larger datasets we risk incurring doc-                                    noisy-channel interpretation of the task (which rarely makes sense
umentation debt,18 i.e. putting ourselves in a situation where the                                outside of ASR), ii) we abandon any goals of theoretical insight into
datasets are both undocumented and too large to document post hoc.                                tasks and treat LMs as “just some convenient technology” [p.7], or
While documentation allows for potential accountability [13, 52, 86],                             iii) we implicitly assume a certain statistical relationship — known
undocumented training data perpetuates harm without recourse.                                     to be invalid — between inputs, outputs and meanings.20 Although
Without documentation, one cannot try to understand training data
characteristics in order to mitigate some of these attested issues
or even unknown ones. The solution, we propose, is to budget for                                  19 ~26%of papers sampled from ACL, NAACL and EMNLP since 2018 cite [39].
                                                                                                  20 Specifically,
                                                                                                                 that the mutual information between the input and the meaning given
18 On   the notion of documentation debt as applied to code, rather than data, see [154].         the output is zero — what Spärck Jones calls “the model of ignorance”.




                                                                                            615
Stochastic Parrots                                                                                                            FAccT ’21, March 3–10, 2021, Virtual Event, Canada


she primarily had n-gram models in mind, the conclusions remain                             Question: What is the name of the Russian mercenary group?
apt and relevant.                                                                           Answer: Wagner group.
   There are interesting linguistic questions to ask about what ex-                         Question: Where is the Wagner group?
actly BERT, GPT-3 and their kin are learning about linguistic struc-                        Answer: In Syria.
ture from the unsupervised language modeling task, as studied in
                                                                                            Question: Who is the leader of the Wagner group?
the emerging field of ‘BERTology’ [e.g. 110, 133]. However, from                            Answer: A Russian citizen, Dmitry Utkin. He was born in 1971 and lives in
the perspective of work on language technology, it is far from clear                        Moscow. He has been a member of the Russian parliament since 2016. He
that all of the effort being put into using large LMs to ‘beat’ tasks                       is a former officer of the Main Intelligence Directorate (GRU) of Russia’s
designed to test natural language understanding, and all of the                             General Staff. He was also a commander of the special forces unit “Vostok”
effort to create new such tasks, once the existing ones have been                           (East). The unit was deployed to Chechnya and fought against separatists
bulldozed by the LMs, brings us any closer to long-term goals of                            there in 1994-1996. After that, he served as an adviser to the head of
general language understanding systems. If a large LM, endowed                              Chechnya, Ramzan Kadyrov, and later became his deputy. In 2004-2005, he
with hundreds of billions of parameters and trained on a very large                         commanded a battalion in Chechnya and participated in fighting against
dataset, can manipulate linguistic form well enough to cheat its                            insurgents there. In 2008-2009, he commanded another battalion in
                                                                                            Abkhazia and South Ossetia during Russia’s war with Georgia over these
way through tests meant to require language understanding, have
                                                                                            two breakaway regions. Since 2014, he has been leading military units
we learned anything of value about how to build machine language                            fighting against Ukrainian government forces on behalf of Russia-backed
understanding or have we been led down the garden path?                                     separatists in eastern Ukraine. According to open sources, Utkin is
                                                                                            currently living in Moscow but travels frequently to Syria where his troops
6     STOCHASTIC PARROTS                                                                    are fighting on behalf of Bashar al-Assad’s regime against
In this section, we explore the ways in which the factors laid out in                       anti-government forces there.
§4 and §5 — the tendency of training data ingested from the Internet
to encode hegemonic worldviews, the tendency of LMs to amplify
biases and other issues in the training data, and the tendency of re-                       Figure 1: GPT-3’s response to the prompt (in bold), from [80]
searchers and other people to mistake LM-driven performance gains
for actual natural language understanding — present real-world
risks of harm, as these technologies are deployed. After exploring                          takes place between individuals who share common ground and are
some reasons why humans mistake LM output for meaningful text,                              mutually aware of that sharing (and its extent), who have commu-
we turn to the risks and harms from deploying such a model at                               nicative intents which they use language to convey, and who model
scale. We find that the mix of human biases and seemingly coherent                          each others’ mental states as they communicate. As such, human
language heightens the potential for automation bias, deliberate                            communication relies on the interpretation of implicit meaning
misuse, and amplification of a hegemonic worldview. We focus                                conveyed between individuals. The fact that human-human com-
primarily on cases where LMs are used in generating text, but we                            munication is a jointly constructed activity [29, 128] is most clearly
will also touch on risks that arise when LMs or word embeddings                             true in co-situated spoken or signed communication, but we use
derived from them are components of systems for classification,                             the same facilities for producing language that is intended for au-
query expansion, or other tasks, or when users can query LMs for                            diences not co-present with us (readers, listeners, watchers at a
information memorized from their training data.                                             distance in time or space) and in interpreting such language when
                                                                                            we encounter it. It must follow that even when we don’t know the
6.1       Coherence in the Eye of the Beholder                                              person who generated the language we are interpreting, we build a
                                                                                            partial model of who they are and what common ground we think
Where traditional n-gram LMs [117] can only model relatively local
                                                                                            they share with us, and use this in interpreting their words.
dependencies, predicting each word given the preceding sequence
                                                                                               Text generated by an LM is not grounded in communicative
of N words (usually 5 or fewer), the Transformer LMs capture
                                                                                            intent, any model of the world, or any model of the reader’s state
much larger windows and can produce text that is seemingly not
                                                                                            of mind. It can’t have been, because the training data never in-
only fluent but also coherent even over paragraphs. For example,
                                                                                            cluded sharing thoughts with a listener, nor does the machine have
McGuffie and Newhouse [80] prompted GPT-3 with the text in
                                                                                            the ability to do that. This can seem counter-intuitive given the
bold in Figure 1, and it produced the rest of the text, including the
                                                                                            increasingly fluent qualities of automatically generated text, but we
Q&A format.21 This example illustrates GPT-3’s ability to produce
                                                                                            have to account for the fact that our perception of natural language
coherent and on-topic text; the topic is connected to McGuffie and
                                                                                            text, regardless of how it was generated, is mediated by our own
Newhouse’s study of GPT-3 in the context of extremism, discussed
                                                                                            linguistic competence and our predisposition to interpret commu-
below.
                                                                                            nicative acts as conveying coherent meaning and intent, whether
   We say seemingly coherent because coherence is in fact in the
                                                                                            or not they do [89, 140]. The problem is, if one side of the commu-
eye of the beholder. Our human understanding of coherence de-
                                                                                            nication does not have meaning, then the comprehension of the
rives from our ability to recognize interlocutors’ beliefs [30, 31] and
                                                                                            implicit meaning is an illusion arising from our singular human
intentions [23, 33] within context [32]. That is, human language use
                                                                                            understanding of language (independent of the model).22 Contrary
21 This
      is just the first part of the response that McGuffie and Newhouse show. GPT-3
continues for two more question answer pairs with similar coherence. McGuffie and           22 Controlled generation, where an LM is deployed within a larger system that guides
Newhouse report that all examples given in their paper are from either the first or         its generation of output to certain styles or topics [e.g. 147, 151, 158], is not the same
second attempt at running a prompt.                                                         thing as communicative intent. One clear way to distinguish the two is to ask whether




                                                                                      616
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                                                                                      Bender and Gebru, et al.


to how it may seem when we observe its output, an LM is a system                               already carry reinforced, leading them to engage in discrimination
for haphazardly stitching together sequences of linguistic forms                               (consciously or not) [55], which in turn leads to harms of subju-
it has observed in its vast training data, according to probabilistic                          gation, denigration, belittlement, loss of opportunity [3, 4, 56] and
information about how they combine, but without any reference to                               others on the part of those discriminated against.
meaning: a stochastic parrot.                                                                     If the LM outputs overtly abusive language (as Gehman et al.
                                                                                               [53] show that they can and do), then a similar set of risks arises.
6.2     Risks and Harms                                                                        These include: propagating or proliferating overtly abusive views
The ersatz fluency and coherence of LMs raises several risks, pre-                             and associations, amplifying abusive language, and producing more
cisely because humans are prepared to interpret strings belonging                              (synthetic) abusive language that may be included in the next itera-
to languages they speak as meaningful and corresponding to the                                 tion of large-scale training data collection. The harms that could
communicative intent of some individual or group of individuals                                follow from these risks are again similar to those identified above
who have accountability for what is said. We now turn to examples,                             for more subtly biased language, but perhaps more acute to the ex-
laying out the potential follow-on harms.                                                      tent that the language in question is overtly violent or defamatory.
   The first risks we consider are the risks that follow from the LMs                          They include the psychological harm experienced by those who
absorbing the hegemonic worldview from their training data. When                               identify with the categories being denigrated if they encounter the
humans produce language, our utterances reflect our worldviews,                                text; the reinforcement of sexist, racist, ableist, etc. ideology; follow-
including our biases [78, 79]. As people in positions of privilege                             on effects of such reinforced ideologies (including violence); and
with respect to a society’s racism, misogyny, ableism, etc., tend                              harms to the reputation of any individual or organization perceived
to be overrepresented in training data for LMs (as discussed in                                to be the source of the text.
§4 above), this training data thus includes encoded biases, many                                  If the LM or word embeddings derived from it are used as com-
already recognized as harmful.                                                                 ponents in a text classification system, these biases can lead to
   Biases can be encoded in ways that form a continuum from sub-                               allocational and/or reputational harms, as biases in the representa-
tle patterns like referring to women doctors as if doctor itself entails                       tions affect system decisions [125]. This case is especially pernicious
not-woman or referring to both genders excluding the possibility of                            for being largely invisible to both the direct user of the system and
non-binary gender identities, through directly contested framings                              any indirect stakeholders about whom decisions are being made.
(e.g. undocumented immigrants vs. illegal immigrants or illegals), to                          Similarly, biases in an LM used in query expansion could influence
language that is widely recognized to be derogatory (e.g. racial slurs)                        search results, further exacerbating the risk of harms of the type
yet still used by some. While some of the most overtly derogatory                              documented by Noble in [94], where the juxtaposition of search
words could be filtered out, not all forms of online abuse are easily                          queries and search results, when connected by negative stereotypes,
detectable using such taboo words, as evidenced by the growing                                 reinforce those stereotypes and cause psychological harm.
body of research on online abuse detection [45, 109]. Furthermore,                                The above cases involve risks that could arise when LMs are de-
in addition to abusive language [139] and hate speech [67], there                              ployed without malicious intent. A third category of risk involves
are subtler forms of negativity such as gender bias [137], microag-                            bad actors taking advantage of the ability of large LMs to produce
gressions [22], dehumanization [83], and various socio-political                               large quantities of seemingly coherent texts on specific topics on
framing biases [44, 114] that are prevalent in language data. For                              demand in cases where those deploying the LM have no investment
example, describing a woman’s account of her experience of sexism                              in the truth of the generated text. These include prosaic cases, such
with the word tantrum both reflects a worldview where the sexist                               as services set up to ‘automatically’ write term papers or interact on
actions are normative and foregrounds a stereotype of women as                                 social media,23 as well as use cases connected to promoting extrem-
childish and not in control of their emotions.                                                 ism. For example, McGuffie and Newhouse [80] show how GPT-3
   An LM that has been trained on such data will pick up these                                 could be used to generate text in the persona of a conspiracy theo-
kinds of problematic associations. If such an LM produces text that                            rist, which in turn could be used to populate extremist recruitment
is put into the world for people to interpret (flagged as produced                             message boards. This would give such groups a cheap way to boost
by an ‘AI’ or otherwise), what risks follow? In the first instance, we                         recruitment by making human targets feel like they were among
foresee that LMs producing text will reproduce and even amplify                                many like-minded people. If the LMs are deployed in this way to
the biases in their input [53]. Thus the risk is that people dissemi-                          recruit more people to extremist causes, then harms, in the first
nate text generated by LMs, meaning more text in the world that                                instance, befall the people so recruited and (likely more severely)
reinforces and propagates stereotypes and problematic associations,                            to others as a result of violence carried out by the extremists.
both to humans who encounter the text and to future LMs trained                                   Yet another risk connected to seeming coherence and fluency in-
on training sets that ingested the previous generation LM’s output.                            volves machine translation (MT) and the way that increased fluency
Humans who encounter this text may themselves be subjects of                                   of MT output changes the perceived adequacy of that output [77].
those stereotypes and associations or not. Either way, harms ensue:                            This differs somewhat from the cases above in that there was an
readers subject to the stereotypes may experience the psychological                            initial human communicative intent, by the author of the source lan-
harms of microaggressions [88, 141] and stereotype threat [97, 126].                           guage text. However, MT systems can (and frequently do) produce
Other readers may be introduced to stereotypes or have ones they                               output that is inaccurate yet both fluent and (again, seemingly)

the system (or the organization deploying the system) has accountability for the truth         23 Suchas the GPT-3 powered bot let loose on Reddit; see https://thenextweb.com/
of the utterances produced.                                                                    neural/2020/10/07/someone-let-a-gpt-3-bot-loose-on-reddit-it-didnt-end-well/amp/.




                                                                                         617
Stochastic Parrots                                                                                                        FAccT ’21, March 3–10, 2021, Virtual Event, Canada


coherent in its own right to a consumer who either doesn’t see                           our research time and effort a valuable resource, to be spent to the
the source text or cannot understand the source text on their own.                       extent possible on research projects that build towards a techno-
When such consumers therefore mistake the meaning attributed to                          logical ecosystem whose benefits are at least evenly distributed or
the MT output as the actual communicative intent of the original                         better accrue to those historically most marginalized. This means
text’s author, real-world harm can ensue. A case in point is the                         considering how research contributions shape the overall direction
story of a Palestinian man, arrested by Israeli police, after MT trans-                  of the field and keeping alert to directions that limit access. Like-
lated his Facebook post which said “good morning” (in Arabic) to                         wise, it means considering the financial and environmental costs
“hurt them” (in English) and “attack them” (in Hebrew).24 This case                      of model development up front, before deciding on a course of in-
involves a short phrase, but it is easy to imagine how the ability of                    vestigation. The resources needed to train and tune state-of-the-art
large LMs to produce seemingly coherent text over larger passages                        models stand to increase economic inequities unless researchers
could erase cues that might tip users off to translation errors in                       incorporate energy and compute efficiency in their model evalua-
longer passages as well [77].                                                            tions. Furthermore, the goals of energy and compute efficient model
    Finally, we note that there are risks associated with the fact                       building and of creating datasets and models where the incorpo-
that LMs with extremely large numbers of parameters model their                          rated biases can be understood both point to careful curation of
training data very closely and can be prompted to output specific                        data. Significant time should be spent on assembling datasets suited
information from that training data. For example, [28] demonstrate                       for the tasks at hand rather than ingesting massive amounts of data
a methodology for extracting personally identifiable information                         from convenient or easily-scraped Internet sources. As discussed in
(PII) from an LM and find that larger LMs are more susceptible to                        §4.1, simply turning to massive dataset size as a strategy for being
this style of attack than smaller ones. Building training data out of                    inclusive of diverse viewpoints is doomed to failure. We recall again
publicly available documents doesn’t fully mitigate this risk: just                      Birhane and Prabhu’s [18] words (inspired by Ruha Benjamin [15]):
because the PII was already available in the open on the Internet                        “Feeding AI systems on the world’s beauty, ugliness, and cruelty,
doesn’t mean there isn’t additional harm in collecting it and provid-                    but expecting it to reflect only the beauty is a fantasy.”
ing another avenue to its discovery. This type of risk differs from                         As a part of careful data collection practices, researchers must
those noted above because it doesn’t hinge on seeming coherence                          adopt frameworks such as [13, 52, 86] to describe the uses for which
of synthetic text, but the possibility of a sufficiently motivated user                  their models are suited and benchmark evaluations for a variety of
gaining access to training data via the LM. In a similar vein, users                     conditions. This involves providing thorough documentation on the
might query LMs for ‘dangerous knowledge’ (e.g. tax avoidance                            data used in model building, including the motivations underlying
advice), knowing that what they were getting was synthetic and                           data selection and collection processes. This documentation should
therefore not credible but nonetheless representing clues to what                        reflect and indicate researchers’ goals, values, and motivations in
is in the training data in order to refine their own search queries.                     assembling data and creating a given model. It should also make
                                                                                         note of potential users and stakeholders, particularly those that
6.3     Summary                                                                          stand to be negatively impacted by model errors or misuse. We note
In this section, we have discussed how the human tendency to                             that just because a model might have many different applications
attribute meaning to text, in combination with large LMs’ ability                        doesn’t mean that its developers don’t need to consider stakeholders.
to learn patterns of forms that humans associate with various bi-                        An exploration of stakeholders for likely use cases can still be
ases and other harmful attitudes, leads to risks of real-world harm,                     informative around potential risks, even when there is no way to
should LM-generated text be disseminated. We have also reviewed                          guarantee that all use cases can be explored.
risks connected to using LMs as components in classification sys-                           We also advocate for a re-alignment of research goals: Where
tems and the risks of LMs memorizing training data. We note that                         much effort has been allocated to making models (and their training
the risks associated with synthetic but seemingly coherent text are                      data) bigger and to achieving ever higher scores on leaderboards
deeply connected to the fact that such synthetic text can enter into                     often featuring artificial tasks, we believe there is more to be gained
conversations without any person or entity being accountable for it.                     by focusing on understanding how machines are achieving the
This accountability both involves responsibility for truthfulness and                    tasks in question and how they will form part of socio-technical
is important in situating meaning. As Maggie Nelson [92] writes:                         systems. To that end, LM development may benefit from guided
“Words change depending on who speaks them; there is no cure.”                           evaluation exercises such as pre-mortems [68]. Frequently used in
    In §7, we consider directions the field could take to pursue goals                   business settings before the deployment of new products or projects,
of creating language technology while avoiding some of the risks                         pre-mortem analyses center hypothetical failures and ask team
and harms identified here and above.                                                     members to reverse engineer previously unanticipated causes.25
                                                                                         Critically, pre-mortem analyses prompt team members to consider
7     PATHS FORWARD                                                                      not only a range of potential known and unknown project risks, but
                                                                                         also alternatives to current project plans. In this way, researchers
In order to mitigate the risks that come with the creation of in-
                                                                                         can consider the risks and limitations of their LMs in a guided
creasingly large LMs, we urge researchers to shift to a mindset of
                                                                                         way while also considering fixes to current designs or alternative
careful planning, along many dimensions, before starting to build
either datasets or systems trained on datasets. We should consider
                                                                                         25 Thiswould be one way to build a evaluation culture that considers not only average-
24 https://www.theguardian.com/technology/2017/oct/24/facebook-palestine-israel-         case performance (as measured by metrics) and best-case performance (cherry-picked
translates-good-morning-attack-them-arrest                                               examples), but also worst-case performance.




                                                                                   618
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                                                                          Bender and Gebru, et al.


methods of achieving a task-oriented goal in relation to specific                          would be watermarked and thus detectable [7, 66, 123]? Are there
pitfalls.                                                                                  policy approaches that could effectively regulate their use?
   Value sensitive design [49, 50] provides a range of methodologies                          In summary, we advocate for research that centers the people
for identifying stakeholders (both direct stakeholders who will use                        who stand to be adversely affected by the resulting technology,
a technology and indirect stakeholders who will be affected through                        with a broad view on the possible ways that technology can affect
others’ use of it), working with them to identify their values, and                        people. This, in turn, means making time in the research process for
designing systems that support those values. These include such                            considering environmental impacts, for doing careful data curation
techniques as envisioning cards [48], the development of value                             and documentation, for engaging with stakeholders early in the
scenarios [90], and working with panels of experiential experts                            design process, for exploring multiple possible paths towards long-
[152]. These approaches help surface not only stakeholder values,                          term goals, for keeping alert to dual-use scenarios, and finally for
but also values expressed by systems and enacted through interac-                          allocating research effort to harm mitigation in such cases.
tions between systems and society [120]. For researchers working
with LMs, value sensitive design is poised to help throughout the
development process in identifying whose values are expressed and                          8   CONCLUSION
supported through a technology and, subsequently, how a lack of                            The past few years, ever since processing capacity caught up with
support might result in harm.                                                              neural models, have been heady times in the world of NLP. Neural
   All of these approaches take time and are most valuable when                            approaches in general, and large, Transformer LMs in particular,
applied early in the development process as part of a conceptual in-                       have rapidly overtaken the leaderboards on a wide variety of bench-
vestigation of values and harms rather than as a post-hoc discovery                        marks and once again the adage “there’s no data like more data”
of risks [72]. These conceptual investigations should come before                          seems to be true. It may seem like progress in the field, in fact, de-
researchers become deeply committed to their ideas and therefore                           pends on the creation of ever larger language models (and research
less likely to change course when confronted with evidence of pos-                         into how to deploy them to various ends).
sible harms. This brings us again to the idea we began this section                            In this paper, we have invited readers to take a step back and
with: that research and development of language technology, at                             ask: Are ever larger LMs inevitable or necessary? What costs are
once concerned with deeply human data (language) and creating                              associated with this research direction and what should we consider
systems which humans interact with in immediate and vivid ways,                            before pursuing it? Do the field of NLP or the public that it serves
should be done with forethought and care.                                                  in fact need larger LMs? If so, how can we pursue this research
   Finally, we would like to consider use cases of large LMs that                          direction while mitigating its associated risks? If not, what do we
have specifically served marginalized populations. If, as we advo-                         need instead?
cate, the field backs off from the path of ever larger LMs, are we                             We have identified a wide variety of costs and risks associated
thus sacrificing benefits that would accrue to these populations?                          with the rush for ever larger LMs, including: environmental costs
As a case in point, consider automatic speech recognition, which                           (borne typically by those not benefiting from the resulting technol-
has seen some improvements thanks to advances in LMs, including                            ogy); financial costs, which in turn erect barriers to entry, limiting
both in size and in architecture [e.g. 8, 59, 121], though the largest                     who can contribute to this research area and which languages can
LMs typically are too large and too slow for the near real-time needs                      benefit from the most advanced techniques; opportunity cost, as re-
of ASR systems [60]. Improved ASR has many beneficial applica-                             searchers pour effort away from directions requiring less resources;
tions, including automatic captioning which has the potential to                           and the risk of substantial harms, including stereotyping, denigra-
be beneficial for Deaf and hard of hearing people, providing access                        tion, increases in extremist ideology, and wrongful arrest, should
to otherwise inaccessible audio content.26 We see two beneficial                           humans encounter seemingly coherent LM output and take it for
paths forward here: The first is a broader search for means of im-                         the words of some person or organization who has accountability
proving ASR systems, as indeed is underway, since the contexts of                          for what is said.
application of the technology aren’t conducive to using ever larger                            Thus, we call on NLP researchers to carefully weigh these risks
LMs [60]. But even if larger LMs could be used, just because we’ve                         while pursuing this research direction, consider whether the bene-
seen that large LMs can help doesn’t mean that this is the only                            fits outweigh the risks, and investigate dual use scenarios utilizing
effective path to stronger ASR technology. (And we note that if we                         the many techniques (e.g. those from value sensitive design) that
want to build strong ASR technology across most of the world’s                             have been put forth. We hope these considerations encourage NLP
languages, we can’t rely on having terabytes of data in all cases.)                        researchers to direct resources and effort into techniques for ap-
The second, should we determine that large LMs are critical (when                          proaching NLP tasks that are effective without being endlessly data
available), is to recognize this as an instance of a dual use problem                      hungry. But beyond that, we call on the field to recognize that appli-
and consider how to mitigate the harms of LMs used as stochastic                           cations that aim to believably mimic humans bring risk of extreme
parrots while still preserving them for use in ASR systems. Could                          harms. Work on synthetic human behavior is a bright line in ethical
LMs be built in such a way that synthetic text generated with them                         AI development, where downstream effects need to be understood
                                                                                           and modeled in order to block foreseeable harm to society and
26 Notehowever, that automatic captioning is not yet and likely may never be good          different social groups. Thus what is also needed is scholarship on
enough to replace human-generated captions. Furthermore, in some contexts, what            the benefits, harms, and risks of mimicking humans and thoughtful
Deaf communities prefer is human captioning plus interpretation to the appropriate
signed language. We do not wish to suggest that automatic systems are sufficient           design of target tasks grounded in use cases sufficiently concrete
replacements for these key accessibility requirements.                                     to allow collaborative design with affected communities.




                                                                                     619
Stochastic Parrots                                                                                                                FAccT ’21, March 3–10, 2021, Virtual Event, Canada


REFERENCES                                                                                             Cognition 22, 6 (1996), 1482.
  [1] Hussein M Adam, Robert D Bullard, and Elizabeth Bell. 2001. Faces of environ-               [24] Robin Brewer and Anne Marie Piper. 2016. “Tell It Like It Really Is” A Case
      mental racism: Confronting issues of global justice. Rowman & Littlefield.                       of Online Content Creation and Sharing Among Older Adult Bloggers. In Pro-
  [2] Chris Alberti, Kenton Lee, and Michael Collins. 2019. A BERT Baseline for the                    ceedings of the 2016 CHI Conference on Human Factors in Computing Systems.
      Natural Questions. arXiv:1901.08634 [cs.CL]                                                      5529–5542.
  [3] Larry Alexander. 1992. What makes wrongful discrimination wrong? Biases,                    [25] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
      preferences, stereotypes, and proxies. University of Pennsylvania Law Review                     Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
      141, 1 (1992), 149–219.                                                                          Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
  [4] American Psychological Association. 2019. Discrimination: What it is, and how                    Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
      to cope. https://www.apa.org/topics/discrimination (2019).                                       Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
  [5] Dario Amodei and Daniel Hernandez. 2018. AI and Compute. https://openai.                         Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
      com/blog/ai-and-compute/                                                                         Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
  [6] David Anthoff, Robert J Nicholls, and Richard SJ Tol. 2010. The economic impact                  In Advances in Neural Information Processing Systems 33: Annual Conference on
      of substantial sea-level rise. Mitigation and Adaptation Strategies for Global                   Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,
      Change 15, 4 (2010), 321–335.                                                                    virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina
  [7] Mikhail J Atallah, Victor Raskin, Christian F Hempelmann, Mercan Karahan,                        Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/
      Radu Sion, Umut Topkara, and Katrina E Triezenberg. 2002. Natural Language                       hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
      Watermarking and Tamperproofing. In International Workshop on Information                   [26] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model
      Hiding. Springer, 196–212.                                                                       Compression. In Proceedings of the 12th ACM SIGKDD International Conference
  [8] Alexei Baevski and Abdelrahman Mohamed. 2020. Effectiveness of Self-                             on Knowledge Discovery and Data Mining (Philadelphia, PA, USA) (KDD ’06).
      Supervised Pre-Training for ASR. In ICASSP 2020 - 2020 IEEE International                        Association for Computing Machinery, New York, NY, USA, 535–541. https:
      Conference on Acoustics, Speech and Signal Processing (ICASSP). 7694–7698.                       //doi.org/10.1145/1150402.1150464
  [9] Michael Barera. 2020. Mind the Gap: Addressing Structural Equity and Inclusion              [27] Robert D Bullard. 1993. Confronting environmental racism: Voices from the
      on Wikipedia. (2020). Accessible at http://hdl.handle.net/10106/29572.                           grassroots. South End Press.
 [10] Russel Barsh. 1990. Indigenous peoples, racism and the environment. Meanjin                 [28] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-
      49, 4 (1990), 723.                                                                               Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson,
 [11] Christine Basta, Marta R Costa-jussà, and Noe Casas. 2019. Evaluating the                        Alina Oprea, and Colin Raffel. 2020. Extracting Training Data from Large
      Underlying Gender Bias in Contextualized Word Embeddings. In Proceedings of                      Language Models. arXiv:2012.07805 [cs.CR]
      the First Workshop on Gender Bias in Natural Language Processing. 33–39.                    [29] Herbert H. Clark. 1996. Using Language. Cambridge University Press, Cam-
 [12] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language                       bridge.
      Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Meth-         [30] Herbert H. Clark and Adrian Bangerter. 2004. Changing ideas about reference.
      ods in Natural Language Processing and the 9th International Joint Conference on                 In Experimental Pragmatics. Springer, 25–49.
      Natural Language Processing (EMNLP-IJCNLP). Association for Computational                   [31] Herbert H. Clark and Meredyth A Krych. 2004. Speaking while monitoring
      Linguistics, Hong Kong, China, 3615–3620. https://doi.org/10.18653/v1/D19-                       addressees for understanding. Journal of Memory and Language 50, 1 (2004),
      1371                                                                                             62–81.
 [13] Emily M. Bender and Batya Friedman. 2018. Data statements for natural lan-                  [32] Herbert H. Clark, Robert Schreuder, and Samuel Buttrick. 1983. Common ground
      guage processing: Toward mitigating system bias and enabling better science.                     at the understanding of demonstrative reference. Journal of Verbal Learning
      Transactions of the Association for Computational Linguistics 6 (2018), 587–604.                 and Verbal Behavior 22, 2 (1983), 245 – 258. https://doi.org/10.1016/S0022-
 [14] Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On                             5371(83)90189-5
      Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th             [33] Herbert H. Clark and Deanna Wilkes-Gibbs. 1986. Referring as a collaborative
      Annual Meeting of the Association for Computational Linguistics. Association                     process. Cognition 22, 1 (1986), 1 – 39. https://doi.org/10.1016/0010-0277(86)
      for Computational Linguistics, Online, 5185–5198. https://doi.org/10.18653/v1/                   90010-7
      2020.acl-main.463                                                                           [34] Kimberlé Crenshaw. 1989. Demarginalizing the intersection of race and sex:
 [15] Ruha Benjamin. 2019. Race After Technology: Abolitionist Tools for the New Jim                   A Black feminist critique of antidiscrimination doctrine, feminist theory and
      Code. Polity Press, Cambridge, UK.                                                               antiracist politics. The University of Chicago Legal Forum (1989), 139.
 [16] Elettra Bietti and Roxana Vatanparast. 2020. Data Waste. Harvard International              [35] Benjamin Dangl. 2019. The Five Hundred Year Rebellion: Indigenous Movements
      Law Journal 61 (2020).                                                                           and the Decolonization of History in Bolivia. AK Press.
 [17] Steven Bird. 2016. Social Mobile Technologies for Reconnecting Indigenous                   [36] Christian Davenport. 2009. Media bias, perspective, and state repression: The
      and Immigrant Communities.. In People.Policy.Place Seminar. Northern Institute,                  Black Panther Party. Cambridge University Press.
      Charles Darwin University, Darwin, Australia. https://www.cdu.edu.au/sites/                 [37] Ferdinand de Saussure. 1959. Course in General Linguistics. The Philosophical
      default/files/the-northern-institute/ppp-bird-20160128-4up.pdf                                   Society, New York. Translated by Wade Baskin.
 [18] Abeba Birhane and Vinay Uday Prabhu. 2021. Large Image Datasets: A Pyrrhic                  [38] Terrance de Vries, Ishan Misra, Changhan Wang, and Laurens van der Maaten.
      Win for Computer Vision?. In Proceedings of the IEEE/CVF Winter Conference on                    2019. Does object recognition work for everyone?. In Proceedings of the IEEE
      Applications of Computer Vision. 1537–1547.                                                      Conference on Computer Vision and Pattern Recognition Workshops. 52–59.
 [19] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Lan-                [39] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
      guage (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceed-                     Pre-training of Deep Bidirectional Transformers for Language Understanding.
      ings of the 58th Annual Meeting of the Association for Computational Linguis-                    In Proceedings of the 2019 Conference of the North American Chapter of the Asso-
      tics. Association for Computational Linguistics, Online, 5454–5476. https:                       ciation for Computational Linguistics: Human Language Technologies, Volume 1
      //doi.org/10.18653/v1/2020.acl-main.485                                                          (Long and Short Papers). Association for Computational Linguistics, Minneapolis,
 [20] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean.                        Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
      2007. Large Language Models in Machine Translation. In Proceedings of the                   [40] Maeve Duggan. 2017. Online Harassment 2017. Pew Research Center.
      2007 Joint Conference on Empirical Methods in Natural Language Processing and               [41] Jennifer Earl, Andrew Martin, John D. McCarthy, and Sarah A. Soule. 2004.
      Computational Natural Language Learning (EMNLP-CoNLL). Association for                           The use of newspaper data in the study of collective action. Annual Review of
      Computational Linguistics, Prague, Czech Republic, 858–867. https://www.                         Sociology 30 (2004), 65–80.
      aclweb.org/anthology/D07-1090                                                               [42] Ethan Fast, Tina Vachovsky, and Michael Bernstein. 2016. Shirtless and Danger-
 [21] Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers,                           ous: Quantifying Linguistic Signals of Gender Bias in an Online Fiction Writing
      Matthew E Peters, Ashish Sabharwal, and Yejin Choi. 2020. Adversarial Filters                    Community. In Proceedings of the International AAAI Conference on Web and
      of Dataset Biases. In Proceedings of the 37th International Conference on Machine                Social Media, Vol. 10.
      Learning.                                                                                   [43] William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transform-
 [22] Luke Breitfeller, Emily Ahn, David Jurgens, and Yulia Tsvetkov. 2019. Finding                    ers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
      Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social                    arXiv:2101.03961 [cs.LG]
      Media Posts. In Proceedings of the 2019 Conference on Empirical Methods in Nat-             [44] Anjalie Field, Doron Kliger, Shuly Wintner, Jennifer Pan, Dan Jurafsky, and
      ural Language Processing and the 9th International Joint Conference on Natural                   Yulia Tsvetkov. 2018. Framing and Agenda-setting in Russian News: a Compu-
      Language Processing (EMNLP-IJCNLP). Association for Computational Linguis-                       tational Analysis of Intricate Political Strategies. In Proceedings of the 2018
      tics, Hong Kong, China, 1664–1674. https://doi.org/10.18653/v1/D19-1176                          Conference on Empirical Methods in Natural Language Processing. Associa-
 [23] Susan E Brennan and Herbert H Clark. 1996. Conceptual pacts and lexical choice                   tion for Computational Linguistics, Brussels, Belgium, 3570–3580. https:
      in conversation. Journal of Experimental Psychology: Learning, Memory, and                       //doi.org/10.18653/v1/D18-1393




                                                                                            620
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                                                                                             Bender and Gebru, et al.


 [45] Darja Fišer, Ruihong Huang, Vinodkumar Prabhakaran, Rob Voigt, Zeerak                       [67] Brendan Kennedy, Drew Kogon, Kris Coombs, Joseph Hoover, Christina Park,
      Waseem, and Jacqueline Wernimont (Eds.). 2018. Proceedings of the 2nd Workshop                   Gwenyth Portillo-Wightman, Aida Mostafazadeh Davani, Mohammad Atari,
      on Abusive Language Online (ALW2). Association for Computational Linguistics,                    and Morteza Dehghani. 2018. A typology and coding manual for the study of
      Brussels, Belgium. https://www.aclweb.org/anthology/W18-5100                                     hate-based rhetoric. PsyArXiv. July 18 (2018).
 [46] Susan T Fiske. 2017. Prejudices in cultural contexts: shared stereotypes (gen-              [68] Gary Klein. 2007. Performing a project premortem. Harvard business review 85,
      der, age) versus variable stereotypes (race, ethnicity, religion). Perspectives on               9 (2007), 18–19.
      psychological science 12, 5 (2017), 791–799.                                                [69] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. 2019.
 [47] Antigoni Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis,                     Measuring Bias in Contextualized Word Representations. In Proceedings of the
      Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and                    First Workshop on Gender Bias in Natural Language Processing. 166–172.
      Nicolas Kourtellis. 2018. Large Scale Crowdsourcing and Characterization of                 [70] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
      Twitter Abusive Behavior. In Proceedings of the International AAAI Conference                    Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised
      on Web and Social Media, Vol. 12.                                                                Learning of Language Representations. arXiv preprint arXiv:1909.11942 (2019).
 [48] Batya Friedman and David Hendry. 2012. The Envisioning Cards: A Toolkit for                 [71] Amanda Lazar, Mark Diaz, Robin Brewer, Chelsea Kim, and Anne Marie Piper.
      Catalyzing Humanistic and Technical Imaginations. In Proceedings of the SIGCHI                   2017. Going gray, failure to hire, and the ick factor: Analyzing how older bloggers
      Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI                       talk about ageism. In Proceedings of the 2017 ACM Conference on Computer
      ’12). Association for Computing Machinery, New York, NY, USA, 1145–1148.                         Supported Cooperative Work and Social Computing. 655–668.
      https://doi.org/10.1145/2207676.2208562                                                     [72] Christopher A Le Dantec, Erika Shehan Poole, and Susan P Wyche. 2009. Values
 [49] Batya Friedman and David G. Hendry. 2019. Value Sensitive Design: Shaping                        as lived experience: evolving value sensitive design in support of value discovery.
      Technology with Moral Imagination. MIT Press.                                                    In Proceedings of the SIGCHI conference on human factors in computing systems.
 [50] Batya Friedman, Peter H. Kahn, Jr., and Alan Borning. 2006. Value sensitive de-                  1141–1150.
      sign and information systems. In Human–Computer Interaction in Management                   [73] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat,
      Information Systems: Foundations, P Zhang and D Galletta (Eds.). M. E. Sharpe,                   Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard:
      Armonk NY, 348–372.                                                                              Scaling Giant Models with Conditional Computation and Automatic Sharding.
 [51] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles                     arXiv:2006.16668 [cs.CL]
      Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser,                  [74] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
      and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for                           Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta:
      Language Modeling. arXiv:2101.00027 [cs.CL]                                                      A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
 [52] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman                              (2019).
      Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets                  [75] Kadan Lottick, Silvia Susai, Sorelle A. Friedler, and Jonathan P. Wilson. 2019.
      for Datasets. arXiv:1803.09010 [cs.DB]                                                           Energy Usage Reports: Environmental awareness as part of algorithmic account-
 [53] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A.                           ability. arXiv:1911.08354 [cs.LG]
      Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in                   [76] Mette Edith Lundsfryd. 2017. Speaking Back to a World of Checkpoints: Oral
      Language Models. In Findings of the Association for Computational Linguistics:                   History as a Decolonizing Tool in the Study of Palestinian Refugees from Syria
      EMNLP 2020. Association for Computational Linguistics, Online, 3356–3369.                        in Lebanon. Middle East Journal of Refugee Studies 2, 1 (2017), 73–95.
      https://doi.org/10.18653/v1/2020.findings-emnlp.301                                         [77] Marianna Martindale and Marine Carpuat. 2018. Fluency Over Adequacy: A
 [54] Wei Guo and Aylin Caliskan. 2020. Detecting Emergent Intersectional Biases:                      Pilot Study in Measuring User Trust in Imperfect MT. In Proceedings of the 13th
      Contextualized Word Embeddings Contain a Distribution of Human-like Biases.                      Conference of the Association for Machine Translation in the Americas (Volume 1:
      arXiv preprint arXiv:2006.03955 (2020).                                                          Research Track). Association for Machine Translation in the Americas, Boston,
 [55] Melissa Hart. 2004. Subjective decisionmaking and unconscious discrimination.                    MA, 13–25. https://www.aclweb.org/anthology/W18-1803
      Alabama Law Review 56 (2004), 741.                                                          [78] Sally McConnell-Ginet. 1984. The Origins of Sexist Language in Discourse.
 [56] Deborah Hellman. 2008. When is Discrimination Wrong? Harvard University                          Annals of the New York Academy of Sciences 433, 1 (1984), 123–135.
      Press.                                                                                      [79] Sally McConnell-Ginet. 2020. Words Matter: Meaning and Power. Cambridge
 [57] Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and                      University Press.
      Joelle Pineau. 2020. Towards the Systematic Reporting of the Energy and Carbon              [80] Kris McGuffie and Alex Newhouse. 2020. The Radicalization Risks of GPT-3
      Footprints of Machine Learning. Journal of Machine Learning Research 21, 248                     and Advanced Neural Language Models. Technical Report. Center on Terrorism,
      (2020), 1–43. http://jmlr.org/papers/v21/20-312.html                                             Extremism, and Counterterrorism, Middlebury Institute of International Studies
 [58] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge                    at Monterrey. https://www.middlebury.edu/institute/sites/www.middlebury.
      in a neural network. arXiv preprint arXiv:1503.02531 (2015).                                     edu.institute/files/2020-09/gpt3-article.pdf.
 [59] Chao-Wei Huang and Yun-Nung Chen. 2019. Adapting Pretrained Transformer                     [81] Douglas M McLeod. 2007. News coverage and social protest: How the media’s
      to Lattices for Spoken Language Understanding. In Proceedings of 2019 IEEE                       protect paradigm exacerbates social conflict. Journal of Dispute Resolution (2007),
      Workshop on Automatic Speech Recognition and Understanding (ASRU 2019).                          185.
      Sentosa, Singapore, 845–852.                                                                [82] Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning
 [60] Hongzhao Huang and Fuchun Peng. 2019. An Empirical Study of Efficient ASR                        Generic Context Embedding with Bidirectional LSTM. In Proceedings of The 20th
      Rescoring with Transformers. arXiv:1910.11450 [cs.CL]                                            SIGNLL Conference on Computational Natural Language Learning. Association
 [61] Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu                         for Computational Linguistics, Berlin, Germany, 51–61. https://doi.org/10.
      Zhong, and Stephen Denuyl. 2020. Social Biases in NLP Models as Barriers for                     18653/v1/K16-1006
      Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Associ-         [83] Julia Mendelsohn, Yulia Tsvetkov, and Dan Jurafsky. 2020. A Framework for the
      ation for Computational Linguistics. Association for Computational Linguistics,                  Computational Linguistic Analysis of Dehumanization. Frontiers in Artificial
      Online, 5491–5501. https://doi.org/10.18653/v1/2020.acl-main.487                                 Intelligence 3 (2020), 55. https://doi.org/10.3389/frai.2020.00055
 [62] Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: strategies for                    [84] Kaitlynn Mendes, Jessica Ringrose, and Jessalynn Keller. 2018. # MeToo and
      collecting sociocultural data in machine learning. In Proceedings of the 2020                    the promise and pitfalls of challenging rape culture through digital feminist
      Conference on Fairness, Accountability, and Transparency. 306–316.                               activism. European Journal of Women’s Studies 25, 2 (2018), 236–246.
 [63] Leslie Kay Jones. 2020. #BlackLivesMatter: An Analysis of the Movement as                   [85] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
      Social Drama. Humanity & Society 44, 1 (2020), 92–110.                                           Distributed Representations of Words and Phrases and Their Compositionality.
 [64] Leslie Kay Jones. 2020. Twitter wants you to know that you’re still                              In Proceedings of the 26th International Conference on Neural Information Pro-
      SOL if you get a death threat — unless you’re President Donald Trump.                            cessing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS’13). Curran Associates
      (2020). https://medium.com/@agua.carbonica/twitter-wants-you-to-know-                            Inc., Red Hook, NY, USA, 3111–3119.
      that-youre-still-sol-if-you-get-a-death-threat-unless-you-re-a5cce316b706.                  [86] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasser-
 [65] Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choud-                    man, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru.
      hury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the                      2019. Model cards for model reporting. In Proceedings of the conference on
      NLP World. In Proceedings of the 58th Annual Meeting of the Association for                      fairness, accountability, and transparency. 220–229.
      Computational Linguistics. Association for Computational Linguistics, Online,               [87] Robert C. Moore and William Lewis. 2010. Intelligent Selection of Language
      6282–6293. https://doi.org/10.18653/v1/2020.acl-main.560                                         Model Training Data. In Proceedings of the ACL 2010 Conference Short Papers.
 [66] Nurul Shamimi Kamaruddin, Amirrudin Kamsin, Lip Yee Por, and Hameedur                            Association for Computational Linguistics, Uppsala, Sweden, 220–224. https:
      Rahman. 2018. A Review of Text Watermarking: Theory, Methods, and Applica-                       //www.aclweb.org/anthology/P10-2041
      tions. IEEE Access 6 (2018), 8011–8028. https://doi.org/10.1109/ACCESS.2018.                [88] Kevin L. Nadal. 2018. Microaggressions and Traumatic Stress: Theory, Research,
      2796585                                                                                          and Clinical Treatment. American Psychological Association. https://books.
                                                                                                       google.com/books?id=ogzhswEACAAJ




                                                                                            621
Stochastic Parrots                                                                                                                FAccT ’21, March 3–10, 2021, Virtual Event, Canada


 [89] Clifford Nass, Jonathan Steuer, and Ellen R Tauber. 1994. Computers are social                   https://doi.org/10.18653/v1/D16-1264
      actors. In Proceedings of the SIGCHI conference on Human factors in computing              [109] Sarah T. Roberts, Joel Tetreault, Vinodkumar Prabhakaran, and Zeerak Waseem
      systems. 72–78.                                                                                  (Eds.). 2019. Proceedings of the Third Workshop on Abusive Language Online.
 [90] Lisa P. Nathan, Predrag V. Klasnja, and Batya Friedman. 2007. Value Scenarios:                   Association for Computational Linguistics, Florence, Italy. https://www.aclweb.
      A Technique for Envisioning Systemic Effects of New Technologies. In CHI’07                      org/anthology/W19-3500
      Extended Abstracts on Human Factors in Computing Systems. ACM, 2585–2590.                  [110] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTol-
 [91] Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa,                           ogy: What We Know About How BERT Works. Transactions of the Association
      Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad,                                for Computational Linguistics 8 (2021), 842–866.
      Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, Rubungo Andre                   [111] Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where
      Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa                           do we go from here? Proc. IEEE 88, 8 (2000), 1270–1278.
      Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi,                    [112] Corby Rosset. 2020. Turing-NLG: A 17-billion-parameter language model by
      Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen                        Microsoft. Microsoft Blog (2020).
      Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro               [113] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis-
      Orife, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsa-                      tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv
      har, Goodness Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel                     preprint arXiv:1910.01108 (2019).
      Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue, Bonaventure                     [114] Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin
      F. P. Dossou, Blessing Sibanda, Blessing Bassey, Ayodele Olabiyi, Arshath Ramk-                  Choi. 2020. Social Bias Frames: Reasoning about Social and Power Implications
      ilowan, Alp Öktem, Adewale Akinfaderin, and Abdallah Bashir. 2020. Partic-                       of Language. In Proceedings of the 58th Annual Meeting of the Association for
      ipatory Research for Low-resourced Machine Translation: A Case Study in                          Computational Linguistics. Association for Computational Linguistics, Online,
      African Languages. In Findings of the Association for Computational Linguistics:                 5477–5490. https://doi.org/10.18653/v1/2020.acl-main.486
      EMNLP 2020. Association for Computational Linguistics, Online, 2144–2160.                  [115] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI.
      https://doi.org/10.18653/v1/2020.findings-emnlp.195                                              Commun. ACM 63, 12 (Nov. 2020), 54–63. https://doi.org/10.1145/3381831
 [92] Maggie Nelson. 2015. The Argonauts. Graywolf Press, Minneapolis.                           [116] Sabine Sczesny, Janine Bosak, Daniel Neff, and Birgit Schyns. 2004. Gender
 [93] Timothy Niven and Hung-Yu Kao. 2019. Probing Neural Network Comprehen-                           stereotypes and the attribution of leadership traits: A cross-cultural comparison.
      sion of Natural Language Arguments. In Proceedings of the 57th Annual Meeting                    Sex roles 51, 11-12 (2004), 631–645.
      of the Association for Computational Linguistics. Association for Computational            [117] Claude Elwood Shannon. 1949. The Mathematical Theory of Communication.
      Linguistics, Florence, Italy, 4658–4664. https://doi.org/10.18653/v1/P19-1459                    University of Illinois Press, Urbana.
 [94] Safiya Umoja Noble. 2018. Algorithms of Oppression: How Search Engines Rein-               [118] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami,
      force Racism. NYU Press.                                                                         Michael W. Mahoney, and Kurt Keutzer. 2019. Q-BERT: Hessian Based Ultra
 [95] Debora Nozza, Federico Bianchi, and Dirk Hovy. 2020. What the [MASK]?                            Low Precision Quantization of BERT. arXiv:1909.05840 [cs.CL]
      Making Sense of Language-Specific BERT Models. arXiv:2003.02912 [cs.CL]                    [119] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019.
 [96] David Ortiz, Daniel Myers, Eugene Walls, and Maria-Elena Diaz. 2005. Where                       The Woman Worked as a Babysitter: On Biases in Language Generation. In
      do we stand with newspaper data? Mobilization: An International Quarterly 10,                    Proceedings of the 2019 Conference on Empirical Methods in Natural Language
      3 (2005), 397–419.                                                                               Processing and the 9th International Joint Conference on Natural Language Process-
 [97] Charlotte Pennington, Derek Heim, Andrew Levy, and Derek Larkin. 2016.                           ing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong,
      Twenty Years of Stereotype Threat Research: A Review of Psychological Me-                        China, 3407–3412. https://doi.org/10.18653/v1/D19-1339
      diators. PloS one 11 (01 2016), e0146487. https://doi.org/10.1371/journal.pone.            [120] Katie Shilton, Jes A Koepfler, and Kenneth R Fleischmann. 2014. How to see
      0146487                                                                                          values in social computing: methods for studying values dimensions. In Pro-
 [98] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe:                        ceedings of the 17th ACM conference on Computer supported cooperative work &
      Global Vectors for Word Representation. In Proceedings of the 2014 Conference                    social computing. 426–435.
      on Empirical Methods in Natural Language Processing (EMNLP). Association for               [121] Joonbo Shin, Yoonhyung Lee, and Kyomin Jung. 2019. Effective Sentence Scoring
      Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/                   Method Using BERT for Speech Recognition. In Asian Conference on Machine
      D14-1162                                                                                         Learning. 1081–1093.
 [99] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,                [122] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared
      Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Represen-                       Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parame-
      tations. In Proceedings of the 2018 Conference of the North American Chapter of                  ter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053
      the Association for Computational Linguistics: Human Language Technologies,                      (2019).
      Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans,            [123] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss,
      Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202                                       Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al.
[100] Pew. 2018. Internet/Broadband Fact Sheet. (2 2018). https://www.pewinternet.                     2019. Release strategies and the social impacts of language models. arXiv
      org/fact-sheet/internet-broadband/                                                               preprint arXiv:1908.09203 (2019).
[101] Aidan Pine and Mark Turin. 2017. Language Revitalization. Oxford Research                  [124] Karen Spärck Jones. 2004. Language modelling’s generative model: Is it rational?
      Encyclopedia of Linguistics.                                                                     Technical Report. Computer Laboratory, University of Cambridge.
[102] Francesca Polletta. 1998. Contending stories: Narrative in social movements.               [125] Robyn Speer. 2017. ConceptNet Numberbatch 17.04: better, less-stereotyped
      Qualitative sociology 21, 4 (1998), 419–446.                                                     word vectors. (2017).         Blog post, https://blog.conceptnet.io/2017/04/24/
[103] Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. 2019. Per-                        conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/.
      turbation Sensitivity Analysis to Detect Unintended Model Biases. In Proceed-              [126] Steven J. Spencer, Christine Logel, and Paul G. Davies. 2016. Stereotype
      ings of the 2019 Conference on Empirical Methods in Natural Language Process-                    Threat. Annual Review of Psychology 67, 1 (2016), 415–437. https://doi.org/
      ing and the 9th International Joint Conference on Natural Language Processing                    10.1146/annurev-psych-073115-103235 arXiv:https://doi.org/10.1146/annurev-
      (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China,                     psych-073115-103235 PMID: 26361054.
      5740–5745. https://doi.org/10.18653/v1/D19-1578                                            [127] Katrina Srigley and Lorraine Sutherland. 2019. Decolonizing, Indigenizing, and
[104] Laura Pulido. 2016. Flint, environmental racism, and racial capitalism. Capitalism               Learning Biskaaybiiyang in the Field: Our Oral History Journey1. The Oral
      Nature Socialism 27, 3 (2016), 1–16.                                                             History Review (2019).
[105] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing                    [128] Greg J. Stephens, Lauren J. Silbert, and Uri Hasson. 2010. Speaker–listener
      Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey.                       neural coupling underlies successful communication. Proceedings of the National
      arXiv:2003.08271 [cs.CL]                                                                         Academy of Sciences 107, 32 (2010), 14425–14430. https://doi.org/10.1073/pnas.
[106] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya                        1008662107 arXiv:https://www.pnas.org/content/107/32/14425.full.pdf
      Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI               [129] Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and
      Blog 1, 8 (2019), 9.                                                                             Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th
[107] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,                          Annual Meeting of the Association for Computational Linguistics. 3645–3650.
      Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the                  [130] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang,
      Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of                  Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced
      Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-                   Representation through Knowledge Integration. arXiv:1904.09223 [cs.CL]
      074.html                                                                                   [131] Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Hao Tian, Hua Wu, and
[108] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.                         Haifeng Wang. 2020. ERNIE 2.0: A Continual Pre-Training Framework for
      SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed-                         Language Understanding. In The Thirty-Fourth AAAI Conference on Artificial
      ings of the 2016 Conference on Empirical Methods in Natural Language Pro-                        Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial
      cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392.                    Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational




                                                                                           622
FAccT ’21, March 3–10, 2021, Virtual Event, Canada                                                                                                                Bender and Gebru, et al.


        Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12,            [150] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov,
        2020. AAAI Press, 8968–8975. https://aaai.org/ojs/index.php/AAAI/article/                          and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language
        view/6428                                                                                          understanding. In Advances in neural information processing systems. 5753–5763.
[132]   Yi Chern Tan and L Elisa Celis. 2019. Assessing social and intersectional biases             [151] Ze Yang, Can Xu, Wei Wu, and Zhoujun Li. 2019. Read, Attend and Comment:
        in contextualized word representations. In Advances in Neural Information                          A Deep Architecture for Automatic News Comment Generation. In Proceed-
        Processing Systems. 13230–13241.                                                                   ings of the 2019 Conference on Empirical Methods in Natural Language Process-
[133]   Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT Rediscovers the Classi-                    ing and the 9th International Joint Conference on Natural Language Processing
        cal NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for                 (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China,
        Computational Linguistics. Association for Computational Linguistics, Florence,                    5077–5089. https://doi.org/10.18653/v1/D19-1512
        Italy, 4593–4601. https://doi.org/10.18653/v1/P19-1452                                       [152] Meg Young, Lassana Magassa, and Batya Friedman. 2019. Toward Inclusive
[134]   Trieu H. Trinh and Quoc V. Le. 2019. A Simple Method for Commonsense                               Tech Policy Design: A Method for Underrepresented Voices to Strengthen Tech
        Reasoning. arXiv:1806.02847 [cs.AI]                                                                Policy Documents. Ethics and Information Technology (2019), 1–15.
[135]   Marlon Twyman, Brian C Keegan, and Aaron Shaw. 2017. Black Lives Mat-                        [153] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT:
        ter in Wikipedia: Collective memory and collaboration around online social                         Quantized 8Bit BERT. arXiv:1910.06188 [cs.CL]
        movements. In Proceedings of the 2017 ACM Conference on Computer Supported                   [154] Nico Zazworka, Rodrigo O. Spínola, Antonio Vetro’, Forrest Shull, and Carolyn
        Cooperative Work and Social Computing. 1400–1412.                                                  Seaman. 2013. A Case Study on Effectively Identifying Technical Debt. In
[136]   Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,                           Proceedings of the 17th International Conference on Evaluation and Assessment
        Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you                     in Software Engineering (Porto de Galinhas, Brazil) (EASE ’13). Association for
        need. In Advances in neural information processing systems. 5998–6008.                             Computing Machinery, New York, NY, USA, 42–47. https://doi.org/10.1145/
[137]   Rob Voigt, David Jurgens, Vinodkumar Prabhakaran, Dan Jurafsky, and Yulia                          2460999.2461005
        Tsvetkov. 2018. RtGender: A Corpus for Studying Differential Responses to                    [155] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG:
        Gender. In Proceedings of the Eleventh International Conference on Language Re-                    A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. In
        sources and Evaluation (LREC 2018). European Language Resources Association                        Proceedings of the 2018 Conference on Empirical Methods in Natural Language
        (ELRA), Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1445                                 Processing. Association for Computational Linguistics, Brussels, Belgium, 93–104.
[138]   Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel                      https://doi.org/10.18653/v1/D18-1009
        Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for                         [156] Haoran Zhang, Amy X Lu, Mohamed Abdalla, Matthew McDermott, and
        Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop                          Marzyeh Ghassemi. 2020. Hurtful words: quantifying biases in clinical contex-
        BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association                       tual word embeddings. In Proceedings of the ACM Conference on Health, Inference,
        for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.                     and Learning. 110–120.
        18653/v1/W18-5446                                                                            [157] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and
[139]   Zeerak Waseem, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017.                             Kai-Wei Chang. 2019. Gender Bias in Contextualized Word Embeddings. In
        Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In                         Proceedings of the 2019 Conference of the North American Chapter of the Associ-
        Proceedings of the First Workshop on Abusive Language Online. Association for                      ation for Computational Linguistics: Human Language Technologies, Volume 1
        Computational Linguistics, Vancouver, BC, Canada, 78–84. https://doi.org/10.                       (Long and Short Papers). Association for Computational Linguistics, Minneapolis,
        18653/v1/W17-3012                                                                                  Minnesota, 629–634. https://doi.org/10.18653/v1/N19-1064
[140]   Joseph Weizenbaum. 1976. Computer Power and Human Reason: From Judgment                      [158] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The Design and
        to Calculation. WH Freeman & Co.                                                                   Implementation of XiaoIce, an Empathetic Social Chatbot. Computational Lin-
[141]   Monnica T Williams. 2019. Psychology Cannot Afford to Ignore the Many                              guistics 46, 1 (March 2020), 53–93. https://doi.org/10.1162/coli_a_00368
        Harms Caused by Microaggressions. Perspectives on Psychological Science 15
        (2019), 38 – 43.
[142]   Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement De-                       ACKNOWLEDGMENTS
        langue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz,
        Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
                                                                                                     This paper represents the work of seven authors, but some were
        Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,                     required by their employer to remove their names. The remaining
        and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language                    listed authors are extremely grateful to our colleagues for the effort
        Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural
        Language Processing: System Demonstrations. Association for Computational                    and wisdom they contributed to this paper.
        Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6                       In addition, in drafting and revising this paper, we benefited
[143]   World Bank. 2018. Indiviuals Using the Internet. (2018). https://data.worldbank.             from thoughtful comments and discussion from many people: Alex
        org/indicator/IT.NET.USER.ZS?end=2017&locations=US&start=2015
[144]   Shijie Wu and Mark Dredze. 2020. Are All Languages Created Equal in                          Hanna, Amandalynne Paullada, Ben Hutchinson, Ben Packer, Bren-
        Multilingual BERT?. In Proceedings of the 5th Workshop on Representation                     dan O’Connor, Dan Jurafsky, Ehud Reiter, Emma Strubell, Emily
        Learning for NLP. Association for Computational Linguistics, Online, 120–130.
        https://doi.org/10.18653/v1/2020.repl4nlp-1.16
                                                                                                     Denton, Gina-Anne Levow, Iason Gabriel, Jack Clark, Kristen How-
[145]   Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, and Haifeng                    ell, Lucy Vasserman, Maarten Sap, Mark Díaz, Miles Brundage, Nick
        Wang. 2020. ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning                   Doiron, Rob Munro, Roel Dobbe, Samy Bengio, Suchin Gururangan,
        Framework for Natural Language Generation. arXiv preprint arXiv:2001.11314
        (2020).                                                                                      Vinodkumar Prabhakaran, William Agnew, William Isaac, and Yejin
[146]   Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. 2020. BERT-                    Choi and our anonymous reviewers.
        of-Theseus: Compressing BERT by Progressive Module Replacing. In Proceedings
        of the 2020 Conference on Empirical Methods in Natural Language Processing
        (EMNLP). Association for Computational Linguistics, Online, 7859–7869. https:
        //doi.org/10.18653/v1/2020.emnlp-main.633
[147]   Peng Xu, Chien-Sheng Wu, Andrea Madotto, and Pascale Fung. 2019. Clickbait?
        Sensational Headline Generation with Auto-tuned Reinforcement Learning. In
        Proceedings of the 2019 Conference on Empirical Methods in Natural Language
        Processing and the 9th International Joint Conference on Natural Language Process-
        ing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong,
        China, 3065–3075. https://doi.org/10.18653/v1/D19-1303
[148]   Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya
        Siddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingual
        pre-trained text-to-text transformer. arXiv:2010.11934 [cs.CL]
[149]   Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming
        Li, and Jimmy Lin. 2019. End-to-End Open-Domain Question Answering with
        BERTserini. In Proceedings of the 2019 Conference of the North American Chapter
        of the Association for Computational Linguistics (Demonstrations). Association
        for Computational Linguistics, Minneapolis, Minnesota, 72–77. https://doi.org/
        10.18653/v1/N19-4013




                                                                                               623