DOKK Library

Auditing for Discrimination in Algorithms Delivering Job Ads

Authors Aleksandra Korolova Basileal Imana John Heidemann

License CC-BY-4.0

Plaintext
   Auditing for Discrimination in Algorithms Delivering Job
                             Ads
                Basileal Imana                                   Aleksandra Korolova                             John Heidemann
     University of Southern California                     University of Southern California             USC/Information Science Institute
          Los Angeles, CA, USA                                  Los Angeles, CA, USA                          Los Angeles, CA, USA

ABSTRACT                                                                             1   INTRODUCTION
Ad platforms such as Facebook, Google and LinkedIn promise                           Digital platforms and social networks have become popular
value for advertisers through their targeted advertising. How-                       means for advertising to users. These platforms provide many
ever, multiple studies have shown that ad delivery on such plat-                     mechanisms that enable advertisers to target a specific au-
forms can be skewed by gender or race due to hidden algorith-                        dience, i.e. specify the criteria that the member to whom an
mic optimization by the platforms, even when not requested                           ad is shown should satisfy. Based on the advertiser’s chosen
by the advertisers. Building on prior work measuring skew                            parameters, the platforms employ optimization algorithms to
in ad delivery, we develop a new methodology for black-box                           decide who sees which ad and the advertiser’s payments.
auditing of algorithms for discrimination in the delivery of job                         Ad platforms such as Facebook and LinkedIn use an au-
advertisements. Our first contribution is to identify the distinc-                   tomated algorithm to deliver ads to a subset of the targeted
tion between skew in ad delivery due to protected categories                         audience. Every time a member visits their site or app, the
such as gender or race, from skew due to differences in qualifi-                     platforms run an ad auction among advertisers who are tar-
cation among people in the targeted audience. This distinction                       geting that member. In addition to the advertiser’s chosen
is important in U.S. law, where ads may be targeted based                            parameters, such as a bid or budget, the auction takes into
on qualifications, but not on protected categories. Second, we                       account an ad relevance score, which is based on the ad’s pre-
develop an auditing methodology that distinguishes between                           dicted engagement level and value to the user. For example,
skew explainable by differences in qualifications from other                         from LinkedIn’s documentation [36]: “scores are calculated
factors, such as the ad platform’s optimization for engagement                       ... based on your predicted campaign performance and the
or training its algorithms on biased data. Our method con-                           predicted performance of top campaigns competing for the
trols for job qualification by comparing ad delivery of two                          same audience.” Relevance scores are computed by ad plat-
concurrent ads for similar jobs, but for a pair of companies                         forms using algorithms; both the algorithms and the inputs
with different de facto gender distributions of employees. We                        they consider are proprietary. We refer to the algorithmic pro-
describe the careful statistical tests that establish evidence                       cess run by platforms to determine who sees which ad as ad
of non-qualification skew in the results. Third, we apply our                        delivery optimization.
proposed methodology to two prominent targeted advertising                               Prior work has hypothesized that ad delivery optimiza-
platforms for job ads: Facebook and LinkedIn. We confirm                             tion plays a role in skewing recipient distribution by gen-
skew by gender in ad delivery on Facebook, and show that                             der or race even when the advertiser targets their ad inclu-
it cannot be justified by differences in qualifications. We fail                     sively [15, 30, 52, 54]. This hypothesis was confirmed, at least
to find skew in ad delivery on LinkedIn. Finally, we suggest                         for Facebook, in a recent study [2], which showed that for jobs
improvements to ad platform practices that could make ex-                            such as lumberjack and taxi driver, Facebook delivered ads to
ternal auditing of their algorithms in the public interest more                      audiences skewed along gender and racial lines, even when the
feasible and accurate.                                                               advertiser was targeting a gender- and race-balanced audience.
                                                                                     The Facebook study [2] established that the skew is not due
CCS CONCEPTS                                                                         to advertiser targeting or competition from other advertisers,
• Social and professional topics → Technology audits;                                and hypothesized that it could stem from the proprietary ad
Employment issues; Socio-technical systems; Systems                                  delivery algorithms trained on biased data optimizing for the
analysis and design.                                                                 platform’s objectives (§2.1).
                                                                                         Our work focuses on developing an auditing methodol-
ACM Reference Format:
                                                                                     ogy for measuring skew in the delivery of job ads, an area
Basileal Imana, Aleksandra Korolova, and John Heidemann. 2021.
Auditing for Discrimination in Algorithms Delivering Job Ads. In                     where U.S. law prohibits discrimination based on certain at-
Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021,                 tributes [57, 59]. We focus on expanding the prior auditing
Ljubljana, Slovenia. ACM, New York, NY, USA, 12 pages. https://doi.                  methodology of [2] to bridge the gap between audit studies
org/10.1145/3442381.3450077                                                          that demonstrate that a platform’s ad delivery algorithm re-
                                                                                     sults in skewed delivery and studies that provide evidence that
This paper is published under the Creative Commons Attribution 4.0 Interna-
tional (CC-BY 4.0) license. Authors reserve their rights to disseminate the work     the skewed delivery is discriminatory, thus bringing the set
on their personal and corporate Web sites with the appropriate attribution.          of audit studies one step closer to potential use by regulators
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia                                      to enforce the law in practice [14]. We identify one such gap
© 2021 IW3C2 (International World Wide Web Conference Committee), pub-
lished under Creative Commons CC-BY 4.0 License.
                                                                                     in the context of job advertisements: controlling for bona fide
ACM ISBN 978-1-4503-8312-7/21/04.                                                    occupational qualifications [59] and develop a methodology
https://doi.org/10.1145/3442381.3450077




                                                                              3767
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia                                      Basileal Imana, Aleksandra Korolova, and John Heidemann
to address it. We focus on designing a methodology that as-           2     PROBLEM STATEMENT
sumes no special access beyond what a regular advertiser sees,        Our goal is to develop a novel methodology that measures
because we believe that auditing of ad platforms in the public        skew in ad delivery that is not justifiable on the basis of differ-
interest needs to be possible by third-parties — and society          ences in job qualification requirements in the targeted audi-
should not depend solely on the limited capabilities of federal       ence. Before we focus on qualification, we first enumerate the
commissions or self-policing by the platforms.                        different potential sources of skew that need to be taken into
    Our first contribution is to examine how the occupational         consideration when measuring the role of the ad delivery algo-
qualification of an ad’s audience affects the legal liability an      rithms. We then discuss how U.S. law may treat qualification
ad platform might incur with respect to discriminatory adver-         as a legitimate cause for skewed ad delivery.
tising (§2). Building upon legal analysis in prior work [14], we         We refer to algorithmic decisions by ad platforms that result
make an additional distinction between skew that is due to a          in members of one group being over- or under-represented
difference in occupational qualifications among the members           among the ad recipients as “skew in ad delivery”. We con-
of the targeted ad audience, and skew that is due to (implicit or     sider groups that have been identified as legally protected
explicit use of) protected categories such as gender or race by       (such as gender, age, race). We set the baseline population for
the platform’s algorithms. This distinction is relevant because       measuring skew as the qualified and available ad platform
U.S. law allows differential delivery that is justified by dif-       members targeted by the campaign (see §4.4 for a quantitative
ferences in qualifications [59], an argument that platforms           definition).
are likely to use to defend themselves against legal liabil-
ity when presented with evidence from audit studies such
as [2, 15, 30, 52, 54].                                               2.1    Potential Sources of Skew
    Our second contribution is to propose a novel auditing            Our main challenge is to isolate the role of the platform’s al-
methodology (§4) that distinguishes between a delivery skew           gorithms in creating skew from other factors that affect ad
that could be a result of the ad delivery algorithm merely in-        delivery and may be used to explain away any observed skew.
corporating job qualifications of the members of the targeted         This is a challenge for a third-party auditor because they inves-
ad audience from skew due to other algorithmic choices that           tigate the platform’s algorithms as a black-box, without access
correlate with gender- or racial- factors, but are not related to     to the code or inputs of the algorithm, or access to the data
qualifications. Like the prior study of Facebook [2], to isolate      or behavior of platform members or advertisers. We assume
the role of the platform’s algorithms we control for factors          that the auditor has access only to ad statistics provided by
extraneous to the platform’s ad delivery choices, such as the         the platform.
demographics of people on-line during an ad campaign’s run,              Targeted advertising consists of two high-level steps. The
advertisers’ targeting, and competition from other advertisers.       advertiser creates an ad, specifies its target audience, campaign
Unlike prior work, our methodology relies on simultaneously           budget, and the advertiser’s objective. The platform then de-
running paired ads for several jobs that have similar qualifica-      livers the ad to its users after running an auction among ad-
tion requirements but have skewed de facto (gender) distribution.     vertisers targeting those users. We identify four categories of
By “skewed de facto distribution”, we refer to existing societal      factors that may introduce skew into this process:
circumstances that are reflected in the skewed (gender) dis-             First, an advertiser can select targeting parameters and
tribution of employees. An example of such a pair of ads is a         an audience that induce skew. Prior work [5, 6, 52, 55, 62] has
delivery driver job at Domino’s (a pizza chain) and at Instacart      shown that platforms expose targeting options that advertisers
(a grocery delivery service). Both jobs have similar qualifi-         can use to create discriminatory ad targeting. Recent changes
cation requirements but one is de facto skewed male (pizza            in platforms have tried to disable such options [22, 47, 53].
delivery) and the other – female (grocery delivery) [17, 50].            Second, an ad platform can make choices in its ad de-
Comparing the delivery of ads for such pairs of jobs ensures          livery optimization algorithm to maximize ad relevance,
skew we may observe can not be attributed to differences in           engagement, advertiser satisfaction, revenue, or other busi-
qualification among the underlying audience.                          ness objectives, which can implicitly or explicitly result in
    Our third contribution is to show that our proposed method-       a skew. As one example, if an image used in an ad receives
ology distinguishes between the behavior of ad delivery algo-         better engagement from a certain demographic, the platform’s
rithms of different real-world ad platforms, and identify those       algorithm may learn this association and preferentially show
whose delivery skew may be going beyond what is justifiable           the ad with that image to the subset of the targeted audience
on the basis of qualifications, and thus may be discriminatory        belonging to that demographic [2]. As another example, for a
(§5). We demonstrate this by registering as advertisers and           job ad, the algorithm may aim to show the ad to users whose
running job ads for real employment opportunities on two              professional backgrounds better match the job ad’s qualifi-
platforms, Facebook and LinkedIn. We apply the same audit-            cation requirements. If the targeted population of qualified
ing methodology to both platforms and observe contrasting             individuals is skewed along demographic characteristics, the
results that show statistically significant gender-skew in the        platform’s algorithm may propagate this skew in its delivery.
case of Facebook, but not LinkedIn.                                      Third, an advertiser’s choice of objective can cause a skew.
    We conclude by providing recommendations for changes              Ad platforms such as LinkedIn and Facebook support adver-
that could make auditing of ad platforms more accessible,             tiser objectives such as reach and conversion. Reach indicates
efficient and accurate for public interest researchers (§6.2).        the advertiser wants their ad to be shown to as many people



                                                               3768
Auditing for Discrimination in Algorithms Delivering Job Ads                                   WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
as possible in their target audience, while for conversion the         not apply. This contribution is what primarily sets us apart
advertiser wants as many ad recipients as possible to take             from prior work. It also brings findings from audit studies
some action, such as clicking through to their site [20, 38].          such as ours a step closer to having the potential to be used
Different demographic groups may have different propensities           by regulators to enforce the law in practice.
to take specific actions, so a conversion objective can implicitly         As discussed in §2.1, the objective an advertiser chooses can
cause skewed delivery. When the platform’s implementation              also be a source of skew, particularly for the conversion objec-
of the advertiser’s objective results in a discriminatory skew,        tive, if different demographic groups tend to engage differently.
the responsibility for it can be a matter of dispute (see §2.2).       When the advertiser chosen objective that is implemented by
   Finally, there may be other confounding factors that are            the platform results in discriminatory delivery, who bears
not under direct control of a particular advertiser or the plat-       the legal responsibility may be unclear. On one hand, the ad-
form leading to skew, such as differing sign-on rates across           vertiser (perhaps, unknowingly or implicitly) requested the
demographics, time-of-day effects, and differing rates of adver-       outcome, and if that choice created a discriminatory outcome,
tiser competition for users from different demographics. For           some prior analysis [14] suggests the platform may be pro-
example, delivery of an ad may be skewed towards men be-               tected under Section 230 of the Communications Decency Act,
cause more men were online during the run of the ad campaign,          a U.S. law that provides ad platforms with immunity from con-
or because competing advertisers were bidding higher to reach          tent published by advertisers [60]. On the other hand, one may
the women in the audience than to reach the men [2, 18, 30].           argue that the platform should be aware of the risk of skew
   In our work, we focus on isolating skew that results from           for job ads when optimizing for conversions, and therefore
an ad delivery algorithm’s optimization (the second factor).           should prevent it, or deny advertisers the option to select this
Since we are studying job ads, we are interested in further            objective, just as they must prevent explicit discriminatory
distinguishing skew due to an algorithm that incorporates              targeting by the advertiser. Our work does not advocate a
qualification in its optimization from skew that is due to an          position on the legal question, but provides data (§5.2) about
algorithm that perpetuates societal biases without a justifi-          outcomes that shows implications of the objective’s choice.
cation grounded in qualifications. We are also interested in               In addition to the optimization objective, other confounding
how job ad delivery is affected by the objective chosen by the         sources of skew (§2.1) may have implications for legal liability.
advertiser (the third factor). We discuss our methodology for          The prior legal analysis of the Google’s ad platform evaluated
achieving these goals in §4.                                           the applicability of Section 230 to different sources of skew, and
                                                                       argued Google may not be protected by Section 230 if a skew
                                                                       is fully a product of Google’s algorithms [14]. Similarly, our
2.2     Discriminatory Job Ads and Liability                           goal is to design a methodology that controls for confounding
Building on a legal analysis in prior work [14], we next discuss       factors and isolates skew that is enabled solely due to choices
how U.S. anti-discrimination law may treat job qualification           made by the platform’s ad delivery algorithms.
requirements, optimization objectives, and other factors that
can cause skew, and discuss how the applicability of the law           3     BACKGROUND
informs our methodology design.
                                                                       We next highlight relevant details about the ad platforms to
   Our work is unique in underscoring the implications of
                                                                       which we apply our methodology and discuss related work.
qualification when evaluating potential legal liability ad plat-
forms may incur due to skewed job ad delivery. We also draw
attention to the nuances in analyzing the implications of the          3.1    LinkedIn and Facebook Ad Platforms
optimization objective an advertiser chooses. We focus on              We give details about LinkedIn’s and Facebook’s advertising
Title VII, a U.S. law which prohibits preferential or discrimina-      platforms that are relevant to our methodology.
tory employment advertising practices using attributes such                Ad objective: LinkedIn and Facebook advertisers purchase
as gender or race [59].                                                ads to meet different marketing objectives. As of February 2021,
   Title VII allows entities who advertise job opportunities           both LinkedIn and Facebook have three types of objectives:
to legally show preference based on bona fide occupational             awareness, consideration and conversion, and each type has
qualifications [59], which are requirements necessary to carry         multiple additional options [20, 38]. For both platforms, the
out a job function. Thus, it is conceivable that a platform such       chosen objective constrains the ad format, bidding strategy
as Facebook can use this exception to legally argue that the           and payment options available to the advertiser.
skew arising from its ad delivery optimization established                 Ad audience: On both platforms, advertisers can target
by [2] does not violate the law, since its ad delivery algorithm       an audience using targeting attributes such as geographic
merely takes into account qualifications. Therefore, our goal          location, age and gender. But if the advertiser discloses they are
is to design an auditing methodology that can distinguish              running a job ad, the platforms disable or limit targeting by age
between ad delivery optimization resulting in skew due to              and gender [47]. LinkedIn, being a professional network, also
ad platform’s use of qualifications from skew due to other             provides targeting by job title, education, and job experience.
algorithmic choices by the platform. Through making this                   In addition, advertisers on both platforms can upload a
distinction we can eliminate the possibility of the platform           list of known contacts to create a custom audience (called
using qualification as a legal argument against being held             “Matched Audience” on LinkedIn and “Custom Audience” on
liable for discriminatory outcomes when this argument does             Facebook). On LinkedIn, contacts can be specified by first and



                                                                3769
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia                                       Basileal Imana, Aleksandra Korolova, and John Heidemann
last name or e-mail address. Facebook allows specification             audience without using micro-targeting and in the presence
by many more fields, such as zip code and phone number.                of ad delivery optimization. Lambrecht et al. [30] perform a
The ad platforms then match the uploaded list with profile             field test promoting job opportunities in STEM using target-
information from LinkedIn or Facebook accounts.                        ing that was intended to be gender-neutral, find that their ads
    Ad performance report: Both LinkedIn and Facebook pro-             were shown to more men than women, and explore potential
vide ad performance reports through their website interface and        explanations for this outcome. Finally, recent work by Ali and
via their marketing APIs [21, 37]. These reports reflect near          Sapiezynski et al. [2] has demonstrated that their job and hous-
real-time campaign performance results such as the number              ing ads placed on Facebook are delivered skewed by gender
of clicks and impressions the ad received, broken down along           and race, even when the advertiser targets a gender- and race-
different axes. The categories of information along which ag-          balanced audience, and that this skew results from choices
gregate breakdowns are available differ among platforms. Face-         of the Facebook’s ad delivery algorithm, and is not due to
book reports breaks down performance data by location, age,            market or user interaction effects. AlgorithmWatch [27] repli-
and gender, while LinkedIn gives breakdowns by location, job           cate these findings with European user audiences, and add an
title, industry and company, but not by age or gender.                 investigation of Google’s ad delivery for jobs. Our work is mo-
                                                                       tivated by these studies, confirming results on Facebook and
                                                                       performing the first study we are aware of for LinkedIn. Going
3.2     Related Work                                                   a step further to distinguish between skewed and discrimina-
Targeted advertising has become ubiquitous, playing a signifi-         tory delivery, we propose a new methodology to control for
cant role in shaping information and access to opportunities           user qualifications, a factor not accounted for in prior work,
for hundreds of millions of users. Because the domains of em-          but that is critical for evaluating whether skewed delivery is,
ployment, housing, and credit have legal anti-discrimination           in fact, discriminatory, for job ads. We build on prior work ex-
protections in the U.S. [11, 12, 58], the study of ad platform’s       ploring ways in which discrimination may arise in job-related
role in shaping access and exposure to those opportunities has         advertising and assessing the legal liability of ad platforms [14],
been of particular interest in civil rights discourse [31, 32] and     to establish that the job ad delivery algorithms of Facebook
research. We discuss such work next.                                   may be violating U.S. anti-discrimination law.
   Discriminatory ad targeting: Several recent studies con-               Auditing algorithms: The proprietary nature of ad plat-
sider discrimination in ad targeting: journalists at ProPublica        forms, algorithms, and their underlying data makes it difficult
were among the first to show that Facebook’s targeting op-             to definitively establish the role platforms and their algorithms
tions enabled job and housing advertisers to discriminate by           play for creation of discriminatory outcomes [4, 8–10, 46]. For
age [6], race [5] and gender [55]. In response to these find-          advertising, in addition to the previously described studies,
ings and as part of a settlement agreement to a legal chal-            recent efforts have explored the possibility of auditing with
lenge [1], Facebook has made changes to restrict the targeting         data provided by Facebook through its public Ad Library [51]
capabilities offered to advertisers for ads in legally protected       (created in response to a legal settlement [1]). Other works
domains [22, 47]. Other ad platforms, e.g. Google, have an-            have focused on approaches that rely on sock-puppet account
nounced similar restrictions [53]. The question of whether             creation [7, 33]. Our work uses only ad delivery statistics that
these restrictions are sufficient to stop an ill-intentioned ad-       platforms provide to regular advertisers. This approach makes
vertiser from discrimination remains open, as studies have             us less reliant on the platform’s willingness to be audited. We
shown that advanced features of ad platforms, such as custom           do not rely on transparency-data from platforms, since it is
and lookalike audiences, can be used to run discriminatory             often limited and insufficient for answering questions about
ads [23, 49, 52, 62]. Our work assumes a well-intentioned ad-          the platform’s role in discrimination [40]. We also do not rely
vertiser and performs an audit study using gender-balanced             on an ability to create user accounts on the platform, since
targeting.                                                             experimental accounts are labor-intensive to create and disal-
   Discriminatory ad delivery: In addition to the targeting            lowed by most platform’s policies. We build on prior work of
choices by advertisers, researchers have hypothesized that             external auditing [2, 3, 14, 15, 48, 64]. We show that auditing
discriminatory outcomes can be a result of platform-driven             for discrimination in ad delivery of job ads is possible, even
choices. In 2013, Sweeney’s empirical study found a statisti-          when limited to capabilities available to a regular advertiser,
cally significant difference between the likelihood of seeing an       and that one can carefully control for confounding factors.
ad suggestive of an arrest record on Google when searching                Auditing LinkedIn: To our knowledge, the only work
for people’s names assigned primarily to black babies com-             that has studied LinkedIn’s ad system’s potential for discrim-
pared to white babies [54]. Datta et al. [15] found that the           ination is that of Venkatadri and Mislove [62]. Their work
gender of a Google account influences the number of ads one            demonstrates that compositions of multiple targeting options
sees related to high-paying jobs, with female accounts seeing          together can result in targeting that is skewed by age and gen-
fewer such ads. Both studies could not examine the causes of           der, without explicitly targeting using those attributes. They
such outcomes, as their methodology did not have an ability to         suggest mitigations should be based not on disallowing in-
isolate the role of the platform’s algorithm from other possibly       dividual targeting parameters, but on evaluating the overall
contributing factors, such as competition from advertisers and         outcome of the composition of targetings specified by an ad-
user activity. Gelauff et al. [24] provide an empirical study of       vertiser. We agree with this goal, and go beyond this prior
the challenges of advertising to a demographically balanced ad



                                                                3770
Auditing for Discrimination in Algorithms Delivering Job Ads                                      WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
work by basing our evaluation on the outcome of ad deliv-                          Table 1: Audiences used in our study.
ery, measuring delivery of real-world ads, and contrasting
outcomes on LinkedIn with Facebook’s.                                           ID       Size        Males      Females       Match Rate
                                                                              Aud #0    954,714     477,129     477,585        11.83%
                                                                              Aud #1    900,000     450,000     450,000         11.6%
4     AUDITING METHODOLOGY                                                    Aud #2    950,000     450,000     500,000         11.8%
We next describe the methodology we propose to audit ad                       Aud #0f   850,000     450,000     400,000        11.88%
delivery algorithms for potential discrimination.                             Aud #1f   800,000     400,000     400,000        12.51%
   Our approach consists of three steps. First, we use the ad-                Aud #2f   790,768     390,768     400,000        12.39%
vertising platform’s custom audience feature (§4.1) to build
an audience that allows us to infer gender of the ad recipients
for platforms that do not provide ad delivery statistics along
gender lines. Second, we develop a novel methodology that                   To evaluate experimental reproducibility without introduc-
controls for job qualifications by carefully selecting job cat-         ing test-retest bias, we repeat our experiments across different,
egories (§4.2) for which everyone in the audience is equally            but equivalent audience partitions. Table 1 gives a summary
qualified (or not qualified) for, yet for which there are distinc-      of the partitions we used. Aud#0, Aud#1 and Aud#2 are parti-
tions in the real-world gender distributions of employees in            tions whose size is approximately a quarter of the full audience,
the companies. We then run paired ads concurrently for each             while Aud#0f, Aud#1f and Aud#2f are constructed by swap-
job category and use statistical tests to evaluate whether the          ping the choice of gender by county. Swapping genders this
ad delivery results are skewed (§4.3).                                  way doubles the number of partitions we can use.
   Our lack of access to users’ profile data, interest or browsing          On both LinkedIn and Facebook, the information we upload
activity prevents us from directly testing whether ad delivery          is used to find exact matches with information on user profiles.
satisfies metrics of fairness commonly used in the literature,          We upload our audience partitions to LinkedIn in the form of
such as equality of opportunity [25], or recently proposed for ad       first and last names. For Facebook, we also include zip codes,
allocation tasks where users have diverse preferences over out-         because their tool for uploading audiences notified us that
comes, such as preference-informed individual fairness [28].            the match rate would be too low when building audiences
In our context of job ads, equality of opportunity means that           only on the basis of first and last names. The final targeted ad
an individual in a demographic group that is qualified for a job        audience is a subset of the audience we upload, because not
should get a positive outcome (in our case: see an ad) at equal         all the names will be matched, i.e. will correspond to an actual
rates compared to an equally qualified individual in another            user of a platform. As shown in Table 1, for each audience
demographic group. While our methodology does not test for              partition, close to 12% of the uploaded names were matched
this metric, we indirectly account for qualification in the way         with accounts on LinkedIn. Facebook does not report the final
we select which job categories we run ads for.                          match rates for our audiences in order to protect user privacy.
   We only describe a methodology for studying discrimina-                  To avoid self-interference between our ads over the same
tion in ad delivery along gender lines, but we believe our              audience we run paired ads concurrently, but ads for different
methodology can be generalized to audit along other attributes          job categories or for different objectives sequentially. In ad-
such as race and age by an auditor with access to auxiliary             dition, to avoid test-retest bias, where a platform learns from
data that is needed for picking appropriate job categories.             prior experiments who is likely to respond and applies that
                                                                        to subsequent experiments, we generally use different (but
                                                                        equivalent) target audiences.
4.1     Targeted Audience Creation
Unlike Facebook, LinkedIn does not give a gender breakdown              4.2     Controlling for Qualification
of ad impressions, but reports their location at the county level.      The main goal of our methodology is to distinguish skew
As a workaround, we rely on an approach introduced in prior             resulting from algorithmic choices that are not related to qual-
work [2, 3] that uses ad recipients’ location to infer gender.          ifications, from skew that can be justified by differences in
    To construct our ad audience, we use North Carolina’s voter         user qualifications for the jobs advertised. A novel aspect of
record dataset [44], which among other fields includes each             our methodology is to control for qualifications by running
voter’s name, zip code, county, gender, race and age. We di-            paired ads for jobs with similar qualification requirements,
vide all the counties in North Carolina into two halves. We             but skewed de facto gender distributions. We measure skew
construct our audience by including only male voters from               by comparing the relative difference between the delivery of a
counties in the first half, and only female voters from counties        pair of ads that run concurrently, targeting the same audience.
in the second half (this data is limited to a gender binary, so our     Each test uses paired jobs that meet two criteria: First, they
research follows). If a person from the first half of the counties      must have similar qualification requirements, thus ensuring
is reported as having seen an ad, we can infer that the person          that the people that we target our ads with are equally quali-
is a male, and vice versa. Furthermore, we include a roughly            fied (or not qualified) for both job ads. Second, the jobs must
equal number of people from each gender in the targeting                exhibit a skewed, de facto gender distribution in the real-world,
because we are interested in measuring skew that results from           as shown through auxiliary data. Since both jobs require simi-
the delivery algorithm, not the advertiser’s targeting choices.         lar qualifications, our assumption is that on a platform whose



                                                                 3771
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia                                       Basileal Imana, Aleksandra Korolova, and John Heidemann
ad delivery algorithms are non-discriminatory, the distribu-
tion of genders among the recipients of the two ads will be
roughly equal. On the other hand, in order to optimize for en-
gagement or business objectives, platforms may incorporate
other factors into ad delivery optimization, such as training or
historical data. This data may reflect the de facto skew and thus
influence machine-learning-based algorithmic predictions of
engagement. Since such factors do not reflect differences in
job qualifications, they may be disallowed (§2.2) and therefore
represent platform-induced discrimination (even if they bene-
fit engagement or the platform’s business interests). We will
look for evidence of such factors in a difference in gender dis-
tribution between the paired ads (see §4.4 for how we quantify         Figure 1: Example delivery driver job ads for Domino’s
the difference).                                                       and Instacart.
    In §5.1, we use the above criteria to select three job cate-
gories – delivery driver, sales associate and software engineer
– and run a pair of ads for each category and compare the
gender make-up of the people to whom LinkedIn and Face-
book show our ads. An example of such a pair of ads is a               with the goal: “Your ads will be shown to those most likely to
delivery driver job at Domino’s (a pizza chain) and at Instacart       view or click on your job ads, getting more applicants.” [38].
(a grocery delivery service). The de facto gender distribution         For Facebook ads, we use “Conversions” option with with the
among drivers of these services is skewed male for Domino’s            following optimization goal: “Encourage people to take a spe-
and skewed female for Instacart [17, 50]. If a platform shows          cific action on your business’s site” [20], such as register on
the Instacart ad to relatively more women than a Domino’s              the site or submit a job application.
ad, we conclude that the platform’s algorithm is discrimina-              In §5.2, we run some of our Facebook ads using the aware-
tory, since both jobs have similar qualification requirements          ness objective. By comparing the outcomes across the two
and thus a gender skew cannot be attributed to differences in          objectives we can evaluate whether an advertiser’s objective
qualifications across genders represented in the audience.             choice plays a role in the skew (§2.2). We use the “Reach” op-
    Using paired, concurrent ads that target the same audience         tion that Facebook provides within the awareness objective
also ensures other confounding factors such as timing or com-          with the stated goal of: “Show your ad to as many people as
petition from other advertisers affect both ads equally [2].           possible in your target audience” [20].
    To avoid bias due to the audience’s willingness to move
for a job, we select jobs in the same physical location. When          4.3.3 Other Campaign Parameters. We next list other parame-
possible (for delivery driver and sales job categories, but not        ters we use for running ads and our reasons for picking them.
software engineering), we select jobs in the location of our              From the ad format options available for the objectives we
target audience.                                                       selected, we choose single image ads, which show up in a
                                                                       prominent part of LinkedIn and Facebook users’ newsfeeds.
4.3     Placing Ads and Collecting Results                                We run all Facebook and LinkedIn ads with a total budget
We next describe the mechanics of placing ads on Facebook              of $50 per ad campaign and schedule them to run for a full day
and LinkedIn, and collecting the ad delivery statistics which          or until the full budget is exhausted. This price point ensures
we use to calculate the gender breakdown of the audiences our          a reasonable sample size for statistical evaluations, with all of
ads were shown to. We also discuss the content and parameters          our ads receiving at least 340 impressions.
we use for running our ads.                                               For both platforms, we request automated bidding to maxi-
                                                                       mize the number of clicks (for the conversion objective) and im-
4.3.1 Ad Content. In creating our ads, we aim to use gender-
                                                                       pressions (for the awareness objective) our ads can get within
neutral text and image so as to minimize any possible skew
                                                                       the budget. We configure our campaigns on both platforms
due to the input of an advertiser (us). The ad headline and
                                                                       to pay per impression shown. On LinkedIn, this is the only
description for each pair of ads is customized to each job cate-
                                                                       available option for our chosen parameters. We use the same
gory as described in §5.1. Each ad we run links to a real-world
                                                                       option on Facebook for consistency. On both platforms we dis-
job opportunity that is listed on a job search site, pointing to a
                                                                       able audience expansion and off-site delivery options. While
job posting on a company’s careers page (for delivery driver)
                                                                       these options might show our ad to more users, they are not
or to a job posting on LinkedIn.com (in other cases). Figure 1
                                                                       relevant or may interfere with our methodology.
shows screenshots of two ads from our experiments.
                                                                          Since our methodology for LinkedIn relies on using North
4.3.2 Ad Optimization Objective. We begin by using the con-            Carolina county names as proxies for gender, we add “North
version objective because searching for people who are likely          Carolina” as the location for our target audience. We do the
to take an action on the job ad is a likely choice for advertisers     same for Facebook for consistency across experiments but
seeking users who will apply for their job (§5.1). For LinkedIn        we do not need to use location as a proxy to infer gender in
ads, we use “Job Applicants” option, a conversion objective            Facebook’s case.



                                                                3772
Auditing for Discrimination in Algorithms Delivering Job Ads                                     WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
                                                                       we expect s 1, f = o f and s 2, f = o f . As an external auditor
        A1 = A2
        Q1=Q2                 Figure 2: Relation between sub-          that does not have access to users’ browsing activities, we
        O1=O2                 sets of audiences involved in run-       do not have a handle on o f but we can directly compare s 1, f
        S1 S2                 ning two ads targeting the same          and s 2, f . Because we ensure other factors that may affect ad
                              audience. The subscripts indicate        delivery are either controlled or affect both ads equally, we can
                              sets for the first and second ad.        attribute any difference we might observe between s 1, f and
                                                                       s 2, f to choices made by the platform’s ad delivery algorithm
                                                                       based on factors unrelated to qualification of users, such as
4.3.4 Launching Ads and Collecting Delivery Statistics. For            revenue or engagement goals of the platform.
LinkedIn, we use its Marketing Developer Platform API to cre-          4.4.2 Statistical Significance: We use the Z-Test to measure
ate the ads, and once the ads run, to get the final count of           the statistical significance of a difference in proportions we
impressions per county which we use to infer gender. For               observe between s 1, f and s 2, f . Our null hypothesis is that
Facebook, we create ads via its Ad Manager portal. The portal          there is no gender-wise difference between the audiences that
gives a breakdown of ad impressions by gender, so we do not            saw the two ads, i.e., s 1, f = s 2,f , evaluated as:
rely on using county names as a proxy. We export the final
                                                                                                    s 1, f − s 2, f
gender breakdown after the ad completes running.                                          Z= q
                                                                                              ŝ f (1 − ŝ f )( n11 + n12 )
4.4     Skew Metric
We now describe the metric we apply to the outcome of adver-           where ŝ f is fraction of females in S 1 and S 2 combined (S 1 ∪ S 2 ),
tising, i.e. the demographic make-up of the audience that saw          and n 1 and n 2 are the sizes of S 1 and S 2 , respectively. At α
our ads, to establish whether platform’s ad delivery algorithm         significance level, if Z > Z α , we reject the null hypothesis and
leads to discriminatory outcomes.                                      conclude that there is a statistically significant gender skew in
                                                                       the ad delivery. We use a 95% confidence level (Z α = 1.96) for
4.4.1 Metric: As discussed in the beginning of this section,           all of our statistical tests. This test assumes the samples are
out methodology works by running two ads simultaneously                independent and n is large. Only the platform knows whom it
and looking at the relative difference in how they are delivered.      delivers the ad to, so only it can verify independence. Sample
In order to be able to effectively compare delivery of the two         sizes vary by experiment, as shown in figures, but they always
ads, we need to ensure the baseline audience that we use to            exceed 340 and often are several thousands.
measure skew is the same for both ads. The baseline we use
is people who are qualified for the job we are advertising and         4.5    Ethics
are browsing the platform during the ad campaigns. However,
                                                                       Our experiments are designed to consider ethical implications,
we must consider several audience subsets shown in Figure 2:
                                                                       minimizing harm both to the platforms and the individuals
A, the the audience targeted by us, the advertiser (us); Q, the
                                                                       that interact with our ads. We minimize harm to the platforms
subset of A that the ad platform’s algorithm considers qualified
                                                                       by registering as an advertiser and interacting with the plat-
for the job being advertised, and O, the subset of Q that are
                                                                       form just like any other regular advertiser would. We follow
online when the ads are run.
                                                                       their terms of service, use standard APIs available to any ad-
   Our experiment design should ensure that these sets are the
                                                                       vertiser and do not collect any user data. We minimize harm to
same for both ads, so that a possible skewed delivery cannot
                                                                       individuals using the platform and seeing our ads by having all
be merely explained by a difference the underlying factors
                                                                       our ads link to a real job opportunity as described. Finally, our
these sets represent. We ensure A, Q, and O match for our jobs
                                                                       ad audiences aim to include an approximately equal number
by targeting the same audience (same A), ensuring both jobs
                                                                       of males and females and so aim not to discriminate. Our study
have similar qualification requirements (same Q) as discussed
                                                                       was classified as exempt by our Institutional Review Board.
in §4.2, and by running the two ads at the same time (same O).
   To measure gender skew, we compare what fraction of peo-
                                                                       5     EXPERIMENTS
ple in O that saw our two ads are a member of a specific gender.
Possible unequal distribution of gender in the audience does           We next present the results from applying our methodology to
not affect our comparison because it affects both ads equally          real-world ads on Facebook and LinkedIn. We find contrasting
(because O is the same for both ads). Let S 1 and S 2 denote           results that show statistically significant evidence of skew that
subsets of people in O who saw the first and second ad, respec-        is not justifiable on the basis of qualification in the case of
tively. S 1 and S 2 are not necessarily disjoint sets. To measure      Facebook, but not in the case of LinkedIn. We make data for
gender skew, we compare the fraction of females in S 1 that            the ads we used in our experiments and their delivery statistics
saw the first ad (s 1, f ) and fraction of females in S 2 that saw     publicly available at [26].
the second ad (s 2,f ) with the fraction of females in O that were
online during the ad campaign (o f ).                                  5.1    Measuring Skew in Real-world Ads
   In the absence of discriminatory delivery, we expect, for           We follow the criteria discussed in §4.2 to pick and compare
both ads, the gender make-up of the audience the ad is shown           jobs which have similar qualification requirements but for
to be representative of the gender make-up of people that              which there is data that shows the de facto gender distribu-
were online and participated in ad auctions. Mathematically,           tion is skewed. We study whether ad delivery optimization



                                                                3773
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia                                                  Basileal Imana, Aleksandra Korolova, and John Heidemann
algorithms reproduce these de facto skews, even though they                                                                                                               Q "    6NHZ"
are not justifiable on the basis of differences in qualification.        )%$XGI                                                                           =
                                                                                                                                                                        
6NHZHG
   We pick three job categories: a low-skilled job (delivery dri-
                                                                          )%$XG                                                                            =
ver), a high-skilled job (software engineer), and a low-skilled                                                                                                         
6NHZHG
but popular job among our ad audience (sales associate). Since            )%$XG                                                                            =
                                                                                                                                              ,QVWDFDUW        
6NHZHG
our methodology compares two ads for each category, we
                                                                          /,$XGI                                                'RPLQRV              =
select two job openings at companies for which we have evi-                                                                                                               1RHYLG
dence of de facto gender distribution differences, and use our            /,$XGI                                                                            =
                                                                                                                                                                          1RHYLG
metric §4.4 to measure whether there is a statistically signif-                                                                                                           =
icant gender skew in ad delivery. In each job category, we                 /,$XG                                                                             1RHYLG
select pairs of jobs in the same state to avoid skew (§4.2).                                                                                  
   For each experiment, we run the same pair of ads on both                                      )UDFWLRQRIIHPDOHVLQDXGLHQFH
Facebook and LinkedIn and compare their delivery. For both
                                                                                        (a) Delivery Driver at Domino’s vs. Instacart
platforms, we repeat the experiments on three different audi-
ence partitions for reproducibility. We run the ads for each job                                                                                                          Q "    6NHZ"
                                                                         )%$XGI                                                                           =
category at different times to avoid self-competition (§4.1). We                                                                                   1HWIOL[       
6NHZHG
run these first set of ads using the conversion objective (§4.3.2).      )%$XGI                                                      1YLGLD         =
   As discussed in §4.3.1, we build our ad creatives (text and                                                                                                          
6NHZHG
                                                                          )%$XG                                                                            =
image) using gender-neutral content to minimize any skew                                                                                                                
6NHZHG
due to an advertiser’s (our) input. For delivery driver and sales
                                                                          /,$XGI                                                                            =
associate categories, Facebook ad text uses modified snippets                                                                                                             1RHYLG
                                                                          /,$XGI                                                                            =
of the real job descriptions they link to (for example, “Become a                                                                                                         1RHYLG
driver at Domino’s and deliver pizza”). Images use a company’s
                                                                           /,$XG                                                                             =
logo or a picture of its office. To ensure any potential skew is                                                                                                          1RHYLG
not due to keywords in the job descriptions that could appeal                                                                            
differently to different audiences, we ran the software engi-                                    )UDFWLRQRIIHPDOHVLQDXGLHQFH
neering Facebook ads using generic headlines with a format                                (b) Software Engineer at Nvidia vs. Netflix
similar to the ones shown in Figure 1, and found similar results
                                                                                                                                                                          Q "    6NHZ"
to the runs that used modified snippets. All LinkedIn ads were                                                                                                          =
ran using generic ad headlines similar to those in Figure 1.             )%$XGI                                                                           
6NHZHG
                                                                          )%$XG                                                                            =
5.1.1 Delivery Drivers. We choose delivery driver as a job cate-                                                                                                        
6NHZHG
gory to study because we were able to identify two companies              )%$XG                                                                            =
                                                                                                                                                    5HHGV          
6NHZHG
– Domino’s and Instacart – with significantly different de facto                                                                                    /HLWK            =
gender distributions among drivers, even though their job re-             /,$XGI                                                                            1RHYLG
quirements are similar. 98% of delivery drivers for Domino’s              /,$XGI                                                                            =
                                                                                                                                                                          
6NHZHG
are male [17], whereas more than 50% of Instacart drivers are                                                                                                             =
female [50]. We run ads for driver positions in North Carolina             /,$XG                                                                             1RHYLG
for both companies, and expect a platform whose ad delivery                                                                  
optimization goes beyond what is justifiable by qualification                                    )UDFWLRQRIIHPDOHVLQDXGLHQFH
and reproduces de facto skews to show the Domino’s ad to
                                                                           (c) Sales Associate at Reeds Jewelry vs. Leith Automotive
relatively more males than the Instacart ad.
   Figure 3a shows gender skews in the results of ad runs for
delivery drivers, giving the gender ratios of ad impressions            Figure 3: Skew in delivery of real-world ads on Facebook
with 95% confidence intervals. These results show evidence of           (FB) and LinkedIn (LI), using “Conversion” objective. n
a statistically significant gender skew on Facebook, and show no        gives total number of impressions. We use our metric
gender skew on LinkedIn. The skew we observe on Facebook is             (§4.4) to test for skew at 95% confidence level (Z > 1.96).
in the same direction as the de facto skew, with the Domino’s
ad delivered to a higher fraction of men than the Instacart
                                                                        pick Netflix and Nvidia for our paired ad experiments. At Net-
ad. We confirm the results across three separate runs for both
                                                                        flix, 35% of employees in tech-related positions are female [42]
platforms, each time targeting a different audience partition.
                                                                        according to its 2021 report. At Nvidia, 19% of all employees are
5.1.2 Software Engineers. We next consider the software en-             female according to [45], and third-party data as of 2020 sug-
gineer (SWE) job category, a high-skilled job which may be a            gests that the percentage of female employees in tech-related
better match for LinkedIn users than delivery driver jobs.              positions is as low as 14% [16]. For both companies, we find
   We pick two companies based employee demographics                    job openings in the San Francisco Area and run ads for those
stated in their diversity report . Because we are running soft-         positions. We expect a platform whose algorithm learns and
ware engineering ads, we specifically look at the percentage            perpetuates the existing difference in employee demographics
of female employees who work in a tech-related position. We             will show the Netflix ad to more women than the Nvidia ad.



                                                                 3774
Auditing for Discrimination in Algorithms Delivering Job Ads                                    WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
   Figure 3b shows the results. The Facebook results show skew           reached with (or shown) the ad, rather than the number of
by gender in all three trials, with a statistically different gender     people who apply for the job. We next examine how the use
distribution between the delivery of the two ads. The skew is            of the reach objective affects skew in ad delivery on Facebook,
in the direction that confirms our hypothesis, a higher fraction         compared to the use of the conversion objective. We focus on
of women seeing the Netflix ads than the Nvidia ads. LinkedIn            Facebook because we observed evidence of skew that cannot
results are not skewed in all three trials. These results confirm        be explained by differences in qualifications in their case, and
the presence of delivery skew not justified by qualifications            we are interested in exploring whether that skew remains even
on Facebook for a second, higher-skilled job category.                   with a more “neutral" objective. While there may be a debate
                                                                         about allocating responsibility for discrimination between
5.1.3 Sales Associates. We consider sales associate as a third
                                                                         advertiser and platform when using a conversion objective (see
job category. Using LinkedIn’s audience estimation feature,
                                                                         §2.2), we believe that the responsibility for any discrimination
we found that many LinkedIn users in the audience we use
                                                                         observed when the advertiser-selected objective is reach rests
identified as having sales experience, so we believe people
                                                                         on the platform.
with experience in sales are well-represented in the audience.
                                                                            We follow our prior approach (§5.1) with one change: we
The Bureau of Labor Statistics (BLS) data shows that sales jobs
                                                                         use reach as an objective and compare with the prior results
skew by gender in different industries, with women filling
                                                                         that used the conversion objective. The job categories and other
62% of sales associates in jewelry stores and only 17.9% in
                                                                         parameters remain the same and we repeat the experiments
auto dealerships [56]. We pick Reeds Jewelers (a retail jeweler)
                                                                         on different audience partitions for reproducibility.
and Leith Automotive (an auto dealership) to represent these
                                                                            Figure 4a, Figure 4b and Figure 4c show the delivery of
two industries with open sales positions in North Carolina. If
                                                                         reach ads for the delivery driver, software engineer and sales
LinkedIn’s or Facebook’s delivery mimics skew in the de facto
                                                                         associate ads, respectively. For comparison, the figures include
gender distribution, we expect them to deliver the Reeds ad to
                                                                         the prior Facebook experiments ran using conversion objective
relatively more women than the Leith ad.
                                                                         (from Figure 3). For all three job categories, the results show a
    Figure 3c presents the results. All three trials on both plat-
                                                                         statistically significant skew in at least two out of the three
forms confirm our prior results using other job categories,
                                                                         experiments using the reach objective. This result confirms
with statistically significant delivery skew between all jobs on
                                                                         our result in §5.1 that showed Facebook’s ad delivery algo-
Facebook but not for two of the three cases on LinkedIn. One of
                                                                         rithm introduces gender skew even when advertiser targets a
the three trials on LinkedIn (Aud#1f) shows skew just above
                                                                         gender-balanced audience. Since skewed delivery occurs even
the threshold for a statistical significance, and surprisingly
                                                                         when the advertiser chooses the reach objective, the skew is
it shows bias in the opposite direction from expected (more
                                                                         attributable to the platform’s algorithmic choices and not to
women for the Reeds ad). We observe that these cases show
                                                                         the advertiser’s choice.
the smallest response rates (349 to 521) and their Z-scores (1.54
                                                                            On the other hand, we notice two main differences in the de-
to 2.15) are close to the threshold (Z = 1.96), while Facebook
                                                                         livery of the ads run with the reach objective. For all three job
shows consistently large skew (11 or more).
                                                                         categories (Figure 4a, Figure 4b and Figure 4c) the gap between
5.1.4 Summary: These experiments confirm that our method-                gender delivery for each pair of ads is reduced for the reach
ology proposed in §4.2 is feasible to implement in practice.             ads compared to the conversion ads. And, for two of the job cat-
Moreover, the observed outcomes are different among the two              egories (delivery driver and sales associate), one of the three
platforms. Facebook’s job ad delivery is skewed by gender,               cases does not show a statistically significant evidence for
even when the advertiser is targeting a gender-balanced au-              skew, while all three showed such evidence in the conversion
dience, consistent with prior results of [2]. However, because           ads. These observations indicate that the degree of skew may
our methodology controls for qualifications, our results imply           be reduced when using the reach objective, and, therefore, an
that the skew cannot be explained by the ad delivery algorithm           advertiser’s request for the conversion objective may increase
merely reflecting differences in qualifications. Thus, based on          the amount of skew because, according to Facebook’s algo-
the discussion of legal liability in §2.2, our findings suggests         rithmic predictions, conversions may correlate with particular
that Facebook’s algorithms may be responsible for unlawful               gender choices for certain jobs.
discriminatory outcomes.                                                    Revisiting our discussion of the legal responsibility for dis-
   Our work provides the first analysis of LinkedIn’s ad deliv-          crimination (§2.2) depending on the advertiser’s objective
ery algorithm. With the exception of one experiment, we did              choice in light of this result, the data finds evidence of dis-
not find evidence of skew by gender introduced by LinkedIn’s             criminatory outcomes that cannot be explained merely by the
ad delivery, a negative result for our investigation, but perhaps        advertiser’s choice. An advertiser’s choice of conversion can
a positive result for society.                                           potentially lead to a more discriminatory outcome than choice
                                                                         of reach, which may imply the advertiser bears some responsi-
5.2     “Reach” vs. “Conversion" Objectives                              bility for the outcome. However, one could also argue that it is
In §5.1, we used the conversion objective, assuming that this            the ad platform that has full control over determining how the
objective would be chosen by most employers running ads and              optimization algorithm actually works and what its inputs are.
aiming to maximize the number of job applicants. However,                Therefore, if an advertiser discloses that they are running job
both LinkedIn and Facebook also offer advertisers the choice             ads (which we did in our experiments), the ad platform may
of the reach objective, aiming to increase the number of people          still have the responsibility to ensure its algorithm does not




                                                                  3775
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia                                                                                                                  Basileal Imana, Aleksandra Korolova, and John Heidemann
                                                                                                                Q "      6NHZ"            Table 2: Breakdown of impressions for LinkedIn ads
                                                                                                              =        run using Aud#2. “Unreported” shows percentages in
 )%$XGI5                                                    ,QVWDFDUW       1RHYLG
                                                                                     'RPLQRV           =          unreported counties (whose genders we cannot infer).
    )%$XG5                                                                            
6NHZHG
                                                                                                              =
    )%$XG5                                                                            
6NHZHG
         Company      Total    Males     Females      Unreported (%)
                                                                                                                =
 )%$XGI&                                                                               
6NHZHG
                                                                                                                                                     Domino’s      806      241        233            41.19
                                                                                                                =           Instacart     757      194        232            43.73
    )%$XG&                                                                              
6NHZHG
          Nvidia       859      232        258            42.96
                                                                                                                =
    )%$XG&                                                                              
6NHZHG
          Netflix      907      240        272            43.55
                                                                                                                                     Leith       454      145        160            32.82
                                        )UDFWLRQRIIHPDOHVLQDXGLHQFH                                                 Reeds       521      192        166            31.29
                     (a) Delivery Driver at Domino’s vs. Instacart
                                                                                                                Q "      6NHZ"            age and gender distributions among employees of different
 )%$XGI5                                                                             =
                                                                                         1HWIOL[       
6NHZHG
      companies in the same category, so that the auditor can pick
 )%$XGI5                                                        1YLGLD         =        job ads that fit the criteria of our methodology. It also requires
                                                                                                              
6NHZHG
      the ability to create audiences whose age and race distributions
  )%$XG5                                                                              =
                                                                                                              
6NHZHG
      are known. The voter dataset we use includes age and race, so
 )%$XGI&                                                                               =      can be adapted to test for discrimination along those attributes.
                                                                                                                
6NHZHG
 )%$XGI&                                                                               =          Like prior studies, we use physical location as a proxy to
                                                                                                                
6NHZHG
      infer the gender of the ad recipient, an approach which has
  )%$XG&                                                                                =
                                                                                                                
6NHZHG
      some limitations. LinkedIn hides location when there are two
                                                                                                                                      or fewer ad recipients, so our estimates may be off in those
                                        )UDFWLRQRIIHPDOHVLQDXGLHQFH                                            areas. These cases account for 31-43% of our ad recipients, as
                                                                                                                                                  shown in Table 2. Assuming gender distribution is uniform by
                       (b) Software Engineer at Nvidia vs. Netflix
                                                                                                                                                  county in North Carolina’s population, we reason that these
                                                                                                                  Q "    6NHZ"            unreported cases do not significantly distort our conclusions.
 )%$XGI5                                                                               =
                                                    5HHGV                                                  
6NHZHG
          We tested three job categories, with three experiment repe-
  )%$XG5                    /HLWK                                                  =          titions each. Additional categories and repetitions would im-
                                                                                                                1RHYLG
  )%$XG5                                                                              =      prove confidence in our results. Although we found it difficult
                                                                                                              
6NHZHG
      to select job categories with documented gender bias that we
 )%$XGI&                                                                               =      could target, such data is available in private datasets. An-
                                                                                                                
6NHZHG
  )%$XG&                                                                                =      other question worth investigating with regards to picking
                                                                                                                
6NHZHG
      job categories is whether delivery optimization algorithms are
  )%$XG&                                                                                =
                                                                                                                
6NHZHG
      the same for all job categories, i.e., whether relatively more
     2EMHFWLYHV                                                                                                           optimization happens for high-paying or scarce jobs.
     5 5HDFK             )UDFWLRQRIIHPDOHVLQDXGLHQFH                                                Some advertisers will wish to target their ads by profession
     & &RQYHUVLRQ                                                                                                             or background. We did not evaluate such targeting because our
      (c) Sales Associate at Reeds Jewelry vs. Leith Automotive                                                                                   population data is not rich and large enough to support such
                                                                                                                                                  comparisons with statistical rigor. Evaluation of this question
Figure 4: Comparison of ad delivery with “reach” and                                                                                              would be future work, especially if the auditor has access to
“conversion” objectives on Facebook.                                                                                                              richer population data.

                                                                                                                                                  6.2    Recommendations
produce a discriminatory outcome regardless of the advertiser                                                                                     Prior work has shown that platforms are not consistent when
objective it is trying to optimize for.                                                                                                           self-policing their algorithms for undesired societal conse-
                                                                                                                                                  quences, perhaps because the platforms’ business objectives
                                                                                                                                                  are at stake. Therefore, we believe independent (third party)
6          FUTURE WORK
                                                                                                                                                  auditing fills an important role. We suggest recommendations
We next discuss the limitations of our study, give some direc-                                                                                    to make such external auditing of ad delivery algorithms more
tions for future study and, motivated by the challenges we                                                                                        accessible, accurate and efficient, especially for public interest
faced in our work, provide recommendations as to what ad                                                                                          researchers and journalists.
platforms can do to make auditing more feasible and accurate.                                                                                        Providing more targeting and delivery statistics: First,
                                                                                                                                                  echoing sentiments from prior academic and activism work [2,
6.1             Limitations and Further Directions                                                                                                40], we note the value of surfacing additional ad targeting and
Our experiments focus on skew from gender, but we believe                                                                                         delivery data in a privacy-preserving way. Public interest audi-
our methodology can be used to study other attributes such as                                                                                     tors often rely on features that the ad platforms make available
age or race. It requires the auditor having access to data about                                                                                  for any regular advertiser to conduct their studies, which can



                                                                                                                                           3776
Auditing for Discrimination in Algorithms Delivering Job Ads                                        WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
make performing certain types of audits challenging. For ex-           increase statistical confidence, or reproduce results. One possi-
ample, in the case of LinkedIn, the ad performance report does         ble solution is to provide a discount for auditors. They would
not contain a breakdown of ad impressions by gender or age.            have similar access to the platform features like any other
To overcome such challenges, prior audit studies and our work          advertiser but would pay less to run ads. However, as with
rely on finding workarounds such as proxies to measure ad              other designed-auditor techniques, this approach risks abuse.
delivery along sensitive demographic features. On one hand,               Overall, making auditing ad delivery systems more feasi-
providing additional ad delivery statistics could help expend          ble to a broader range of interested parties can help ensure
the scope of auditors’ investigations. On the other hand, there        that the systems that shape job opportunities people see oper-
may be an inherent trade-off between providing additional              ate in a fair manner that does not violate anti-discrimination
statistics about ad targeting and delivery and the privacy of          laws. The platforms may not currently have the incentives
users (see e.g. [23, 29]) or business interests of advertisers.        to make the changes proposed and, in some cases, may ac-
We believe that privacy-preserving techniques, such as dif-            tively block transparency efforts initiated by researchers and
ferentially private data publishing [19] may be able to strike         journalists [39]; thus, they may need to be mandated by law.
a balance between auditability and privacy, and could be a
fruitful direction for future work and practical implementation        7    CONCLUSION
in the ad delivery context.
                                                                       We study gender bias in the delivery of job ads due to plat-
    It is also worth asking what additional functionalities or in-
                                                                       form’s optimization choices, extending existing methodology
sights about its data or ad delivery optimization algorithms the
                                                                       to account for the role of qualifications in addition to the other
platforms can or should provide which would allow for more
                                                                       confounding factors studied in prior work. We are the first to
accessible auditing without sacrificing independence of the
                                                                       methodologically address the challenge of controlling for qual-
audits. Recent work has explored finding a balance between
                                                                       ification, and also draw attention to how qualification may be
independence and addressing the challenges of external audit-
                                                                       used as a legal defense against liability under applicable laws.
ing by suggesting a cooperative audit framework [63], where
                                                                       We apply our methodology to both Facebook and LinkedIn and
the target platform is aware of the audit and gives the auditor
                                                                       show that our proposed methodology is applicable to multiple
special access but there are certain protocols in place to ensure
                                                                       platforms and can identify distinctions between their ad deliv-
the auditor’s independence. In the context of ad platforms, we
                                                                       ery practices. We also provide the first analysis of LinkedIn
recognize that providing a special access option for auditors
                                                                       for potential skew in ad delivery. We confirm that Facebook’s
may open a path for abuse where advertisers may pretend to
                                                                       ad delivery can result in skew of job ad delivery by gender
be an auditor for their economic or competitive benefit.
                                                                       beyond what can be legally justified by possible differences
    Replacing ad-hoc privacy techniques: Our other recom-
                                                                       in qualifications, thus strengthening the previously raised ar-
mendation is for ad platforms to replace ad-hoc techniques
                                                                       guments that Facebook’s ad delivery algorithms may be in
they use as a privacy enhancement with more rigorous ap-
                                                                       violation of anti-discrimination laws [2, 14]. We do not find
proaches. For example, LinkedIn gives only a rough estimate of
                                                                       such skew on LinkedIn. Our approach provides a novel exam-
audience sizes, and does not give the sizes if less than 300 [35]
                                                                       ple of feasibility of auditing algorithmic systems in a black-box
It also does not give the number of impressions by location if
                                                                       manner, using only the capabilities available to all users of
the count per county is less than three [34].
                                                                       the system. At the same time, the challenges we encounter
    Such ad-hoc approaches have two main problems. First,
                                                                       lead us to suggest changes that ad platforms could make (or
it is not clear based on prior work on the ad platforms how
                                                                       that should be mandated of them) to make external auditing
effective they are in terms of protecting privacy of users [61,
                                                                       of their performance in societally impactful areas easier.
62]. We were also able to circumvent the 300-minimum limit
for audience size estimates on LinkedIn with repeated queries
by composing one targeting parameter with another, then                ACKNOWLEDGMENTS
repeating a decomposed query and calculating the difference.           This work was funded in part by NSF grants CNS-1755992,
More generally, numerous studies show ad-hoc approaches                CNS-1916153, CNS-1943584, CNS-1956435, and CNS-1925737.
often fail to provide the privacy that they promise [13, 41].
Second, ad-hoc approaches can distorts statistical tests that          REFERENCES
auditors perform [43]. Therefore, we recommend ad platforms             [1] ACLU. Facebook EEOC complaints. https://www.aclu.org/cases/facebook-
use approaches with rigorous privacy guarantees, and whose                  eeoc-complaints?redirect=node/70165.
                                                                        [2] Ali, M., Sapiezynski, P., Bogen, M., Korolova, A., Mislove, A., and Rieke,
impact on statistical validity can be precisely analyzed, such              A. Discrimination through optimization: How facebook’s ad delivery can
as differentially private algorithms [19], where possible.                  lead to biased outcomes. In Proceedings of the ACM Conference on Computer-
    Reducing cost of auditing: Auditing ad platforms via                    Supported Cooperative Work and Social Computing (2019).
                                                                        [3] Ali, M., Sapiezynski, P., Korolova, A., Mislove, A., and Rieke, A. Ad
black-box techniques incurs a substantial cost of money, effort,            delivery algorithms: The hidden arbiters of political messaging. In 14th
and time. Our work alone required several months of research                ACM International Conference on Web Search and Data Mining (2021).
on data collection and methodology design, and cost close               [4] Andrus, M., Spitzer, E., Brown, J., and Xiang, A. "What We Can’t Mea-
                                                                            sure, We Can’t Understand": Challenges to demographic data procurement
to $5K to perform the experiments by running ads. A prior                   in the pursuit of fairness. In ACM Conference on Fairness, Accountability,
study of the impact of Facebook’s ad delivery algorithms on                 and Transparency (FAccT) (2021).
                                                                        [5] Angwin, J., and Paris Jr., T. Facebook lets advertisers exclude users
political discourse cost up to $13K [3]. These costs quickly                by race – ProPublica. https://www.propublica.org/article/facebook-lets-
accumulate if one is to repeat experiments to study trends,                 advertisers-exclude-users-by-race, October 26, 2016.




                                                                3777
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia                                                             Basileal Imana, Aleksandra Korolova, and John Heidemann
 [6] Angwin, J., Scheiber, N., and Tobin, A. Dozens of companies are                           https://www.linkedin.com/help/lms/answer/85406.
     using Facebook to exclude older workers from job ads – ProPub-                     [37]   LinkedIn.               LinkedIn      marketing      developer      platform.
     lica. https://www.propublica.org/article/facebook-ads-age-discrimination-                 https://docs.microsoft.com/en-us/linkedin/marketing/.
     targeting, December 20, 2017.                                                      [38]   LinkedIn.       Select a marketing objective for your ad campaign.
 [7] Asplund, J., Eslami, M., Sundaram, H., Sandvig, C., and Karahalios,                       https://www.linkedin.com/help/lms/answer/94698/select-a-marketing-
     K. Auditing race and gender discrimination in online housing markets.                     objective-for-your-ad-campaign.
     In Proceedings of the International AAAI Conf. on Web and Social Media             [39]   Merrill, J. B., and Tobin, A. Facebook moves to block ad transparency
     (2020).                                                                                   tools – including ours. https://www.propublica.org/article/facebook-
 [8] Barocas, S., and Selbst, A. D. Big data’s disparate impact. California                    blocks-ad-transparency-tools, January 28, 2019.
     Law Review 104, 3 (2016), 671–732.                                                 [40]   Mozilla. Facebook’s ad archive API is inadequate. https://blog.mozilla.
 [9] Bogen, M., and Rieke, A. Help wanted: an examination of hiring algo-                      org/blog/2019/04/29/facebooks-ad-archive-api-is-inadequate/, 2019.
     rithms, equity, and bias. Technical report, Upturn (2018).                         [41]   Narayanan, A., and Shmatikov, V. Robust de-anonymization of large
[10] Bogen, M., Rieke, A., and Ahmed, S. Awareness in practice: tensions in                    sparse datasets. In 2008 IEEE Symposium on Security and Privacy (2008).
     access to sensitive attribute data for antidiscrimination. In Proceedings of       [42]   Netflix. Inclusion takes root at Netflix: Our first report. https://about.
     the 2020 Conference on Fairness, Accountability, and Transparency (2020).                 netflix.com/en/news/netflix-inclusion-report-2021, 2021.
[11] CFR.                12    CFR      section     202.4     (b)—discouragement.       [43]   Nissim, K., Steinke, T., Wood, A., Altman, M., Bembenek, A., Bun, M.,
     https://www.law.cornell.edu/cfr/text/12/202.4.                                            Gaboardi, M., O’Brien, D. R., and Vadhan, S. Differential privacy: A
[12] CFR. 24 CFR section 100.75—discriminatory advertisements, statements                      primer for a non-technical audience. Vand. J. Ent. & Tech. L. 21 (2018).
     and notices. https://www.law.cornell.edu/cfr/text/24/100.75.                       [44]   North Carolina State Board of Elections. Voter history data.
[13] Cohen, A., and Nissim, K. Linear program reconstruction in practice.                      https://dl.ncsbe.gov/index.html. Downloaded on April 23, 2020.
     Journal of Privacy and Confidentiality 10, 1 (2020).                               [45]   Nvidia. Global diversity and inclusion report. https://www.nvidia.com/en-
[14] Datta, A., Datta, A., Makagon, J., Mulligan, D. K., and Tschantz, M. C.                   us/about-nvidia/careers/diversity-and-inclusion/, 2021. Last accessed on
     Discrimination in online personalization: A multidisciplinary inquiry. FAT                Feb 28, 2021.
     (2018).                                                                            [46]   Reisman, D., Schultz, J., Crawford, K., and Whittaker, M. Algorithmic
[15] Datta, A., Tschantz, M. C., and Datta, A. Automated experiments on ad                     impact assessments: A practical framework for public agency accountabil-
     privacy settings. Proceedings on Privacy Enhancing Technologies, 1 (2015).                ity. AI Now (2018).
[16] DiversityReports.org.                    Diversity     reports    -    Nvidia.     [47]   Sandberg, S. Doing more to protect against discrimination in housing,
     https://www.diversityreports.org/company-information/nvidia, 2020. Last                   employment and credit advertising. https://about.fb.com/news/2019/03/
     accessed on Feb 28, 2021.                                                                 protecting-against-discrimination-in-ads/, March 19, 2019.
[17] Dominos. Gender pay gap report 2018. https://investors.dominos.co.uk/              [48]   Sandvig, C., Hamilton, K., Karahalios, K., and Langbort, C. Auditing
     sites/default/files/attachments/dominos-corporate-stores-sheermans-                       algorithms: Research methods for detecting discrimination on internet plat-
     limited-gender-pay-gap-2018-report.pdf, 2018. Last accessed on October                    forms. Data and discrimination: converting critical concerns into productive
     6, 2020.                                                                                  inquiry 22 (2014), 4349–4357.
[18] Dwork, C., and Ilvento, C. Fairness Under Composition. In 10th Innova-             [49]   Sapiezynski, P., Ghosh, A., Kaplan, L., Mislove, A., and Rieke, A. Algo-
     tions in Theoretical Computer Science Conference (ITCS) (2019).                           rithms that "don’t see color": Comparing biases in lookalike and special ad
[19] Dwork, C., and Roth, A. The algorithmic foundations of differential                       audiences. arXiv preprint arXiv:1912.07579 (2019).
     privacy. Foundations and Trends in Theoretical Computer Science (2014).            [50]   Selyukh, A. Why suburban moms are delivering your groceries.
[20] Facebook. Choose the right objective. https://www.facebook.com/                           NPR https://www.npr.org/2019/05/25/722811953/why-suburban-moms-
     business/help/1438417719786914.                                                           are-delivering-your-groceries, May 25, 2019.
[21] Facebook.                Marketing API—Facebook for developers.                    [51]   Shukla, S. A better way to learn about ads on facebook. https://about.fb.
     https://developers.facebook.com/docs/marketing-apis/.                                     com/news/2019/03/a-better-way-to-learn-about-ads/, March 28 2019.
[22] Facebook. Simplifying targeting categories. https://www.facebook.com/              [52]   Speicher, T., Ali, M., Venkatadri, G., Ribeiro, F. N., Arvanitakis, G.,
     business/news/update-to-facebook-ads-targeting-categories/, 2020.                         Benevenuto, F., Gummadi, K. P., Loiseau, P., and Mislove, A. Potential
[23] Faizullabhoy, I., and Korolova, A. Facebook’s advertising platform:                       for discrimination in online targeted advertising. In Proceedings of Machine
     New attack vectors and the need for interventions. In IEEE Workshop on                    Learning Research (2018), S. A. Friedler and C. Wilson, Eds.
     Technology and Consumer Protection (ConPro) (2018).                                [53]   Spencer, S. Upcoming update to housing, employment, and credit ad-
[24] Gelauff, L., Goel, A., Munagala, K., and Yandamuri, S. Advertising for                    vertising policies. https://www.blog.google/technology/ads/upcoming-
     demographically fair outcomes. arXiv preprint arXiv:2006.03983 (2020).                    update-housing-employment-and-credit-advertising-policies/, 2020.
[25] Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised         [54]   Sweeney, L. Discrimination in online ad delivery: Google ads, black names
     learning. In Advances in Neural Information Processing Systems (2016).                    and white names, racial discrimination, and click advertising. Queue (2013).
[26] Imana, B., Korolova, A., and Heidemann, J. Dataset of content and                  [55]   Tobin, A., and Merrill, J. B. Facebook is letting job advertisers target
     delivery statistics of ads used in “Auditing for discrimination in algorithms             only men – ProPublica. https://www.propublica.org/article/facebook-is-
     delivering job ads”. https://ant.isi.edu/datasets/addelivery/.                            letting-job-advertisers-target-only-men, September 18, 2018.
[27] Kayser-Bril,        N.              Automated        discrimination:     Face-     [56]   U.S. Bureau of Labor Statistics.               Employed persons by de-
     book uses gross stereotypes to optimize ad delivery.                                      tailed industry, sex, race, and Hispanic or Latino ethnicity.
     https://algorithmwatch.org/en/story/automated-discrimination-                             https://www.bls.gov/cps/cpsaat18.pdf, 2018.
     facebook-google/, October 18, 2020.                                                [57]   U.S. Eqal Employment Opportunity Commission. Prohibited employ-
[28] Kim, M. P., Korolova, A., Rothblum, G. N., and Yona, G. Preference-                       ment policies/practices. https://www.eeoc.gov/prohibited-employment-
     informed fairness. In Innovations in Theoretical Computer Science (2020).                 policiespractices.
[29] Korolova, A. Privacy violations using microtargeted ads: A case study.             [58]   USC.       29 USC section 623—prohibition of age discrimination.
     Journal of Privacy and Confidentiality 3, 1 (2011), 27–49.                                https://www.law.cornell.edu/uscode/text/29/623.
[30] Lambrecht, A., and Tucker, C. Algorithmic bias? an empirical study of              [59]   USC. 42 USC section 2000e-3—other unlawful employment practices.
     apparent gender-based discrimination in the display of STEM career ads.                   https://www.law.cornell.edu/uscode/text/42/2000e-3.
     Management Science 65, 7 (2019), 2966–2981.                                        [60]   USC. 47 USC section 230—protection for private blocking and screening
[31] Laura Murphy and Associates. Facebook’s civil rights audit – progress                     of offensive material. https://www.law.cornell.edu/uscode/text/47/230.
     report. https://about.fb.com/wp-content/uploads/2019/06/civilrightaudit_           [61]   Venkatadri, G., Andreou, A., Liu, Y., Mislove, A., Gummadi, K. P.,
     final.pdf, June 30, 2019.                                                                 Loiseau, P., and Goga, O. Privacy risks with Facebook’s PII-based target-
[32] Laura Murphy and Associates. Facebook’s civil rights audit – Final                        ing: Auditing a data broker’s advertising interface. In IEEE Symposium on
     report. https://about.fb.com/wp-content/uploads/2020/07/Civil-Rights-                     Security and Privacy (SP) (2018).
     Audit-Final-Report.pdf, July 8 2020.                                               [62]   Venkatadri, G., and Mislove, A. On the Potential for Discrimination via
[33] Lecuyer, M., Spahn, R., Spiliopolous, Y., Chaintreau, A., Geambasu,                       Composition. In Internet Measurement Conference (IMC’20) (2020).
     R., and Hsu, D. Sunlight: Fine-grained targeting detection at scale with           [63]   Wilson, C., Ghosh, A., Jiang, S., Mislove, A., Baker, L., Szary, J., Trindel,
     statistical confidence. In CCS (2015).                                                    K., and Polli, F. Building and auditing fair algorithms: A case study in
[34] LinkedIn.           Ads Reporting.            https://docs.microsoft.com/en-              candidate screening. In ACM Conference on Fairness, Accountability, and
     us/linkedin/marketing/integrations/ads-reporting/ads-reporting.                           Transparency (FAccT) (2021).
[35] LinkedIn. Audience Counts. https://docs.microsoft.com/en-us/linkedin/              [64]   Zhang, J., and Bareinboim, E. Fairness in decision-making - the causal
     marketing/integrations/ads/advertising-targeting/audience-counts.                         explanation formula. In Association for the Advancement of Artificial Intel-
[36] LinkedIn.          Campaign quality scores for sponsored content.                         ligence (2018).




                                                                                 3778