Authors Aleksandra Korolova Basileal Imana John Heidemann
Auditing for Discrimination in Algorithms Delivering Job Ads Basileal Imana Aleksandra Korolova John Heidemann University of Southern California University of Southern California USC/Information Science Institute Los Angeles, CA, USA Los Angeles, CA, USA Los Angeles, CA, USA ABSTRACT 1 INTRODUCTION Ad platforms such as Facebook, Google and LinkedIn promise Digital platforms and social networks have become popular value for advertisers through their targeted advertising. How- means for advertising to users. These platforms provide many ever, multiple studies have shown that ad delivery on such plat- mechanisms that enable advertisers to target a specific au- forms can be skewed by gender or race due to hidden algorith- dience, i.e. specify the criteria that the member to whom an mic optimization by the platforms, even when not requested ad is shown should satisfy. Based on the advertiser’s chosen by the advertisers. Building on prior work measuring skew parameters, the platforms employ optimization algorithms to in ad delivery, we develop a new methodology for black-box decide who sees which ad and the advertiser’s payments. auditing of algorithms for discrimination in the delivery of job Ad platforms such as Facebook and LinkedIn use an au- advertisements. Our first contribution is to identify the distinc- tomated algorithm to deliver ads to a subset of the targeted tion between skew in ad delivery due to protected categories audience. Every time a member visits their site or app, the such as gender or race, from skew due to differences in qualifi- platforms run an ad auction among advertisers who are tar- cation among people in the targeted audience. This distinction geting that member. In addition to the advertiser’s chosen is important in U.S. law, where ads may be targeted based parameters, such as a bid or budget, the auction takes into on qualifications, but not on protected categories. Second, we account an ad relevance score, which is based on the ad’s pre- develop an auditing methodology that distinguishes between dicted engagement level and value to the user. For example, skew explainable by differences in qualifications from other from LinkedIn’s documentation : “scores are calculated factors, such as the ad platform’s optimization for engagement ... based on your predicted campaign performance and the or training its algorithms on biased data. Our method con- predicted performance of top campaigns competing for the trols for job qualification by comparing ad delivery of two same audience.” Relevance scores are computed by ad plat- concurrent ads for similar jobs, but for a pair of companies forms using algorithms; both the algorithms and the inputs with different de facto gender distributions of employees. We they consider are proprietary. We refer to the algorithmic pro- describe the careful statistical tests that establish evidence cess run by platforms to determine who sees which ad as ad of non-qualification skew in the results. Third, we apply our delivery optimization. proposed methodology to two prominent targeted advertising Prior work has hypothesized that ad delivery optimiza- platforms for job ads: Facebook and LinkedIn. We confirm tion plays a role in skewing recipient distribution by gen- skew by gender in ad delivery on Facebook, and show that der or race even when the advertiser targets their ad inclu- it cannot be justified by differences in qualifications. We fail sively [15, 30, 52, 54]. This hypothesis was confirmed, at least to find skew in ad delivery on LinkedIn. Finally, we suggest for Facebook, in a recent study , which showed that for jobs improvements to ad platform practices that could make ex- such as lumberjack and taxi driver, Facebook delivered ads to ternal auditing of their algorithms in the public interest more audiences skewed along gender and racial lines, even when the feasible and accurate. advertiser was targeting a gender- and race-balanced audience. The Facebook study  established that the skew is not due CCS CONCEPTS to advertiser targeting or competition from other advertisers, • Social and professional topics → Technology audits; and hypothesized that it could stem from the proprietary ad Employment issues; Socio-technical systems; Systems delivery algorithms trained on biased data optimizing for the analysis and design. platform’s objectives (§2.1). Our work focuses on developing an auditing methodol- ACM Reference Format: ogy for measuring skew in the delivery of job ads, an area Basileal Imana, Aleksandra Korolova, and John Heidemann. 2021. Auditing for Discrimination in Algorithms Delivering Job Ads. In where U.S. law prohibits discrimination based on certain at- Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021, tributes [57, 59]. We focus on expanding the prior auditing Ljubljana, Slovenia. ACM, New York, NY, USA, 12 pages. https://doi. methodology of  to bridge the gap between audit studies org/10.1145/3442381.3450077 that demonstrate that a platform’s ad delivery algorithm re- sults in skewed delivery and studies that provide evidence that This paper is published under the Creative Commons Attribution 4.0 Interna- tional (CC-BY 4.0) license. Authors reserve their rights to disseminate the work the skewed delivery is discriminatory, thus bringing the set on their personal and corporate Web sites with the appropriate attribution. of audit studies one step closer to potential use by regulators WWW ’21, April 19–23, 2021, Ljubljana, Slovenia to enforce the law in practice . We identify one such gap © 2021 IW3C2 (International World Wide Web Conference Committee), pub- lished under Creative Commons CC-BY 4.0 License. in the context of job advertisements: controlling for bona fide ACM ISBN 978-1-4503-8312-7/21/04. occupational qualifications  and develop a methodology https://doi.org/10.1145/3442381.3450077 3767 WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Basileal Imana, Aleksandra Korolova, and John Heidemann to address it. We focus on designing a methodology that as- 2 PROBLEM STATEMENT sumes no special access beyond what a regular advertiser sees, Our goal is to develop a novel methodology that measures because we believe that auditing of ad platforms in the public skew in ad delivery that is not justifiable on the basis of differ- interest needs to be possible by third-parties — and society ences in job qualification requirements in the targeted audi- should not depend solely on the limited capabilities of federal ence. Before we focus on qualification, we first enumerate the commissions or self-policing by the platforms. different potential sources of skew that need to be taken into Our first contribution is to examine how the occupational consideration when measuring the role of the ad delivery algo- qualification of an ad’s audience affects the legal liability an rithms. We then discuss how U.S. law may treat qualification ad platform might incur with respect to discriminatory adver- as a legitimate cause for skewed ad delivery. tising (§2). Building upon legal analysis in prior work , we We refer to algorithmic decisions by ad platforms that result make an additional distinction between skew that is due to a in members of one group being over- or under-represented difference in occupational qualifications among the members among the ad recipients as “skew in ad delivery”. We con- of the targeted ad audience, and skew that is due to (implicit or sider groups that have been identified as legally protected explicit use of) protected categories such as gender or race by (such as gender, age, race). We set the baseline population for the platform’s algorithms. This distinction is relevant because measuring skew as the qualified and available ad platform U.S. law allows differential delivery that is justified by dif- members targeted by the campaign (see §4.4 for a quantitative ferences in qualifications , an argument that platforms definition). are likely to use to defend themselves against legal liabil- ity when presented with evidence from audit studies such as [2, 15, 30, 52, 54]. 2.1 Potential Sources of Skew Our second contribution is to propose a novel auditing Our main challenge is to isolate the role of the platform’s al- methodology (§4) that distinguishes between a delivery skew gorithms in creating skew from other factors that affect ad that could be a result of the ad delivery algorithm merely in- delivery and may be used to explain away any observed skew. corporating job qualifications of the members of the targeted This is a challenge for a third-party auditor because they inves- ad audience from skew due to other algorithmic choices that tigate the platform’s algorithms as a black-box, without access correlate with gender- or racial- factors, but are not related to to the code or inputs of the algorithm, or access to the data qualifications. Like the prior study of Facebook , to isolate or behavior of platform members or advertisers. We assume the role of the platform’s algorithms we control for factors that the auditor has access only to ad statistics provided by extraneous to the platform’s ad delivery choices, such as the the platform. demographics of people on-line during an ad campaign’s run, Targeted advertising consists of two high-level steps. The advertisers’ targeting, and competition from other advertisers. advertiser creates an ad, specifies its target audience, campaign Unlike prior work, our methodology relies on simultaneously budget, and the advertiser’s objective. The platform then de- running paired ads for several jobs that have similar qualifica- livers the ad to its users after running an auction among ad- tion requirements but have skewed de facto (gender) distribution. vertisers targeting those users. We identify four categories of By “skewed de facto distribution”, we refer to existing societal factors that may introduce skew into this process: circumstances that are reflected in the skewed (gender) dis- First, an advertiser can select targeting parameters and tribution of employees. An example of such a pair of ads is a an audience that induce skew. Prior work [5, 6, 52, 55, 62] has delivery driver job at Domino’s (a pizza chain) and at Instacart shown that platforms expose targeting options that advertisers (a grocery delivery service). Both jobs have similar qualifi- can use to create discriminatory ad targeting. Recent changes cation requirements but one is de facto skewed male (pizza in platforms have tried to disable such options [22, 47, 53]. delivery) and the other – female (grocery delivery) [17, 50]. Second, an ad platform can make choices in its ad de- Comparing the delivery of ads for such pairs of jobs ensures livery optimization algorithm to maximize ad relevance, skew we may observe can not be attributed to differences in engagement, advertiser satisfaction, revenue, or other busi- qualification among the underlying audience. ness objectives, which can implicitly or explicitly result in Our third contribution is to show that our proposed method- a skew. As one example, if an image used in an ad receives ology distinguishes between the behavior of ad delivery algo- better engagement from a certain demographic, the platform’s rithms of different real-world ad platforms, and identify those algorithm may learn this association and preferentially show whose delivery skew may be going beyond what is justifiable the ad with that image to the subset of the targeted audience on the basis of qualifications, and thus may be discriminatory belonging to that demographic . As another example, for a (§5). We demonstrate this by registering as advertisers and job ad, the algorithm may aim to show the ad to users whose running job ads for real employment opportunities on two professional backgrounds better match the job ad’s qualifi- platforms, Facebook and LinkedIn. We apply the same audit- cation requirements. If the targeted population of qualified ing methodology to both platforms and observe contrasting individuals is skewed along demographic characteristics, the results that show statistically significant gender-skew in the platform’s algorithm may propagate this skew in its delivery. case of Facebook, but not LinkedIn. Third, an advertiser’s choice of objective can cause a skew. We conclude by providing recommendations for changes Ad platforms such as LinkedIn and Facebook support adver- that could make auditing of ad platforms more accessible, tiser objectives such as reach and conversion. Reach indicates efficient and accurate for public interest researchers (§6.2). the advertiser wants their ad to be shown to as many people 3768 Auditing for Discrimination in Algorithms Delivering Job Ads WWW ’21, April 19–23, 2021, Ljubljana, Slovenia as possible in their target audience, while for conversion the not apply. This contribution is what primarily sets us apart advertiser wants as many ad recipients as possible to take from prior work. It also brings findings from audit studies some action, such as clicking through to their site [20, 38]. such as ours a step closer to having the potential to be used Different demographic groups may have different propensities by regulators to enforce the law in practice. to take specific actions, so a conversion objective can implicitly As discussed in §2.1, the objective an advertiser chooses can cause skewed delivery. When the platform’s implementation also be a source of skew, particularly for the conversion objec- of the advertiser’s objective results in a discriminatory skew, tive, if different demographic groups tend to engage differently. the responsibility for it can be a matter of dispute (see §2.2). When the advertiser chosen objective that is implemented by Finally, there may be other confounding factors that are the platform results in discriminatory delivery, who bears not under direct control of a particular advertiser or the plat- the legal responsibility may be unclear. On one hand, the ad- form leading to skew, such as differing sign-on rates across vertiser (perhaps, unknowingly or implicitly) requested the demographics, time-of-day effects, and differing rates of adver- outcome, and if that choice created a discriminatory outcome, tiser competition for users from different demographics. For some prior analysis  suggests the platform may be pro- example, delivery of an ad may be skewed towards men be- tected under Section 230 of the Communications Decency Act, cause more men were online during the run of the ad campaign, a U.S. law that provides ad platforms with immunity from con- or because competing advertisers were bidding higher to reach tent published by advertisers . On the other hand, one may the women in the audience than to reach the men [2, 18, 30]. argue that the platform should be aware of the risk of skew In our work, we focus on isolating skew that results from for job ads when optimizing for conversions, and therefore an ad delivery algorithm’s optimization (the second factor). should prevent it, or deny advertisers the option to select this Since we are studying job ads, we are interested in further objective, just as they must prevent explicit discriminatory distinguishing skew due to an algorithm that incorporates targeting by the advertiser. Our work does not advocate a qualification in its optimization from skew that is due to an position on the legal question, but provides data (§5.2) about algorithm that perpetuates societal biases without a justifi- outcomes that shows implications of the objective’s choice. cation grounded in qualifications. We are also interested in In addition to the optimization objective, other confounding how job ad delivery is affected by the objective chosen by the sources of skew (§2.1) may have implications for legal liability. advertiser (the third factor). We discuss our methodology for The prior legal analysis of the Google’s ad platform evaluated achieving these goals in §4. the applicability of Section 230 to different sources of skew, and argued Google may not be protected by Section 230 if a skew is fully a product of Google’s algorithms . Similarly, our 2.2 Discriminatory Job Ads and Liability goal is to design a methodology that controls for confounding Building on a legal analysis in prior work , we next discuss factors and isolates skew that is enabled solely due to choices how U.S. anti-discrimination law may treat job qualification made by the platform’s ad delivery algorithms. requirements, optimization objectives, and other factors that can cause skew, and discuss how the applicability of the law 3 BACKGROUND informs our methodology design. We next highlight relevant details about the ad platforms to Our work is unique in underscoring the implications of which we apply our methodology and discuss related work. qualification when evaluating potential legal liability ad plat- forms may incur due to skewed job ad delivery. We also draw attention to the nuances in analyzing the implications of the 3.1 LinkedIn and Facebook Ad Platforms optimization objective an advertiser chooses. We focus on We give details about LinkedIn’s and Facebook’s advertising Title VII, a U.S. law which prohibits preferential or discrimina- platforms that are relevant to our methodology. tory employment advertising practices using attributes such Ad objective: LinkedIn and Facebook advertisers purchase as gender or race . ads to meet different marketing objectives. As of February 2021, Title VII allows entities who advertise job opportunities both LinkedIn and Facebook have three types of objectives: to legally show preference based on bona fide occupational awareness, consideration and conversion, and each type has qualifications , which are requirements necessary to carry multiple additional options [20, 38]. For both platforms, the out a job function. Thus, it is conceivable that a platform such chosen objective constrains the ad format, bidding strategy as Facebook can use this exception to legally argue that the and payment options available to the advertiser. skew arising from its ad delivery optimization established Ad audience: On both platforms, advertisers can target by  does not violate the law, since its ad delivery algorithm an audience using targeting attributes such as geographic merely takes into account qualifications. Therefore, our goal location, age and gender. But if the advertiser discloses they are is to design an auditing methodology that can distinguish running a job ad, the platforms disable or limit targeting by age between ad delivery optimization resulting in skew due to and gender . LinkedIn, being a professional network, also ad platform’s use of qualifications from skew due to other provides targeting by job title, education, and job experience. algorithmic choices by the platform. Through making this In addition, advertisers on both platforms can upload a distinction we can eliminate the possibility of the platform list of known contacts to create a custom audience (called using qualification as a legal argument against being held “Matched Audience” on LinkedIn and “Custom Audience” on liable for discriminatory outcomes when this argument does Facebook). On LinkedIn, contacts can be specified by first and 3769 WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Basileal Imana, Aleksandra Korolova, and John Heidemann last name or e-mail address. Facebook allows specification audience without using micro-targeting and in the presence by many more fields, such as zip code and phone number. of ad delivery optimization. Lambrecht et al.  perform a The ad platforms then match the uploaded list with profile field test promoting job opportunities in STEM using target- information from LinkedIn or Facebook accounts. ing that was intended to be gender-neutral, find that their ads Ad performance report: Both LinkedIn and Facebook pro- were shown to more men than women, and explore potential vide ad performance reports through their website interface and explanations for this outcome. Finally, recent work by Ali and via their marketing APIs [21, 37]. These reports reflect near Sapiezynski et al.  has demonstrated that their job and hous- real-time campaign performance results such as the number ing ads placed on Facebook are delivered skewed by gender of clicks and impressions the ad received, broken down along and race, even when the advertiser targets a gender- and race- different axes. The categories of information along which ag- balanced audience, and that this skew results from choices gregate breakdowns are available differ among platforms. Face- of the Facebook’s ad delivery algorithm, and is not due to book reports breaks down performance data by location, age, market or user interaction effects. AlgorithmWatch  repli- and gender, while LinkedIn gives breakdowns by location, job cate these findings with European user audiences, and add an title, industry and company, but not by age or gender. investigation of Google’s ad delivery for jobs. Our work is mo- tivated by these studies, confirming results on Facebook and performing the first study we are aware of for LinkedIn. Going 3.2 Related Work a step further to distinguish between skewed and discrimina- Targeted advertising has become ubiquitous, playing a signifi- tory delivery, we propose a new methodology to control for cant role in shaping information and access to opportunities user qualifications, a factor not accounted for in prior work, for hundreds of millions of users. Because the domains of em- but that is critical for evaluating whether skewed delivery is, ployment, housing, and credit have legal anti-discrimination in fact, discriminatory, for job ads. We build on prior work ex- protections in the U.S. [11, 12, 58], the study of ad platform’s ploring ways in which discrimination may arise in job-related role in shaping access and exposure to those opportunities has advertising and assessing the legal liability of ad platforms , been of particular interest in civil rights discourse [31, 32] and to establish that the job ad delivery algorithms of Facebook research. We discuss such work next. may be violating U.S. anti-discrimination law. Discriminatory ad targeting: Several recent studies con- Auditing algorithms: The proprietary nature of ad plat- sider discrimination in ad targeting: journalists at ProPublica forms, algorithms, and their underlying data makes it difficult were among the first to show that Facebook’s targeting op- to definitively establish the role platforms and their algorithms tions enabled job and housing advertisers to discriminate by play for creation of discriminatory outcomes [4, 8–10, 46]. For age , race  and gender . In response to these find- advertising, in addition to the previously described studies, ings and as part of a settlement agreement to a legal chal- recent efforts have explored the possibility of auditing with lenge , Facebook has made changes to restrict the targeting data provided by Facebook through its public Ad Library  capabilities offered to advertisers for ads in legally protected (created in response to a legal settlement ). Other works domains [22, 47]. Other ad platforms, e.g. Google, have an- have focused on approaches that rely on sock-puppet account nounced similar restrictions . The question of whether creation [7, 33]. Our work uses only ad delivery statistics that these restrictions are sufficient to stop an ill-intentioned ad- platforms provide to regular advertisers. This approach makes vertiser from discrimination remains open, as studies have us less reliant on the platform’s willingness to be audited. We shown that advanced features of ad platforms, such as custom do not rely on transparency-data from platforms, since it is and lookalike audiences, can be used to run discriminatory often limited and insufficient for answering questions about ads [23, 49, 52, 62]. Our work assumes a well-intentioned ad- the platform’s role in discrimination . We also do not rely vertiser and performs an audit study using gender-balanced on an ability to create user accounts on the platform, since targeting. experimental accounts are labor-intensive to create and disal- Discriminatory ad delivery: In addition to the targeting lowed by most platform’s policies. We build on prior work of choices by advertisers, researchers have hypothesized that external auditing [2, 3, 14, 15, 48, 64]. We show that auditing discriminatory outcomes can be a result of platform-driven for discrimination in ad delivery of job ads is possible, even choices. In 2013, Sweeney’s empirical study found a statisti- when limited to capabilities available to a regular advertiser, cally significant difference between the likelihood of seeing an and that one can carefully control for confounding factors. ad suggestive of an arrest record on Google when searching Auditing LinkedIn: To our knowledge, the only work for people’s names assigned primarily to black babies com- that has studied LinkedIn’s ad system’s potential for discrim- pared to white babies . Datta et al.  found that the ination is that of Venkatadri and Mislove . Their work gender of a Google account influences the number of ads one demonstrates that compositions of multiple targeting options sees related to high-paying jobs, with female accounts seeing together can result in targeting that is skewed by age and gen- fewer such ads. Both studies could not examine the causes of der, without explicitly targeting using those attributes. They such outcomes, as their methodology did not have an ability to suggest mitigations should be based not on disallowing in- isolate the role of the platform’s algorithm from other possibly dividual targeting parameters, but on evaluating the overall contributing factors, such as competition from advertisers and outcome of the composition of targetings specified by an ad- user activity. Gelauff et al.  provide an empirical study of vertiser. We agree with this goal, and go beyond this prior the challenges of advertising to a demographically balanced ad 3770 Auditing for Discrimination in Algorithms Delivering Job Ads WWW ’21, April 19–23, 2021, Ljubljana, Slovenia work by basing our evaluation on the outcome of ad deliv- Table 1: Audiences used in our study. ery, measuring delivery of real-world ads, and contrasting outcomes on LinkedIn with Facebook’s. ID Size Males Females Match Rate Aud #0 954,714 477,129 477,585 11.83% Aud #1 900,000 450,000 450,000 11.6% 4 AUDITING METHODOLOGY Aud #2 950,000 450,000 500,000 11.8% We next describe the methodology we propose to audit ad Aud #0f 850,000 450,000 400,000 11.88% delivery algorithms for potential discrimination. Aud #1f 800,000 400,000 400,000 12.51% Our approach consists of three steps. First, we use the ad- Aud #2f 790,768 390,768 400,000 12.39% vertising platform’s custom audience feature (§4.1) to build an audience that allows us to infer gender of the ad recipients for platforms that do not provide ad delivery statistics along gender lines. Second, we develop a novel methodology that To evaluate experimental reproducibility without introduc- controls for job qualifications by carefully selecting job cat- ing test-retest bias, we repeat our experiments across different, egories (§4.2) for which everyone in the audience is equally but equivalent audience partitions. Table 1 gives a summary qualified (or not qualified) for, yet for which there are distinc- of the partitions we used. Aud#0, Aud#1 and Aud#2 are parti- tions in the real-world gender distributions of employees in tions whose size is approximately a quarter of the full audience, the companies. We then run paired ads concurrently for each while Aud#0f, Aud#1f and Aud#2f are constructed by swap- job category and use statistical tests to evaluate whether the ping the choice of gender by county. Swapping genders this ad delivery results are skewed (§4.3). way doubles the number of partitions we can use. Our lack of access to users’ profile data, interest or browsing On both LinkedIn and Facebook, the information we upload activity prevents us from directly testing whether ad delivery is used to find exact matches with information on user profiles. satisfies metrics of fairness commonly used in the literature, We upload our audience partitions to LinkedIn in the form of such as equality of opportunity , or recently proposed for ad first and last names. For Facebook, we also include zip codes, allocation tasks where users have diverse preferences over out- because their tool for uploading audiences notified us that comes, such as preference-informed individual fairness . the match rate would be too low when building audiences In our context of job ads, equality of opportunity means that only on the basis of first and last names. The final targeted ad an individual in a demographic group that is qualified for a job audience is a subset of the audience we upload, because not should get a positive outcome (in our case: see an ad) at equal all the names will be matched, i.e. will correspond to an actual rates compared to an equally qualified individual in another user of a platform. As shown in Table 1, for each audience demographic group. While our methodology does not test for partition, close to 12% of the uploaded names were matched this metric, we indirectly account for qualification in the way with accounts on LinkedIn. Facebook does not report the final we select which job categories we run ads for. match rates for our audiences in order to protect user privacy. We only describe a methodology for studying discrimina- To avoid self-interference between our ads over the same tion in ad delivery along gender lines, but we believe our audience we run paired ads concurrently, but ads for different methodology can be generalized to audit along other attributes job categories or for different objectives sequentially. In ad- such as race and age by an auditor with access to auxiliary dition, to avoid test-retest bias, where a platform learns from data that is needed for picking appropriate job categories. prior experiments who is likely to respond and applies that to subsequent experiments, we generally use different (but equivalent) target audiences. 4.1 Targeted Audience Creation Unlike Facebook, LinkedIn does not give a gender breakdown 4.2 Controlling for Qualification of ad impressions, but reports their location at the county level. The main goal of our methodology is to distinguish skew As a workaround, we rely on an approach introduced in prior resulting from algorithmic choices that are not related to qual- work [2, 3] that uses ad recipients’ location to infer gender. ifications, from skew that can be justified by differences in To construct our ad audience, we use North Carolina’s voter user qualifications for the jobs advertised. A novel aspect of record dataset , which among other fields includes each our methodology is to control for qualifications by running voter’s name, zip code, county, gender, race and age. We di- paired ads for jobs with similar qualification requirements, vide all the counties in North Carolina into two halves. We but skewed de facto gender distributions. We measure skew construct our audience by including only male voters from by comparing the relative difference between the delivery of a counties in the first half, and only female voters from counties pair of ads that run concurrently, targeting the same audience. in the second half (this data is limited to a gender binary, so our Each test uses paired jobs that meet two criteria: First, they research follows). If a person from the first half of the counties must have similar qualification requirements, thus ensuring is reported as having seen an ad, we can infer that the person that the people that we target our ads with are equally quali- is a male, and vice versa. Furthermore, we include a roughly fied (or not qualified) for both job ads. Second, the jobs must equal number of people from each gender in the targeting exhibit a skewed, de facto gender distribution in the real-world, because we are interested in measuring skew that results from as shown through auxiliary data. Since both jobs require simi- the delivery algorithm, not the advertiser’s targeting choices. lar qualifications, our assumption is that on a platform whose 3771 WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Basileal Imana, Aleksandra Korolova, and John Heidemann ad delivery algorithms are non-discriminatory, the distribu- tion of genders among the recipients of the two ads will be roughly equal. On the other hand, in order to optimize for en- gagement or business objectives, platforms may incorporate other factors into ad delivery optimization, such as training or historical data. This data may reflect the de facto skew and thus influence machine-learning-based algorithmic predictions of engagement. Since such factors do not reflect differences in job qualifications, they may be disallowed (§2.2) and therefore represent platform-induced discrimination (even if they bene- fit engagement or the platform’s business interests). We will look for evidence of such factors in a difference in gender dis- tribution between the paired ads (see §4.4 for how we quantify Figure 1: Example delivery driver job ads for Domino’s the difference). and Instacart. In §5.1, we use the above criteria to select three job cate- gories – delivery driver, sales associate and software engineer – and run a pair of ads for each category and compare the gender make-up of the people to whom LinkedIn and Face- book show our ads. An example of such a pair of ads is a with the goal: “Your ads will be shown to those most likely to delivery driver job at Domino’s (a pizza chain) and at Instacart view or click on your job ads, getting more applicants.” . (a grocery delivery service). The de facto gender distribution For Facebook ads, we use “Conversions” option with with the among drivers of these services is skewed male for Domino’s following optimization goal: “Encourage people to take a spe- and skewed female for Instacart [17, 50]. If a platform shows cific action on your business’s site” , such as register on the Instacart ad to relatively more women than a Domino’s the site or submit a job application. ad, we conclude that the platform’s algorithm is discrimina- In §5.2, we run some of our Facebook ads using the aware- tory, since both jobs have similar qualification requirements ness objective. By comparing the outcomes across the two and thus a gender skew cannot be attributed to differences in objectives we can evaluate whether an advertiser’s objective qualifications across genders represented in the audience. choice plays a role in the skew (§2.2). We use the “Reach” op- Using paired, concurrent ads that target the same audience tion that Facebook provides within the awareness objective also ensures other confounding factors such as timing or com- with the stated goal of: “Show your ad to as many people as petition from other advertisers affect both ads equally . possible in your target audience” . To avoid bias due to the audience’s willingness to move for a job, we select jobs in the same physical location. When 4.3.3 Other Campaign Parameters. We next list other parame- possible (for delivery driver and sales job categories, but not ters we use for running ads and our reasons for picking them. software engineering), we select jobs in the location of our From the ad format options available for the objectives we target audience. selected, we choose single image ads, which show up in a prominent part of LinkedIn and Facebook users’ newsfeeds. 4.3 Placing Ads and Collecting Results We run all Facebook and LinkedIn ads with a total budget We next describe the mechanics of placing ads on Facebook of $50 per ad campaign and schedule them to run for a full day and LinkedIn, and collecting the ad delivery statistics which or until the full budget is exhausted. This price point ensures we use to calculate the gender breakdown of the audiences our a reasonable sample size for statistical evaluations, with all of ads were shown to. We also discuss the content and parameters our ads receiving at least 340 impressions. we use for running our ads. For both platforms, we request automated bidding to maxi- mize the number of clicks (for the conversion objective) and im- 4.3.1 Ad Content. In creating our ads, we aim to use gender- pressions (for the awareness objective) our ads can get within neutral text and image so as to minimize any possible skew the budget. We configure our campaigns on both platforms due to the input of an advertiser (us). The ad headline and to pay per impression shown. On LinkedIn, this is the only description for each pair of ads is customized to each job cate- available option for our chosen parameters. We use the same gory as described in §5.1. Each ad we run links to a real-world option on Facebook for consistency. On both platforms we dis- job opportunity that is listed on a job search site, pointing to a able audience expansion and off-site delivery options. While job posting on a company’s careers page (for delivery driver) these options might show our ad to more users, they are not or to a job posting on LinkedIn.com (in other cases). Figure 1 relevant or may interfere with our methodology. shows screenshots of two ads from our experiments. Since our methodology for LinkedIn relies on using North 4.3.2 Ad Optimization Objective. We begin by using the con- Carolina county names as proxies for gender, we add “North version objective because searching for people who are likely Carolina” as the location for our target audience. We do the to take an action on the job ad is a likely choice for advertisers same for Facebook for consistency across experiments but seeking users who will apply for their job (§5.1). For LinkedIn we do not need to use location as a proxy to infer gender in ads, we use “Job Applicants” option, a conversion objective Facebook’s case. 3772 Auditing for Discrimination in Algorithms Delivering Job Ads WWW ’21, April 19–23, 2021, Ljubljana, Slovenia we expect s 1, f = o f and s 2, f = o f . As an external auditor A1 = A2 Q1=Q2 Figure 2: Relation between sub- that does not have access to users’ browsing activities, we O1=O2 sets of audiences involved in run- do not have a handle on o f but we can directly compare s 1, f S1 S2 ning two ads targeting the same and s 2, f . Because we ensure other factors that may affect ad audience. The subscripts indicate delivery are either controlled or affect both ads equally, we can sets for the first and second ad. attribute any difference we might observe between s 1, f and s 2, f to choices made by the platform’s ad delivery algorithm based on factors unrelated to qualification of users, such as 4.3.4 Launching Ads and Collecting Delivery Statistics. For revenue or engagement goals of the platform. LinkedIn, we use its Marketing Developer Platform API to cre- 4.4.2 Statistical Significance: We use the Z-Test to measure ate the ads, and once the ads run, to get the final count of the statistical significance of a difference in proportions we impressions per county which we use to infer gender. For observe between s 1, f and s 2, f . Our null hypothesis is that Facebook, we create ads via its Ad Manager portal. The portal there is no gender-wise difference between the audiences that gives a breakdown of ad impressions by gender, so we do not saw the two ads, i.e., s 1, f = s 2,f , evaluated as: rely on using county names as a proxy. We export the final s 1, f − s 2, f gender breakdown after the ad completes running. Z= q ŝ f (1 − ŝ f )( n11 + n12 ) 4.4 Skew Metric We now describe the metric we apply to the outcome of adver- where ŝ f is fraction of females in S 1 and S 2 combined (S 1 ∪ S 2 ), tising, i.e. the demographic make-up of the audience that saw and n 1 and n 2 are the sizes of S 1 and S 2 , respectively. At α our ads, to establish whether platform’s ad delivery algorithm significance level, if Z > Z α , we reject the null hypothesis and leads to discriminatory outcomes. conclude that there is a statistically significant gender skew in the ad delivery. We use a 95% confidence level (Z α = 1.96) for 4.4.1 Metric: As discussed in the beginning of this section, all of our statistical tests. This test assumes the samples are out methodology works by running two ads simultaneously independent and n is large. Only the platform knows whom it and looking at the relative difference in how they are delivered. delivers the ad to, so only it can verify independence. Sample In order to be able to effectively compare delivery of the two sizes vary by experiment, as shown in figures, but they always ads, we need to ensure the baseline audience that we use to exceed 340 and often are several thousands. measure skew is the same for both ads. The baseline we use is people who are qualified for the job we are advertising and 4.5 Ethics are browsing the platform during the ad campaigns. However, Our experiments are designed to consider ethical implications, we must consider several audience subsets shown in Figure 2: minimizing harm both to the platforms and the individuals A, the the audience targeted by us, the advertiser (us); Q, the that interact with our ads. We minimize harm to the platforms subset of A that the ad platform’s algorithm considers qualified by registering as an advertiser and interacting with the plat- for the job being advertised, and O, the subset of Q that are form just like any other regular advertiser would. We follow online when the ads are run. their terms of service, use standard APIs available to any ad- Our experiment design should ensure that these sets are the vertiser and do not collect any user data. We minimize harm to same for both ads, so that a possible skewed delivery cannot individuals using the platform and seeing our ads by having all be merely explained by a difference the underlying factors our ads link to a real job opportunity as described. Finally, our these sets represent. We ensure A, Q, and O match for our jobs ad audiences aim to include an approximately equal number by targeting the same audience (same A), ensuring both jobs of males and females and so aim not to discriminate. Our study have similar qualification requirements (same Q) as discussed was classified as exempt by our Institutional Review Board. in §4.2, and by running the two ads at the same time (same O). To measure gender skew, we compare what fraction of peo- 5 EXPERIMENTS ple in O that saw our two ads are a member of a specific gender. Possible unequal distribution of gender in the audience does We next present the results from applying our methodology to not affect our comparison because it affects both ads equally real-world ads on Facebook and LinkedIn. We find contrasting (because O is the same for both ads). Let S 1 and S 2 denote results that show statistically significant evidence of skew that subsets of people in O who saw the first and second ad, respec- is not justifiable on the basis of qualification in the case of tively. S 1 and S 2 are not necessarily disjoint sets. To measure Facebook, but not in the case of LinkedIn. We make data for gender skew, we compare the fraction of females in S 1 that the ads we used in our experiments and their delivery statistics saw the first ad (s 1, f ) and fraction of females in S 2 that saw publicly available at . the second ad (s 2,f ) with the fraction of females in O that were online during the ad campaign (o f ). 5.1 Measuring Skew in Real-world Ads In the absence of discriminatory delivery, we expect, for We follow the criteria discussed in §4.2 to pick and compare both ads, the gender make-up of the audience the ad is shown jobs which have similar qualification requirements but for to be representative of the gender make-up of people that which there is data that shows the de facto gender distribu- were online and participated in ad auctions. Mathematically, tion is skewed. We study whether ad delivery optimization 3773 WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Basileal Imana, Aleksandra Korolova, and John Heidemann algorithms reproduce these de facto skews, even though they Q " 6 N H Z " are not justifiable on the basis of differences in qualification. ) % $ X G I = 6 N H Z H G We pick three job categories: a low-skilled job (delivery dri- ) % $ X G = ver), a high-skilled job (software engineer), and a low-skilled 6 N H Z H G but popular job among our ad audience (sales associate). Since ) % $ X G = , Q V W D F D U W 6 N H Z H G our methodology compares two ads for each category, we / , $ X G I '