Authors Bodil Svennblad Bénédicte Delcoigne Karl Michaëlsson Nils Feltelius Peter Wilén Rickard Ljung Rolf Gedeborg Wilmar Igl
License CC-BY-4.0
Received: 1 August 2022 Revised: 30 November 2022 Accepted: 12 December 2022 DOI: 10.1002/pds.5587 REVIEW Federated analyses of multiple data sources in drug safety studies Rolf Gedeborg 1 | Wilmar Igl 2 | Bodil Svennblad 3 | Peter Wilén 4 | Bénédicte Delcoigne 5 | Karl Michaëlsson 3 | Rickard Ljung 6 | Nils Feltelius 6 1 Department of Efficacy and Safety 1, Division of Licensing, Medical Products Agency, Abstract Uppsala, Sweden Purpose: Studies of rare side effects of new drugs with limited exposure may require 2 Statistics Group, Department of Efficacy and Safety 2, Division of Licensing, Medical pooling of multiple data sources. Federated Analyses (FA) allow real-time, interactive, Products Agency, Uppsala, Sweden centralized statistical processing of individual-level data from different data sets 3 Department of Surgical Sciences, Unit of without transfer of sensitive personal data. Medical Epidemiology, Uppsala University, Uppsala, Sweden Methods: We review IT-architecture, legal considerations, and statistical methods in 4 Department of Legal Affairs, Medical FA, based on a Swedish Medical Products Agency methodological development Products Agency, Uppsala, Sweden 5 project. Clinical Epidemiology Division, Department of Medicine Solna, Karolinska Institutet, Sweden Results: In a review of all post-authorisation safety studies assessed by the EMA dur- 6 Division of Use and Information, Medical ing 2019, 74% (20/27 studies) reported issues with lack of precision in spite of mean Products Agency, Uppsala, Sweden study periods of 9.3 years. FA could potentially improve precision in such studies. Correspondence Depending on the statistical model, the federated approach can generate identical Rolf Gedeborg, Department of Efficacy and results to a standard analysis. FA may be particularly attractive for repeated collabo- Safety 1, Division of Licensing, Medical Products Agency, Uppsala, Sweden. rative projects where data is regularly updated. There are also important limitations. Email: rolf.gedeborg@lakemedelsverket.se Detailed agreements between involved parties are strongly recommended to antici- Funding information pate potential issues and conflicts, document a shared understanding of the project, Lakemedelsverket (Swedish Medical Products and fully comply with legal obligations regarding ethics and data protection. FA do Agency) not remove the data harmonisation step, which remains essential and often cumber- some. Reliable support for technical integration with the local server architecture and security solutions is required. Common statistical methods are available, but adapta- tions may be required. Conclusions: Federated Analyses require competent and active involvement of all collaborating parties but have the potential to facilitate collaboration across institu- tional and national borders and improve the precision of postmarketing drug safety studies. KEYWORDS adverse drug reactions, decentralised analysis, federated analysis, pooled analysis, post- authorisation safety studies, product surveillance, postmarketing, registers This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2022 The Authors. Pharmacoepidemiology and Drug Safety published by John Wiley & Sons Ltd. Pharmacoepidemiol Drug Saf. 2023;32:279–286. wileyonlinelibrary.com/journal/pds 279 10991557, 2023, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5587 by CochraneItalia, Wiley Online Library on [13/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 280 GEDEBORG ET AL. Key Points • Federated analysis allows real-time, interactive, centralized statistical analysis on individual- level data, without actual transfer of sensitive personal data between institutions and countries. • The technique may be particularly attractive for situations where repeated collaborative pro- jects are anticipated, and the cohorts are dynamic. • Common statistical methods are available as mathematical models and software implementa- tions for federated analysis. • It is strongly recommended that the division of responsibility, including economic undertak- ings, and limitations on liability are always documented in a contract or other legally binding document. • Successful implementation of federated analysis requires competent and active involvement of all collaborating parties. 1 | I N T RO DU CT I O N used statistical models, for example, Generalized Linear Regression models.7–9 Characterising rare adverse drug reactions is a common and challeng- In FA all individual-level information remains protected behind ing regulatory concern, especially for new drugs and drugs with lim- the normal security mechanisms of the local host system, and the data ited exposure. The need to combine multiple data sources for such owner retains full control over the minimum level of aggregation analyses is becoming increasingly important, but combining different required to allow data to be viewed by the analyst, with protection sources of individual-level data is a complex process, especially when against unauthorized use (Figure 1). From a statistical perspective, FA different countries are involved. allows bi-directional exchange of information between a statistical The rapid development of the COVID-19 vaccines and need for a model and sensitive individual-level patient data via anonymized fast implementation of national vaccination programs has highlighted group-level summary results, without sharing information on individ- the need to generate timely post-approval safety data to detect uals. Thus, the data owners retain full local control over security set- potential uncommon adverse reactions. Having agreements and tech- tings that determine the level of aggregation required to protect nical arrangements in place for such pooling of data from multiple sensitive personal information. sources and countries would greatly facilitate the generation of timely Whenever multiple raw datasets are used for a study, there is a study results. need to create common study variables, a “Common Data Model”, Federated analysis (FA) is a technique that may facilitate a centra- with common structure and format for the variables.10–12 This is a lised combined analysis of multiple decentralised data sources without prerequisite for FA as well as for any other strategy for combined requiring actual data merging. We review the findings in a FA devel- analysis of individual level data from different data sources. It is opment project conducted by the Swedish Medical Products Agency essential that the variables have similar definitions by harmonisation covering computer engineering aspects, limitations and opportunities using transparent and well documented algorithms. of statistical methods, the legality and regularity of FA involving the A two-step meta-analysis is another alternative to direct access processing of personal data, and tools for validating implementation. to combined individual level data. With harmonised data at each data The focus is on FA that can generate identical results as individual node, and a distributed code for analysis, the results from each data level pooling of data in comparative epidemiological studies requiring node can then be aggregated using conventional meta-analysis. In regression analysis to control for confounding. many instances this will generate results similar to a one-step analysis on individual level data.13–15 One potential limitation is that parame- ters for covariates may be inconsistently estimated across different 2 | WHAT ARE FEDERATED ANALYSES? data sets, but this is not necessarily a concern. In principle, FA and two-step meta-analysis are statistically equiv- Federated analysis allow real-time, interactive, central statistical analyses alent. The main advantage is that FA allows real-time, interactive, cen- in a system of federated databases, without transferring individual level tralized, standardized analysis of distributed data, that is, without data outside the protective security mechanisms of the original host sys- having to ask data owners to perform specific analysis and provide tem (Figure 1).1 It is therefore an apparently attractive alternative to the results. The main disadvantage is that the required IT infrastruc- physical merging of data and there are several examples of initiatives ture is considerably more complex for FA than for two-step meta-ana- 2–6 implementing different forms of FA for pharmacoepidemiology. That lyses. While meta-analysis can be performed on standard statistical the federated approach provides identical results compared to analyses software and personal computers, FA requires a complex software of physically merged data has been demonstrated for commonly stack on a federated client–server architecture. 10991557, 2023, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5587 by CochraneItalia, Wiley Online Library on [13/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License GEDEBORG ET AL. 281 Data source Data source 1 2 Each organisaons normal data protecon mechanisms (”Firewalls”) Analysis computer Data source Data source 3 4 F I G U R E 1 A schematic representation of the concept of a Federated analysis, here illustrated by four separate data sources. The red boxes denote each data owner's normal security mechanisms for protection of sensitive personal information. The analysis task is submitted to each data server and executed locally. All summary results are then sent back to the analysis client, where they are integrated to a single result. This process may be iterated, and results optimized until the analysis has converged to a final result. 3 | H O W C A N F E D E R A T E D A N A LY S E S 4 | W H I C H ST A T I S T I C A L A N A L Y S E S A R E F A CI LI T A TE P O S T- M A R K E T I N G S T U D I E S A V A I L A B L E A N D W H A T A R E TH E FROM A REGULATORY PERSPECTIVE? LIM I TAT I ON S? Characterisation of drug safety profiles is one of the core missions of In our review of PASS the most common statistical methods used regulatory agencies. Data available at the time of a marketing authori- were Cox proportional hazards regression (59%; 16/27) and logistic zation is often not sufficient to fully characterise safety and non- regression (15%; 4/27). Negative binomial regression, Poisson interventional post-authorisation safety studies (PASS) using existing regression, generalised estimation equations (GEE), LASSO regres- health-care databases are often required. A current example is the sion, and regularised regression were applied in only a few studies. characterisation of myocarditis as an adverse reaction from mRNA Models incorporating random effects were used in 22% (6/27) of 16 COVID-19 vaccines. these studies. Propensity scores (PS) were used in 44% (12/27) of To assess the practical extent of insufficient sample size in PASS, the studies. The extent and potential consequences of missing data and the potential of FA to address this concern, we reviewed regula- was reviewed in only 52% (14/27) of the assessment reports. In tory assessment reports of PASS final results on the agenda for the 48% (13/27) of these studies missing data was stated as a potential plenary meetings of the European Medicines Agency Pharmacovigi- concern. The methods used to handle missing data were single lance Risk Assessment Committee (PRAC) in 2019. Results from infer- imputation (4/13), multiple imputation (3/13), missing category ential studies involving modelling of covariates were selected for (2/13), complete case analysis (1/13), no method (1/13), or not review. The review was restricted to 27 non-interventional PASS with stated (2/13). an inferential design and study objective motivating regression analy- The following sections discuss opportunities and limitations sis (Figure 2). The time periods observed in these studies were on regarding statistical methods and related issues in relation to the find- average 9.3 years, ranging from 5 to 19 years. Despite this, 74% ings in our review of PASS study results (Supplemental online-only (20/27) of the assessments reported issues with poor precision of material: Report Federated analyses – Statistical methods). estimates. A total of 20 studies were performed using a single data source, but in 70% (14/20) of these we still assessed a FA approach applica- 4.1 | Horizontally or vertically partitioned data ble, based on availability of other similarly structured data sources that could have been used. In one of seven studies based on multiple A standard FA requires that the data is horizontally partitioned, mean- data sources a FA approach had been used, and four studies were ing that all data sources include different sets of patients, but with the considered potentially suitable since multiple similarly structured data same variables. This is also the key type of data pooling needed to sources were used in the study but without pooling of individual level increase the sample size and precision of estimates. In vertically parti- data. This suggests that FA could be applicable in the majority of stud- tioned data, on the other hand, all data nodes include the same set of ies involving multiple datasets. patients, but with different sets of variables. This type of data pooling 10991557, 2023, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5587 by CochraneItalia, Wiley Online Library on [13/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 282 GEDEBORG ET AL. F I G U R E 2 Flow chart describing selection of a post- authorisation safety studies (PASS) where final study results were assessed by the PRAC in 2019. Studies with an aim involving inferential analyses with modelling of multiple variables were assessed for their potential suitability for Federated analysis (FA). is also of interest,17–19 but FA on vertically partitioned data requires 4.4 | Propensity scores the involvement of a trusted third-party facility.20 Propensity scores based on measured baseline covariates are commonly used to control for confounding.27,28 They can be calculated using 4.2 | Generalized linear models logistic-binomial regression, which is available as FA. In a FA the propen- sity scores can either be site-specific, where a model is fitted separately Generalized linear models (GLM)s include many common regression in each site with the possibility of including site-specific measurement models, for example, linear regression, logistic regression, or Poisson covariates to control within-site confounding29 or fitted using harmo- regression, which have a broad range of applications in epidemiology. nized variables and all observations as if pooled to control between-site They can be used in FA in a form that gives identical results to the confounding. The propensity score can then be used for stratification, standard formulation of GLMs.7–9 GLMs are implemented, for exam- reweighting or adjustment, but matching across sites will not be possible. ple, in the R/DataSHIELD package for FA.21 4.5 | Missing data 4.3 | Time-to-event analyses When a complete case analysis is inappropriate, trivial methods for impu- The equivalence of the Cox proportional hazards model in a FA com- tation of missing values, such as by imputation of the mean, can be pared to a standard analysis has been demonstrated.22 This approach applied.30 If data is missing-at-random (MAR), maximum likelihood esti- requires that distinct event times are shared between sites, which mation (without imputation) or multiple imputation are recommended.31 potentially could identify a patient. There are also limitations in han- Multiple imputation by Chained Equations (MICE)32 can be used in FA dling high-dimensional data. and is under development to be integrated in the DataSHIELD soft- A federated meta-analysis approach using the Cox model has ware.33 If data is missing-not-at-random (MNAR), model-based imputa- recently been added to the R/DataSHIELD package performing the tion is required, but has not been applied in FA to our knowledge. When Cox model separately in each federated data set.23 The results can a variable is missing completely at one or more data sites a possible solu- then be combined using meta-analysis. In case of rare events the tion is to replace the covariate based on a prediction model (including result from such an analysis can be biased because of the normal uncertainty) using other correlated variables from other data sources. approximation of the likelihood function.24 Other options for the like- This is equally applicable for physically merged data and FA. lihood function approximation exist but are not yet implemented as FA but for use in a traditional meta-analysis approach (e.g., the pack- age R/EvidenceSynthesis).24 4.6 | Limitations The only option to analyse time to event data federated on individual data, without sharing event time between sites and Statistical analyses that require the combined or joint distribution of without using the meta-analysis approach, is to approximate the individual-level data from multiple data sources, or are based on non- Cox proportional hazard model with a GLM using Poisson parametric empirical distributions, which cannot be approximated by regression.25,26 parametric distributions, will not be possible or are technically 10991557, 2023, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5587 by CochraneItalia, Wiley Online Library on [13/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License GEDEBORG ET AL. 283 challenging. Examples of such analyses are the identification of dupli- Federated analysis requires a basic IT setup with a central access cated patient records across datasets, the calculation of the median server with R Studio software, which connects to several DataSHIELD across multiple datasets, or the analysis of variables with non- servers via secure permanent VPN channels. The client (i.e., analyst) parametric statistical distributions whose properties cannot be ade- computer has a secure connection to this central access server. Once quately summarized by a parametric distribution. These issues can be this central node server is setup at one of the locations, the access to solved by all other servers is through this node server. The implementation and maintenance of such a system requires a strong commitment from the • the use of alternative statistical measures, for example, by calculat- IT support teams at all locations. Integration with existing local IT ing the (weighted) mean instead of the median, or the standard architecture and IT policies may be challenging (Supplemental online- deviation instead of range, only material: Report Federated analyses - IT architecture). • the use of alternative distributions, for example, deriving the median from a fitted parametric distribution, • or the transformation of a variable, for example, from a complex, 6 | W H A T L EG A L I M P L I C A T I O N S S HO U L D continuous distribution to a categorical distribution. BE CONSIDERED? • the comparison of encrypted patient-identifying features20 The main advantage of FA is that the processing happens at the node level, which gives more control over sensitive personal data to the 4.7 | Computation time local research principals at each node. While the General Data Protec- tion Regulation (GDPR) is a common legislation for all EU countries, In an FA all participating data servers must be up and running simulta- national legislation for ethical approval of research may still result in neously for the duration of the analysis. Computation time for a FA is differences between EU countries in views on access to sensitive per- determined by the network latency or lowest specification hardware sonal information for research. When FA is used, the local research in the overall system, and will also depend on the number of steps in principal at each node performs several specific processing operations iterative statistical analyses.18 The statistical models may have to be on personal data in the stage prior to the actual FA. The statistical re-formulated in a way that requires the computation of additional analysis performed locally at the node constitutes a personal data pro- parameters. For example, a log-Poisson Generalized Linear Model cessing operation. The numerical estimates delivered to the central requires additional parameters compared to a Cox proportional haz- server by each node are not to be construed as personal data when ards model, and therefore increased computation time.34 Additional data only consists of aggregated results from statistical calculations. processing may also be required to apply data disclosure controls.18 Responsibility for personal data is based on the real influence an actor has over each processing operation. The term controller means the natural or legal person, public authority, agency or other body 5 | WHAT TO CONSIDER REGARDING IT which, alone or jointly with others, determines the purpose and means ARCHITECTURE? of the processing of personal data. The term processor means a natu- ral or legal person, public authority, agency or other body which pro- Federated analysis requires a complex software stack of applications cesses personal data on behalf of the controller. There may be to coordinate data management and data analysis. One example is the multiple controllers for the various processing operations performed. open-source OBiBa software application suite developed in the Mael- In FA an actor may have a real influence over a given processing oper- strom Research project.8,18,35,36 This aims to facilitate epidemiological ation even if no individual level data has been transferred to that research using multiple, physically separate data sources. This involves actor. It is therefore appropriate to establish early in the planning collecting and harmonising data from different databases, publishing phase if there will be one or more research projects and distribute general information about the content of data sources, and creating roles and responsibilities accordingly. All agreements should be in tools for FA. There is a strong collaboration with the Data to Knowl- writing and regulate roles and responsibilities, as well as liability for edge Research Group (D2K) at Newcastle University, any failure to comply with the terms of the agreement. United Kingdom, which is spearheading the development of Data- In traditional register-based research, the role of the controller for Shield, the key software for FA within the R software for statistical different processing activities tends to be assigned based on where computing. All software developed in Maelstrom Research has a GPL the personal data is being processed. This means the role of controller 37 v3 open-source licence. may be transferred between organisations and research principals The Maelstrom Research project software was used in the Swed- together with the personal data. When FA is used, responsibility for ish Medical Products Agency development project as an example processing may be divided among the participating research principals, because it is rooted in a non-profit organisation, uses open-source even if the personal data in question is not transferred. It is crucial to code, and it is used by academic research groups in Sweden. There clarify procedures, roles, and relationships to determine who is a con- are several other technical solutions for FA, but it was not within the troller. All collaborating parties should be aware of the division of scope of this project to compare different software. responsibilities to avoid any party being able to influence the purpose 10991557, 2023, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5587 by CochraneItalia, Wiley Online Library on [13/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 284 GEDEBORG ET AL. MODEL 1: Data processing agreements MODEL 2: Sole overall responsibility MODEL 3: Joint controllers The research principal for Node 1 is the A single research principal with overall Joint overall responsibility in combination controller of all personal data processed in responsibility, in combination with nodes with nodes with independent responsibility. the federated analysis. The research with independent responsibility. principals for Nodes 2 and 3 are processors. F I G U R E 3 Three theoretical models for the division of controller legal responsibility in a research project that uses Federated analysis. The models illustrate the influence exerted by one or more research principals over operations performed in the central server and in the nodes. The models assume that the operations described can be deemed to fall within the scope of the General Data Protection Regulation (GDPR); for example, in terms of responsibility for implementing security measures or technical measures. or means of processing, thus altering the de facto control of proces- limitations on liability are always documented in a contract or other sing. The roles of controller and processor are distributed based on legally binding document. The division of responsibility and how pro- how responsibility for the purpose and means of processing is to be cessing is to be performed within the project must be unambiguous allocated. This in turn depends on the intended nature of collabora- and transparent. tion and the influence that each of the participating research princi- pals has on the analysis. Although the primary responsibility is to the data subject, there is also a responsibility to the other parties involved 7 | CONC LU SIONS in the project. We propose three different theoretical models for the distribu- Federated analysis allows real-time, interactive, centralized statistical tion of responsibility as controller of personal data in FA (Figure 3). analyses on individual-level data, without actual transfer of sensitive They are intended to facilitate an analysis where actors can deter- personal data between institutions and countries. It has the potential mine the purpose and means of personal data processing. It is to facilitate collaboration and improve the precision of postmarketing important that policymakers, researchers, technicians, lawyers, and safety studies, by increasing the quantity, variety, and availability of others with key roles in a research project are sufficiently familiar data needed to study rare adverse events. Our review of post- with how FA works, how data protection regulations should be authorisation studies indicates that lack of precision in such studies is applied, and the intention behind distributing responsibility in the a common limitation. The technique may be particularly attractive for project. If a processing operation serves common scientific inter- situations where repeated collaborative projects are anticipated, and ests, the operators jointly determine the purposes of processing, the cohorts are dynamic. A recent example is the need to characterise even if they each have their own specific purposes at an earlier or timely postmarketing safety of COVID-19 vaccines. The Nordic coun- later stage. tries, having very similar nationwide health data resources, would be The question of who is to be considered the controller of the an attractive area for such collaborations. Rheumatic diseases can personal data being processed is important, given that the control- serve as another example of a therapeutic area that has seen a rapid ler shall ensure compliance with GDPR in all processing operations development of new medicinal products and a need to further charac- for which they are responsible. This implies that the controller shall terise their safety profile in clinical practice, and where there are exist- ensure that data subjects are informed about how their personal ing disease registers with similar structures in several different data is being processed. Data subjects have the right to obtain the countries.38 Such situations would likely benefit from using FA. rectification or erasure of personal data concerning them. The con- Federated analysis does not remove the data harmonisation troller or processor is also liable to compensate any person who has step, requires reliable support for integrating the FA-specific IT suffered damage. architecture with the respective organisation's general IT architec- There is no requirement pursuant to GDPR for a written agree- ture and security solutions, and should be based on clear and ment between joint controllers, but it is strongly recommended that detailed agreements between involved parties to fully comply with the division of responsibility, including economic undertakings, and legal obligations. Common statistical methods are available as 10991557, 2023, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5587 by CochraneItalia, Wiley Online Library on [13/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License GEDEBORG ET AL. 285 mathematical models and software implementations for FA. The 10. The Book of OHDSI. Chapter 4: The Common Data Model. 2021. implementation of FA requires competent and active involvement Accessed 2022-10-10, 2022, at https://ohdsi.github.io/ TheBookOfOhdsi/ of all collaborating parties. 11. Cohen JM, Cesta CE, Kjerpeseth L, et al. A common data model for The four full reports from the Swedish Medical Products Agency harmonization in the Nordic pregnancy drug safety studies on IT architecture, legal considerations, statistical methods, and a (NorPreSS). Norsk Epidemiologi. 2021;29:117-123. tutorial are provided in the online Appendix. 12. Gini R, Sturkenboom MCJ, Sultana J, et al. Different strategies to exe- cute multi-database studies for medicines surveillance in real-world setting: a reflection on the European model. Clin Pharmacol Ther. AUTHOR CONTRIBUTIONS 2020;108:228-235. Rolf Gedeborg conceptualised and drafted the initial version of the manu- 13. Scotti L, Rea F, Corrao G. One-stage and two-stage meta-analysis of script. All other authors contributed with critical review of the manuscript. individual participant data led to consistent summarized evidence: lessons learned from combining multiple databases. J Clin Epidemiol. 2018;95:19-27. FUND ING INFORMATION 14. Lin DY, Zeng D. Meta-analysis of genome-wide association studies: This research was conducted as a development project by the Medical no efficiency gain in using individual participant data. Genet Epidemiol. Products Agency, which is a Swedish Government agency. The study 2010;34:60-66. 15. Lin DY, Zeng D. On the relative efficiency of using summary statistics did not receive any external funding. versus individual-level data in meta-analysis. Biometrika. 2010;97: 321-332. CONF LICT OF IN TE RE ST 16. Karlstad Ø, Hovi P, Husby A, et al. SARS-CoV-2 vaccination and myo- Dr Ljung reported receiving grants from Sanofi Aventis paid to his carditis in a Nordic cohort study of 23 million residents. JAMA Cardiol. institution outside the submitted work; and receiving personal fees 2022;7:600-612. 17. Cheung Y-m, Lou J, Yu F. Vertical Federated Principal Component from Pfizer outside the submitted work. All other authors declare no Analysis on Feature-Wise Distributed Data. In Web Information Sys- conflict of interest. tems Engineering – WISE 2021: 22nd International Conference on Web Information Systems Engineering, WISE 2021, October 26–29, DATA AVAI LAB ILITY S TATEMENT 2021. Melbourne, VIC, Australia: Springer-Verlag, Berlin, Heidelberg. 2021;173-88. The data that support the findings of this study are available from the 18. Wilson RC, Butters OW, Avraam D, et al. DataSHIELD – new direc- corresponding author upon reasonable request. tions and dimensions. Data Sci J. 2017;16:1-21. 19. Li Y, Jiang X, Wang S, Xiong H, Ohno-Machado L. VERTIcal grid lOgis- ORCID tic regression (VERTIGO). J Am Med Informat Assoc. 2016;23: 570-579. Rolf Gedeborg https://orcid.org/0000-0002-8850-7863 20. Snackerstrom T, Johansen C. De-identified linkage of data across sep- arate registers: a proposal for improved protection of personal infor- RE FE R ENC E S mation in registry-based clinical research. Ups J Med Sci. 2019;124: 1. Azevedo LG, Soares EFdS, Souza R, Moreno MF. Modern Federated 29-32. Database Systems: An Overview. 2020. 21. DataSHIELD CRAN - The Comprehensive R Archive Network of 2. Yamaguchi M, Inomata S, Harada S, et al. Establishment of the MID- DataSHIELD. 2020. Accessed 2020-10-21, at https://cran. NET® medical information database network as a reliable and valu- datashield.org/ able database for drug safety assessments in Japan. Pharmacoepide- 22. Lu CL, Wang S, Ji Z, et al. WebDISCO: a web service for distributed miol Drug Saf. 2019;28:1395-1404. cox model learning without patient-level data sharing. J Am Med Infor- 3. Exploring and understanding adverse drug reactions by integrative mat Assoc. 2015;22:1212-1219. mining of clinical records and biomedical knowledge. European Com- 23. Banerjee S, Sofack GN, Papakonstantinou T, et al. dsSurvival: privacy mission, 2019. Accessed 2019-12-06, 2019, at https://cordis.europa. preserving survival models for federated individual patient meta- eu/project/rcn/85424/factsheet/en analysis in DataSHIELD. BMC Res Notes. 2022;15:197. 4. Platt R, Brown JS, Robb M, et al. The FDA sentinel initiative - an 24. Schuemie MJ, Chen Y, Madigan D, Suchard MA. Combining cox evolving National Resource. N Engl J Med. 2018;379:2091-2093. regressions across a heterogeneous distributed research network fac- 5. Trifiro G, Fourrier-Reglat A, MCJM S, Díaz Acedo C, Van Der Lei J, ing small and zero counts. Stat Methods Med Res. 2022;31:438-450. Group E-A. The EU-ADR project: preliminary results and perspective. 25. McCullagh P, Nelder JA. Generalized Linear Models. 2nd ed. Chap- Stud Health Technol Inform. 2009;148:43-49. man & Hall/CRC; 1998. 6. The Book of OHDSI 2021. Accessed 2022-10-10, 2022, at https:// 26. Carstensen B. Who Needs the Cox Model Anyway? (Version 7). In: ohdsi.github.io/TheBookOfOhdsi/ Steno Diabetes Center C, Denmark, ed. http://bendixcarstensen. 7. Jones E, Sheehan N, Masca N, Wallace S, Murtagh M, Burton P. Data- com/WntCma.pdf2019 SHIELD – shared individual-level analysis without sharing data: a bio- 27. Ali MS, Prieto-Alhambra D, Lopes LC, et al. Propensity score methods statistical perspective. Norwegian J Epidemiol. 2012;21:231-239. in health technology assessment: principles, extended applications, 8. Wolfson M, Wallace SE, Masca N, et al. DataSHIELD: resolving a con- and recent advances. Front Pharmacol. 2019;10:973. flict in contemporary bioscience--performing a pooled analysis of 28. Rassen JA, Schneeweiss S. Using high-dimensional propensity scores individual-level data without sharing the data. Int J Epidemiol. 2010; to automate confounding control in a distributed medical product 39:1372-1382. safety surveillance system. Pharmacoepidemiol Drug Saf. 2012;21- 9. Wu Y, Jiang X, Kim J, Ohno-Machado L. Grid binary LOgistic REgres- (Suppl 1):41-49. sion (GLORE): building shared models without sharing data. J Am Med 29. Rassen JA, Avorn J, Schneeweiss S. Multivariate-adjusted pharmacoe- Informat Assoc. 2012;19:758-764. pidemiologic analyses of confidential information pooled from 10991557, 2023, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5587 by CochraneItalia, Wiley Online Library on [13/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 286 GEDEBORG ET AL. multiple health care utilization databases. Pharmacoepidemiol Drug 37. GNU General Public License v3 [Internet]. 2007. Accessed Saf. 2010;19:848-857. 2019-11-25, at https://www.gnu.org/licenses/gpl-3.0.html 30. Molenberghs G, Fitzmaurice G, Kenward M, Tsiatis A, Verbeke G, 38. Chatzidionysiou K, Hetland ML, Frisell T, et al. Opportunities and eds. Handbook of Missing Data Methodology. Champman & Hall/CRC; challenges for real-world studies on chronic inflammatory joint dis- 2014. eases through data enrichment and collaboration between national 31. Schafer JL, Graham JW. Missing data: our view of the state of the art. registers: the Nordic example. RMD Open. 2018;4:e000655. Psychol Methods. 2002;7:147-177. 32. van Buuren S. Flexible Imputation of Missing Data. 2nd ed. Chapman & Hall/CRC; 2018. SUPPORTING INF ORMATION 33. Amices/dsMice: DataSHIELD Server-side Functions for the Mice Additional supporting information can be found online in the Support- Package (version 0.2.0). 2020. Accessed 2020-11-03, at https://rdrr. ing Information section at the end of this article. io/github/amices/dsMice/ 34. Carstensen B. Who Needs the Cox Model Anyway? (Version 7). C. Steno Diabetes Center; 2019. 35. Budin-Ljosne I, Burton P, Isaeva J, et al. DataSHIELD: an ethically How to cite this article: Gedeborg R, Igl W, Svennblad B, et al. robust solution to multiple-site individual-level data analysis. Public Federated analyses of multiple data sources in drug safety Health Genomics. 2015;18:87-96. studies. Pharmacoepidemiol Drug Saf. 2023;32(3):279‐286. 36. Gaye A, Marcon Y, Isaeva J, et al. DataSHIELD: taking the analysis doi:10.1002/pds.5587 to the data, not the data to the analysis. Int J Epidemiol. 2014;43: 1929-1944.