DOKK Library

European Language Grid - A Language Technology Platform for Multilingual Europe

Authors

License CC-BY-4.0

Plaintext
Cognitive Technologies




Georg Rehm Editor


European
Language
Grid
A Language Technology Platform
for Multilingual Europe
Cognitive Technologies


Editor-in-Chief
Daniel Sonntag, German Research Center for AI, DFKI, Saarbrücken, Saarland,
Germany
Titles in this series now included in the Thomson Reuters Book Citation Index and
Scopus!
The Cognitive Technologies (CT) series is committed to the timely publishing of
high-quality manuscripts that promote the development of cognitive technologies and
systems on the basis of artificial intelligence, image processing and understanding,
natural language processing, machine learning and human-computer interaction.
    It brings together the latest developments in all areas of this multidisciplinary
topic, ranging from theories and algorithms to various important applications. The
intended readership includes research students and researchers in computer science,
computer engineering, cognitive science, electrical engineering, data science and
related fields seeking a convenient way to track the latest findings on the foundations,
methodologies and key applications of cognitive technologies.
    The series provides a publishing and communication platform for all cognitive
technologies topics, including but not limited to these most recent examples:
   Interactive machine learning, interactive deep learning, machine teaching
   Explainability (XAI), transparency, robustness of AI and trustworthy AI
   Knowledge representation, automated reasoning, multiagent systems
   Common sense modelling, context-based interpretation, hybrid cognitive
    technologies
   Human-centered design, socio-technical systems, human-robot interaction,
    cognitive robotics
   Learning with small datasets, never-ending learning, metacognition and
    introspection
   Intelligent decision support systems, prediction systems and warning systems
   Special transfer topics such as CT for computational sustainability, CT in
    business applications and CT in mobile robotic systems
    The series includes monographs, introductory and advanced textbooks, state-of-
the-art collections, and handbooks. In addition, it supports publishing in Open Access
mode.
Georg Rehm
Editor




European Language Grid
A Language Technology Platform
for Multilingual Europe
Editor
Georg Rehm
Deutsches Forschungszentrum
für Künstliche Intelligenz GmbH (DFKI)
Berlin, Germany


The European Language Grid has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement no. 825627.




ISSN 1611-2482                     ISSN 2197-6635 (electronic)
Cognitive Technologies
ISBN 978-3-031-17257-1             ISBN 978-3-031-17258-8 (eBook)
https://doi.org/10.1007/978-3-031-17258-8

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s)
and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this book are included in the book's Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword




I was proud to have the opportunity to present my report on Language Equality in
the Digital Age to the European Parliament in 2018 and even prouder to see the over-
whelming support it received. It was one of my final achievements as a Member of
the European Parliament and I am delighted that it contributed to the groundbreak-
ing work being done on the European Language Grid project. Despite it not being a
legislative report, the level of cross-party support it received meant its recommenda-
tions could not be ignored.
    When I first proposed a report on language equality in the digital age to the Eu-
ropean Parliament’s Culture and Education Committee it provoked a great deal of
interest, as it did in the Industry Committee. This was due to the clear language
inequality in Europe but also to the huge opportunities it presented for the digital in-
dustries. As both committees laid claim to the report, there was some debate before
it was resolved that it would be a Culture Committee report with a written opinion
from the Industry Committee. The latter’s participation strengthened the report and
its impact. It widened the scope to emphasise the importance of the role of private
companies alongside public bodies and of facilitating cross border trade in the Digi-
tal Single Market.
    It was clear from the early days that the European Commission was keen to sup-
port the report and take the proposals forward. As a spokesperson for the Commis-
sioner stated in a conference I organised in parliament in September 2018, “You
are never so wealthy as when you can speak in your own language”. The European
Language Equality project is currently developing a roadmap to achieve language
equality by 2030, which will be presented to the European Institutions later this year.
    Minority languages in particular have most to lose but also most to gain from the
digital age, given the right support. Cultural and linguistic diversity depends largely
on the technological resources available to all languages.
    It was a report by the EU Panel for the Future of Science and Technology, STOA,
that sparked the idea of a parliamentary report. STOA highlighted the social and
economic consequences of language barriers and the widening of the technological
gap. As someone who had long campaigned for equal status for the Welsh language,
I was inspired by the potential that a major EU project could offer.


                                                                                      v
vi                                                                             Foreword

    Even though the dominance of a few well-resourced languages in the digital
world was obvious, the impact of this on other languages had not been adequately
explored. When the discussion began, the interest grew. The increase in technology
presented new threats and new opportunities. This was an issue which literally af-
fected everyone, and most notably children growing up in this digital world. The
role of education is crucial in teaching and understanding language technologies but
also in raising awareness of career opportunities in this industry across Europe.
    The European Union itself, of course, could could play a major role. The institu-
tional framework for the provision of language technology could be improved con-
siderably. I believed that this was such a crucial issue that it deserved the specific
allocation of the portfolio to a European Commissioner. This did not materialise in
the appointment of a new Commission following the European elections in 2019,
but I believe the proposal should be maintained and could be adopted in future.
    The strong support given to my report by the European Parliament was an indica-
tion of support for the exciting language equality work that is taking place now. The
report proposed a dedicated funding programme for research, development and inno-
vation in language technologies with the aim of closing the gap between European
languages.
    This suggestion was a direct result of seeing the existing research being done in
many countries. Identifying the problem went hand in hand with discovering that
there were many individuals and organisations already addressing it and working to
overcome it. They had the information and expertise but needed far more support
and a higher profile. It was clear that the EU could become a trailblazer in research
on digital language technology, given the political will.
    As a politician, the rights of minority languages like my own, Welsh, were at the
heart of my work for justice and equality. For me, language was not merely a means
of communication but central to our culture and identity. The EU claims equality
in diversity but when it came to language equality it fell far short. So in my role as
a Member of the European Parliament I saw an opportunity to help correct this. I
could play my role in parliament but to ensure the report was effective in achieving
its aim it needed the input of the experts, the practitioners and the pioneers in this
field of work to ensure that it was accurate and informed.
    I never fail to be inspired by their work and their dedication and I repeat my thanks
to all those who contributed to the success of the Language Equality in the Digital
Age report and to the remarkable European Language Grid project which established
the primary platform for “language technologies for Europe built in Europe”.


Rhondda Valley, April 2022                                                   Jill Evans
Preface




The origins of this book date back to 2012. Back then, under the umbrella of the EU
Network of Excellence META-NET, we prepared the recommendations and priority
research themes specified in the first Strategic Research Agenda (SRA) for the Eu-
ropean Language Technology (LT) field in a complex, community-driven process.1
While the European LT community is quite extensive, with hundreds of commercial
and academic organisations working on a large and heterogeneous set of technolo-
gies, it is also extremely fragmented with many community members operating only
in narrow niches and limited regions, on very specific topics and quite often only
taking into account one or two regionally confined languages. Through the META-
NET SRA process, we have been able to identify the community’s need for a joint
technology platform that brings the European LT community together, that fosters
collaboration and synergies, that acts as a marketplace and deployment platform, that
functions like the “yellow pages” of the European LT community and through which
essentially all European resources, corpora, datasets and grammars as well as tools,
services and source code can be discovered and actually used, straight from the plat-
form itself. Back in 2013, in the published META-NET SRA, we called this concept
the European Service Platform for Language Technologies. The SRA document only
contained a rather coarse-grained description of this ambitious technology vision,
which has been demanded, for a number of different reasons, by an overwhelming
majority of the members of the LT community.
   Later on, in the three Strategic Research and Innovation Agendas prepared under
the umbrella of the EU project CRACKER (Cracking the Language Barrier; 2015-
2017), we refined the notion of the European LT Service Platform and we extended
the possible use cases and a large number of LT-driven applications, primarily fo-
cusing the multilingual digital single market. Further boosted by the scientific break-
throughs produced in the area of Artificial Intelligence, Machine Learning and Deep
Learning, early on applied to LT applications such as Machine Translation, not only
the topic of Language Technology but also the vision of a joint European Language
Technology Platform became more and more relevant. The topic was mentioned in

1   http://www.meta-net.eu/sra


                                                                                    vii
viii                                                                            Preface

a prominent way in the STOA study Language equality in the digital age2 , commis-
sioned by the European Parliament and also in a European Parliament resolution3
with the same title, adopted by the European Parliament in a landslide vote in 2018
(cf. Jill Evans’ foreword).
    Roughly at the same time, in late 2017, we started preparing a project proposal
for the Horizon 2020 ICT call, topic ICT-29 a), European Language Grid, which
fortunately reflected the vast interest within the community in such a platform. After
various unsuccessful attempts at coming up with a good title for the proposal, we
decided to use the title of the actual call because it fit perfectly. Having passed the
evaluation with a positive result, the project started in January 2019. We had an
enthusiastic kick-off meeting, exciting hackathons and developed the first prototype
of the platform in a fast and agile way. It was first presented to the public at META-
FORUM 2019, which took place in Brussels in October that year.
    Only a few weeks later, the global SARS-CoV-2 pandemic hit. The whole world
was affected and so was our project plan. We were unable to have face-to-face project
meetings or additional hackathons, we were unable to organise any on-site work-
shops with our 32 ELG National Competence Centres as part of the “ELG Euro-
pean Roadshow”. All meetings, including our annual META-FORUM conferences
in 2020 and 2021, had to go virtual, which was new to us at first and quickly became
the new normal. Recently, in early June 2022, we had our last META-FORUM con-
ference under the umbrella of the ELG EU project. META-FORUM 2022 went back,
at least partially, to the old normal with approx. 100 participants in the conference
centre in Brussels and hundreds more participating remotely.
    It was nothing but a pleasure to act as Coordinator of the European Language Grid
project and to work together with such a strong and dedicated team. Our original plan
in this Innovation Action was already quite ambitious yet we managed to exceed our
joint expectations in terms of the technology platform and its features, in terms of the
services and resources developed, collected and ingested into the platform, in terms
of the acceptance and feedback by the community and also in terms of the various
collaborations we conducted with other projects. Many of the features envisioned
for the European Service Platform for Language Technologies in 2013 are in fact
now finally available in the European Language Grid, which is, by a large margin,
the biggest all-purpose Language Technology platform on the planet covering the
whole breadth and technology spectrum of the field.
    All of the activities and results produced by the nine partners of the ELG con-
sortium during the project’s runtime are described in this book in detail. I would
like to thank all consortium partners and team members for their extremely hard
and dedicated work towards our common goal of developing and establishing the
ELG platform, community and marketplace. In addition, I would like to thank the
15 selected pilot projects for their innovative proposals and the more than 200 organ-
isations who applied for funding through one of our pilot projects. Thanks are also
due to the projects ELG collaborated with, especially, in 2021/2022, the European

2   https://www.europarl.europa.eu/stoa/en/document/EPRS_STU(2017)598621
3   https://www.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.html
Preface                                                                                        ix

Language Equality project, the results of which will also be documented in the form
of a book in the same series, but also others such as Bergamot, COMPRISE, ELITR,
EMBEDDIA, Gourmet, Prêt-à-LLOD, AI4EU, HumanE AI Net, VISION, TAILOR,
WeVerify, NTEU, Microservices at your Service, MAPA, QURATOR, PANQURA,
SPEAKER and many others.
   This book is the definitive documentation of the EU project European Language
Grid.4 I would like to thank all colleagues from the ELG consortium and also from
the ELG pilot projects wholeheartedly for the chapters they contributed, without
which this book would not have been possible.
   While this book can only cover the results achieved during the project’s runtime
(January 2019 until June 2022), the ELG initiative will continue. In the second half of
2022 we will establish a legal entity that will take over maintenance and operation of
the platform. We hope that ELG will serve its many purposes and, among others, ad-
dress the stark community fragmentation and contribute to digital language equality
in Europe, functioning indeed as one joint umbrella platform for the whole European
LT community. Furthermore, while none of these can be considered a direct follow-
up just yet, in a few projects (including OpenGPT-X, NFDI4DataScience and AI
as well as the EU projects DataBri-X and SciLake) we will have the opportunity to
continue our work with and on the ELG platform.


Berlin, July 2022                                                                 Georg Rehm

Acknowledgements The European Language Grid EU project has received funding from the Euro-
pean Union’s Horizon 2020 research and innovation programme under grant agreement no. 825627.




4 https://european-language-grid.readthedocs.io provides more details with regard to technical
aspects of the ELG platform. The online documentation is actively maintained and kept up to date.
Contents




1   European Language Grid: Introduction . . . . . . . . . . . . . . . . . . . . . . . . .                                  1
    Georg Rehm
    1     Overview and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    1
    2     The European Language Grid EU Project . . . . . . . . . . . . . . . . . . . . .                                   4
    3     Beyond the ELG EU Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         6
    4     Summary of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    7
          4.1          Part I: ELG Cloud Platform . . . . . . . . . . . . . . . . . . . . . . . . .                         7
          4.2          Part II: ELG Inventory of Technologies and Resources . .                                             8
          4.3          Part III: ELG Community and Initiative . . . . . . . . . . . . . . .                                 8
          4.4          Part IV: ELG Open Calls and Pilot Projects . . . . . . . . . . . .                                   8
    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    9

Part I ELG Cloud Platform

2   The European Language Grid Platform: Basic Concepts . . . . . . . . . .                                                13
    Stelios Piperidis, Penny Labropoulou, Dimitris Galanis, Miltos
    Deligiannis, and Georg Rehm
    1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        13
    2      Overview of the ELG Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                        14
           2.1         Catalogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           15
           2.2         Repository of Language Resources and Technologies . . .                                             16
           2.3         Running Language Technology Cloud Services . . . . . . . .                                          16
    3      User Types and User Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     17
    4      Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        18
    5      Catalogue Contents and Metadata Model . . . . . . . . . . . . . . . . . . . . . .                               19
    6      Publication Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                24
    7      ELG and the FAIR Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                       25
    8      Related Platforms and Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . .                         27
    9      Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         30
    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   30
    Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   34


                                                                                                                           xi
xii                                                                                                                  Contents

3     Using the European Language Grid as a Consumer . . . . . . . . . . . . . . .                                           37
      Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano,
      Athanasia Kolovou, Dimitris Gkoumas, Andis Lagzdiņš, and Stelios
      Piperidis
      1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        37
      2      Web-based Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               39
             2.1         Viewing the Catalogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     39
             2.2         Searching the Catalogue . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     40
             2.3         Viewing Metadata Records and Resources . . . . . . . . . . . . .                                    43
             2.4         Consumer’s Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 46
             2.5         Try out UIs for Language Technology Services . . . . . . . . .                                      48
      3      Public REST APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                50
             3.1         Accessing and Using the Catalogue . . . . . . . . . . . . . . . . . .                               51
             3.2         Downloading a Resource . . . . . . . . . . . . . . . . . . . . . . . . . . .                        51
             3.3         Language Technology Service Public API . . . . . . . . . . . . .                                    51
      4      Python SDK for Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  55
             4.1         Browsing the Catalogue . . . . . . . . . . . . . . . . . . . . . . . . . . . .                      55
             4.2         Downloading a Resource . . . . . . . . . . . . . . . . . . . . . . . . . . .                        56
             4.3         Obtaining an Access Token . . . . . . . . . . . . . . . . . . . . . . . . .                         56
             4.4         Calling Language Technology Services . . . . . . . . . . . . . . .                                  57
      5      User Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               58
      6      Licensing and Billing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               59
      7      Consumer-Related Functionalities in ELG and other Platforms . . .                                               60
             7.1         Catalogue and Repository Functionalities . . . . . . . . . . . . .                                  60
             7.2         Language Technology Service Execution . . . . . . . . . . . . . .                                   61
      8      Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         64
      References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   65
4     Contributing to the European Language Grid as a Provider . . . . . . . .                                               67
      Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis,
      Leon Voukoutis, Katerina Gkirtzou, Rémi Calizzano, Athanasia
      Kolovou, Dimitris Gkoumas, and Stelios Piperidis
      1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        67
      2      Adding Resources to the ELG Platform . . . . . . . . . . . . . . . . . . . . . . .                              68
             2.1      Creating Metadata Records . . . . . . . . . . . . . . . . . . . . . . . . .                            69
             2.2      Uploading and Managing Data Files . . . . . . . . . . . . . . . . . .                                  71
             2.3      Managing Catalogue Entries . . . . . . . . . . . . . . . . . . . . . . . .                             72
      3      Validating and Publishing Metadata Records . . . . . . . . . . . . . . . . . .                                  74
      4      Entity-Type Specific Requirements . . . . . . . . . . . . . . . . . . . . . . . . . .                           74
             4.1      ELG-compatible Services . . . . . . . . . . . . . . . . . . . . . . . . . . .                          75
             4.2      ELG-hosted Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         88
             4.3      Metadata Records for External Resources . . . . . . . . . . . . .                                      89
      5      Provider-Related Functionalities in ELG and other Platforms . . . . .                                           89
             5.1      Metadata Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . .                          89
             5.2      Provider User Interface and Metadata User Interface . . . .                                            90
Contents                                                                                                                    xiii

           5.3          Try Out User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    91
           5.4          Helper Tools for Packaging Resources . . . . . . . . . . . . . . . .                                91
           5.5          Packaging Data Resources . . . . . . . . . . . . . . . . . . . . . . . . . .                        91
     6     Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          92
     References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   92
5    Cloud Infrastructure of the European Language Grid . . . . . . . . . . . . . 95
     Florian Kintzel, Rémi Calizzano, and Georg Rehm
     1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
     2      Cloud Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
            2.1         Kubernetes and Cloud Native . . . . . . . . . . . . . . . . . . . . . . . 97
            2.2         Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
            2.3         Software Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
            2.4         Container Registries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
     3      Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
            3.1         ELG Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
            3.2         Third-Party Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
     4      Scalability of LT Tools and Services . . . . . . . . . . . . . . . . . . . . . . . . . 104
            4.1         Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
            4.2         Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
     5      Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
     References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6    Interoperable Metadata Bridges to the wider Language Technology
     Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
     Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon
     Voukoutis, Maria Giagkou, Ondřej Košarko, Jan Hajič, and Georg
     Rehm
     1     Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
     2     Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
     3     Establishing Interoperable Connections: Four Use Cases . . . . . . . . 110
           3.1          Use Case 1: OAI-PMH (CLARIN, ELRC-SHARE) . . . . . 110
           3.2          Use Case 2: Custom API (Hugging Face) . . . . . . . . . . . . . 112
           3.3          Use Case 3: General Catalogues (Zenodo) . . . . . . . . . . . . . 113
           3.4          Use Case 4: Collaborative Initiatives (ELE, ELG) . . . . . . 115
           3.5          Summary of Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
     4     Implementing Metadata Interoperability . . . . . . . . . . . . . . . . . . . . . . 117
           4.1          ELG Metadata Schema – Relaxed Version . . . . . . . . . . . . . 118
           4.2          Publication Policies for Imported Metadata Records . . . . 118
     5     Interoperability across Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . 119
           5.1          Technical Interoperability across Repositories . . . . . . . . . . 120
           5.2          Semantic Interoperability across Repositories . . . . . . . . . . 120
           5.3          Minimal Metadata Requirements . . . . . . . . . . . . . . . . . . . . . 121
           5.4          Duplicate Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
     6     Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
     References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xiv                                                                                                               Contents

Part II ELG Inventory of Technologies and Resources

7     Language Technology Tools and Services . . . . . . . . . . . . . . . . . . . . . . . . 131
      Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca, Jose Manuel
      Gómez-Pérez, Miroslav Jánošík, Dimitris Galanis, Rémi Calizzano,
      Andis Lagzdiņš, Milan Straka, and Ulrich Germann
      1     Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
      2     Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
      3     Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
            3.1          Case Study: Speech Tools from HENSOLDT . . . . . . . . . . 137
      4     Text Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
            4.1          Case Study: Cogito Discover from Expert.AI . . . . . . . . . . 139
            4.2          Case Study: GATE from University of Sheffield . . . . . . . . 140
            4.3          Case Study: Microservices At Your Service . . . . . . . . . . . . 141
      5     Other Service Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
            5.1          Pilot Project: Text2TCS, University of Vienna . . . . . . . . . 144
            5.2          Pilot Project: MKS as LLOD, Coreon . . . . . . . . . . . . . . . . . 145
      6     Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
      References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8     Datasets, Corpora and other Language Resources . . . . . . . . . . . . . . . . 151
      Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault,
      Penny Labropoulou, Miltos Deligiannis, Leon Voukoutis, and Stelios
      Piperidis
      1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
      2      Identification of Language Resources and Repositories . . . . . . . . . . 152
             2.1         Identification by the Consortium . . . . . . . . . . . . . . . . . . . . . 153
             2.2         Identification by the National Competence Centres . . . . . 153
             2.3         Collaboratively Filling the Gaps . . . . . . . . . . . . . . . . . . . . . 153
      3      Integrating Repositories into ELG . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
             3.1         Priorities in the Ingestion Work . . . . . . . . . . . . . . . . . . . . . . 155
             3.2         Contributing Language Resources . . . . . . . . . . . . . . . . . . . . 156
      4      Procedures to Ingest Language Resources . . . . . . . . . . . . . . . . . . . . . 157
             4.1         Metadata Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
             4.2         Metadata Extraction and Completion . . . . . . . . . . . . . . . . . 158
             4.3         Metadata Harvesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
      5      Language Resources in the ELG Catalogue . . . . . . . . . . . . . . . . . . . . 162
      6      Language Resources and Legal Issues . . . . . . . . . . . . . . . . . . . . . . . . 164
      7      Language Resources and Data Management . . . . . . . . . . . . . . . . . . . 166
      8      Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
      References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Contents                                                                                                                   xv

9      Language Technology Companies, Research Organisations and
       Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
       Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny
       Labropoulou
       1     Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
       2     The European Language Technology Landscape . . . . . . . . . . . . . . . 174
       3     Organisations in the European Language Grid . . . . . . . . . . . . . . . . . 177
             3.1            Collecting the Members of the European LT Community 178
             3.2            Preparation and Integration of Metadata Records . . . . . . . 178
             3.3            Claiming and Enriching Organisation Pages . . . . . . . . . . . 179
             3.4            Organisation Pages in the European Language Grid . . . . . 180
       4     Projects in the European Language Grid . . . . . . . . . . . . . . . . . . . . . . 180
       5     Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
       References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Part III ELG Community and Initiative

10 European Language Technology Landscape:
   Communication and Collaborations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
   Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens
   1     Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
   2     Stakeholders of the European Language Grid . . . . . . . . . . . . . . . . . . 191
         2.1          Language Technology Providers . . . . . . . . . . . . . . . . . . . . . 191
         2.2          Language Technology Users . . . . . . . . . . . . . . . . . . . . . . . . 192
         2.3          Additional Horizon 2020 EU Projects . . . . . . . . . . . . . . . . . 194
         2.4          Major European Projects and Initiatives . . . . . . . . . . . . . . . 194
         2.5          National Competence Centres . . . . . . . . . . . . . . . . . . . . . . . 194
         2.6          Public at Large . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
   3     Communication and Outreach Activities . . . . . . . . . . . . . . . . . . . . . . 195
         3.1          Communication Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
         3.2          Communication Campaign . . . . . . . . . . . . . . . . . . . . . . . . . . 196
   4     Collaborations with other Projects and Initiatives . . . . . . . . . . . . . . . 199
   5     Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11     ELG National Competence Centres and Events . . . . . . . . . . . . . . . . . . 205
       Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm
       1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
       2      National Competence Centres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
              2.1      Tasks and Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . 206
              2.2      Role and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
              2.3      Visibility and Promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
              2.4      Operational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
       3      Conferences and Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
              3.1      META-FORUM Conference Series . . . . . . . . . . . . . . . . . . 210
              3.2      ELG Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
xvi                                                                                                                Contents

             3.3          Additional Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
       4     Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
       References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
12 Innovation and Marketplace:
   A Vision for the European Language Grid . . . . . . . . . . . . . . . . . . . . . . . 219
   Katja Prinz and Gerhard Backfried
   1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
   2      Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
          2.1         Significance of Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . 220
          2.2         Types of Innovation and Innovation Strategies . . . . . . . . . 221
          2.3         Open Innovation in the ELG Platform and Marketplace . . 222
   3      Multi-sided Marketplace Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 226
          3.1         Foundations for a Successful Marketplace . . . . . . . . . . . . . 228
          3.2         ELG Ecosystem of Participants . . . . . . . . . . . . . . . . . . . . . . 230
          3.3         Technical and Practical Aspects . . . . . . . . . . . . . . . . . . . . . . 231
   4      Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
13 Sustaining the European Language Grid:
   Towards the ELG Legal Entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
   Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis,
   Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs,
   Gerhard Backfried, Katja Prinz, Jose Manuel Gómez-Pérez, and Ulrich
   Germann
   1      Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
   2      Long-term Vision and Mission of ELG . . . . . . . . . . . . . . . . . . . . . . . 234
          2.1         Mission of the European Language Grid . . . . . . . . . . . . . . 235
          2.2         Added Value for Stakeholders . . . . . . . . . . . . . . . . . . . . . . . 236
   3      Main Pillars of the Business and Operational Model . . . . . . . . . . . . 236
          3.1         Expectations by the ELG Consortium’s SME Partners . . . 237
          3.2         Key Aspects of the ELG Legal Entity . . . . . . . . . . . . . . . . . 239
          3.3         Assessment of Operational Costs . . . . . . . . . . . . . . . . . . . . 240
          3.4         Business Model Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
          3.5         Product Portfolio and Revenue Streams . . . . . . . . . . . . . . . 245
          3.6         Legal Entity Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
   4      Summary and Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

Part IV ELG Open Calls and Pilot Projects

14 Open Calls and Pilot Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
   Lukáš Kačena, Jana Hamrlová, and Jan Hajič
   1     Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
   2     Organisation of the Open Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
         2.1      Management Structure and Organisation . . . . . . . . . . . . . . 258
Contents                                                                                                                 xvii

             2.2          Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
             2.3          Communication with Stakeholders . . . . . . . . . . . . . . . . . . . 259
             2.4          Submission Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
             2.5          Evaluation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
       3     Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
             3.1          Open Call 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
             3.2          Open Call 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
       4     Pilot Project Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
       5     Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
       References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
15 Basque-speaking Smart Speaker based on Mycroft AI . . . . . . . . . . . . . 271
   Igor Leturia, Ander Corral, Xabier Sarasola, Beñat Jimenez, Silvia
   Portela, Arkaitz Anza, and Jaione Martinez
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 271
   2      Mycroft Localisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
   3      Privacy, Gender and Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
   4      Developments in Basque Speech Technology . . . . . . . . . . . . . . . . . . 273
          4.1         ASR Robustness in Noisy Environments . . . . . . . . . . . . . . 273
          4.2         ASR Closed Grammar-based Recognition . . . . . . . . . . . . . 274
          4.3         Neural Network-based Basque TTS . . . . . . . . . . . . . . . . . . 274
          4.4         Gender-neutral Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
   5      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 275
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
16 CEFR Labelling and Assessment Services . . . . . . . . . . . . . . . . . . . . . . . 277
   Mark Breuker
   1     Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 277
   2     Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
   3     Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
   4     Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
   5     Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 281
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
17 European Clinical Case Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
   Bernardo Magnini, Begoña Altuna, Alberto Lavelli, Anne-Lyse Minard,
   Manuela Speranza, and Roberto Zanoli
   1     Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 283
   2     Corpus Collection and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
   3     Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
   4     Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
   5     Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 287
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
xviii                                                                                                          Contents

18 Extracting Terminological Concept Systems from Natural
   Language Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
   Dagmar Gromann, Lennart Wachowiak, Christian Lang, and Barbara
   Heinisch
   1      Overview and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
   2      Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
          2.1         Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
          2.2         Term Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
          2.3         Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
          2.4         Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
   3      Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
   4      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 293
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
19 Italian EVALITA Benchmark Linguistic Resources, NLP Services
   and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
   Viviana Patti, Valerio Basile, Andrea Bolioli, Alessio Bosca, Cristina
   Bosco, Michael Fell, and Rossella Varvara
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 295
   2      Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
          2.1         Surveying the EVALITA Tasks . . . . . . . . . . . . . . . . . . . . . . 296
          2.2         The EVALITA Knowledge Graph . . . . . . . . . . . . . . . . . . . . 297
          2.3         Anonymisation of Resources . . . . . . . . . . . . . . . . . . . . . . . . 298
          2.4         Release of Data and Models through ELG . . . . . . . . . . . . . 299
   3      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 299
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
20 Lingsoft Solutions as Distributable Containers . . . . . . . . . . . . . . . . . . . 301
   Sebastian Andersson and Michael Stormbom
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 301
   2      Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
   3      Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
   4      Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
   5      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 305
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
21 Motion Capture 3D Sign Language Resources . . . . . . . . . . . . . . . . . . . 307
   Zdeněk Krňoul, Pavel Jedlička, Miloš Železný, and Luděk Müller
   1     Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 307
   2     Methodology and Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
         2.1          Recording Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
         2.2          Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
         2.3          Data Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
         2.4          Dataset Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
   3     Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 311
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Contents                                                                                                              xix

22 Multilingual Image Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
   Svetla Koeva
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 313
   2      Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
          2.1         Ontology of Visual Objects . . . . . . . . . . . . . . . . . . . . . . . . . 314
          2.2         Collection of Images and Metadata . . . . . . . . . . . . . . . . . . . 315
   3      Criteria for the Selection of Images . . . . . . . . . . . . . . . . . . . . . . . . . . 315
          3.1         Generation and Evaluation of Suggestions . . . . . . . . . . . . . 316
          3.2         Annotation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
   4      Multilingual Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
   5      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 317
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
23 Multilingual Knowledge Systems as Linguistic Linked Open Data . . . 319
   Alena Vasilevich and Michael Wetzel
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 319
   2      Making Coreon Data Structure LLOD-compatible . . . . . . . . . . . . . . 320
   3      Real-Time Data Access via a SPARQL Endpoint . . . . . . . . . . . . . . . 323
   4      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 323
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
24 Open Translation Models, Tools and Services . . . . . . . . . . . . . . . . . . . . 325
   Jörg Tiedemann, Mikko Aulamo, Sam Hardwick, and Tommi Nieminen
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 325
   2      Increasing Language Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
   3      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 329
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
25 Sign Language Explanations for Terms in a Text . . . . . . . . . . . . . . . . . 331
   Helmut Ludwar and Julia Schuster
   1     Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 331
   2     Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
   3     Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
   4     Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
   5     Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 335
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
26 Streaming Language Processing in Manufacturing . . . . . . . . . . . . . . . 337
   Patrick Wiener and Steffen Thoma
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 337
   2      Graphical, Flow-based Modeling with Apache StreamPipes . . . . . . 338
   3      Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
   4      Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
   5      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 341
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
xx                                                                                                             Contents

27 Textual Paraphrase Dataset for Deep Language Modelling . . . . . . . . . 343
   Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Valtteri Skantsi, Jemina
   Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija
   Sevón, and Otto Tarkka
   1      Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 343
   2      Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
   3      Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
   4      Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
   5      Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 347
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
28 Universal Semantic Annotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
   Roberto Navigli, Riccardo Orlando, Cesare Campagnano, and Simone
   Conia
   1     Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 349
   2     Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
   3     Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
   4     Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
   5     Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 353
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
29 Virtual Personal Assistant Prototype YouTwinDi . . . . . . . . . . . . . . . . . 355
   Franz Weber and Gregor Jarisch
   1     Overview and Objectives of the Pilot Project . . . . . . . . . . . . . . . . . . 355
   2     Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
         2.1          Use Case 1: Automated Translation of local News . . . . . . 357
         2.2          Use Case 2: Secure Communication between Virtual
                     Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
   3     Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
   4     Conclusions and Results of the Pilot Project . . . . . . . . . . . . . . . . . . . 359
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
List of Contributors




European Language Grid EU Project (Parts I, II and III)
Victoria Arranz
ELDA, France, arranz@elda.org
Gerhard Backfried
HENSOLDT Analytics GmbH, Austria, gerhard.backfried@hensoldt.net
Cristian Berrìo Aroca
Expert AI, Spain, cberrio@expert.ai
Kalina Bontcheva
University of Sheffield, UK, k.bontcheva@sheffield.ac.uk
Rémi Calizzano
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany,
remi.calizzano@dfki.de
Khalid Choukri
ELDA, France, choukri@elda.org
Miltos Deligiannis
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
mdel@athenarc.gr
Dimitris Galanis
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
galanisd@athenarc.gr
Andres Garcia Silva
Expert AI, Spain, agarcia@expert.ai
Ulrich Germann
University of Edinburgh, UK, ulrich.germann@ed.ac.uk



                                                                         xxi
xxii                                                            List of Contributors

Maria Giagkou
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
mgiagkou@athenarc.gr
Katerina Gkirtzou
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
katerina.gkirtzou@athenarc.gr
Dimitris Gkoumas
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
dgkoumas@athenarc.gr
Jose Manuel Gómez-Pérez
Expert AI, Spain, jmgomez@expert.ai
Annika Grützner-Zahn
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany,
annika.gruetzner-zahn@dfki.de
Jan Hajič
Charles University, Czech Republic, hajic@ufal.mff.cuni.cz
Jana Hamrlová
Charles University, Czech Republic, hamrlova@ufal.mff.cuni.cz
Stefanie Hegele
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany,
stefanie.hegele@dfki.de
Miroslav Jánošík
HENSOLDT Analytics GmbH, Austria, miroslav.janosik@hensoldt-analytics.com
Lukáš Kačena
Charles University, Czech Republic, kacena@ufal.mff.cuni.cz
Florian Kintzel
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany,
florian.kintzel@dfki.de
Athanasia Kolovou
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
akolovou@athenarc.gr
Ondřej Košarko
Charles University, Czech Republic, kosarko@ufal.mff.cuni.cz
Jens-Peter Kückens
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany,
jens_peter.kueckens@dfki.de
Penny Labropoulou
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
penny@athenarc.gr
List of Contributors                                                        xxiii

Andis Lagzdiņš
Tilde, Latvia, andis.lagzdins@tilde.lv
Valérie Mapelli
ELDA, France, mapelli@elda.org
Katrin Marheinecke
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany,
katrin.marheinecke@dfki.de
Stelios Piperidis
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
spip@athenarc.gr
Katja Prinz
HENSOLDT Analytics GmbH, Austria, katja.prinz@hensoldt.net
Georg Rehm
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany,
georg.rehm@dfki.de
Mickaël Rigault
ELDA, France, mickael@elda.org
Ian Roberts
University of Sheffield, UK, i.roberts@sheffield.ac.uk
Milan Straka
Charles University, Czech Republic, straka@ufal.mff.cuni.cz
Andrejs Vasiļjevs
Tilde, Latvia, andrejs@tilde.lv
Leon Voukoutis
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
leon.voukoutis@athenarc.gr


European Language Grid FSTP Pilot Projects (Part IV)
Begoña Altuna
Fondazione Bruno Kessler, Italy, HiTZ Centre, University of the Basque Country,
Spain, begona.altuna@ehu.eus
Sebastian Andersson
Lingsoft, Finland, sebastian.andersson@lingsoft.fi
Arkaitz Anza
Skura Mobile, Spain, arkaitz@skuramobile.com
Mikko Aulamo
University of Helsinki, Finland, mikko.aulamo@helsinki.fi
xxiv                                                            List of Contributors

Valerio Basile
University of Turin, Italy, valerio.basile@unito.it
Andrea Bolioli
CELI, Italy, andrea.bolioli@h-farm.com
Alessio Bosca
CELI, Italy, alessio.bosca@h-farm.com
Cristina Bosco
University of Turin, Italy, cristina.bosco@unito.it
Mark Breuker
EDIA b. v., The Netherlands, mark@edia.nl
Cesare Campagnano
Sapienza University of Rome, Italy, campagnano@di.uniroma1.it
Li-Hsin Chang
University of Turku, Finland, lhchan@utu.fi
Simone Conia
Sapienza University of Rome, Italy, conia@di.uniroma1.it
Ander Corral
Elhuyar Fundazioa, Spain, a.corral@elhuyar.eus
Michael Fell
University of Turin, Italy, michael.fell@unito.it
Filip Ginter
University of Turku, Finland, figint@utu.fi
Dagmar Gromann
University of Vienna, Austria, dagmar.gromann@univie.ac.at
Sam Hardwick
University of Helsinki, Finland, sam.hardwick@helsinki.fi
Barbara Heinisch
University of Vienna, Austria, barbara.heinisch@univie.ac.at
Gregor Jarisch
Labs.ai, Austria, gregor@labs.ai
Pavel Jedlička
University of West Bohemia, Czech Republic, jedlicka@ntis.zcu.cz
Beñat Jimenez
Talaios Koop., Spain, jimakker@talaios.coop
Jenna Kanerva
University of Turku, Finland, jmnybl@utu.fi
List of Contributors                                                         xxv

Jemina Kilpeläinen
University of Turku, Finland, jemina.j.kilpelainen@utu.fi
Svetla Koeva
Institute for Bulgarian Language, Bulgarian Academy of Sciences, Bulgaria,
svetla@dcl.bas.bg
Zdeněk Krňoul
University of West Bohemia, Czech Republic, zdkrnoul@ntis.zcu.cz
Hanna-Mari Kupari
University of Turku, Finland, hmknie@utu.fi
Christian Lang
University of Vienna, Austria, christian.lang@univie.ac.at
Alberto Lavelli
Fondazione Bruno Kessler, Italy, lavelli@fbk.eu
Igor Leturia
Elhuyar Fundazioa, Spain, i.leturia@elhuyar.eus
Helmut Ludwar
Sign Time GmbH, Austria, helmut.ludwar@signtime.media
Bernardo Magnini
Fondazione Bruno Kessler, Italy, magnini@fbk.eu
Jaione Martinez
Skura Mobile, Spain, jaione@skuramobile.com
Anne-Lyse Minard
Université d’Orléans, France, anne-lyse.minard@univ-orleans.fr
Luděk Müller
University of West Bohemia, Czech Republic, muller@kky.zcu.cz
Roberto Navigli
Sapienza University of Rome, Italy, navigli@diag.uniroma1.it
Tommi Nieminen
University of Helsinki, Finland, tommi.nieminen@helsinki.fi
Riccardo Orlando
Sapienza University of Rome, Italy, orlando@diag.uniroma1.it
Viviana Patti
University of Turin, Italy, viviana.patti@unito.it
Aurora Piirto
University of Turku, Finland, aurora.e.piirto@utu.fi
Silvia Portela
Skura Mobile, Spain, silvia@skuramobile.com
xxvi                                                            List of Contributors

Jenna Saarni
University of Turku, Finland, jensaay@utu.fi
Xabier Sarasola
Elhuyar Fundazioa, Spain, x.sarasola@elhuyar.eus
Julia Schuster
Sign Time GmbH, Austria, julia.schuster@signtime.media
Maija Sevón
University of Turku, Finland, maija.suonpaa@gmail.com
Valtteri Skantsi
University of Turku, Finland, valtteri.skantsi@oulu.fi
Manuela Speranza
Fondazione Bruno Kessler, Italy, manspera@fbk.eu
Michael Stormbom
Lingsoft, Finland, michael.stormbom@lingsoft.fi
Otto Tarkka
University of Turku, Finland, ohitar@utu.fi
Steffen Thoma
FZI Research Center for Information Technology, Germany, thoma@fzi.de
Jörg Tiedemann
University of Helsinki, Finland, jorg.tiedemann@helsinki.fi
Rossella Varvara
University of Turin, Italy, rossella.varvara@unito.it
Alena Vasilevich
Coreon GmbH, Germany, alena@coreon.com
Lennart Wachowiak
University of Vienna, Austria, lennart.wachowiak@univie.ac.at
Franz Weber
Labs.ai, Austria, franz@labs.ai
Michael Wetzel
Coreon GmbH, Germany, michael@coreon.com
Patrick Wiener
FZI Research Center for Information Technology, Germany, wiener@fzi.de
Roberto Zanoli
Fondazione Bruno Kessler, Italy, zanoli@fbk.eu
Miloš Železný
University of West Bohemia, Czech Republic, zelezny@ntis.zcu.cz
Acronyms




AI         Artificial Intelligence
AMR        Abstract Meaning Representation
API        Application Programming Interface
ASL        American Sign Language
ASR        Automatic Speech Recognition
ATE        Automated Term Extraction
BMC        Business Model Canvas
CAS        Common Analysis System
CAT        Computer-assisted Translation
CC         Creative Commons
CD         Continuous Deployment
CEF        Connecting Europe Facility
CEFR       Common European Framework of Reference
CI         Continuous Integration
CLAIRE     Confederation of Laboratories for AI Research in Europa
CLARIN     Common Language Resources and Technology Infrastructure
CLI        Command-Line Interface
CMDI       Component Metadata Infrastructure
CMS        Content Management System
COAR       Controlled Vocabularies for Repositories
COMPRISE   Cost-effective, Multilingual, Privacy-driven, Voice-enabled Ser-
           vices
CPU        Central Processing Unit
CRACKER    Cracking the Language Barrier
CSE        Czech Sign Language
CSS        Cascading Style Sheets
CURLICAT   Curated Multilingual Language Resources for CEF AT
DC         Data Controller
DC         Dublin Core
DCAT       Data Catalog Vocabulary
DMP        Data Management Plan


                                                                      xxvii
xxviii                                                      Acronyms

DOI        Digital Object Identifier
DSDE       Development of Slovene in a Digital Environment
EEA        European Economic Area
EEIG       European Economic Interest Grouping
EFNIL      European Federation of National Institutions for Language
ELE        European Language Equality
ELG        European Language Grid
ELG R1     European Language Grid Release 1
ELG R2     European Language Grid Release 2
ELG R3     European Language Grid Release 3
ELITR      European Live Translator
ELRA       European Language Resource Association
ELRC       European Language Resource Coordination
ELT        European Language Technology
EMBEDDIA   Cross-lingual Embeddings for Less-Represented Languages in
           European News Media
EOSC       European Open Science Cloud
EUCPT      EU Council Presidency Translator
FAIR       Findable, Accessible, Interoperable, Reusable
FSTP       Financial Support to Third Parties
GATE       General Architecture for Text Engineering
GDPR       General Data Protection Regulation
GPU        Graphics Processing Unit
GUI        Graphical User Interface
HF         Hugging Face
HLT        Human Language Technology
HMM        Hidden Markov Models
HPA        Horizontal Pod Autoscaler
HTML       Hypertext Markup Language
HTTP       Hypertext Transfer Protocol
IAA        Inter-Annotator Agreement
ICT        Information and Communication Technology
IE         Information Extraction
IRI        Internationalised Resource Identifier
JSON       JavaScript Object Notation
JVM        Java Virtual Machine
KG         Knowledge Graph
KPI        Key Performance Indicator
KWS        Keyword-Spotting
LCR        Lexical or Conceptual Resource
LLOD       Linguistic Linked Open Data
LOD        Linked Open Data
LR         Language Resource
LRT        Language Resources and Language Technologies
LSDISCO    Lingsoft Solutions as Distributable Containers
Acronyms                                                               xxix

LT           Language Technology
MAPA         Multilingual Anonymisation for Public Administrations
MARCELL      Multilingual Resources for CEF.AT in the Legal Domain
META         Multilingual Europe Technology Alliance
META-NET     Network of Excellence forging the Multilingual Europe Technol-
             ogy Alliance
MIME         Multipurpose Internet Mail Extensions
MT           Machine Translation
MVP          Minimum Viable Product
NAP          National Anchor Point
NCC          National Competence Centre
NCP          National Contact Point
NER          Named Entity Recognition
NLP          Natural Language Processing
NLU          Natural Language Understanding
NMT          Neural Machine Translation
NTEU         Neural Translation for the European Union
OA           Open Access
OAI-PMH      Open Archives Initiative Protocol for Metadata Harvesting
OCR          Optical Character Recognition
OLAC         Open Language Archives Community
OWL          Web Ontology Language
PB           Pilot Board
PID          Persistent Identifier
POS          Part of Speech
PRINCIPLE    Providing Resources in Irish, Norwegian, Croatian and Icelandic
             for Purposes of Language Engineering
PROVENANCE   Providing Verification Assistance for New Content
QURATOR      QURATOR – Curation Technologies
RDF          Resource Description Framework
REST         Representational State Transfer
SDK          Software Development Kit
SEO          Search Engine Optimisation
SKOS         Simple Knowledge Organisation System
SL           Sign Language
SME          Small and Medium Size Enterprises
SOA          Service-Oriented Architecture
SPDX         Software Package Data Exchange
SQL          Structured Query Language
SR           Subword Regularisation
SRL          Semantic Role Labeling
TBX          Termbase Exchange
TC           Text Classification
TCS          Terminological Concept System
TMX          Translation Memory Exchange
xxx                                                  Acronyms

TTS     Text To Speech Synthesis
UI      User Interface
UML     Unified Modeling Language
URI     Uniform Resource Identifier
URL     Uniform Resource Locator
USP     Unique Selling Proposition
WFST    Weighted Finite State Transducer
WMT     Workshop/Conference on Machine Translation
WSD     Word Sense Disambiguation
XGAPP   XML GATE Application
XMI     XML Metadata Interchange
XML     Extensible Markup Language
XSD     XML Schema Definition
YAML    YAML Ain’t Markup Language
Chapter 1
European Language Grid: Introduction

Georg Rehm




Abstract Europe is a multilingual society with 24 European Union Member State
languages and dozens of additional languages including regional and minority lan-
guages as well as languages spoken by immigrants, trade partners and tourists. While
languages are an essential part of our cultural heritage, language barriers continue
to be unbreachable in many situations. The only option to enable and to benefit
from multilingualism is through Language Technologies (LTs) including Natural
Language Processing (NLP), Natural Language Understanding (NLU) and Speech
Technologies. The commercial European LT landscape is dominated by hundreds of
SMEs that develop many different kinds of LTs. While the industrial and also the
academic European LT community is world-class, it is also massively fragmented.
This chapter is an introduction to the present volume, which describes the European
Language Grid (ELG) cloud platform, initiative and EU project. The ELG system
is targeted to evolve into the primary platform and marketplace for LT in Europe
by providing one umbrella platform for the entire European LT community, includ-
ing research and industry, enabling all stakeholders to showcase, share and distribute
their services, tools, products, datasets and other resources. At the time of writing, the
ELG platform provides access to more than 13,000 commercial and non-commercial
language resources and technologies covering all official EU languages and many
national, co-official, regional and minority languages.



1 Overview and Context

Europe is a multilingual society with 24 EU Member State languages and dozens of
additional languages including regional and minority languages as well as languages
spoken by immigrants, trade partners and tourists. While languages are an important
part of our cultural heritage, language barriers continue to be unbreachable in many
situations. The only option to enable and to benefit from multilingualism is through
Language Technologies (LTs) including Natural Language Processing (NLP), Nat-

Georg Rehm
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany, georg.rehm@dfki.de


© The Author(s) 2023                                                                    1
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_1
2                                                                                   Georg Rehm

ural Language Understanding (NLU), and Speech Technologies. The commercial
European LT landscape is dominated by hundreds of SMEs and a few larger enter-
prises (Rehm et al. 2020b). While the European LT community is world class, it is
also very fragmented, significantly holding back its impact (Vasiljevs et al. 2019).
   This book is the definitive documentation1 of the EU project European Language
Grid, which has developed the ELG cloud platform (Figure 1), available online at:
                       https://www.european-language-grid.eu
    The European Language Grid is targeted to evolve into the primary platform for
Language Technology in Europe. We provide one umbrella platform for all LTs and
LRs developed by the whole European LT landscape, including research and indus-
try, addressing a major gap, i. e., the lack of a common LT platform, that has been
repeatedly raised by the whole community for many years (Rehm and Uszkoreit
2013; Rehm et al. 2016; STOA 2018; Rehm and Hegele 2018; European Parliament
2018). The ELG platform is also meant to be a virtual home and marketplace for
all products, services and organisations active in this space in Europe, significantly
boosting the EU Digital Single Market by helping to make it multilingual. ELG is
an initiative from the European LT community for the European LT community. It
provides one platform that can be used by all stakeholders to showcase, share and dis-
tribute their products, services, tools, datasets, corpora and other relevant resources.
At the time of writing, the ELG platform enables access to more than 13,000 com-
mercial and non-commercial language resources and technologies for all official EU
languages and many national, co-official, regional and minority languages.
    The European LT community had been demanding a dedicated LT platform for
years – the ELG cloud platform fills this gap. The ambition of the ELG project
and initiative is to unite a strong and extensive network of European experts and
concentate on commercial as well as non-commercial LTs, both functional (analy-
sis, processing and generation for written and spoken language) and non-functional
(datasets, corpora, lexicons, models etc.). A related goal is to establish the ELG as a
marketplace for the fragmented European LT landscape (Vasiljevs et al. 2019; Rehm
et al. 2020b) to connect demand and supply, strengthening Europe’s position in this
field. The ELG platform enables the whole European LT community to upload their
services and datasets, to deploy them, connect with, and make use of those resources
made available by others (taking into account IPR and licences, as soon as the ELG
legal entity is in place, including payment and billing options, especially with regard
to commercial services and resources).
    ELG is also meant to support digital language equality in Europe (STOA 2018;
European Parliament 2018), i. e., bringing about a situation in which all languages
are supported through technologies equally well. Currently, there is still an extreme
imbalance, characterised by a stark predominance of LRTs for English, while almost
all other languages are only marginally supported (Gaspari et al. 2022; Grützner-
Zahn and Rehm 2022). In fact, many of these languages are in danger of digital
1The ELG cloud platform is actively being used, i. e., new services, tools and resources are made
available on or through ELG on a daily basis. The data, numbers and statistics presented in this
book regarding the use of ELG reflect the respective time of writing.
1 European Language Grid: Introduction                                          3




Fig. 1 The European Language Grid cloud platform



language extinction (Rehm and Uszkoreit 2012; Kornai 2013). With an initial con-
sortium of 52 partners, ELG’s sister project ELE (European Language Equality;
Jan. 2021 – June 2022) and its immediate follow-up project ELE 2 (July 2022 – June
2023) are developing a strategic agenda and roadmap for digital language equality
in Europe by 2030 to address this issue by means of a coordinated, pan-European
research, development and innovation programme (Rehm and Way 2023).2


2   https://european-language-equality.eu
4                                                                                     Georg Rehm

2 The European Language Grid EU Project

The original proposal for the Innovation Action “European Language Grid” (ELG)
was prepared by a consortium of nine partners (Table 1) and submitted on 17 April
2018, responding to the European Commission Horizon 2020 call topic ICT-29-2018
(“A multilingual Next Generation Internet”, sub-topic “European Language Grid”).3
The ELG EU project4 started in January 2019 and finished in June 2022.5


    1 Deutsches Forschungszentrum für Künstliche Intelligenz GmbH DFKI            Germany
      (Coordinator)
    2 Athena Research and Innovation Center in Information, Commu- ILSP           Greece
      nication and Knowledge Technologies, Institute for Language and
      Speech Processing
    3 University of Sheffield                                         USFD        UK
    4 Charles University                                              CUNI        Czech Republic
    5 Evaluations and Language Resources Distribution Agency          ELDA        France
    6 Tilde SIA                                                       TILDE       Latvia
    7 HENSOLDT Analytics GmbH                                         HENS        Austria
    8 Expert System Iberia SL                                         EXPSYS      Spain
    9 University of Edinburgh                                         UEDIN       UK
Table 1 Consortium of the ELG EU project



    The project was structured into three broader areas. The ELG Platform area
(WP 1, WP 2, WP 3) took care of developing the technology platform, which was
built with robust, scalable, reliable and widely used open source technologies, en-
abling it to scale with the growing demand and supply. As an important part of the
platform, the ELG catalogue contains metadata records of all resources (including
services, datasets etc.), service and application types, languages as well as records
of LT companies, research organisations, projects, etc. This is where the first area
overlapped with the second, i. e., ELG Content (WP 4, WP 5), referring to the actual
content of the European Language Grid in terms of processing or generation services,
tools, datasets, corpora, models, language resources etc. We distinguished between
functional content (running services that can be uploaded into and deployed from the
ELG cloud platform and integrated into other systems) and non-functional content
(datasets, corpora, lexicons, etc.). Functional LT services are created by container-
ising and ingesting them into ELG. One of our key goals was to make this process
as easy and efficient as possible for commercial and non-commercial LT providers.
These are two of the main classes of users of the third area, i. e., ELG Community
(WP 6, WP 7), which includes all stakeholders of the ELG. Apart from commercial
or academic developers of LT, these stakeholders also include companies, NGOs or

3 https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/
ict-29-2018
4 https://cordis.europa.eu/project/id/825627

5 The original runtime of 36 months was extended by six months due to the COVID-19 pandemic.
1 European Language Grid: Introduction                                                   5

public administrations interested in purchasing or integrating Language Technolo-
gies into their own systems and applications. The ELG project collaborated – and
still collaborates – with various other EU-supported research and innovation projects
as well as with international networks and associations. Furthermore, ELG estab-
lished a network of 32 National Competence Centres (NCCs) in as many European
countries, who acted as national bridges to the project, generating interest in partici-
pating in the ELG initiative amongst relevant stakeholders from their own regions. In
2020, ELG published two open calls through which a total of 15 pilot projects were
financially supported. These pilot projects extended ELG’s catalogue with relevant
services or datasets and realised innovative applications based on the ELG platform
and available services and resources, demonstrating the usefulness of the platform.
Table 2 shows all work packages of the ELG EU project.


               Area                             Work Package                     Lead
      ELG Platform           WP 1 Base Infrastructure                            DFKI
                             WP 2 Language Grid                                   ILSP
                             WP 3 Interactive Interface and Information System   TILDE
      ELG Content            WP 4 Services, Tools, Components                    USFD
                             WP 5 Language Resources, Data Sets and Models       ELDA
      ELG Community          WP 6 Piloting the ELG                               CUNI
                             WP 7 Communication and Competence Centres           DFKI
                             WP 8 Project Management and Coordination            DFKI
Table 2 Work packages of the ELG EU project



   The ELG project resulted in more than 40 deliverables, the public ones of which
are available online.6 In addition to what had been originally specified in the project
plan in early 2018, the project also worked on a number of activities that were not
foreseen to be executed in the project proposal or grant agreement. For example,
ELG organised the First International Workshop on Language Technology Platforms
(IWLTP 2020).7 Driven by the success of this workshop (Rehm et al. 2020a), a spe-
cial issue of the Language Resources and Evaluation journal focusing on LT Plat-
forms is currently in preparation, scheduled to be published in 2023. Motivated by
the very positive feedback we have received from many different stakeholders since
the beginning of the project, we decided, in 2020, to compile the present book as the
definitive documentation of the project.




6   https://www.european-language-grid.eu/deliverables
7   https://www.european-language-grid.eu/iwltp-2020
6                                                                         Georg Rehm

3 Beyond the ELG EU Project

Throughout the years it has been repeatedly argued that Europe should not outsource
its multilingual communication and digital language infrastructure to other conti-
nents and markets since the European demands are complex, challenging and above
all unique. Instead, Europe should support and make use of its own LT community.
One of the obstacles to overcome along the way has been the development of a
shared technology and community platform for all European stakeholders. Now that
the ELG cloud platform is finally in place, it is able to foster Language Technologies
for Europe built in Europe, tailored to our languages and cultures and to our soci-
etal and economical demands, benefitting European citizens, society, innovation and
industry. ELG plays the role of a shared, scalable cloud platform for the whole Eu-
ropean LT community and it also functions as a joint marketplace and broker for a
broad variety of services, products and datasets.
    The ELG EU project was successfully completed in June 2022, and Release 3
of the ELG platform is ready to be used. At the time of writing, ELG provides ac-
cess to more than 13,000 commercial and non-commercial language resources and
technologies for all official EU languages and many national, co-official, regional
and minority languages. In addition, the ELG project has contributed to validating
and extending the platform with 15 pilot projects, building a pan-European commu-
nity of users and providers, establishing communication and outreach channels and
organising a number of large-scale conferences and smaller workshops.
    Since the start of the project, we have been collaborating with the European AI
on demand platform, especially with the AI4EU project, to ensure compatibility of
our approaches in terms of describing resources semantically. Furthering these col-
laborative efforts will facilitate cross-platform search and discovery enabling ELG
resources and other assets to be visible, discoverable and usable by the wider AI com-
munity. Considering the EU’s plan to deploy the emerging European AI on demand
platform, ELG is ready to act as the central language-related AI hub and marketplace
providing access to and direct use of several thousands of LT services and datasets.
    The ELG legal entity will take over further development and maintenance of ELG
in the second half of 2022. At the same time, the ELG platform plays a role in sev-
eral new funded projects. ELE (Jan. 2021 – June 2022) and ELE 2 (July 2022 –
June 2023) have already been mentioned – ELG’s sister projects are developing a
strategic agenda and roadmap for achieving full digital language equality in Europe
by 2030.8 The ELG platform was and is heavily used in ELE – of special impor-
tance is the ELE dashboard, which provides a number of visualisations of the ELG
catalogue, enabling various comparisons of the technology support of Europe’s lan-
guages.9 The project OpenGPT-X (Jan. 2022 – Dec. 2024), funded by the German
Federal Ministry for Economic Affairs and Climate Action, develops large language
models that will enable new data-driven business solutions, specifically address-


8   https://european-language-equality.eu
9   https://live.european-language-grid.eu/catalogue/dashboard
1 European Language Grid: Introduction                                                 7

ing European needs.10 In this project, many different language resources provided
by ELG are used for research and development purposes. In addition, ELG will
be further extended so that it complies to the specifications of the emerging Gaia-
X11 infrastructure and ecosystem, eventually integrating ELG into Gaia-X, making
available many of the OpenGPT-X results (and all ELG resources) through Gaia-X.
The project NFDI4DataScience and Artificial Intelligence (Oct. 2021 – Sept. 2026)
is part of the initiative Nationale Forschungsdateninfrastruktur (German Research
Data Infrastructure).12 In this project, the ELG platform will be integrated into the
emerging NFDI13 infrastructure. A similar goal will be addressed by the upcoming
EU project SciLake (Jan. 2023 – Dec. 2025), in which we will establish technical
bridges between the ELG platform and the European Open Science Cloud (EOSC).14
Finally, the upcoming EU project DataBri-X (Oct. 2022 – Sept. 2025) will interlink
ELG and the emerging DataBri-X platform.



4 Summary of this Book

This book is structured into four different parts. Parts I, II and III describe the main
results of the ELG project, while Part IV focuses on the ELG open calls and the 15
pilot projects. Below we include short summaries of the four parts.



4.1 Part I: ELG Cloud Platform

Part I provides an in-depth description of the European Language Grid Cloud Plat-
form. First, Chapter 2 (p. 13 ff.) introduces the architecture and setup of the ELG
cloud platform, including fundamental concepts such as the user and provider roles,
the semantic metadata scheme and the different types of technologies currently sup-
ported by the platform. Afterwards, Chapter 3 (p. 37 ff.) concentrates on using ELG
as a consumer. For this purpose, the web-based user interface, the public-facing APIs
and the ELG Python SDK can be used. The complementary Chapter 4 (p. 67 ff.) ex-
amines using ELG as a provider of Language Technologies and Language Resources
including the corresponding dashboard, service integration and various helper tools.
Chapter 5 (p. 95 ff.) goes even deeper and provides a description of the ELG cloud in-
frastructure, e. g., the Kubernetes cluster, the storage solution etc. Finally, Chapter 6
(p. 107 ff.) examines the relation between ELG and other projects and infrastructures
in terms of various technical collaborations (e. g., metadata harvesting).

10 https://opengpt-x.de
11 https://gaia-x.eu
12 https://www.nfdi4datascience.de

13 https://www.nfdi.de

14 http://eosc.eu, https://eosc-portal.eu
8                                                                          Georg Rehm

4.2 Part II: ELG Inventory of Technologies and Resources

Part II focuses on the actual content of the ELG platform, i. e., it examines the ELG
Inventory of Technologies and Resources. First, Chapter 7 (p. 131 ff.) describes the
hundreds of functional Language Technology tools and services available in the
ELG platform, covering machine translation, automatic speech recognition, text-to-
speech synthesis as well as text analysis tools, among others. These tools and services
have been and are being provided by companies as well as academic organisations.
Chapter 8 (p. 151 ff.) then takes a look at the diverse set of Language Resources
covering datasets, corpora, language models and other types of resources for all Eu-
ropean languages. Many of these are hosted in ELG, available for direct download,
while for others metadata records are collected from external repositories, enabling
discovery through ELG as a one-stop-shop platform for the European LT community.
Chapter 9 (p. 171 ff.) concludes Part II and describes the organisations, i. e., compa-
nies and research institutions, as well as projects currently represented in ELG. Our
vision is for ELG to become the primary platform for Language Technology in Eu-
rope and, thus, for all organisations that develop LT to actively maintain their ELG
pages, provide language tools and services as well as language resources, linking
them to their own ELG pages.



4.3 Part III: ELG Community and Initiative

Part III provides an in-depth look at four different dimensions of the ELG Community
and Initiative. First, Chapter 10 (p. 189 ff.) describes the main group of stakeholders
that the EU project ELG collaborated with including various LT providers, different
EU and national research projects as well as several wider initiatives. This chapter
also describes the different ELG communication channels including social media.
Chapter 11 (p. 205 ff.) focuses on the 32 National Competence Centres (NCCs) that
the ELG project set up. The NCCs function as an international network of national
networks, they support the overall mission of the ELG project. On a more abstract
level, Chapter 12 (p. 219 ff.) provides a glimpse at various aspects and processes
that revolve around open innovation and the marketplace concept as one of the main
visions we have for the European Language Grid. Finally, Chapter 13 (p. 233 ff.)
describes the ELG legal entity – including setup, challenges, products etc. – as the
main instrument to sustain the ELG initiative beyond the EU project.



4.4 Part IV: ELG Open Calls and Pilot Projects

Part IV is dedicated to the ELG Open Calls and Pilot Projects. A considerable
amount of the overall budget of the EU project European Language Grid was set
aside to support a number of pilot projects that either make use of the technologies
1 European Language Grid: Introduction                                                         9

and resources provided by ELG or that extend the ELG inventory and portfolio by
contributing additional technologies or resources. First, Chapter 14 (p. 257 ff.) de-
scribes the setup of the ELG open calls including designed and implemented pro-
cedures, boards, evaluation criteria etc. The following 15 chapters – Chapter 15
(p. 271 ff.) to Chapter 29 (p. 355 ff.) – report on the 15 pilot projects, selected from
more than 200 project proposals in an expert-driven evaluation procedure.



References

European Parliament (2018). Language Equality in the Digital Age. European Parliament reso-
   lution of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). URL:
   http://www.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf.
Gaspari, Federico, Owen Gallagher, Georg Rehm, Maria Giagkou, Stelios Piperidis, Jane Dunne,
   and Andy Way (2022). “Introducing the Digital Language Equality Metric: Technological Fac-
   tors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-
   located with LREC 2022). Ed. by Itziar Aldabe, Begoña Altuna, Aritz Farwell, and German
   Rigau. Marseille, France, pp. 1–12. URL: http://www.lrec-conf.org/proceedings/lrec2022/wor
   kshops/TDLE/pdf/2022.tdle-1.1.pdf.
Grützner-Zahn, Annika and Georg Rehm (2022). “Introducing the Digital Language Equality Met-
   ric: Contextual Factors”. In: Proceedings of the Workshop Towards Digital Language Equality
   (TDLE 2022; co-located with LREC 2022). Ed. by Itziar Aldabe, Begoña Altuna, Aritz Farwell,
   and German Rigau. Marseille, France, pp. 13–26. URL: http://www.lrec-conf.org/proceedings
   /lrec2022/workshops/TDLE/pdf/2022.tdle-1.2.pdf.
Kornai, Andras (2013). “Digital Language Death”. In: PLoS ONE 8.10. DOI: 10.1371/journal.pon
   e.0077056. URL: https://doi.org/10.1371/journal.pone.0077056.
Rehm, Georg, Kalina Bontcheva, Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasil-
   jevs, eds. (2020a). Proc. of the 1st Int. Workshop on Language Technology Platforms (IWLTP
   2020, co-located with LREC 2020). Marseille, France. URL: https://www.aclweb.org/antholo
   gy/volumes/2020.iwltp-1/.
Rehm, Georg and Stefanie Hegele (2018). “Language Technology for Multilingual Europe: An
   Analysis of a Large-Scale Survey regarding Challenges, Demands, Gaps and Needs”. In: Pro-
   ceedings of the 11th Language Resources and Evaluation Conference (LREC 2018). Ed. by
   Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti
   Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno,
   Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga. Miyazaki, Japan: ELRA, pp. 3282–3289.
   URL: https://aclanthology.org/L18-1519.pdf.
Rehm, Georg, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic,
   Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez
   Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek,
   Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-
   bina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De
   Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek,
   Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur
   Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás
   Váradi, Kadri Vider, Andy Way, and François Yvon (2020b). “The European Language Technol-
   ogy Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communi-
   cation in Multilingual Europe”. In: Proceedings of the 12th Language Resources and Evaluation
   Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christo-
   pher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
10                                                                                   Georg Rehm

   Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325.
   URL: https://www.aclweb.org/anthology/2020.lrec-1.407/.
Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan-
   guages in the Digital Age. 32 volumes on 31 European languages. Heidelberg etc.: Springer.
Rehm, Georg and Hans Uszkoreit, eds. (2013). The META-NET Strategic Research Agenda for
   Multilingual Europe 2020. Heidelberg, New York, Dordrecht, London: Springer. URL: http://w
   ww.meta-net.eu/vision/reports/meta-net-sra-version_1.0.pdf.
Rehm, Georg, Hans Uszkoreit, Sophia Ananiadou, Núria Bel, Audronė Bielevičienė, Lars Borin,
   António Branco, Gerhard Budin, Nicoletta Calzolari, Walter Daelemans, Radovan Garabík,
   Marko Grobelnik, Carmen García-Mateo, Josef van Genabith, Jan Hajič, Inma Hernáez, John
   Judge, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Joseph
   Mariani, John McNaught, Maite Melero, Monica Monachini, Asunción Moreno, Jan Odjik,
   Maciej Ogrodniczuk, Piotr Pęzik, Stelios Piperidis, Adam Przepiórkowski, Eiríkur Rögnvalds-
   son, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadiņa, Koenraad De Smedt, Marko
   Tadić, Paul Thompson, Dan Tufiş, Tamás Váradi, Andrejs Vasiļjevs, Kadri Vider, and Jolanta
   Zabarskaite (2016). “The Strategic Impact of META-NET on the Regional, National and Inter-
   national Level”. In: Language Resources and Evaluation 50.2, pp. 351–374. DOI: 10.1007/s1
   0579-015-9333-4. URL: http://link.springer.com/article/10.1007/s10579-015-9333-4.
Rehm, Georg and Andy Way, eds. (2023). European Language Equality: A Strategic Agenda for
   Digital Language Equality. Cognitive Technologies. Forthcoming. Springer.
STOA (2018). Language equality in the digital age – Towards a Human Language Project. STOA
   study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. URL: https://data.europa.eu/doi/10
   .2861/136527.
Vasiljevs, Andrejs, Khalid Choukri, Luc Meertens, and Stefania Aguzzi (2019). Final study report
   on CEF Automated Translation value proposition in the context of the European LT market/e-
   cosystem. DOI: 10.2759/142151. URL: https://op.europa.eu/de/publication-detail/-/publicatio
   n/8494e56d-ef0b-11e9-a32c-01aa75ed71a1/language-en.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
             Part I
ELG Cloud Platform
Chapter 2
The European Language Grid Platform:
Basic Concepts

Stelios Piperidis, Penny Labropoulou, Dimitris Galanis, Miltos Deligiannis, and
Georg Rehm




Abstract In the fragmented Language Technology (LT) landscape of multilingual
Europe, ELG has set out to bring together language resources and technologies
(LRTs) and boost the LT sector and its activities. The primary goal is to build a scal-
able and comprehensive cloud platform for providers, developers, integrators and
consumers of language resources and technologies. We describe the basic concepts
of the ELG platform in terms of its architecture, the functionalities and services
offered to its types of users and the policies it implements. We present the ELG
repository, its catalogue features, the LT services execution environment as well as
the metadata model underlying the platform operations and the resources life cycle,
from creation to publication. We also discuss the compliance of ELG with the FAIR
principles and the relation to other platforms and infrastructure initiatives which have
inspired certain aspects and with which ELG has been establishing strong links.



1 Introduction

The overarching objective of the European Language Grid (ELG, Rehm et al. 2021)
is to tackle the observed fragmentation in the European Language Technology (LT)
landscape by bringing together Language Resources and Technologies (LRTs), com-
mercial and non-commercial, and through multiple multi-level services support and
boost the LT sector and LT activities in Europe. The primary technological goal is
to build a scalable cloud-based platform through which developers and providers of
language resources and technologies can not only deposit and upload their resources
and technologies into ELG, but also deploy them through the platform and make
use of the services, technologies and resources made available by others. ELG is a
marketplace through which consumers and integrators of LRTs can discover, try out

Stelios Piperidis · Penny Labropoulou · Dimitris Galanis · Miltos Deligiannis
Institute for Language and Speech Processing, R. C. “Athena”, Greece, spip@athenarc.gr,
penny@athenarc.gr, galanisd@athenarc.gr, mdel@athenarc.gr
Georg Rehm
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany, georg.rehm@dfki.de


© The Author(s) 2023                                                                      13
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_2
14                              Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

and integrate the resources and technologies they require for their own research and
application development.
   The primary services of the platform are dedicated to the deposition, discovery,
distribution and deployment of Language Resources and Technologies. ELG already
offers access to thousands of commercial and non-commercial LTs and ancillary
LRs for all European languages and more. These include processing and generation
services, tools, applications for written and spoken language, as well as datasets,
corpora, lexical resources, language models and computational grammars.
   ELG also supports the promotion and collaboration of LT stakeholders through an
extensive catalogue of organisations (companies, SMEs, academic and research or-
ganisations and groups, etc.) active in the LT community. Organisations can describe,
promote and distribute their services and resources all in one place. Complemented
with an expanding catalogue of European and national projects that have funded the
production of LRTs and related activities, the catalogue of the ELG platform offers
an overview of the European LT landscape. ELG, therefore, also acts as an observa-
tory of LT, consolidating existing and legacy tools, services, LRs, and information
about them, as well as newly emerging ones. This, in turn, enables the identifica-
tion of gaps and imbalances between the LRTs offered for all European languages,
a valuable instrument for the support of digital language equality in Europe.
   ELG is conceived as a platform for the whole LT community. Primarily for Eu-
rope, ELG is a platform built by the European LT community for the European
LT community, including industry, innovation and research. For the population of
the catalogue of its platform, ELG builds bridges to existing initiatives and reaches
agreements for harvesting and importing information (i. e., metadata) and resources
from other infrastructures, platforms and repositories under mutually agreed condi-
tions, business policies, acknowledgement and attribution of the source, and collab-
orates in joint initiatives and crowdsourcing campaigns.
   This chapter introduces the basic concepts of the ELG platform, while the subse-
quent chapters go into more detail with regard to functionalities offered to consumers
(Chapter 3) and providers (Chapter 4), the cloud infrastructure (Chapter 5) and the
synergies with other initiatives (Chapter 6). We first give an overview of the platform
features (Section 2) and its users (Section 3). Section 4 presents the architecture of
ELG. Sections 5 and 6 present the models and policies that influence the design and
operations of the ELG platform, i. e., the metadata model, and the publication life
cycle of catalogue entries. Section 7 positions the ELG platform with regard to the
FAIR principles (Wilkinson et al. 2016).



2 Overview of the ELG Platform

The ELG platform combines the features of a catalogue (Section 2.1), a repository
(Section 2.2), and an execution environment for running services (Section 2.3).
2 The European Language Grid Platform: Basic Concepts                                                                                                           15

2.1 Catalogue

All LRTs are accessed through their metadata records in the catalogue (Figure 1).
Providers can describe and share their LRTs; they can upload them to be hosted
in ELG, or they can only describe them and provide access to them through other
locations, such as institutional or national repositories, or private repositories of com-
mercial organisations. They can also create dedicated pages for their organisations,
describe their offerings and services and interlink all their LRTs through their own
pages.



                                                                                                            Catalogue     Documentation & Media         About

               RELEASE 3




     Search for services, tools, datasets, organizations...                                                                         Search




    Language resources &
    technologies
                                                     AbuseEval                                                                                       41 views
                                                     version: 1.0
    Service functions
                                               Extension of OLID/OffensEval data set with distinction of explicit vs implicit
                                               offensive messages.Annotation of Abusive Language, distinguishing also
                                               between explicit vs implicit offensive messages.
    Languages
                                               Keyword: Corpus Creation/Annotation
                                               Language: English
    Media types
                                               Licence: Creative Commons Attribution Non Commercial Share Alike 4.0 International


    Licences
                                                     Academic Written Catalan in Catalonia [CesCa: El Català
                                                     Escolar Escrit a Catalunya]                                                             for information
    Conditions of use                                version: 1.0.0 (automatically assigned)

                                               It is a reference corpus of the written scholar Catalan in Catalonia. It con-
    Related entities                           tains 2.426 processed texts that have been produced by children between
                                               the last year of childhood education (P5) and the last year of obligator

    ELG integrated services                    Keywords: schoolar · written · obligatory education period
    and data                                   Language: Catalan



Fig. 1 Browse/Search page of the ELG catalogue


   Additionally, the ELG catalogue includes metadata records imported automati-
cally from other sources, through standard harvesting protocols and dedicated con-
verters, thus resulting in an extensive and continuously growing inventory of LRTs
as well as of organisations and projects in the LT domain.
   LRT consumers, i. e., users, and other interested parties can search for and dis-
cover LRTs using free text search and faceted views of the catalogue. Users can
select and view the detailed descriptions of LRTs to see if they fit the users’ needs.
Users can access the resources, either directly if hosted in ELG, or be re-directed
to the URL from where the resources are accessible. Users can also search for or-
ganisations, browse them, and view their activities on their profile pages. If these
organisations have also described the LRTs they developed, users can navigate to
the respective pages for more details. Last, users can also discover the LT-related
16                               Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

projects in which organisations participated and that have helped fund the organ-
isations’ LRT development. Finally, users can export and download the metadata
descriptions or share the pages on social media.



2.2 Repository of Language Resources and Technologies

LRT providers can upload their resources to be hosted in the ELG cloud infrastruc-
ture, and to be made available to consumers for direct download. Providers must
specify the licensing conditions under which the resources can be used. Depending
on the terms, ELG will allow immediate download (for open access resources) or
impose further measures (authentication and authorisation). Commercial LRTs, dis-
tributed for download at a fee, will be available for purchase using a user-friendly
billing service.
    ELG as a repository is committed to making data, services and their metadata
FAIR, i. e., findable, accessible, interoperable and reusable (Wilkinson et al. 2016).
The assignment of persistent identifiers in the form of Digital Object Identifiers
(DOIs)1 for the data and services hosted in ELG is among the main steps towards this
objective; the FAIR principles, detailed in Section 7, form an integral part of the ELG
policies aiming to support the requirements posed by research results reproducibility
objectives and practices.



2.3 Running Language Technology Cloud Services

To benefit from the advanced features of ELG, providers can integrate LT tools as
ready-to-deploy services, following our specifications (Chapter 4). In this case, con-
sumers can test the tools and services using the trial UIs or APIs offered by ELG,
and, ultimately, integrate them in their workflows and systems. For commercial ser-
vices, billing services will be available to allow pay-for-use services with seamless
access and use in the minimum possible number of steps.
    ELG provides a set of standard APIs which cover all principal service types (see
Chapter 3, Section 3, p. 50 ff., for more details): information extraction and annota-
tion services for text and speech, text-to-text services (most notably machine trans-
lation services, but also summarisers, anonymisers, etc.), classification services for
text or image, such as language identifiers, fake news detectors, sentiment analysers,
etc., speech recognition services, text-to-speech synthesis services, and image OCR
(optical character recognition) services.
    The technical specifications give service providers a set of easy-to-implement
integration options from which they can select the one that best fits their needs. All
that is required is that they upload an image of their tool or service using one of these
options in a container registry and provide access to ELG.
1   https://www.doi.org
2 The European Language Grid Platform: Basic Concepts                                17

   ELG maintains a dedicated container registry for LT services.2 As the images of
LT services are partly pulled from registries external to the ELG project, this registry
serves as a point to collect LT service images when they are ingested into the ELG
and to apply versioning. This approach enables us to ensure that older versions of
images remain available even if their original site no longer provides them.
   To provide easy access and interaction with the ELG platform also for program-
mers, a Python SDK has been developed on top of the various ELG programmatic
interfaces providing simple methods to easily interact with the platform and consume
resources in Python (see Chapter 3, Section 4, p. 55 ff., for more details).



3 User Types and User Model

Specified by its mission, ELG targets various types of users, broadly classified into:
Providers of LRTs, both commercial and academic, albeit with different require-
  ments (the former seek to promote and sell their products and activities, while the
  latter wish to make their resources available for research or look for cooperation
  to further develop them in new projects or commercialize them),
Consumers of LRTs, including companies developing LT tools, services and ap-
  plications, integrators, researchers using LRT for their studies, etc.,
LT laypersons interested in finding out more about LT and its uses,
Funding authorities and stakeholders that wish to get an overview of the LT
  field and landscape, trends and prospects with regard to languages, domains etc.
   All users can browse the catalogue and access, view and inspect the detailed de-
scriptions of the assets listed in the catalogue, and download resources available with
open access licences. For further interactions with the ELG platform, registration is
required and can be performed with a simple and user-friendly self-service proce-
dure. The types of permitted actions and access level are determined by the user
role: registered consumers can run integrated services and download resources that
are available for free download to authenticated users; providers can, in addition,
describe all types of assets, upload content files, and integrate services according
to the ELG technical requirements; two specific user roles (validator and adminis-
trator) are reserved for ELG team members responsible for the management of the
catalogue, metadata records and data files, in accordance with the ELG policies (Sec-
tion 6) including the overall platform maintenance and administrative operations.




2   registry.european-language-grid.eu
18                                                                   Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

4 Architecture

The ELG platform uses state-of-the art technologies and is designed to evolve over
time to address new requirements or technological advancements. The choices made
in the architectural design and implementation allow for scaling with the growing
demand and supply for compute resources and lay the foundation for interoperable
data and service spaces.
   All subsystems are built with robust, scalable, reliable, widely used open source
technologies, as described below. Docker containers3 are used for all services and
applications which comprise the ELG platform, while Kubernetes4 is used for con-
tainer orchestration. Conceptually, ELG takes the form of a three-layered platform,
with each layer grouping together the main subsystems responsible for the platform’s
functionalities: base infrastructure, platform back end, platform front end (Figure 2).


                                                                                           Provider UI &
                          Catalogue UI          Test/Trial UI          Admin UI                                   Language Technologies             CMS UI
                                                                             Entity       Metadata Editor
              FRONTEND
              PLATFORM




                                                                                                                  (functional content)
    GATEWAY




                                                                                                                  Language Resources
                                                                                                                  (non-functional content)




      REST API

                       LT Service Execution                      Analytics             User Management           Catalogue
PLATFORM




                           Orchestrator
BACKEND




                                                                  Billing                 Storage Proxy     Database       Elastic Index
                LT          LT                 LT
               SRV 1       SRV 2      …       SRV n
                                                                Monitoring            Metadata Harvesting


                                                                      Docker repository                       File & object storage
BASE INFRA-
STRUCTURE




                                                                                                                                         Language Resources
                                                                                                                                       (non-functional content)

                                                                                        Docker
                                                                                        Images
                              Nodes


Fig. 2 ELG platform architecture


   The base infrastructure is the layer on which all ELG software components are
deployed and run. It includes the supporting tools that facilitate development and
management of the ELG platform software. It is composed, first and foremost, of
the compute nodes running the platform, alongside their respective volume storage
and networking facilities; these are organised in two different clusters, one for de-
velopment and one for production purposes. It also comprises public and private

3 A Docker image of an application contains its actual code and all required dependencies required
to run it; e. g., the operating system, frameworks, settings, configuration files, libraries, etc. Con-
tainers are instantiations of images and can be thought of as lightweight virtual machines.
4 Kubernetes is a framework that enables and simplifies the deployment, scaling and management

of containers, see https://kubernetes.io.
2 The European Language Grid Platform: Basic Concepts                                      19

container registries, which host all images for the ELG platform components and for
the LT services integrated in the platform. In addition, it includes an S3-compatible
file and object storage, through which data resources uploaded by providers as well
as backups of core platform components are persisted. This layer also includes a set
of Git5 repositories for the source code of the platform software apps and for the
individual LT services implementations of specific providers. Chapter 5 (p. 95 ff.)
provides more information on the base infrastructure.
    The platform back end consists of all the components that enable the operation
of the ELG platform, i. e., the catalogue core components, the component for pro-
cessing LT services and platform support as well as management components. The
catalogue component, implemented using Django6 , interfaces with a PostgreSQL7
database for storing the metadata records and an index, which uses ElasticSearch8 .
The LT service execution server offers a common REST API for calling LT ser-
vices integrated in the platform, and handles failures, time-outs, etc. Finally, sepa-
rate modules are used for the user management and authentication module (based on
Keycloak9 , an identity and access management solution), the analytics, monitoring,
metadata harvesting and the proxy for interacting with the S3-compatible storage.
    The platform front end layer consists of the static pages maintained in a Con-
tent Management System (CMS). These provide information on the ELG project
and initiative, and the platform UIs for the different types of users, i. e., consumers,
providers, validators, and administrators. These include the catalogue pages (browse,
search, view), and the dashboard pages customised for the different user types, UIs
for registering (describing and uploading) LRTs and other assets and supporting the
publication life cycle, implemented using React10 , and the trial UIs for services in-
tegrated in ELG. The catalogue UI consumes REST services exposed by the ELG
platform back end (e. g., catalogue application, LT Service execution server).
    Chapters 3 (p. 37 ff.) and 4 (p. 67 ff.) provide more information on the back end
and front end layers of the European Language Grid platform.



5 Catalogue Contents and Metadata Model

All types of LT assets as well as all LT-related meta-information are brought to-
gether, aligned and interlinked. This set of information11 is formally structured and
harmonised in ELG using the ELG-SHARE metadata model12 catering for the full

5 https://git-scm.com
6 https://www.djangoproject.com
7 https://www.postgresql.org

8 https://www.elastic.co

9 https://www.keycloak.org

10 https://reactjs.org

11 https://european-language-grid.readthedocs.io/en/stable/all/A2_Metadata/Metadata.html

12 https://gitlab.com/european-language-grid/platform/ELG-SHARE-schema
20                                           Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.


                                                   Entity
                                                  Entity


              LRT           Licence               Project             Actor           Document

             Tool/Service                                                                   Publication
                                                                 Organization




                               Accompanies
                Dataset                                                                     Annotation
               (Corpus)                                                 Group                Manual
                Lexical/                                                                          User
              Conceptual                                               Person                    Manual
               Resource
                                                                                                   …
              Language                                       Core             Satellite
              Description                                                      Entity
                                                                                          is-a
                                                             Entity

Fig. 3 ELG entities



language data and services life cycle and their related entities (Labropoulou et al.
2020). The ELG model covers the following types of entities (Figure 3).
  • Language resources and technologies (LRTs), further classified into:
      – Corpora, i. e., datasets of mono/bi/multilingual text documents, audio/video
        recordings, multimedia datasets, parallel corpora, translation memories, etc.
      – Lexical/conceptual resources, including lexica, ontologies, gazetteers, term
        lists, computational dictionaries, etc.
      – Language descriptions, which mainly refer to computational grammars, sta-
        tistical and machine learning models
      – Tools/services, i. e., pieces of software offered as locally executable code or
        web services, hosted and running in the ELG cloud platform or remotely
  • Related/satellite entities, such as actors, be it persons or organizations that have
    created or that curate resources, projects that have funded them or in which
    they have been used, as well as licences and accompanying documents (e. g.,
    publications related to the resource, user manuals, technical documents, etc.)
    The ELG model lies at the heart of the platform and supports its key operations.
In particular, it aims to 1. support the discoverability of all catalogue contents; 2. en-
able accessibility by human users and, where possible or required, machines (e. g.,
including links to URLs that offer direct access to a resource or service); 3. address
(at the metadata level) interoperability requirements of resources belonging to the
same types and media, but coming from different sources with different descriptions,
as well as between resources of different types and media (e. g., between datasets
and services to be used for their processing); and, 4. finally, satisfy documentation
needs at different levels of granularity, ranging from the strict enforcement of tech-
nical metadata required for the deployment of ELG-compatible services to rather
loose descriptions of resources imported from general purpose catalogues.
2 The European Language Grid Platform: Basic Concepts                                  21

   The metadata model builds upon previous work from the META-SHARE meta-
data model (Gavrilidou et al. 2012), which caters for the description of language
resources and language-processing technologies, and its application profiles, i. e.,
ELRC-SHARE (Piperidis et al. 2018a), OMTD-SHARE (Labropoulou et al. 2018),
CLARIN-SHARE (Piperidis et al. 2018b), which extend, restrict and adapt the basic
model to specific domains and areas (e. g., public domain resources, text and data
mining domain, etc.), and the MS-OWL ontology13 (McCrae et al. 2015; Khan et al.
2022), which is the RDF/OWL representation of the model.
   The model builds along three key concepts, each of which is associated with a
distinctive set of metadata elements:
     • resource type, with the four subtypes described above;
     • media type, which specifies the form or physical medium of the resource. The
       notion of medium is preferred over the written, spoken or multimodal distinction,
       as it has clearer semantics and allows us to view LRs as a set of modules, each of
       which can be described through a distinctive set of features. Thus, the following
       media type values are foreseen: text, audio, image, video and numerical text
       (referring to numerical data, such as biometrical, geospatial data, etc.). To cater
       for multimedia and multimodal language resources (e. g., a corpus of videos and
       subtitles, or a corpus of audio recordings and transcripts, a sign language corpus
       with videos and texts, etc.), language resources are represented as consisting of
       at least one media part;
     • distribution, which, following the DCAT14 model (Albertoni et al. 2020; Maali
       and Erickson 2014), refers to any physical form of the resource that can be dis-
       tributed and deployed by end-users.
    These elements give rise to a modular structure, in which metadata elements are
attached to the appropriate level (“class”). The “LanguageResource” class includes
properties common to all resource and media types, such as those used for identifi-
cation purposes (title, description, etc.), recording provenance (creation, publication
dates, creators, providers, etc.), contact points, etc. More technical features and clas-
sification elements differ across resource and media types and are, thus, attached
to combinations thereof; for example, a corpus may take elements specific to an-
notation processes, while the description of a computational lexicon encodes, e. g.,
whether it includes lemmas, examples, grammatical information, translation equiva-
lents, etc. Technical features, such as format, size, information on licensing and mode
of access are properties of the distribution. They can also differ across resource type.
For example, corpora can be distributed as PDF files or as simple text files, lexical
resources in tabular form or queried through an interface, while tools may be avail-
able as source code, executable files or web services. Each of these forms can be
licensed under different terms: source code may be available at a price for integra-
tion in other applications, while an API may be offered for research purposes without
any fee. Figure 4 illustrates a subset of the elements for a tool/service.

13   http://w3id.org/meta-share/meta-share
14   https://www.w3.org/TR/vocab-dcat-3/
22                                  Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.




Fig. 4 Excerpt of the minimal schema for tools/services



   The schema allows for the description of the full life cycle of language resources
(see, e. g., Rehm 2016), from conception and creation to integration in applications
and usage. All this information leads to a complex and demanding schema; to ensure
flexibility and uptake by resource providers, the elements are classified into three
levels of optionality:
  • mandatory: elements that are necessary
  • recommended: elements that can help the current or future use of the resource,
    or useful information that providers have not yet standardised
  • optional: all remaining information
   The minimal schema comprises all mandatory elements which must be filled for
a metadata record to be considered ELG-compliant and eligible to be registered in
the platform. Recently, a “relaxed” version of the ELG schema was introduced as
a way of handling metadata records with “lighter” information imported from other
catalogues in ELG, but this version of the schema is allowed only under specific
circumstances. Chapter 6 discusses this in more detail. Below, we summarise the
metadata categories considered mandatory for the description of resources (Figures 6
to 10 in the Appendix provide an overview for each resource type).
2 The European Language Grid Platform: Basic Concepts                                        23

     • Administrative information: these features are important for the identification
       of an LRT (resource name, version, description which includes information on
       the contents, provenance information, any other information deemed useful and
       helpful for consumers, etc.), contact information (landing page with additional
       information or a contact email).
     • Classification information: one or more free text keywords that support the find-
       ability of the resource.
     • Usage information: separate distributions for each distributable form of the re-
       source, with the following elements: the distribution form (i. e., whether it can
       be downloaded, accessed through an interface, deployed as a web service, etc.),
       the licensing terms under which it can be used (licence name and URL); if the
       resource is not uploaded in ELG, an access or download link.
     • Legal/ethical information for data resources: whether personal or sensitive data
       is included and, if applicable, information on anonymisation.
     • Technical information: depending on the resource type
        – for tools/services: the function (i. e., the task it performs, e. g., named en-
          tity recognition, machine translation, speech recognition, etc.), the techni-
          cal specifications of its input (at least the resource type it processes, e. g.,
          corpus, text, etc.), whether it is language independent and, if not, the input
          languages; depending on the function, further information may be required
          (e. g., the languages of the output resource for machine translation services);
        – for all data resources15 : features on the language following the BCP 4716
          guidelines, multilinguality type, resource subtype with different values (e. g.,
          terminological glossary, ontology, etc. for lexical/conceptual resources, raw
          or annotated for corpora); size and format information must also be added
          separately for each distribution and media part;
        – in addition, specifically for models: the intended application (e. g., machine
          translation, named entity recognition, etc.), the model function (e. g., zero-
          shot classification), and model type (e. g., embeddings, Bayesian model, n-
          gram model, etc.);
        – specifically for grammars and lexical/conceptual resources: the encoding
          level of their contents (i. e., whether they contain morphological, syntactic,
          semantic, etc. information).
    For organisations and projects, all that is required is the name (official title). How-
ever, we also recommend a free text description with the activities of the organisation
or the project summary respectively, and the URL of its website. The LT area(s) in
which the organisation/project activities are related to and one or more keywords
increase its visibility and findability. For big organisations with multiple divisions
(e. g., academic institutions with schools, faculties, departments, or multinational

15 A resource can consist of one or more media parts, which must be described separately, for
example, for a corpus of video recordings and their subtitles in various languages, the language
value must be indicated separately for each part.
16 https://www.rfc-editor.org/info/bcp47
24                                                          Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

companies with branches), both the parent organisation and division(s) can be regis-
tered and a link between them added.
   For standardisation purposes, the ELG schema favours controlled vocabularies
over free-text fields, especially when these are associated with internationally ac-
knowledged standards, best practices or widespread vocabularies, e. g., ISO 3166
for region codes (ISO 2020), RFC 5646 for languages17 (Phillips and Davis 2009),
etc. The implementation in the form of an XML Schema Definition (XSD) im-
ports elements from two ontologies, i. e., the MS-OWL ontology, which includes
most elements and controlled vocabularies, and the OMTD-SHARE ontology18
(Labropoulou et al. 2018) reserved for the controlled vocabularies of LT categories
(also referred to as “LT taxonomy”), data formats, annotation types and methods.



6 Publication Life Cycle

ELG considers the quality of metadata records to be of primary importance as it
contributes to the discovery and usage of resources. We defined a set of policies that
take into account the source and the process through which a record has been entered
in the ELG catalogue.



                                                                                    Submitted for                                        Approved and
                                          Draft
     The provider uploads a                             All mandatory
                                                                                     publication            The ELG technical team
                                                                                                                                           published
     metadata file or uses                              metadata elements are                               checks the metadata; if
     the interactive editor   The record is invalid     filled in but the       The provider is satisfied   needed, the metadata      The metadata record is
     to create a new item     (not all mandatory        provider can continue   with the metadata and       returns to the provider   published on the ELG
                              elements are filled in)   editing                 submits the record for      for corrections           catalogue and can no
                              and the provider must                             publication                                           longer be edited
                              continue editing it
                                                            Syntactically
            New item                                                                                                 Under
                                                               valid
                                                                                                                   validation




Fig. 5 ELG publication life cycle


    The ELG publication life cycle consists of a set of states through which an entry
progresses, from its creation in the ELG platform until it is published (Figure 5). A
new item is created each time a provider adds a new metadata record. The record can
remain at the draft status as long as the provider wishes, in which case no validation
checks are made – apart from validation of the data types of the metadata elements
(e. g., that a URL is properly formulated). At the syntactically valid status, a metadata
record must comply with the minimal version of the ELG schema (i. e., all mandatory
elements must be filled in). The provider can still continue to edit it until they are
satisfied with the description and can then submit it for publication; once submitted,
the provider is notified by email. While the record is submitted for publication the

17   https://datatracker.ietf.org/doc/html/rfc5646
18   http://w3id.org/meta-share/omtd-share/
2 The European Language Grid Platform: Basic Concepts                                25

entry is validated at the metadata, technical and legal level. The validation, which
is described in more detail in Chapter 3, aims to check the consistency of the de-
scription and, where required, its technical compliance with the ELG specifications;
it does not include any qualitative evaluation of the resource itself. The validation
is currently performed by the ELG team. When validators identify a problem, they
contact the provider and recommend changes and additions to the metadata; in such
cases, the status is changed to syntactically valid again and the provider is notified
to make the appropriate amendments. When the validators have approved an item, it
is automatically visible via the ELG catalogue. Published metadata records cannot
be edited any more, i. e., they are immutable.
    Metadata records added by individuals go through the whole publication life cy-
cle. Human validation aims at ensuring a minimum level of quality included in the
records, which can be achieved through interactions with the provider. This pro-
cedure cannot be adopted for metadata records automatically imported from other
catalogues. For these, the responsibility for the quality and extent of information lies
with the source catalogue. The same policy, that of accepting records as is, has been
adopted for records added through bulk initiatives, such as the collaborative survey
of LRTs undertaken in the context of the European Language Equality project19 and
described in Chapter 6.



7 ELG and the FAIR Principles

The publication of the FAIR principles (Wilkinson et al. 2016) marked a landmark
for infrastructures that support the sharing and re-use of data resources. The FAIR
principles are guidelines set to enhance re-usability of data by improving their find-
ability, accessibility, interoperability and re-usability. They are intended both for
humans and machines, and put an emphasis on machine actionability, i. e., the ca-
pacity of computational systems to find, access, interoperate, and reuse data with no
or minimal human intervention.20 ELG has implemented mechanisms and policies
to ensure that resources (data and software) included in ELG as well as the metadata
that describe them are FAIR, i. e., adhere to the FAIR principles.21
Findability principles
     •   F1 – (Meta)data are assigned a globally unique and persistent identifier
         Resources hosted in ELG and ELG-compatible services are assigned a DOI
         (Digital Object Identifier)22 provided by DataCite23 . Metadata for resources
         will also have their own unique identifier created on the basis of the resource

19 https://european-language-equality.eu
20 https://www.go-fair.org/fair-principles/
21 https://force11.org/info/the-fair-data-principles/

22 https://www.doi.org

23 https://datacite.org
26                                     Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

          DOI. For metadata records that do not have an accompanying file and hence
          cannot be assigned a DOI, we use their URL as an identifier.
      •   F2 – Data are described with rich metadata
          The ELG metadata schema is rich in information. Providers are encouraged
          to add not only the mandatory but also recommended information. The vali-
          dation process for resources and services aims at improving metadata quality.
      •   F3 – Metadata clearly and explicitly include the identifier of the data they
          describe
          The element “identifier” (with the “identifier scheme” attribute) is included
          in the metadata record.
      •   F4 – (Meta)data are registered or indexed in a searchable resource
          All metadata records are indexed and searchable in the ELG catalogue and
          also accessible to search engines. In addition, we expose the metadata records
          of LRTs to Google’s dedicated search engine for research datasets.24
Accessibility principles
      •   A1 – (Meta)data are retrievable by their identifier using a standardised com-
          munications protocol
          All metadata in ELG are accessible via the ELG catalogue. Resources hosted
          in ELG and ELG-compatible are accessible via their DOI and directly retriev-
          able via a URL. The HTTPS protocol is used.
      •   A1.1 The protocol is open, free, and universally implementable
          HTTPS is used for providing access to metadata and resources.
      •   A1.2 The protocol allows for an authentication and authorisation procedure,
          where necessary
          HTTPS is used for providing access to metadata and resources. ELG uses an
          authentication and authorisation system.
      •   A2 – Metadata are accessible, even when the data are no longer available
          When a resource or a metadata record is deleted, a tombstone page with all
          the required elements following DataCite recommendations is put in place.
Interoperability principles
      •   I1 – (Meta)data use a formal, accessible, shared, and broadly applicable
          language for knowledge representation
          All metadata records are exported in XML format, a subset is available in
          JSON-LD format; work is ongoing for the export into RDF using the MS-
          OWL ontology.
      •   I2 – (Meta)data use vocabularies that follow FAIR principles
          The metadata elements and values are taken from two RDF/OWL ontologies,
          MS-OWL and OMTD-SHARE25 .



24   https://datasetsearch.research.google.com
25   http://w3id.org/meta-share/omtd-share
2 The European Language Grid Platform: Basic Concepts                                   27

      •   I3 – (Meta)data include qualified references to other (meta)data
          Qualified relations are used for linking between versions of the resources and,
          in cases of imported records, for linking with their source metadata records.
Re-usability principles
      •   R1 – (Meta)data are richly described with a plurality of accurate and relevant
          attributes
          Alongside the “description” element where providers are advised to add as
          much information as possible for the benefit of human users, the ELG schema
          includes elements that can be used to identify potential uses of a resource and
          properties that make clear where they can be of use, e. g., “intended applica-
          tion”, “service function”, “domain”, etc.
      •   R1.1 – (Meta)data are released with a clear and accessible data usage license
          All resources must have a licence; the licence value and a link to the licence
          text are included in the metadata. Metadata are also permissively licensed
          with a Creative Commons licence.
      •   R1.2 – (Meta)data are associated with detailed provenance
          The source for the metadata record is explicitly added in the metadata record
          (“metadata creator” or “source repository”). Properties about the creation of
          a resource are included in the metadata.
      •   R1.3 – (Meta)data meet domain-relevant community standards
          With regard to the metadata, the ELG schema is based on META-SHARE,
          a well-established metadata vocabulary in the LT community. For the tools
          and services added in the ELG catalogue, the technical specifications follow
          current best practices (e. g., preparing a Docker image). For data, a set of rec-
          ommendations, taking into account established file formats, standards, and
          de facto best practices, is under construction.



8 Related Platforms and Infrastructures

ELG builds upon previous work of the ELG consortium partners and the wider Eu-
ropean LT community (Rehm et al. 2020b), especially META-NET26 and ELRC27 .
   The ELG platform shares common features and goals with other platforms, repos-
itories, projects or other initiatives: 1. a collection of LT/NLP tools or datasets, 2. a
platform, which harvests metadata records from distributed sources, 3. a platform
for the sharing of tools or datasets, 4. a platform for the deployment of services, 5. a
repository for storing data files. Comparisons can be made along various dimensions.
We include here an overview at the level of the main functionalities provided, while
the respective background and technical details are presented in Chapters 3 and 4. An
alternative and minimally outdated comparison is provided in Rehm et al. (2020a).

26   http://www.meta-net.eu
27   https://www.elrc-share.eu
28                                    Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

    META-SHARE28 is a network of repositories (Piperidis 2012; Piperidis et al.
2014). Each repository, or node, hosts various types of resources (datasets, services,
etc.) described with the META-SHARE metadata schema (Gavrilidou et al. 2012).
Each node is deployed at a different organisation. The nodes periodically harvest
metadata records from each other. Architecture and conceptual design of the ELG
platform have been inspired by the META-SHARE setup but designed and imple-
mented from scratch. ELG adopts a different approach as it operates as a centralised
platform where individuals can directly register, download and run resources and
services. Harvesting is also performed but from external catalogues (e. g., ELRC-
SHARE29 , LINDAT/CLARIAH-CZ30 , etc.), as described in Chapter 6. From an en-
gineering point of view, ELG is a radically improved version of META-SHARE,
e. g., 1. ELG offers REST APIs while META-SHARE does not, 2. the ELG front
end and back end are implemented as different layers that can be developed in par-
allel, 3. the metadata schema has been updated and extended to cover new resource
types and description requirements.
    The OpenMinTeD platform31 was designed as an open, service-oriented e-Infra-
structure for Text and Data Mining of scientific content (Labropoulou et al. 2018).
It includes a catalogue for datasets, NLP and text mining services, worfklows, lex-
ica etc., described with a rich metadata schema, OMTD-SHARE. REST APIs for
searching, metadata and resource upload/download are provided, as in the case of
ELG. OpenMinTeD was a centralised repository, and harvesting was employed as
a one-off procedure for importing metadata records from a few content providers.
It supported the creation of workflows from tools contained in the catalogue, and
their execution on datasets provided through the same platform; the functionality
was based on the Galaxy32 worfklow management system (Afgan et al. 2018).
    ELRC-SHARE33 (Piperidis et al. 2018a) is an infrastructure developed by the
European Language Resource Coordination action34 with the objective to host, doc-
ument, manage and distribute LRs pertinent to MT, with a particular focus on the
needs of the eTranslation35 service of the European Commission. It is a centralised
repository with a catalogue of datasets, which are added and documented by individ-
uals. Metadata records of tools and services are listed as for information only.
    The European AI-on-demand platform, as initiated by the EU project AI4EU
seeks to bring together the European AI community while promoting European val-
ues.36 The platform is a facilitator of knowledge transfer from research to multiple


28   http://www.meta-share.org
29   https://www.elrc-share.eu
30   https://lindat.mff.cuni.cz
31   https://github.com/openminted – the OpenMinTeD platform is not available online any more.
32   https://galaxyproject.org/learn/advanced-workflow/
33   https://www.elrc-share.eu
34   https://lr-coordination.eu
35   https://cor.europa.eu/en/engage/Pages/e-translation.aspx
36   https://www.ai4europe.eu
2 The European Language Grid Platform: Basic Concepts                                 29

business and industry domains. The AI catalogue37 is designed for hosting datasets
and services in the area of AI; for instance, it includes NLP resources, computer vi-
sion services, etc. The capabilities of the metadata schema used are rather limited
compared to the ELG schema. It also provides catalogues for organisations involved
in AI38 , collaborating projects39 and educational resources40 , but the catalogues are
all separate, without any linking between the entities as offered in the ELG catalogue.
    CLARIN41 (Hinrichs and Krauwer 2014; Eskevich et al. 2020) is a European
Research Infrastructure providing access to digital language resources and tools to
researchers in the humanities and social sciences. CLARIN does not host a single
repository; instead, it is organised in the form of a network of centres that operate
their own repositories and catalogues. The individual centres are free in their choice
of repository software and metadata schema (Broeder et al. 2008). The CLARIN
Virtual Language Observatory42 is the central catalogue which harvests metadata
from all centres as well as other catalogues of interest to scholars in the target disci-
plines and displays them in a uniform way, although only a subset of the metadata
elements are common. Processing services are catalogued centrally in the Language
Switchboard 43 , while some CLARIN centres make available processing services
connected to their catalogues or offered separately (e. g., LINDAT/CLARIAH-CZ44 ,
PORTULAN-CLARIN45 , CLARIN:EL46 , etc.). Unlike ELG, there is no central com-
pute infrastructure for deploying and running processing services.
    The Language Application Grid (LAPPS Grid)47 (Ide et al. 2014, 2016) is an open,
interoperable web service platform for NLP research and development. It provides
facilities for selecting and combining NLP tools and services to create workflows,
composite services, and applications, and to evaluate, reproduce, and share them. It
is based largely on the Galaxy48 worfklow management system and does not actually
include a catalogue. Some limited metadata have to be provided in order to create
the files that are required for adding tools used in Galaxy wokflows, e. g., the name
of the tool, a description, input parameters etc. For datasets no metadata are required
since they are not permanently stored in Galaxy.
    Hugging Face49 is an AI/NLP company, offering repository and deployment func-
tionalities for machine learning (Wolf et al. 2020). It hosts a large set of models and

37 https://www.ai4europe.eu/research/ai-catalog
38 https://www.ai4europe.eu/ai-community/organizations
39 https://www.ai4europe.eu/ai-community/projects

40 https://www.ai4europe.eu/education/education-catalog

41 https://www.clarin.eu

42 https://vlo.clarin.eu

43 https://switchboard.clarin.eu

44 https://lindat.mff.cuni.cz

45 https://portulanclarin.net

46 https://inventory.clarin.gr

47 https://www.lappsgrid.org

48 https://galaxyproject.org/learn/advanced-workflow/

49 https://HuggingFace.co
30                                 Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

datasets that can be used for model training. It offers a catalogue with a limited REST
API, e. g., the API does not allow filtering search results, etc. Similar to this, there are
other catalogues and repositories, such as Kaggle50 and Papers With Code51 , which
target the machine learning community. These are also community-driven, i. e., re-
sources are registered by individuals and have their own metadata schemas.
   Finally, we should mention the long lasting initiative of ELRA and the LREC
community in establishing the LREC Map (Calzolari et al. 2010), as well as the
growing popularity of initiatives that include general (e. g., European Open Science
Cloud52 ) or federated catalogues (e. g., Gaia-X53 ) and also general repositories (e. g.,
Zenodo54 ), which bring together a large range of resources from and for various
disciplines. See Chapter 6 for more details.



9 Conclusions

ELG has been designed as the primary platform for the European LT community,
adopting a holistic view of technology development, deployment and use, bringing
together language data, resources and processing services as well as the commer-
cial and non-commercial LT actors and initiatives. ELG has established and imple-
mented a standardised resource life cycle catering for all stages, from creation to
publication and version evolution. The primary services offered are dedicated to the
deposition, discovery, distribution and deployment of language resources and tech-
nologies through appropriate interfaces for technical and non-technical providers,
developers, consumers and integrators. Such interfaces include web GUIs, REST
APIs and a Python Software Development Kit (SDK). Its operations are supported
by a metadata model underlying the description, search, discovery and distribution
of resources and services, conforming to the FAIR principles. On this basis, ELG
has started building bridges to existing initiatives for harvesting and importing in-
formation and resources from other infrastructures, platforms and repositories under
mutually agreed conditions, business policies, acknowledgement and attribution of
the source, and collaborates in joint initiatives and crowdsourcing campaigns.



References

Afgan, Enis, Dannon Baker, Bérénice Batut, Marius van den Beek, Dave Bouvier, Martin Čech,
   John Chilton, Dave Clements, Nate Coraor, Björn A Grüning, Aysam Guerler, Jennifer Hillman-
   Jackson, Saskia Hiltemann, Vahid Jalili, Helena Rasche, Nicola Soranzo, Jeremy Goecks, James

50   https://www.kaggle.com
51   https://paperswithcode.com
52   https://eosc-portal.eu
53   https://www.gaia-x.eu
54   https://zenodo.org
2 The European Language Grid Platform: Basic Concepts                                          31

   Taylor, Anton Nekrutenko, and Daniel Blankenberg (2018). “The Galaxy platform for accessi-
    ble, reproducible and collaborative biomedical analyses: 2018 update”. In: Nucleic Acids Re-
    search 46.W1, W537–W544. DOI: 10.1093/nar/gky379. URL: https://academic.oup.com/nar
   /article/46/W1/W537/5001157.
Albertoni, Riccardo, David Browning, Simon Cox, Alejandra Gonzalez-Beltran, Andrea Perego,
    and Peter Winstanley, eds. (2020). Data Catalog Vocabulary (DCAT) – Version 2. W3C Rec-
    ommendation. URL: https://www.w3.org/TR/vocab-dcat-2/.
Broeder, Daan, Thierry Declerck, Erhard Hinrichs, Stelios Piperidis, Laurent Romary, Nicoletta
    Calzolari, and Peter Wittenburg (2008). “Foundation of a Component-based Flexible Registry
    for Language Resources and Technology”. In: Proceedings of the Sixth International Confer-
    ence on Language Resources and Evaluation (LREC 2008). Marrakech, Morocco: ELRA. URL:
    http://www.lrec-conf.org/proceedings/lrec2008/pdf/364_paper.pdf.
Calzolari, Nicoletta, Claudia Soria, Riccardo Del Gratta, Sara Goggi, Valeria Quochi, Irene Russo,
    Khalid Choukri, Joseph Mariani, and Stelios Piperidis (2010). “The LREC Map of Language
    Resources and Technologies”. In: Proceedings of the Seventh International Conference on Lan-
    guage Resources and Evaluation (LREC 2010). Valletta, Malta: ELRA. URL: http://www.lrec-
    conf.org/proceedings/lrec2010/pdf/370_Paper.pdf.
Eskevich, Maria, Franciska de Jong, Alexander König, Darja Fišer, Dieter Van Uytvanck, Tero
   Aalto, Lars Borin, Olga Gerassimenko, Jan Hajic, Henk van den Heuvel, Neeme Kahusk, Krista
    Liin, Martin Matthiesen, Stelios Piperidis, and Kadri Vider (2020). “CLARIN: Distributed Lan-
    guage Resources and Technology in a European Infrastructure”. In: Proc. of the 1st Int. Work-
    shop on Language Technology Platforms (IWLTP 2020, co-located with LREC 2020). Ed. by
    Georg Rehm, Kalina Bontcheva, Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasil-
    jevs. Marseille, France: ELRA, pp. 28–34. URL: https://aclanthology.org/2020.iwltp-1.5.
Gavrilidou, Maria, Penny Labropoulou, Elina Desipri, Stelios Piperidis, Haris Papageorgiou, Mon-
    ica Monachini, Francesca Frontini, Thierry Declerck, Gil Francopoulo, Victoria Arranz, and Va-
    lerie Mapelli (2012). “The META-SHARE Metadata Schema for the Description of Language
    Resources”. In: Proceedings of the Eighth International Conference on Language Resources
    and Evaluation (LREC 2012). Istanbul, Turkey: ELRA, pp. 1090–1097. URL: http://www.lrec
   -conf.org/proceedings/lrec2012/pdf/998_Paper.pdf.
Hinrichs, Erhard and Steven Krauwer (2014). “The CLARIN Research Infrastructure: Resources
    and Tools for eHumanities Scholars”. In: Proceedings of the Ninth International Conference on
    Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland: ELRA, pp. 1525–1531.
    URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/415_Paper.pdf.
Ide, Nancy, James Pustejovsky, Christopher Cieri, Eric Nyberg, Denise DiPersio, Chunqi Shi, Keith
    Suderman, Marc Verhagen, Di Wang, and Jonathan Wright (2016). “The Language Application
    Grid”. In: Worldwide Language Service Infrastructure. Ed. by Yohei Murakami and Donghui
    Lin. Cham: Springer, pp. 51–70. DOI: 10.1007/978-3-319-31468-6_4.
Ide, Nancy, James Pustejovsky, Christopher Cieri, Eric Nyberg, Di Wang, Keith Suderman, Marc
   Verhagen, and Jonathan Wright (2014). “The Language Application Grid”. In: Proc. of the Ninth
    International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik,
    Iceland: ELRA. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/926_Paper.pdf.
ISO (2020). ISO 3166 – Country Codes. International Organization for Standardization. URL: htt
    ps://www.iso.org/iso-3166-country-codes.html.
Khan, Anas Fahad, Christian Chiarcos, Thierry Declerck, Daniela Gifu, Elena González-Blanco
    Garcı́a, Jorge Gracia, Maxim Ionov, Penny Labropoulou, Francesco Mambrini, and John P.
    McCrae (2022). “When Linguistics Meets Web Technologies. Recent advances in Modelling
    Linguistic Linked Open Data”. In: Semantic Web Journal. Accepted for publication.
Labropoulou, Penny, Dimitris Galanis, Antonis Lempesis, Mark Greenwood, Petr Knoth, Richard
    Eckart de Castilho, Stavros Sachtouris, Byron Georgantopoulos, Stefania Martziou, Lucas Anas-
    tasiou, Katerina Gkirtzou, Natalia Manola, and Stelios Piperidis (2018). “OpenMinTeD: A
    Platform Facilitating Text Mining of Scholarly Content”. In: Proceedings of WOSP 2018 (co-
    located with LREC 2018). Miyazaki, Japan: ELRA, pp. 7–12. URL: http://lrec-conf.org/works
    hops/lrec2018/W24/pdf/13_W24.pdf.
32                                   Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

Labropoulou, Penny, Katerina Gkirtzou, Maria Gavriilidou, Miltos Deligiannis, Dimitris Galanis,
   Stelios Piperidis, Georg Rehm, Maria Berger, Valérie Mapelli, Michael Rigault, Victoria Ar-
   ranz, Khalid Choukri, Gerhard Backfried, José Manuel Gómez Pérez, and Andres Garcia-Silva
   (2020). “Making Metadata Fit for Next Generation Language Technology Platforms: The Meta-
   data Schema of the European Language Grid”. In: Proceedings of the 12th Language Resources
   and Evaluation Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe
   Blache, Christopher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard,
   Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA,
   pp. 3421–3430. URL: https://www.aclweb.org/anthology/2020.lrec-1.420/.
Maali, Fadi and John Erickson, eds. (2014). Data Catalog Vocabulary (DCAT) – Version 1. W3C
   Recommendation. URL: https://www.w3.org/TR/2020/SPSD-vocab-dcat-20200204/.
McCrae, John Philip, Penny Labropoulou, Jorge Gracia, Marta Villegas, Víctor Rodríguez-Doncel,
   and Philipp Cimiano (2015). “One Ontology to Bind Them All: The META-SHARE OWL
   Ontology for the Interoperability of Linguistic Datasets on the Web”. In: The Semantic Web:
   ESWC 2015 Satellite Events. Ed. by Fabien Gandon, Christophe Guéret, Serena Villata, John
   Breslin, Catherine Faron-Zucker, and Antoine Zimmermann. Lecture Notes in Computer Sci-
   ence. Springer International Publishing, pp. 271–282. URL: https://link.springer.com/chapter
   /10.1007/978-3-319-25639-9_42.
Phillips, Addison and Mark Davis (2009). Tags for Identifying Languages. Tech. rep. RFC 5646.
   Internet Engineering Task Force. URL: https://datatracker.ietf.org/doc/rfc5646.
Piperidis, Stelios (2012). “The META-SHARE Language Resources Sharing Infrastructure: Princi-
   ples, Challenges, Solutions”. In: Proceedings of the Eight International Conference on Lan-
   guage Resources and Evaluation (LREC’12). Ed. by Nicoletta Calzolari, Khalid Choukri,
   Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno,
   Jan Odijk, and Stelios Piperidis. Istanbul, Turkey: ELRA.
Piperidis, Stelios, Penny Labropoulou, Miltos Deligiannis, and Maria Giagkou (2018a). “Manag-
   ing Public Sector Data for Multilingual Applications Development”. In: Proceedings of the
   Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Ed.
   by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti
   Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno,
   Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga. Miyazaki, Japan: ELRA. URL: http://w
   ww.lrec-conf.org/proceedings/lrec2018/pdf/648.pdf.
Piperidis, Stelios, Penny Labropoulou, and Maria Gavriilidou (2018b). “clarin:el An infrastructure
   for the documentation, sharing and processing of language data (in Greek)”. In: Proceedings
   of the 12th International Conference on Greek Linguistics (ICGL12). Vol. 2. Berlin, Germany:
   Edition Romiosini/CeMoG, Freie Universität Berlin, pp. 851–869. URL: http://www.cemog.f
   u-berlin.de/en/icgl12/offprints/piperidis-lampropoulou-gavriilidou/icgl12_Piperidis-et-al.pdf.
Piperidis, Stelios, Harris Papageorgiou, Christian Spurk, Georg Rehm, Khalid Choukri, Olivier Ha-
   mon, Nicoletta Calzolari, Riccardo del Gratta, Bernardo Magnini, and Christian Girardi (2014).
   “META-SHARE: One year after”. In: Proceedings of the 9th Language Resources and Evalu-
   ation Conference (LREC 2014). Ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck,
   Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios
   Piperidis. Reykjavik, Iceland: ELRA, pp. 1532–1538. URL: http://www.lrec-conf.org/proceed
   ings/lrec2014/pdf/786_Paper.pdf.
Rehm, Georg (2016). “The Language Resource Life Cycle: Towards a Generic Model for Creat-
   ing, Maintaining, Using and Distributing Language Resources”. In: Proceedings of the 10th
   Language Resources and Evaluation Conference (LREC 2016). Ed. by Nicoletta Calzolari,
   Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asun-
   cion Moreno, Jan Odijk, and Stelios Piperidis. Portorož, Slovenia: ELRA, pp. 2450–2454. URL:
   https://aclanthology.org/L16-1388.pdf.
Rehm, Georg, Maria Berger, Ela Elsholz, Stefanie Hegele, Florian Kintzel, Katrin Marheinecke,
   Stelios Piperidis, Miltos Deligiannis, Dimitris Galanis, Katerina Gkirtzou, Penny Labropoulou,
   Kalina Bontcheva, David Jones, Ian Roberts, Jan Hajic, Jana Hamrlová, Lukáš Kačena, Khalid
   Choukri, Victoria Arranz, Andrejs Vasiļjevs, Orians Anvari, Andis Lagzdiņš, Jūlija Meļņika,
2 The European Language Grid Platform: Basic Concepts                                         33

   Gerhard Backfried, Erinç Dikici, Miroslav Janosik, Katja Prinz, Christoph Prinz, Severin Stam-
   pler, Dorothea Thomas-Aniola, José Manuel Gómez Pérez, Andres Garcia Silva, Christian
   Berrío, Ulrich Germann, Steve Renals, and Ondrej Klejch (2020a). “European Language Grid:
   An Overview”. In: Proceedings of the 12th Language Resources and Evaluation Conference
   (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christopher Cieri,
   Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Asun-
   cion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3359–3373. URL:
   https://www.aclweb.org/anthology/2020.lrec-1.413/.
Rehm, Georg, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic,
   Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez
   Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek,
   Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-
   bina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De
   Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek,
   Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur
   Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás
   Váradi, Kadri Vider, Andy Way, and François Yvon (2020b). “The European Language Technol-
   ogy Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communi-
   cation in Multilingual Europe”. In: Proceedings of the 12th Language Resources and Evaluation
   Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christo-
   pher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
   Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325.
   URL: https://www.aclweb.org/anthology/2020.lrec-1.407/.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Ax-
   ton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E.
   Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier
   Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alas-
   dair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen,
   Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert
   Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik,
   Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris
   A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waag-
   meester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons (2016). “The
   FAIR Guiding Principles for Scientific Data Management and Stewardship”. In: Scientific Data
   3. DOI: 10.1038/sdata.2016.18. URL: http://www.nature.com/articles/sdata201618.
Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
   Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick
   von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,
   Mariama Drame, Quentin Lhoest, and Alexander M. Rush (2020). “Transformers: State-of-the-
   art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical Meth-
   ods in Natural Language Processing: System Demonstrations. ACL, pp. 38–45. DOI: 10.1865
   3/v1/2020.emnlp-demos.6. URL: https://aclanthology.org/2020.emnlp-demos.6.
34                                                    Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

Appendix



             LANGUAGE RESOURCE /
                                                             TOOL/SERVICE                                DISTRIBUTION                        DATA
                 TECHNOLOGY

                       IDENTITY                              CATEGORIES                                    TECHNICAL                         DATA

         •     Resource name                     •           IDENTITY
                                                       Function                                    •   Software distribution form
         •     Description
                     IDENTITY                                                                      •   Private
         •     Version                                        TECHNICAL                            •   Docker download location *
                                                                                                   •   Download location *
                    CATEGORIES                    •     Language dependent                         •   Access  location *
                                                                                                             IDENTITY
                                                  •     Input content resource                     •   Execution location *
         •         IDENTITY
               Keyword                                    • Resource type                          •   Web service type *
                                                          • IDENTITY
                                                               Language *                          •   Licence
                                                  •     Output resource *
                       CONTACT                            • Resource type
                                                          • Language *
      •       Additional information
                     IDENTITY
                                                             EVALUATION
                 DOCUMENTATION

                  RELATED LRT'S

Fig. 6 ELG minimal schema version for a tool/service




          LANGUAGE RESOURCE /
                                                CORPUS                             PART                          DISTRIBUTION             DATA
              TECHNOLOGY

                   IDENTITY                    TECHNICAL                       TEXT PART *                         TECHNICAL              DATA

      •      Resource name             •   Corpus subclass           •     Language                       •   Dataset distribution form
                                                                                  IDENTITY
      •      Description
                   IDENTITY            •   Personal data             •     Multilinguality type *         •   Download location *
      •      Version                   •         IDENTITY
                                           Sensitive data                                                 •   Access location *
                                       •   Anonymized *                       AUDIO PART *                •   Distribution location *
                 CATEGORIES                                                                               •   Text features *
                                                                     •     Language                              • Size
                                                                     •            IDENTITY
                                                                           Multilinguality type *                • Data format
      •          IDENTITY
             Keyword
                                                                                                          •   Audio features *
                                                                              VIDEO PART *                       • Size
                   CONTACT                                                                                       • Data format
                                                                     •    Language                        •   Video IDENTITY
                                                                                                                     features *
     •       Additional information
                    IDENTITY                                         •    Multilinguality
                                                                                  IDENTITYtype *                 • Size
                                                                     •    Type of content                        • Data format
                                                                                                          •   Image features *
              DOCUMENTATION                                                   IMAGE PART *                       • Size
                                                                                                                 • Data format
                RELATED LRT'S                                        •     Type of content
                                                                                IDENTITY                  •   Numerical text features *
                                                                                                                 • Size
                                                                         NUMERICAL TEXT PART *                   • Data format
                                                                                                          •   Licence
                                                                     •     Type of content
                                                                                IDENTITY


Fig. 7 ELG minimal schema version for a corpus
2 The European Language Grid Platform: Basic Concepts                                                                                    35

             LANGUAGE RESOURCE /
                                                               MODEL                                           PART                              DISTRIBUTION
                 TECHNOLOGY

                       IDENTITY                              TECHNICAL                                UNSPECIFIED PART                             TECHNICAL

         •     Resource name                       •     Model function                           •   Language                           •    Dataset distribution
                                                                                                             IDENTITY
         •     Description
                     IDENTITY                      •     Model type *                             •   Multilinguality type *             •    Download location
         •     Version                             •     N-gramIDENTITY
                                                                 model *                                                                 •    Access location *
                                                           • Base item                                                                   •    Distribution location
                                                                                                                                                     IDENTITY
                     CATEGORIES                            • Order                                                                       •    Unspecified feature
                                                                                                                                                 • Size
                                                                                                                                                 • Data format
         •     Keyword
                     IDENTITY                                                                                                            •    Licence
         •     Intended application

                       CONTACT

      •        Additional information
                      IDENTITY


                 DOCUMENTATION

                   RELATED LRT'S

Fig. 8 ELG minimal schema version for a model




          LANGUAGE RESOURCE /
                                                   LCR                            PART                          DISTRIBUTION                     DATA
              TECHNOLOGY

                   IDENTITY                     TECHNICAL                     TEXT PART *                         TECHNICAL                      DATA

      •      Resource name              •   Encoding level           •   Language                        •   Dataset distribution form
                                                                                IDENTITY
      •      Description
                   IDENTITY             •   Personal data            •   Multilinguality type *          •   Download location *
      •      Version                    •         IDENTITY
                                            Sensitive data                                               •   Access location *
                                        •   Anonymized *                     AUDIO PART *                •   Distribution location *
                 CATEGORIES                                                                              •   Text features *
                                                                     •   Language                               • Size
                                                                     •          IDENTITY
                                                                         Multilinguality type *                 • Data format
      •          IDENTITY
             Keyword
                                                                                                         •   Audio features *
                                                                             VIDEO PART *                           Size
                                                                                                                • IDENTITY
                   CONTACT                                                                                      • Data format
                                                                    •    Language                        •   Video features *
     •       Additional information
                    IDENTITY                                        •    Multilinguality
                                                                                 IDENTITYtype *                 • Size
                                                                    •    Type of content                        • Data format
                                                                                                         •   Image features *
              DOCUMENTATION                                                  IMAGE PART *                       • Size
                                                                                                                • Data format
                RELATED LRT'S                                        •   Type of content
                                                                              IDENTITY                   •   Licence


Fig. 9 ELG minimal schema version for a lexical/conceptual resource
36                                              Stelios Piperidis, Penny Labropoulou, Dimitris Galanis et al.

          LANGUAGE RESOURCE /
                                            GRAMMAR                       PART                   DISTRIBUTION             DATA
              TECHNOLOGY

                 IDENTITY                   TECHNICAL                 TEXT PART *                  TECHNICAL              DATA

      •    Resource name            •   Encoding level
                                             IDENTITY        •   Language                 •   Dataset distribution form
                                                                        IDENTITY
      •    Description
                 IDENTITY                                    •   Multilinguality type *   •   Download location *
      •    Version                                                                        •   Access location *
                                                                     VIDEO PART *         •   Distribution location *
               CATEGORIES                                                                 •   Text features *
                                                            •    Language                        • Size
                                                            •    Multilinguality
                                                                         IDENTITYtype *          • Data format
      •        IDENTITY
           Keyword                                                                                   IDENTITY
                                                            •    Type of content          •   Video features  *
                                                                                                 • Size
                 CONTACT                                            IMAGE PART *                 • Data format
                                                                                          •   Image features *
      •    Additional information                            •   Type IDENTITY
                                                                      of content
                  IDENTITY                                                                       • Size
                                                                                                 • Data format
                                                                                          •   Licence
            DOCUMENTATION

              RELATED LRT'S


Fig. 10 ELG minimal schema version for a grammar




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 3
Using the European Language Grid
as a Consumer

Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano, Athanasia
Kolovou, Dimitris Gkoumas, Andis Lagzdiņš, and Stelios Piperidis




Abstract This chapter describes the European Language Grid cloud platform from
the point of view of a consumer who wishes to access language resources or make use
of language technology tools and services. Three aspects are discussed: 1. the web-
based user interface (UI) for casual and non-technical users, 2. the underlying REST
APIs that drive the UI but can also be called directly by third parties to integrate ELG
functionality in their own tools, and 3. the Python Software Development Kit (SDK)
that we have developed to simplify access to these APIs from Python code. The
chapter concludes with a preview of the upcoming payment module that will enable
the sale of commercial LT services and resources through ELG, and a discussion of
how ELG compares and relates to other similar platforms and initiatives.



1 Introduction

The European Language Grid (ELG) platform (Rehm et al. 2021) provides access
to Language Technology (LT) tools and services, both basic Natural Language Pro-
cessing (NLP) tools and end-to-end applications, as well as data resources, such as
structured and unstructured datasets and corpora, Machine Learning models, lexica,
ontologies, terminologies, etc. Chapters 7 (p. 131 ff.) and 8 (p. 151 ff.) present the
current state of LT services as well as datasets and language resources included in
the ELG platform respectively.

Ian Roberts
University of Sheffield, UK, i.roberts@sheffield.ac.uk
Penny Labropoulou · Dimitris Galanis · Athanasia Kolovou · Dimitris Gkoumas · Stelios Piperidis
Institute for Language and Speech Processing, R. C. “Athena”, Greece, penny@athenarc.gr,
galanisd@athenarc.gr, akolovou@athenarc.gr, dgkoumas@athenarc.gr, spip@athenarc.gr
Rémi Calizzano
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany,
remi.calizzano@dfki.de
Andis Lagzdiņš
Tilde, Latvia, andis.lagzdins@tilde.lv


© The Author(s) 2023                                                                        37
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_3
38                    Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

    ELG enables consumers of Language Technology to browse through the ELG cat-
alogue and have an overview of its contents, search for specific resources and select
as well as view the features of a resource through its formal description (metadata
record). Users can download resources hosted in the ELG cloud infrastructure in
accordance with their licensing conditions, or, in the case of external resources, be
re-directed to the location where they can be downloaded from or accessed. They can
also try out services in order to assess whether they comply with their needs; for this
to happen, the services must comply with the ELG technical interoperability speci-
fications, which are outlined in Chapter 4. Furthermore, ELG includes a catalogue
of commercial companies and academic and research organisations that are active
in the LT domain and of EU and national projects that have funded the development
and maintenance of LRTs (see Chapter 9); LRTs, actors and projects are interlinked
offering a comprehensive image of the LT landscape in Europe.
    Different types of users have different requirements and different levels of techni-
cal expertise, and the ELG platform provides a variety of access methods to address
these; all the principal functionality of the ELG is offered through both web-based
user interfaces (UIs, see Section 2) for interactive use and Application Programming
Interfaces (APIs, see Section 3) for programmatic access. In addition, the ELG team
supports the advanced needs of LT integrators with dedicated tools and helpers; most
notably a Software Development Kit (SDK) for Python (see Section 4), which is cur-
rently the most widely used programming language in the LT community.
    Supporting consumers to easily discover resources is of utmost importance, espe-
cially when a catalogue contains many entries, as in the case of ELG (over 13,000
metadata records for LRTs and 1,800 related entities at the time of writing and con-
stantly increasing). Best practices and recommendations (Wu et al. 2019; Wilkinson
et al. 2016) have been taken into account in the design and implementation of the
ELG catalogue pages and interaction mechanisms with the consumers.
    At present all functionality of the ELG platform is offered free of charge. All
users can view the catalogue and metadata descriptions as well as download open
access resources. In order to download resources with restrictive licences and try
out ELG-compatible services, users must register in the platform, as described in
Section 5. It should be noted that while the ELG platform does not currently charge
fees for access to any resources or services, restrictions may apply with regard to
the intended use(s) of the resource (e. g., available only for non-commercial use),
request for explicit consent to licensing conditions, etc. Resources available with
commercial licences are described in the ELG catalogue but for now re-directed
to the providers for further information. A prototype billing module, described in
Section 6, has been implemented and will be fully launched following the setup
of the ELG legal entity (see Chapter 13). Finally, in Section 7 we compare the ELG
platform to other similar services and initiatives, from the point of view of the service
or resource consumer. A similar comparison from the point of view of the provider
can be found in Chapter 4.
3 Using the European Language Grid as a Consumer                                                                                                                39

2 Web-based Interface

The ELG platform targets a diverse set of user types with different needs and levels
of technical expertise. The primary access route for non-technical users is via the
web user interface (UI), which prioritises user-friendliness and ease of use alongside
raw performance considerations. The catalogue UI includes two main pages: the
catalogue page, which offers access to the catalogue contents, and the view pages
for each metadata record or resource (LT, LR, organisation, project).



2.1 Viewing the Catalogue

After ELG’s homepage, the dedicated catalogue page (Figure 1) is the primary entry
point through which users have access to the ELG platform contents and functions.
Users can browse through the entire catalogue to find entries that might interest them.
They can also look for specific entries, using the free text search bar, filtering the
catalogue with one or more facets, or combining these two modes.



                                                                                                            Catalogue     Documentation & Media         About

               RELEASE 3




     Search for services, tools, datasets, organizations...                                                                         Search




    Language resources &
    technologies
                                                     AbuseEval                                                                                       41 views
                                                     version: 1.0
    Service functions
                                               Extension of OLID/OffensEval data set with distinction of explicit vs implicit
                                               offensive messages.Annotation of Abusive Language, distinguishing also
                                               between explicit vs implicit offensive messages.
    Languages
                                               Keyword: Corpus Creation/Annotation
                                               Language: English
    Media types
                                               Licence: Creative Commons Attribution Non Commercial Share Alike 4.0 International


    Licences
                                                     Academic Written Catalan in Catalonia [CesCa: El Català
                                                     Escolar Escrit a Catalunya]                                                             for information
    Conditions of use                                version: 1.0.0 (automatically assigned)

                                               It is a reference corpus of the written scholar Catalan in Catalonia. It con-
    Related entities                           tains 2.426 processed texts that have been produced by children between
                                               the last year of childhood education (P5) and the last year of obligator

    ELG integrated services                    Keywords: schoolar · written · obligatory education period
    and data                                   Language: Catalan



Fig. 1 Browse/Search page of the ELG catalogue


   The main section of the catalogue page shows all published entries sorted by
name in alphabetical order. Users can also sort the entries according to the update
date of the metadata record, so that they can view the most recently added entries
40                       Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

first. The catalogue shows only the most recent version of each entry if multiple
versions are registered. The snippet informs the users of additional older entries,
which can be viewed and accessed through the view page of the newest version
(see Section 2.3). This allows users to always keep up to date with the most recent
version of a service, but also access older versions when needed, for instance, when
reproducing previously published experiments.
    Each entry is shown with an informative snippet, designed to serve as a preview
of the full metadata record and to help users decide whether they want to explore
the entry further. Following well-established practices in catalogues, each entry is
represented by its name, an excerpt of its description, a set of metadata tags, and
popularity indicators. The set of metadata tags has been carefully selected to accom-
modate consumer requirements, as identified in a user survey conducted during the
ELG design and specification phase (Melnika et al. 2019) and subsequently enriched
based on user feedback. All types of entries include their free-text keywords. Entries
representing LRTs additionally include the resource type (represented with an icon),
language(s), and licence(s). The popularity indicators, displayed at the right hand
side of the snippet, consist of counts of visits of the view page of all versions of
an entry, counts of downloads (for ELG-hosted resources only) and number of calls
(for ELG-compatible services only; again for all versions of the entry). Finally, ded-
icated badges are shown for resources hosted in ELG and ELG-compatible services,
as well as for a subset of the metadata records that have been imported from other
catalogues with minimal metadata (see Chapter 6).



2.2 Searching the Catalogue

Search of the catalogue is supported in two different modes, which can be combined
in order to refine search queries and support users in easily finding entries of interest:
free text search (Section 2.2.1) and faceted search (Section 2.2.2).


2.2.1 Free Text Search

Users enter a word or phrase in the search box at the top of the catalogue page (see
Figure 1) and click the “Search” button to submit the query. By default, the search
functionality matches whole words using the OR operator. Advanced queries, util-
ising the Lucene query syntax1 , are supported, allowing users to search for partial
or exact matches, words or phrases, etc. Only certain metadata elements have been
indexed to make them searchable; these include a resource’s name(s), short name(s),
keywords and a subset of technical elements appropriate for each entry type and
deemed important as a search criterion. For example, for all LRTs, additional in-


1   https://www.lucenetutorial.com/lucene-query-syntax.html
3 Using the European Language Grid as a Consumer                                      41

dexed elements are the “resource type”, “language” and “licence”; for LT tools/ser-
vices, “service function” is also added to the search elements.
   In addition, to improve recall of search results, for those metadata elements that
take values from controlled vocabularies, i. e., “service function”,“intended LT appli-
cation”, and “language”, the query is expanded with the use of synonyms. Synonyms
for the first two elements are derived from a taxonomy of LT activities2 , which pro-
vides the values. For alternative names of languages, besides the official ones in-
cluded in the ISO 639-3 standard for language codes3 (International Organization
for Standardization 2007), we exploit open access vocabularies published as linked
data, i. e., the Glottolog list of languoids (families, languages, dialects)4 , the lexvo
ontology of languages5 , and the WALS list of languages6 ; all these vocabularies are
offered through Glottolog.


2.2.2 Faceted Search

Users can filter the catalogue or previous search results by selecting values from the
list of facets (Figure 2) on the left side of the catalogue page (Figure 1). For facets
with a long list of values, such as languages and licences, the facet values are broken
down into subsections or a search bar is included to refine the list.


                                  Language resources &
                                  technologies


                                  Service functions



                                  Languages


                                   Official EU languages

                                   Type to narrow down Official EU
                                   languages


                                    Bulgarian                  (634)
                                    Croatian                   (512)
                                    Czech                      (790)
                                    Danish                     (576)
                                    Dutch                      (793)
                                  Show more

Fig. 2 Faceted search in the ELG catalogue

2 Part of the OMTD-SHARE ontology, see http://w3id.org/meta-share/omtd-share.
3 https://iso639-3.sil.org/code_tables/639/data
4 https://glottolog.org

5 http://lexvo.org/ontology

6 https://wals.info/languoid
42                        Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

    The facets were selected in the initial phase of the ELG development based on
user preferences collected through a survey conducted for the technical platform
specifications (Melnika et al. 2019). Important criteria for users searching for are
language coverage (62%), licence and access conditions (59%) and availability of
open source code (56%). Later on, more facets have been added to reflect updates
in the metadata schema and improve search capabilities (Wu et al. 2019).
    There are two facets, based on the resource type and entity type elements, that cre-
ate dedicated subsets of the catalogue contents. The values are taken from the respec-
tive elements of the ELG metadata schema, but are tuned to current LT approaches.
Thus, with regard to LRTs, users can view specific catalogues of tools and services,
corpora, lexical/conceptual resources, models, grammars and other language descrip-
tions. In the ELG schema the last three are subclasses of the language description
type, but we opted to treat them as separate resource types primarily to improve the
visibility of models; these are what define the state of the art for many NLP tasks
and are likely to be particularly popular, so need to be easily discoverable. The two
catalogues of organisations and projects are a valuable asset for boosting and acti-
vating interactions within and across the LT community (including match-making in
the ELG marketplace) and eventually also for monitoring funding outcomes.
    LRTs can be further filtered using the facet ELG integrated services and data to
restrict the catalogue view to the ELG-compatible services and resources hosted in
ELG, for users who wish to take advantage of the “try out” functionality offered by
ELG for services or of the direct download of resources uploaded in ELG.
    The facet languages shows the language coverage of the LRTs in the ELG cata-
logue, i. e., the languages of the contents of data resources and the ones that tools/ser-
vices cater for. Given the scope of ELG, the official EU languages are presented in
a separate group shown at the top of the facet. The encoding of language values in
the catalogue follows the BCP 47 recommendations (Phillips and Davis 2009), i. e.,
it allows for users adding a tag consisting of subtags for language, region, script and
language variants, but for simplicity of the UI the facet browser includes only the val-
ues of the language subtag. Moreover, it includes only one of the known names of a
language; e. g., for “Catalan; Valencian”, only the first name is shown. For languages
and language varieties without an ISO 639 code, we show the name associated with
the respective Glottocode7 if it has one.
    The facets intended LT application and service function are used for classifying
LRTs and related entities with concepts specific to the LT community; consumers
can search for services that perform specific functions (e. g., dependency parsers,
Machine Translation tools), but also for corpora or models that have been created
or can be used for a a specific application (e. g., bilingual or multilingual corpora to
be used for building machine translation models), as well as for organisations and
projects active in an LT area; the values of these two elements are both taken from
the taxonomy of LT areas8 , and free text values that have been added by users.


7   https://glottolog.org/meta/glossary
8   http://w3id.org/meta-share/omtd-share/
3 Using the European Language Grid as a Consumer                                        43

    Licensing and access conditions are among the search criteria most requested
by users: licences gives the detailed list of licences used for LRTs in the catalogue.9
The more coarse-grained facet Conditions of use groups licences by the general types
of conditions they impose (e. g., “no commercial use”, “share-alike”), intended for
users with little knowledge of legal terms. Users are still advised to carefully read
the licence specified on the view page of each LRT for all terms and conditions.
    The media types facet was introduced at a later stage when the number of mul-
timodal resources included in the catalogue increased. As for languages, this refers
to the media type of the contents of resources or the media type of the input/output
of tools, and can be used to quickly search not only for text-related applications and
resources, but also for audio, video and image ones.
    The ELG catalogue includes both entries added by individuals and entries ag-
gregated from other catalogues.10 Thus, the facet source refers to the source of the
metadata record. It includes the name of the catalogue from which the record has
been imported or the value “ELG/ELE” for records originating in ELG or added by
the collaborating project European Language Equality (ELE)11 through processes
described in Chapter 6.



2.3 Viewing Metadata Records and Resources

By clicking the title of an entry on the catalogue page, users can view its full descrip-
tion. Figures 3 and 4 show the view page of a tool/service and a corpus respectively.
Specific view pages have been implemented for all LRT types published in ELG.
Their design takes into account user preferences and requirements, design and ac-
cessibility considerations and the ELG metadata schema. They allow users to access
detailed information about an item, test it, if it is a service integrated in the platform,
and, finally, obtain and use it for their purposes.
   Even though the types of information shown on the view pages differ for each
category, we apply a consistent visual look and feel for all of them. The information
on the view page of each item comes from the respective metadata record. Taking
into consideration the specificities and richness of the metadata schema, but also
user-friendliness, the information is layered along specific sections of the page. Thus,
view pages share a common layout that consists of a header, a right-hand sidebar, a
main content area and a bottom content region; the positioning of the elements on
the page and the formatting of the text is carefully thought through to draw users’
attention to the most important information.
   The header shows the name and version of the resource, its resource type and op-
tionally important flags (e. g., to indicate that a certain service is deployed in ELG).


9 Chapter 6 discusses why this element was made mandatory.
10 See Chapters 4 and 6 for more information on the respective modes of population.
11 https://european-language-equality.eu
44                    Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.




Fig. 3 View page of an ELG-compatible service



   At the top of the right-hand sidebar, the button “Claim” may appear for some of
the metadata records; these are records with minimal metadata that have been im-
ported through automatic harvesting and bulk collection procedures (see Chapter 6).
The claiming process enables interested users, i. e., the rightful owners of these LRTs,
to ask to curate and enrich them. The same area provides for all records information
on how they can be cited, according to data and software citation principles (Smith
3 Using the European Language Grid as a Consumer   45




Fig. 4 View page of a corpus
46                         Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

et al. 2016; Data Citation Synthesis Group 2014) and DataCite guidelines12 . They
also have the option to share the URL link of the page by email or through social
media and export the metadata record as an XML file in the ELG-compliant schema.
Statistics of resource usage are shown both for the particular resource version and
for all versions (if there are multiple versions). Links to other versions of the same
resource are also displayed here.
    In the content area, tabs split information into smaller views and enable users to
navigate to offered functionalities of the platform. The first tab provides an overview
of the main features of the entry that help users decide if the resource fits their needs.
In terms of layout it is similar across resource types, but the information types (meta-
data elements) differ. Compare, for instance, Figures 3 and 4 that show the overview
tab for a service and a corpus. The top shows a free text description for all record
types, followed by a section for classification information (keywords, domain, ser-
vice function, etc.) and an area for technical metadata, e. g., the media type(s) and
language(s) of a corpus, the input and output data formats for a service, etc. The
bottom section contains hyperlinks to useful documents, creation details, etc. and is
again specific to resource types.
    Depending on the resource type, the “Download” or “Download/Run” tab presents
information related to the distribution of the resource, such as the licence under
which it can be accessed, a technical description of its content files (e. g., size and
format for data resources), and access to the resource itself – a direct download link
if the resource is uploaded into ELG (see Section 3.2), otherwise a redirect to the
resource on its provider’s site. Figure 5 shows the tab for a corpus hosted in ELG.
    A third tab appears if the item is related to other items, e. g., a project with the
LRTs this project has funded, an organisation with the LRTs it has created and the
projects it is involved in.
    Finally, ELG-compatible services have two more tabs that enable users to try out
the service (see Section 2.5) and inform them how to use it via the command line or
Python SDK (see Section 4).



2.4 Consumer’s Grid

Individuals can browse the catalogue, view detailed metadata cards and download
open access resources without any registration. To access restricted resources and
run ELG-compatible services, they must be registered with an ELG account and
also logged in. For registered users, ELG offers a dashboard (“grid”) for managing
and performing actions on catalogue items depending on their rights (see Chapter 2
for more information on user roles and rights). As for view pages, the grid follows
a similar layout which is customised for each user type.
   The consumer’s grid (Figure 6) allows registered users to monitor their usage of
daily quotas, view details on downloads of LRTs they performed and of the services

12   https://datacite.org/cite-your-data.html
3 Using the European Language Grid as a Consumer   47




Fig. 5 Download tab for a corpus
48                         Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.




Fig. 6 Consumer’s grid (see Figure 4 in Chapter 4, p. 73, for the Provider’s grid)



they have deployed. Additional elements of the “My grid” section that are relevant
only to provider users are discussed in Chapter 4.



2.5 Try out UIs for Language Technology Services

One of the key benefits of having an LT service fully integrated in ELG is that users
have access to a “try out” UI from which they can test the service directly using their
web browser. ELG provides standard trial UIs13 covering all principal service types:
     • Information Extraction (IE) & text analysis services take text input and produce
       standoff annotations over that text.
       In addition to this generic text analysis UI there is also a specific one for depen-
       dency parsers that renders CoNLL-U style annotations as a tree structure.14
     • Text-to-text services (most notably Machine Translation, but also summarisation,
       anonymisation, etc.) take text and return new text that is derived from the input.
     • Text classification services take text input and classify it somehow (e. g., lan-
       guage identification, “fake news” detection, etc.)
     • Speech recognition services accept audio and return a text transcription.


13   Service providers whose tools do not fit one of the above UIs are free to provide their own.
14   https://universaldependencies.org/format.html
3 Using the European Language Grid as a Consumer                                    49




Fig. 7 An example “try out” UI for a named entity service



  • Audio annotation services take audio and return standoff annotations over par-
    ticular time segments of the audio stream.
  • Text-to-speech services take text and return audio.
  • Image OCR (optical character recognition) services take image data and return
    text extracted from the image.
   The trial UIs for services are available to any user who has logged in to the ELG
portal. The UI appears in the “Try out” tab when viewing a service in the catalogue;
Figure 7 shows an example for a simple service that only requires plain text. How-
ever, some services can be much more complex, requiring additional parameters or
providing snippets of sample data that users can test the service with – if a service
declares these kinds of items in its metadata record, then the try out UI will automat-
ically adapt, as shown in Figure 8. This service – also see Chapter 18 – declares two
optional parameters and offers a selection of samples in different languages.
   The UIs have been designed to render all of the main service response types in a
user-friendly way, for example, annotations over text are shown as colour highlights
(Figure 9), translated text is displayed alongside the original, audio can be played
directly in the browser, etc.
50                                           Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.


                                                                                                                                  My grid                    Ian Roberts

                                                                                       Technologies    Resources      Community      Events       Documentation         About ELG

             RELEASE 2




                                                                                                                                                                   Go to catalogue




              Text to Terminological Concept System                                                                      Cite resource
              Text2TCS                                                                                                   Gromann, Dagmar (2021, September 15). Text to
              Version: 1.1.2 (15/09/2021)                                                                                Terminological Concept System. Version 1.1.2.
                   ELG-compatible service
                                                                                                                         [Software (Tool/Service)]. Source: European
                                                                                                                         Language Grid. https://live.european-language-
                                                                                                                         grid.eu/catalogue/tool-service/8122


       Overview                   Download/Run                    Try out                    Code samples




     What is coronavirus?
     There are many different kinds of coronavirus (CoV). Known types of coronavirus include:                                       Annotations
                                                                                                                                         c01: coronavirus
     SARS coronavirus (SARS-CoV), which was first detected in 2003;
     MERS coronavirus (MERS-CoV), which first occurred in humans in 2012.                                                                 c02: types
Fig. 8 A more complex “try out” UI for the Text2TCS service
     In humans, certain types of coronavirus can cause illnesses ranging from a common cold to severe pneumonia. Other types          c03: SARS coronavirus; MERS
     of coronavirus can cause a variety of infectious diseases in animals. Some types of coronavirus can be transmitted from        coronavirus; coronavirus SARS-CoV-2
     animals to humans.
                                                                                                                                         c04: detected; identified

     The coronavirus identified in China in late 2019 was never before detected in humans.                                                c05: humans
                                                                                                                                         c06: illnesses; disease
     On 11 February 2020, WHO assigned the official name COVID-19 (coronavirus disease 2019) to this disease. The designation
                                                                                                                                         c07: common cold
     for the pathogen (germ) was changed from 2019-nCoV to SARS–CoV-2.
                                                                                                                                         c08: severe pneumonia
                                                                         c23: infection with coronavirus SARS–CoV-2
     How dangerous is coronavirus SARS-CoV-2?                                                                                            c09: infectious diseases
                                                                           Name     Value
     Similar to seasonal influenza, it affects in particular elderly persons and persons with a weakened immune system.
                                                                                                                                         c10: animals
     In more severe cases, infection with coronavirus SARS–CoV-2 can, id            c23cause pneumonia or severe breathing
                                                                           for example,
     difficulties.                                                           term     infection with coronavirus SARS–CoV-2                c11: transmitted

                                                                       relations  [ … ]                                                 c12: China
     How does coronavirus SARS-CoV-2 spread?
                                                                                                                                         c13: WHO
     Person-to-person spreading is the most frequent path of infection with coronavirus SARS–CoV-2. Contagion can be caused
     by:                                                                                                                                 c14: COVID-19
                                                                                                                                         c15: designation
     Mucus and saliva
                                                                                                                                         c16: pathogen
     Urine and faeces
     Body fluids like for example blood                                                                                                   c17: 2019-nCoV
                                                                                                                                         c18: SARS–CoV-2
     Features                                                                                                                            c19: dangerous
      Name          Value                                                                                                                c20: seasonal influenza
      Graph Link https://live.european-language-grid.eu/temp-storage/retrieve/01h5dwxa-cn5gqz6vmgra6f4t9adkut4gfs0ef                    c21: elderly persons; persons
      TBX Link      https://live.european-language-grid.eu/temp-storage/retrieve/01h5dwxa-jfkfprgapwsz9i4uea8pxchf6iu5f 
                                                                                                                                         c22: immune system
                                                                                                                                      c23: infection with coronavirus
                                                                                                                                    SARS–CoV-2

Fig. 9 Example result for the Text2TCS service showing rendered text annotations
                                                                       c24: pneumonia
                                                                                                                                         c25: severe breathing difficulties
                                                                                                                                         c26: spread
                                                                                                                                         c27: Contagion
                                                                                                                                         c28: Mucus
3 Public REST APIs                                                                                                                       c29: saliva
                                                                                                                                         c30: Urine
                                                                                                                                         c31: faeces

The web user interfaces described above are built on top of a set of REST APIs, and                                                      c32: Body fluids
                                                                                                                                         c33: blood
the same APIs can also be called directly by third parties, allowing ELG functionality
to be accessed programmatically and embedded into other tools. The current public
       BACK
APIs break down into three principal groups: 1. accessing/using the catalogue (Sec-
tion 3.1), 2. accessing and downloading ELG-hosted data resources (Section 3.2),
3. calling ELG-hosted LT services (Section 3.3).
   All APIs are HTTPS-based and use JSON as the primary data representation for-
mat. Where authentication is required, this is performed using OAuth2 access tokens
     Home         Technologies       Resources      Events       Documentation       About ELG                                                                      Contact us
issued by the ELG user management layer (see Section 5).
                 The European Language Grid has received funding from the European Union’s
                 Horizon 2020 research and innovation programme under grant agreement
                 № 825627 (ELG)

      © 2022 ELG Consortium Terms of Use
3 Using the European Language Grid as a Consumer                                      51

3.1 Accessing and Using the Catalogue

The ELG catalogue is a Python web application based on the Django REST Frame-
work.15 It offers a number of services as REST APIs, including the following ones
which are useful for consumers: 1. searching the catalogue, 2. authorising the down-
load of a resource or access of any resource or page, 3. retrieving the metadata de-
scription of a resource.



3.2 Downloading a Resource

ELG allows providers to upload and store the actual contents of their LRTs within
the platform (data files for corpora, source code for software, etc.), and the catalogue
offers an API to allow consumers to download this data subject to licensing terms.
    LRT data is stored in a storage service compatible with the API of Amazon S3.
Access by consumers is mediated by a Storage Proxy.16 The proxy defers to a data
management module within the catalogue application (see Section 6) to determine,
based on authentication information provided by the user who attempts the down-
load, whether that user has the permission to download the requested resource. Fac-
tors considered in making a decision include whether the resource is open access to
all requesters (authenticated or not), if it requires authentication, or if the user must
explicitly accept the terms of the licence prior to download.



3.3 Language Technology Service Public API

One of the great strengths of ELG is its use of a single harmonised set of APIs for
all ELG-compatible LT services regardless of provider. This differs from other API
aggregator platforms such as RapidAPI17 , where each service provider defines their
own API and the caller must adapt their code for each different service.
    For each LT service the platform provides two endpoints at which the service
can be called, which implement synchronous and asynchronous modes of operation.
These endpoints are implemented in the LT Service Execution Server. The endpoint
URLs can be found in the service_info section of the metadata record JSON
structure returned by the catalogue API.
    The synchronous mode simply consists of a single API call in which the caller
will POST the data to be processed and receive the results via the response to the
same request. The asynchronous mode accepts the same type of request but instead
of blocking the caller until the results are ready it returns a polling URL, which the

15   https://www.django-rest-framework.org
16   https://gitlab.com/european-language-grid/platform/s3proxy
17   https://rapidapi.com
52                        Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

caller must repeatedly poll for status updates. This requires more HTTP requests but
for long-running services (or those that take some time to scale up from idle) the
asynchronous mode is more resilient to connection failures or intermediary proxy
timeouts between the client and the ELG platform.
   Any query parameters appended to the URL will be passed through to the service
and may affect its behaviour – each service declares the parameters that it supports
in its metadata. All available versions of a given service are exposed at the same
endpoint, the ?version=... parameter is used to select between them, with the
latest version used by default if no parameter is given.
   The POST data must have an appropriate Content-Type header for the service
in question; services that take text (such as text analysis or MT services) expect
“text/plain”18 , services that take audio (such as speech recognition) expect “audio/x-
wav” or “audio/mpeg”, and services that take images expect the “image/png”, “im-
age/jpeg”, etc. A few services expect their input to be “structured text” that has been
pre-segmented by the caller, for these the request must be presented in an ELG-
defined JSON format. The response will be in JSON, in one of a variety of formats
depending on the data type:
     • Standoff annotations are represented in a style inspired by the format used by
       Twitter, each type of annotation mapping to a JSON array of objects referenc-
       ing the start and end locations of the annotation (characters for text, fractional
       seconds for audio), and an optional set of features.
     • Classifications of the whole input have their own format giving an ordered list
       of classes, each with an optional score.
     • New texts such as translations of text or transcriptions of audio are returned in a
       structured format referred to as a “texts” response (note texts is plural). This is
       described in more detail below.
     • Audio responses such as text-to-speech are still represented in JSON. Short snip-
       pets of audio can be returned inline in base 64 encoding, longer audio will typi-
       cally be stored at a short-lived temporary URL for the caller to download via a
       separate HTTPS request.
    The full specification of these response types can be found in the ELG documen-
tation.19 The “texts” response type is the most complex one as it is able to encode a
nested tree structure of texts, where each node in the tree can be either a leaf node
containing a single string of content, or a branch node containing another level of
texts. The vast majority of services currently using this response format produce one
of the three basic forms shown in Listing 1: a single text, a flat list of segments or
alternatives, or a two-level list where each segment has a set of alternatives.
    The property role is used to distinguish the cases. Not all services populate this
property but it is encouraged; conventionally a role of “sentence”, “paragraph” or
“segment” denotes segments of text that are all part of the same transcript or trans-
lation, and “alternative” denotes different translations or transcriptions of the same

18   UTF-8 encoding is the default but can be overridden by adding the charset=... parameter.
19   https://european-language-grid.readthedocs.io/en/stable/all/A3_API/LTPublicAPI.html
     3 Using the European Language Grid as a Consumer                                   53

 1   // A single text
 2   {
 3     " response ":{
 4        "type":" texts ",
 5        " texts ":[
 6           {" content ":"This is some text"}
 7        ]
 8     }
 9   }
10

11   // A flat list of segments or alternatives
12   {
13     " response ":{
14        "type":" texts ",
15        " texts ":[
16           {" content ":" First sentence ", "role":" sentence "},
17           {" content ":" Second sentence ", "role":" sentence "},
18        ]
19     }
20   }
21

22   // A two level list of segments that each have a number of alternatives
23   {
24     " response ":{
25        "type":" texts ",
26        " texts ":[
27           {
28              "role":" sentence ",
29              " texts ":[
30                 {" content ":" Translation one", "role":" alternative "},
31                 {" content ":" First translation ", "role":" alternative "}
32              ]
33           },
34           ...
35        ]
36     }
37   }
     Listing 1 The three most common types of “texts” response




     input segment. In the case of alternatives, each entry may also have a “score” repre-
     senting the relative quality of the different options.
         For errors (and also for warning messages), ELG, being a multilingual platform,
     uses a format designed to be amenable to internationalisation (i18n). Each message
     is represented as a JSON object with three properties “code”, “text” and “params”
     (see Listing 2). The property “code” is the primary identifier for the error; there is
     a list of standard message codes provided in the ELG documentation but providers
     are free to create their own codes if the standard messages do not adequately cover
     their needs. The property “text” is a string for the message text in English, and it
     may include numbered placeholders {0}, {1}, etc. If the message has placeholders,
     54                      Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

 1   {
 2        "code":"elg. request .type. unsupported ",
 3        "text":" Request type {0} not supported by this service ",
 4        " params ":[" audio "]
 5   }
     Listing 2 An example “status message” object from the ELG API, designed to be easily translated
     into many languages.



 1   POST https :// live.european -language -grid.eu/i18n/ resolve ?lang=fr
 2   Content -Type: application /json
 3

 4   [
 5        {
 6            "code":"elg. request .type. unsupported ",
 7            "text":" Request type {0} not supported by this service ",
 8            " params ":[" audio "]
 9        }
10   ]
11

12   // response
13   Content -Type: application /json
14

15   ["La demande du type audio n'est pas supportée par ce service "]
     Listing 3 Resolving a status message to a translated string




     the corresponding values are given in the “params” array (as a zero-based index, so
     0 refers to the first item, 1 to the second, etc.). The error message may also include
     an optional “detail” object providing more technical details about the error.
         The standard ELG message codes have translations into a number of different
     languages (twelve at the time of writing, with more in the pipeline), and ELG pro-
     vides a special API endpoint that accepts an array of errors and an ISO 639 language
     code, and returns an array of message strings in the requested language (if available)
     with all placeholders filled in. If the requested message code is not available in that
     language the endpoint falls back to English, and if the message code is not known
     at all then the “text” fallback from the original error is used instead.
         Listing 3 shows an example of calling the “resolver” API; the ?lang=... param-
     eter specifies the desired language. If it is not provided then the resolver will respect
     any Accept-Language HTTP header on the request.20 If no language is requested
     by the parameter or the header then messages will be returned in English by default.
         Some long running services will return more meaningful progress updates as they
     work through their various stages of processing, and these updates will be passed
     back to the caller if they use the asynchronous API mode – requests to the polling

     20For browser-based clients this will typically result in the messages being returned in the user’s
     preferred browsing language.
3 Using the European Language Grid as a Consumer                                      55

URL for a given job will return the latest progress update if the process is not yet
complete. These updates are represented as i18n message objects in the same way as
the errors and warnings described above, and they can be resolved to strings using
the same resolver API endpoint.



4 Python SDK for Users

ELG provides many APIs to access the catalogue and search for specific resources, to
download corpora hosted in ELG, to call services or many other uses (see Section 3).
This provides ELG users with a lot of flexibility in the way they want to interact with
the platform, however, the basic APIs are rather low level. For example, the search
endpoint is paginated and returns only 20 results per call, which means that multiple
API calls are needed to obtain more than 20 results. Similarly, calling a service via
the public LT service API in the asynchronous mode requires multiple API calls to
be made at the correct times and in the correct sequence to perform what is, from
the user’s perspective, a single action.
   In order to simplify interactions with the platform, we developed a Python SDK
that operates on top of the various ELG APIs and provides simple methods to easily
interact with ELG and consume the resources in Python. We chose Python as the
language for this first ELG SDK as it is probably the most widely-used programming
language within the LT community.
   The SDK is included in the ELG Pypi package which can be installed using the
pip command familiar to any Python programmer. The basic SDK for consumer
use is installed using pip install elg. The SDK provides access to most ELG
functions through Python. It provides access to the catalogue with methods that allow
users to search the catalogue and look for corpora, services, and organisations. The
SDK enables users to call ELG-compatible services, and even to combine them using
a simple pipeline mechanism.



4.1 Browsing the Catalogue

   The SDK enables access to the ELG catalogue. It uses the same filters as the
UI, i. e., we can filter for the type of resource or LT service, languages and licence;
free text search can also be used. Listing 4 shows how to search for an English
to French machine translation service. The SDK handles issues such as pagination
automatically and returns the result as a list of entities, where each entity is a Python
object that encapsulates the information about the respective ELG resource.
     56                    Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

 1   from elg import Catalog
 2

 3   catalog = Catalog ()
 4

 5   # Search and get the result as a list of Python objects
 6   results = catalog . search (
 7       resource = "Tool/ Service ", # " Corpus ", " Lexical / Conceptual
 8                                    # resource " or " Language
 9                                    # description "
10       function = " Machine Translation ", # only for "Tool/ Service "
11       languages = ["en", "fr"], # string or list if multiple
12                                    # languages
13   )
     Listing 4 Example code to use the ELG catalogue




     4.2 Downloading a Resource

     The Python SDK has a Corpus class that corresponds to a corpus or data set. It can
     be initialised using the identifier of the resource. If the resource is stored in ELG,
     it can be downloaded using the download method of the Corpus class. Listing 5
     shows the most simple usage and parameters are available to choose the distribution
     or specify the download location for example.


 1   from elg import Corpus
 2

 3   corpus = Corpus . from_id (913) # initialise the Corpus using its ID
 4   corpus . download ()            # download corpus method
     Listing 5 Example code to download an ELG corpus




     4.3 Obtaining an Access Token

     Some functions are restricted to authorised users of ELG (see Section 5). For the
     restricted APIs, an access token must be retrieved to identify the user behind the
     API call. It is possible to obtain a short-lived valid access token through the UI but
     this is not convenient for programmatic use.
         To address this limitation, the Python SDK includes the Authentication class
     that interacts directly with the ELG OpenID Connect authentication service to obtain
     tokens, i. e., the access token to authenticate the API call and the refresh token which
     is used to refresh the access token when it expires.
     3 Using the European Language Grid as a Consumer                                          57

 1   from elg import Authentication
 2

 3   auth = Authentication .init ()
 4   # here the user is asked to authenticate in the browser
 5

 6   auth = Authentication .init(scope=" offline_access ")
 7   # here we are requesting an ``offline'' token that remains valid until
 8   # revoked, as opposed to the usual token that requires re-authentication
 9   # after 6 hours
10

11   auth. to_json (" tokens .json") # export the tokens to a json file
12

13   auth = Authentication . from_json (" tokens .json")
14   # creation of an Authentication object from the tokens in the json file
     Listing 6 Example of code to obtain, store, and retrieve authentication tokens




         Listing 6 shows an example usage of the Authentication class. During the
     process, the user has to authenticate using their browser and paste the resulting au-
     thorisation code back to the Python program. Once the Authentication object is
     initialised, it is possible to save the tokens in a json file and reuse them. Obtained
     tokens are by default valid for only six hours. It is possible to get tokens that are
     valid indefinitely by setting the scope parameter to offline_access.



     4.4 Calling Language Technology Services

     The Service class of the Python SDK corresponds to an ELG LT service, and can
     be initialised using the identifier of the service. As users need to be authenticated to
     use ELG services, a login step is necessary. Alternatively, it is possible to provide
     an Authentication object or a json file containing the tokens during the initial-
     isation of the service, which allows the login step to be skipped. Various ways of
     authenticating during the service initialisation of a service are shown in Listing 7.
        A service that is initialised in Python can be called easily (see Listing 8). The
     Python SDK handles the creation of the input message, any necessary refreshing of
     the access token, the communication with the REST API, etc.
        When calling a service, the input request can be provided in various formats: a
     plain text, a path to a text or an audio file, or a Request object.21 The result is a
     Python object that corresponds to one of the response messages (see Section 3.3).




     21 https://european-language-grid.readthedocs.io/en/stable/all/A1_PythonSDK/notebooks/Service
     .html#Usage
    58                         Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

1   from elg import Service
2

3   lt = Service . from_id (474) # login step necessary ( unless tokens
        are cached ) and the tokens will expire after 6 hours
4   lt = Service . from_id (474 , scope=" offline_access ") # login step
        necessary ( unless tokens are cached ) and the tokens will
        never expire
5   lt = Service . from_id (474 , auth_object =auth) # 'auth ' is an
        Authentication object . No login step and the expiration of
        the tokens depends on the `auth ` object
6   lt = Service . from_id (474 , auth_file =" tokens .json") # file
        containing existing tokens . No login step and the expiration
        of the tokens depends on the scope used to create them
    Listing 7 Different ways of providing authentication during Service initialisation



1   from elg import Service
2

3   lt = Service . from_id (474) # initialise LT service using its ID
4   result = lt(" Nikola Tesla did not live in Berlin .") # run service
5   print ( result )
    Listing 8 Example code for calling an ELG service




    5 User Authentication

    While general exploration and search in the ELG catalogue is open to all, various
    other operations in ELG are restricted to certain users. For example, access to the LT
    service public API (via the Python SDK, curl or the “try out” UIs) requires the caller
    to be logged in so that the platform can enforce API call quotas to limit how much
    data can be processed by each user per day, following the ELG licensing strategy
    (see Section 6). Similarly, the submission of new resources and metadata records is
    limited to users who are registered as providers; administrative tasks are restricted
    to the technical ELG team.
        Registering a regular user account is a simple self-service procedure. The regis-
    tration form is available through the sign up/sign in icon in the top right corner of the
    catalogue page. All registered users are assigned the consumer role by default. To
    get provider status, users can submit a request through their profile page. All other
    roles are assigned internally by the ELG administrators.
        ELG uses Keycloak22 , a user management, authentication and authorisation server
    based on the OAuth2 and OpenID Connect23 standards. Keycloak supports both in-
    teractive authentication of users through the web UI, and programmatic access to
    the REST APIs using JSON Web Tokens. Users sign in to Keycloak, then they (or

    22   https://www.keycloak.org
    23   https://openid.net/connect/
3 Using the European Language Grid as a Consumer                                     59

the client tool they are using, such as the ELG Python SDK) can acquire an access
token, which is a cryptographically signed “permit” that encodes their identity and
permissions. API endpoints can verify the validity of the token by checking its sig-
nature, and then make access decisions based on the “claims” encoded in the token
without needing to check every request directly with the authentication server.
   The adoption of OpenID Connect opens up the possibility for third party applica-
tions to allow their own users to authenticate using ELG accounts, in the same way
as many existing websites and applications support “sign in with Google” or “sign
in with Facebook”. The OpenID Connect specification allows this without compro-
mising the protection of users’ personal information. When a given user attempts to
“log in with ELG” to a particular third party application for the first time, Keycloak
requires the user to grant explicit consent before any of their data is shared with the
provider, and that consent can be revoked at any time. At the time of writing the first
proof of this concept is under development with one of the ELG pilot projects.



6 Licensing and Billing

ELG includes mechanisms that support the consumption of services and resources
that are available without any restrictions in terms of commercial aspects. It sup-
ports the download of resources under the condition that they are offered free of
charge with open access licences or with restrictive licences that require only user
authentication and, optionally, accepting the licensing terms. Technical safeguards
have been implemented to ensure that access to LRTs is granted in accordance with
the above terms, for example, access to LRTs distributed with restricted licences is
made available only to those users that fulfil the criteria specified in the licences.
With regard to LT services, only the “try out” functionality is available and only for
registered users. Each user has two independent daily quotas for the quantity of data
processed, one for plain text and the other for binary (audio or image) data, to reflect
the fact that binary formats generally require much more data than plain text.
   In addition, we also designed and implemented the prototype of a billing module
that will enable ELG to offer resources and services distributed with commercial
licences. The module is based on the commercial platform Chargebee, which was
selected because it fulfilled our requirements: it ensures security and includes vari-
ous services, such as handling subscriptions, payments, pricing, taxes, emails, ensur-
ing customer satisfaction and conformance to all EU and national laws, and offers
several functionalities, such as checkout pages, self-service after the payment, can-
cellation, creating and managing subscription plans, subscription changes, etc. The
integration of the external billing module is based on the interaction between the
two platforms, ELG and Chargebee. Information about the pricing of a resource or
service is formally encoded in the metadata record in ELG; administrative and ex-
ecution costs may also be added and calculated on the ELG side. In the Chargebee
catalogue we maintain a set of all monetised products and plans, and their prices.
60                       Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

    The relationship between the ELG catalogue products and the Chargebee cata-
logue is not necessarily one-to-one; Chargebee can contain paid plans that allow the
use of multiple products from the ELG catalogue, or the download of multiple re-
sources. The relation between the two catalogues depends on the ELG business strat-
egy. All transactions, subscription changes, logs, billing information, subscription
data and similar information are stored on the Chargebee side, i. e., a database that
is external to ELG. Any information needed from Chargebee can be synchronised
through a webhook mechanism. For the ELG platform, this information includes the
identity of the user who has performed an action through a subscription plan and/or
a purchase, the action performed, the billing plan to which the user subscribed, etc.
Chargebee sends this information via HTTPS POST to the ELG back end so that it
can register changes in the ELG platform. The ELG back end monitors the user’s
quota usage and, taking into account the user’s subscription plans from the Charge-
bee platform, decides whether to allow or block a request for running a service. A
similar procedure is used for the download of a purchased resource.



7 Consumer-Related Functionalities in ELG and other Platforms

In this section we present platforms and catalogue-based systems that share features
with ELG, with a special focus on functionalities for consumers.



7.1 Catalogue and Repository Functionalities

With regard to the presentation and organisation of the contents of such a digital
catalogue of artefacts, the users of ELG can see all types of entities on the same
page or go through quick links from the top menu to the subset that interests them.
Offering such resource type-specific filtering functionalities is an approach adopted
by many catalogues, for example, Hugging Face24 has separate pages for models
and datasets, Papers with Code25 for datasets and benchmarks, some CLARIN cen-
tres distinguish between data resources and services (e. g., CLARIN-PL26 , etc.), the
European AI on demand platform27 maintains separate catalogues for AI assets, or-
ganisations, projects and educational resources. This approach is particularly useful
for expert users with clear search objectives. In addition, distinguishing between
separate resource types allows for the selection of different metadata elements and
subgroupings of entries along the parameters most suitable to each type (e. g., group-
ing together services based on the tasks they perform or the degree of complexity

24   https://huggingface.co
25   https://paperswithcode.com
26   https://clarin-pl.eu
27   https://www.ai4europe.eu
3 Using the European Language Grid as a Consumer                                   61

of use, and datasets based on modality or language). On the other hand, the one-
size-fits-all page has the benefit of allowing users to have an overview of resources
and activities using the same set of filters. ELG combines the two approaches by
providing quick links in the top menu and filters for the targeted pages.
   With regard to search functionalities, free text search is the most popular one. In
some cases, an autocomplete function (e. g., Hugging Face) is used while advanced
queries are less used. Faceted search is also common, but in most cases with limited
facets (e. g., European AI on Demand platform, Hugging Face, etc.). Search with
programmatic modes through REST APIs is offered by many platforms on a limited
set of metadata elements in the same way that ELG does.
   With regard to the functionalities offered for hosted data resources, direct down-
load of open access resources is common. A download link that can be used from
outside the platform (e. g., through a command line mode, or as a URL link) is pro-
vided in most cases. The deployment of integrated services on hosted resources is
a feature offered in only a few platforms (e. g., OpenMinTeD, clarin:el28 ). Machine
Learning platforms, like Hugging Face, can feed hosted datasets into applications,
but this is not among the objectives of the ELG platform.



7.2 Language Technology Service Execution

ELG’s LT service execution functionality has been designed and implemented from
scratch. Below, we compare this functionality with similar related infrastructures or
frameworks and highlight the similarities and differences in various aspects, e. g.,
interchange format, trial/visualisation UIs and support of workflows.
   The DKPro29 family of tools and resources (Gurevych et al. 2007) consists of a
growing number of projects addressing different NLP tasks and aspects, such as
pre-processing, machine learning, and lexical resources. It offers a collection of
tools wrapped as UIMA components (Unstructured Information Management Ar-
chitecture)30 , i. e., the components implement the interfaces and specifications of
the UIMA framework. A UIMA reader component should extend the ResourceCol-
lectionReaderBase class and also implement the getNext(CAS aJCas) method.
A processor must extend JCasAnnotator_ImplBase and, furthermore, implement
process(JCas aJCas) and a writer extends JCasFileWriter_ImplBase and im-
plements process(JCas aJCas). A UIMA reader loads data from a text file and
creates a Common Analysis System (CAS) object. A processor gets a CAS object,
runs the wrapped NLP tool and adds the results to the CAS object. A writer gets a
CAS object and serialises its content to a file in a specific format. UIMA is Java-
based but it can be used to wrap non-Java tools as well. UIMA allows to program-
matically define pipelines (workflows), i. e., chain a reader, various processors, a

28   https://inventory.clarin.gr
29   https://dkpro.github.io
30   https://uima.apache.org
62                        Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

writer and run the pipeline locally; it does not run remote services as in the case of
ELG. The DKPro components are interoperable because they all follow the DKPro
typesystem31 , which defines which annotations can be added to a CAS object, which
features an annotation can contain, how these are serialised etc. The typesystem is
actually an ontology for annotations, how they are organised etc. The ELG JSON
format does not follow a typesystem. Another difference with ELG is that a CAS
object is serialised (by default) in XML Metadata Interchange (XMI) format32 , a
standard for exchanging metadata information via XML; other formats are also sup-
ported. If the results of a DKPro pipeline are exported in an appropriate format (e. g.,
XMI) they can be loaded, visualised and even edited with the annotation tool IN-
CEpTION33 (Klie et al. 2018), which is not possible in the ELG trial UIs.
    GATE34 (Cunningham et al. 2013) is an open source toolkit capable of solving
numerous text processing problem. The GATE framework is written in Java and
similar to DKPro/UIMA. As with UIMA there are additional modules to support
integration with non-Java tools. It allows creating, either via a UI builder or progra-
matically, a pipeline of NLP tools for specific tasks. The completed pipeline can be
saved in the XML “recipe” format XGAPP, which can, in turn, be loaded into the
developer UI to process small numbers of documents and visualise the resulting an-
notations, run using a batch processing tool for larger scale processing, or packaged
as a service on either the ELG or GATE’s own GATE Cloud platform (see Chap-
ter 7, Section 4.2, 140 ff.). Each GATE processing component gets as input a GATE
Document which is enriched with annotations. Again, as in DKPro, GATE readers
and writers load the data and write the processing results. A GATE Document is by
default serialised to GATE XML, however, other formats are also supported. The
annotations that are added in GATE Document do not follow a specific typesystem
but follow some generic rules – each document has one or more sets of annotations,
each set can contain annotations of many types, each annotation can have zero or
more features, and while there is no enforced typesystem, all standard GATE com-
ponents share a set of informal conventions for the types and features they use. This
logic is very similar to the one adopted in ELG’s JSON-based format. Contrary to
ELG, the DKPro/UIMA and GATE tools are not dockerized (by default) and run as
command line tools locally. Furthermore, the ELG services always process raw text
while DKPro and UIMA components can also handle other formats such as PDF,
and documents that have already been partially annotated.
    GATE Cloud35 (Tablan et al. 2013) is a platform very similar in spirit to ELG, but
specifically built around the requirements of GATE-based text analysis tools. It was
developed by the same team at the University of Sheffield that was responsible for
the initial design of the ELG LT service execution layer and thus shares many of the
same API design decisions. GATE Cloud offers a REST API accepting documents

31   http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html
32   https://www.omg.org/spec/XMI/2.5.1/About-XMI/
33   https://inception-project.github.io
34   https://gate.ac.uk
35   https://cloud.gate.ac.uk
3 Using the European Language Grid as a Consumer                                   63

via HTTP post and returning annotations in the native JSON or XML formats of the
GATE framework. GATE Cloud services process only text (not audio or other media
types), but can accept formats such as XML, PDF (with machine-readable text) or
Word documents as well as plain text. As well as the single document API, GATE
Cloud also supports batch processing of larger amounts of data using on-demand
processing capacity from Amazon Web Services. GATE Cloud services are defined
as XGAPP “recipes” in the native GATE format, which are wrapped as Docker con-
tainers for the REST API or executed as-is by the batch processing engine. GATE
Cloud has recently added support for other types of APIs such as image OCR (a
service which has itself been integrated into the ELG platform).
    The LAPPS Grid platform, as DKPro, is based on a typesystem, the LAPPS Web
Service Exchange Vocabulary (Ide et al. 2016), “an ontology of terms for a core of
linguistic objects and features exchanged among NLP tools that consume and pro-
duce linguistically annotated data. It is intended to be used for module description
and input/output interchange to support service discovery, composition, and reuse
in the natural language processing domain.” In LAPPS Grid, as in ELG, tools are
wrapped as web services, packaged as Docker images and exchange JSON messages.
However, LAPPS Grid also offers workflows by using Galaxy, a workflow manage-
ment system. Galaxy includes a visual editor for creating and parameterising work-
flows and an engine for executing these workflows. LAPPS Grid does not have a
catalogue and each service is described with a limited set of metadata elements that
are required for adding it to the Galaxy tool inventory. ELG was not designed to
offer workflows, i. e., it does not include a workflow editor or a workflow execution
engine. In addition, all services get as input raw text and they were not designed for
playing the role of components in a workflow. However, some pipelines can be cre-
ated by using external tools, e. g., the Python SDK and some code/adapters (Rehm et
al. 2020; Moreno-Schneider et al. 2022). For example, using the ELG Python SDK,
a Machine Translation service can be called, the result can extracted from the output
JSON message and fed to an ELG NER service.
    The OpenMinTeD execution service (Labropoulou et al. 2018) is also built on
top of Galaxy. A large number of tools from the DKPro and GATE collections were
ingested to OpenMinTeD. Several tools from other providers were also added. All
tools were dockerized and are executed inside the container as command line tools,
i. e., not as web services. An OpenMinTeD workflow is executed by running a series
of Docker images (one after the other) in a cluster managed by Mesos36 , a framework
similar to Kubernetes37 . The workflow itself is created using the Galaxy editor. In
OpenMinTeD no specific interchange format was enforced, the recommendation was
to use the DKPro typesystem and XMI serialization. However, the GATE tools were
using GATE XML format and several others were using their own custom format
(e. g., based on JSON). In order to create a “mixed” workflow the creator had to com-
bine the respective components with corresponding format adapters. If the results of


36   https://mesos.apache.org
37   https://kubernetes.io
64                        Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

the workflow were in XMI format, they could be visualised using WebAnno38 , a
predecessor of INCEpTION.
    The European AI on Demand platform39 covers the whole European AI landscape
rather than being restricted to LT or NLP. For example, computer vision is also in-
cluded. The services are gRPC-based (not REST-based as in ELG) and are packaged
as Docker images. The messages that they consume and produce are based on the
ProtoBuf serialisation format40 and no specific typesystem is used. The platform
does not offer an execution environment. However, the worfklows that are created
with the AI4EU Experiments editor41 , an editor similar to the one offered by Galaxy,
are exported to a format that allows their execution in a Kubernetes cluster.
    Hugging Face offers a large collection of Transformer-based models for computer
vision, language processing, audio processing etc. Transformers are a specific type
of neural networks (Vaswani et al. 2017) that have revolutionised machine learning
since they achieve state of the art results in many tasks. Hugging Face allows training
of Transformer-based models via the AutoNLP API42 , which is not free of charge.
While we have performed initial experiments, ELG does not offer integrated model
training. In Hugging Face, training as well model deployment is based on Amazon
SageMaker, which is built on top of Docker. Hugging Face users can call a model via
the trial UIs/widgets that are embedded in the respective page (as in ELG). For doing
the same in a programmatic way, Hugging Face offers an inference REST API along
with a Python client API43 . Similar inference functionalities are offered through the
ELG REST APIs and the Python SDK. Upon request, Hugging Face also offers an in-
ference solution delivered as a container with the Transformer model for on-premise
usage.44 It can be used via a HTTP API (as in ELG). Finally, Hugging Face has devel-
oped a Python-based library (called “transformer”) that allows to download a model
and either fine-tune it in a specific task or use it for inference. Such functionality is
not offered by the ELG Python SDK.



8 Conclusions

The ELG platform has fully achieved all objectives it had set for serving consumers.
It allows consumers to browse through the whole ELG catalogue, already populated
with more than 13,000 metadata records, apply faceted filtering and exploration,
search for specific resources and services, download them (if hosted in ELG) and
try out more than 800 functional services, both basic processing NLP services and

38   https://webanno.github.io/webanno/
39   https://www.ai4europe.eu
40   https://developers.google.com/protocol-buffers
41   https://aiexp.ai4europe.eu
42   https://huggingface.co/autotrain
43   https://api-inference.HuggingFace.co/docs/python/html/quicktour.html
44   https://HuggingFace.co/infinity
3 Using the European Language Grid as a Consumer                                              65

end-to-end applications. Users can also access the directory of LT-developing com-
panies and academic organisations, find organisations active in a specific LT area,
and initiate collaborations with them. The links between LRTs, organisations and
projects allows users to navigate between them and have an overview of the over-
all European LT landscape. Consumers can access all these functionalities through
user-friendly web user interfaces, or in programmatic ways, using the public REST
APIs and Python SDK.



References

Cunningham, Hamish, Valentin Tablan, Angus Roberts, and Kalina Bontcheva (2013). “Getting
    More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics”.
    In: PLOS Computational Biology 9.2, pp. 1–16. DOI: 10.1371/journal.pcbi.1002854.
Data Citation Synthesis Group (2014). Joint Declaration of Data Citation Principles – FORCE11.
    Ed. by M. Martone. DOI: 10.25490/a97f-egyk. URL: https://doi.org/10.25490/a97f-egyk.
Gurevych, Iryna, Max Mühlhäuser, Christof Müller, Jürgen Steimle, Markus Weimer, and Torsten
    Zesch (2007). “Darmstadt Knowledge Processing Repository based on UIMA”. In: Proc. of
    the First Workshop on Unstructured Information Management Architecture (co-located with
    GLDV 2007). Tübingen, Germany, p. 89.
Ide, Nancy, Keith Suderman, Marc Verhagen, and James Pustejovsky (2016). “The Language Ap-
    plication Grid Web Service Exchange Vocabulary”. In: Worldwide Language Service Infrastruc-
    ture. Lecture Notes in Computer Science. Springer, pp. 18–32.
International Organization for Standardization (2007). Codes for the representation of names of
    languages – Part 3: Alpha-3 code for comprehensive coverage of languages. URL: https://ww
    w.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/03/95/39534.html.
Klie, Jan-Christoph, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, and Iryna Gure-
    vych (2018). “The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Inter-
    active Annotation”. In: Proceedings of the 27th International Conference on Computational
    Linguistics (COLING 2018): System Demonstrations. Santa Fe, USA: ACL, pp. 5–9. URL: htt
    p://tubiblio.ulb.tu-darmstadt.de/106270/.
Labropoulou, Penny, Dimitris Galanis, Antonis Lempesis, Mark Greenwood, Petr Knoth, Richard
    Eckart de Castilho, Stavros Sachtouris, Byron Georgantopoulos, Stefania Martziou, Lucas Anas-
    tasiou, Katerina Gkirtzou, Natalia Manola, and Stelios Piperidis (2018). “OpenMinTeD: A
    Platform Facilitating Text Mining of Scholarly Content”. In: Proceedings of WOSP 2018 (co-
    located with LREC 2018). Miyazaki, Japan: ELRA, pp. 7–12. URL: http://lrec-conf.org/works
    hops/lrec2018/W24/pdf/13_W24.pdf.
Melnika, Julija, Andis Lagzdiņš, Uldis Siliņš, Raivis Skadins, and Andrejs Vasiļjevs (2019). De-
    liverable D3.1 Requirements and Design Guidelines. Project deliverable; EU project European
    Language Grid (ELG); Grant Agreement no. 825627 ELG. URL: https://www.european-langu
    age-grid.eu/wp-content/uploads/2021/02/ELG-Deliverable-D3.1-final.pdf.
Moreno-Schneider, Julián, Rémi Calizzano, Florian Kintzel, Georg Rehm, Dimitris Galanis, and
    Ian Roberts (2022). “Towards Practical Semantic Interoperability in NLP Platforms”. In: Pro-
    ceedings of the 18th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (ISA 2022;
    co-located with LREC 2022). Ed. by Harry Bunt. Marseille, France, pp. 118–126. URL: http:
    //www.lrec-conf.org/proceedings/lrec2022/workshops/ISA-18/pdf/2022.isa18-1.16.pdf.
Phillips, Addison and Mark Davis (2009). Tags for Identifying Languages. Tech. rep. RFC 5646.
    Internet Engineering Task Force. URL: https://datatracker.ietf.org/doc/rfc5646.
Rehm, Georg, Dimitrios Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo
    Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian
    Chiarcos, Nils Feldhus, Julián Moreno-Schneider, Florian Kintzel, Elena Montiel, Víctor Ro-
66                      Ian Roberts, Penny Labropoulou, Dimitris Galanis, Rémi Calizzano et al.

   dríguez Doncel, John P. McCrae, David Laqua, Irina Patricia Theile, Christian Dittmar, Kalina
   Bontcheva, Ian Roberts, Andrejs Vasiljevs, and Andis Lagzdiņš (2020). “Towards an Interop-
   erable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different
   Levels of Interoperability”. In: Proc. of the 1st Int. Workshop on Language Technology Plat-
   forms (IWLTP 2020, co-located with LREC 2020). Ed. by Georg Rehm, Kalina Bontcheva,
   Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasiljevs. Marseille, France, pp. 96–
   107. URL: https://www.aclweb.org/anthology/2020.iwltp-1.15.pdf.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Smith, Arfon M., Daniel S. Katz, and Kyle E. Niemeyer (2016). “Software citation principles”. In:
   PeerJ Computer Science 2. URL: https://peerj.com/articles/cs-86.
Tablan, Valentin, Ian Roberts, Hamish Cunningham, and Kalina Bontcheva (2013). “GATECloud-
   .net: A Platform for large-scale, Open-Source Text Processing on the Cloud”. In: Philosophical
   Transactions of the Royal Society A: Math., Phys. and Eng. Sciences 371.20120071.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
   Łukasz Kaiser, and Illia Polosukhin (2017). “Attention is all you need”. In: Proceedings of the
   31st International Conference on Neural Information Processing Systems, pp. 6000–6010.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Ax-
   ton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E.
   Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier
   Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alas-
   dair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen,
   Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert
   Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik,
   Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris
   A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waag-
   meester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons (2016). “The
   FAIR Guiding Principles for Scientific Data Management and Stewardship”. In: Scientific Data
   3. DOI: 10.1038/sdata.2016.18. URL: http://www.nature.com/articles/sdata201618.
Wu, Mingfang, Fotis Psomopoulos, Siri Jodha Khalsa, and Anita de Waard (2019). “Data Discovery
   Paradigms: User Requirements and Recommendations for Data Repositories”. In: Data Science
   Journal 18.1. URL: http://datascience.codata.org/articles/10.5334/dsj-2019-003/.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 4
Contributing to the European Language Grid
as a Provider

Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis, Leon
Voukoutis, Katerina Gkirtzou, Rémi Calizzano, Athanasia Kolovou, Dimitris
Gkoumas, and Stelios Piperidis



Abstract The ELG platform enables producers of language resources and language
technology tools and services to upload, describe, share, and distribute their services
and products as well as to describe their companies, academic organisations and
projects. This chapter presents the functionalities offered through web-based user
interfaces for describing LT resources or related entities with metadata and for man-
aging their publication. It gives a detailed description of the options that providers of
LT tools can exploit to integrate them into ELG as ready-to-deploy services and the
tools that ELG offers in their support during the preparation, upload and integration
phases. The tools and packaging recommendations for resources to be uploaded in
ELG are also presented. The chapter concludes with a discussion of functionalities
offered to providers by ELG and other related platforms.



1 Introduction

The European Language Grid platform (Rehm et al. 2021) offers various functional-
ities for providers of Language Resources and Technologies (LRTs) through which
they can share their assets with the Language Technology (LT) community and inter-
ested clients, customers or users of these technologies. The minimum requirement
is that they make them accessible (by uploading them to ELG or through another
website) and describe them with a metadata record that complies with the ELG spec-
ifications (see Chapter 2), where they specify the access location and licensing con-

Dimitris Galanis · Penny Labropoulou · Miltos Deligiannis · Leon Voukoutis · Katerina Gkirtzou ·
Athanasia Kolovou · Dimitris Gkoumas · Stelios Piperidis
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
galanisd@athenarc.gr, penny@athenarc.gr, mdel@athenarc.gr, leon.voukoutis@athenarc.gr,
katerina.gkirtzou@athenarc.gr, akolovou@athenarc.gr, dgkoumas@athenarc.gr, spip@athenarc.gr
Ian Roberts
University of Sheffield, UK, i.roberts@sheffield.ac.uk
Rémi Calizzano
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany,
remi.calizzano@dfki.de


© The Author(s) 2023                                                                         67
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_4
68                 Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

ditions under which they can be used. To take advantage of the advanced features
of ELG, providers can also integrate LT tools as ready-to-deploy services, following
the ELG specifications, or upload the resource itself, in which case it will be stored
and preserved according to the Data Management Plan (see Chapter 8) and made
readily available to LRT consumers. Furthermore, descriptions of organisations that
are active in the LT area can be added in order to promote their activities and prod-
ucts. Descriptions of projects that have been funded in the broader LT area can also
be included in the ELG catalogue. LRTs, organisations that have provided or created
them and projects that have contributed to their funding are linked together.
    Detailed documentation is provided and a suite of helper tools have been devel-
oped aiming to make the contribution and integration of all entities briefly sketched
above as simple as possible, taking into account the technical expertise and prefer-
ences of users. In ELG, the provision and management of catalogue entries is sup-
ported through web user interfaces (UIs) and REST application programming inter-
faces (APIs). Section 2 describes the steps a provider must take to contribute entries
to the catalogue, and the tools provided by ELG to support this process. The ELG cat-
alogue intends to be a reliable source for resources that can be accessed and (re-)used
by commercial and non-commercial, research and public organisations as well as in-
dividuals. For this purpose, management and curation policies and processes for the
metadata, data and services included in ELG have been set up, albeit with variations
depending on the source and type of contribution. Only authorised and authenticated
individuals can add LRTs in ELG; the registration and assignment of the “provider”
user role is a simple process for all interested users (see Chapter 3). In addition, all
entries go through a formal publication life cycle (see Chapter 2). Before being pub-
lished in the catalogue, added metadata records are validated by the ELG core team
(Section 3). Section 4 looks into the requirements for the different types of resources
and entities in ELG, either integrated in ELG or available remotely and added to ELG
as metadata records only. Further technical specifications are set for LT services that
are intended to be deployed through the ELG cloud infrastructure, and for data re-
sources hosted in ELG. Before being published in ELG, these resources go through
a process that aims to ensure their technical validity and, for services, to set up the
required environment for their deployment. Section 5 presents similar platforms and
infrastructures and discusses the approach and tools they offer for providers of LRTs,
in analogy to the comparison made for the platform functionalities from the point of
view of consumers in Chapter 3.



2 Adding Resources to the ELG Platform

LRT providers come from a variety of backgrounds, some within Language Technol-
ogy fields such as NLP or Computational Linguistics, and others from neighbouring
fields such as Digital Humanities. Different providers have different levels of techni-
cal knowledge and familiarity with formal metadata descriptions, so ELG attempts
to offer an integrated environment suitable for both expert and non-expert users. The
4 Contributing to the European Language Grid as a Provider                           69

functions exposed for registering and managing catalogue entries and their accompa-
nying data files are designed to be user-friendly while still offering advanced features
to users with the relevant skills.
    All metadata records must comply with the ELG metadata schema (Labropoulou
et al. 2020). The schema offers a rich set of metadata elements for each type of LRT or
entity (organisation, project) to be added. Individual elements are either mandatory,
recommended or optional, depending on the record type. Providers can add entries
with only the mandatory elements, although they are also encouraged to add the
recommended ones. See Chapter 2 for more details.



2.1 Creating Metadata Records

Providers can add records in one of two ways: either by creating and uploading XML
files compliant to the ELG schema (Section 2.1.1), or by using the interactive editor
offered by ELG (Section 2.1.2). In practice many users will adopt a combination
of the two approaches, for example, a provider who wishes to submit many similar
records (such as MT services based on the same underlying engine but with models
for different language pairs) may create their first record using the editor, export it
as XML, and use this file as a template to generate the remaining records.


2.1.1 Creation and Upload of Metadata Files

This first option is probably more appealing to expert and technical users, especially
those that wish to register multiple related records or produce frequently updated
versions of LRTs registered in ELG. To facilitate the process of adding records, pre-
filled metadata templates and examples (with the mandatory and recommended ele-
ments) are available in the ELG GitLab repository1 . As mentioned above, any exist-
ing metadata record can be exported from ELG as XML to be used as a template.
    A REST endpoint for metadata validation of single files or zipped archives of
XML files is publicly available and offered for providers that want to validate their
metadata files and ensure they comply with the ELG schema before uploading them
to the platform.2 The XSD validator checks that all mandatory elements are filled in
and that filled-in values are consistent with the data type declared for the elements –
for example, if elements take values from controlled vocabularies or should follow
a specific pattern – and returns the results in JSON form.
    Users can upload their metadata records through the provider’s grid (see Sec-
tion 2.3) as single files or in batch mode. The import step includes additional vali-
dation rules, which check the syntactic and, to a certain extent, semantic integrity of
the record. For example, checks are performed for metadata elements that depend

1   https://gitlab.com/european-language-grid/platform/ELG-SHARE-schema
2   https://live.european-language-grid.eu/catalogue/#/validate-xml
70                                         Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

on the presence or value of other elements (e. g., the element “multilinguality type”
which is mandatory for bilingual and multilingual resources), or for duplicate values
(e. g., the same “language” value used twice). Validation errors are reported to the
user for correction. If the file is valid, it is imported to the platform and the provider
can perform further edits with the editor or submit it for publication in accordance
with the publication life cycle (see Chapter 2).


2.1.2 Metadata Editor

The editor can be accessed through the provider’s grid (see Section 2.3). It supports
users in creating new metadata records, as well as editing and updating existing
ones. The editor includes the mandatory and recommended fields of the ELG schema.
Chapter 2 provides a summary of all mandatory metadata elements.
    The editor has been designed with non-expert users in mind, and intends to hide
the richness of the ELG schema. For this reason, we offer a full-fledged UI with
metadata elements grouped into semantically coherent sets and layered along hor-
izontal and vertical tabs, following the ELG conceptual structure. Different editor
forms with the same look and feel have been implemented for each resource or en-
tity type. Figure 1 shows the editor for tools/services; the horizontal tabs correspond
to the main classes of the schema – in this case, LRT, tool/service and distribution –
and the vertical tabs to categories of elements within that main section. The figure
shows the LRT horizontal tab, whose options include “identity” (identification meta-
data such as the resource name, long description, and name of the creator responsible
for the record), “categories” (classification elements such as keywords and subject                                                                               My grid                  Ian R Roberts


domain), and “documentation” (links to publications, user manuals, or other docu-
            RELEASE 2
                                                                                                                                                                       Catalogue   Documentation & Media       About



ments describing the resource).
     My grid             My items                               My validations                    Feedback                   Administration                                                 Go to catalogue
CREATE A NEW SERVICE OR TOOL



                                                                                                                                    Work in progress
 LANGUAGE
                                       TOOL/SERVICE             DISTRIBUTION               DATA                                     ELG-compatible                                     Save draft             Save
 RESOURCE/TECHNOLOGY
                                                                                                                                    service

                          LRT name *                                                                                                          language

     IDENTITY             Example service                                                                                                     English

                          The official name or title of the language resource/technology                                                      select language




     CATEGORIES
                         LRT identifier
                                                                                                                                                                                                    Fill in
                         A string used to uniquely identify the language resource/technology

                          LRT short name                                                                                                      language

                          Example                                                                                                             English
     CONTACT
                          An abbreviation, acronym, etc. used for the language resource/technology                                            select language



                        Description
                                                                                                                                                           language
     DOCUMENTATION        Paragraph
                                                                                                                                                           English

                         This is an example of a metadata record for an ELG-compatible service.                                                            select language




     RELATED LRΤS




                          Version

                          1.0.0

                          Recommended format: major_version.minor_version.patch (see semantic versioning guidelines at http://semver.org)
Fig. 1 ELG metadata editor
                          Version date

                          The date of the LRT version (latest update of the particular version if possible)


                        LRT provider
                        The actor responsible for providing, curating, maintaining and making available (publishing) the language resource/technology


                          Actor type

                          Select type and describe



                         Source of metadata record
                                                                                                                                                                                                    Fill in
                         The entity (repository, catalogue, archive, etc.) from which the metadata record has been imported into the new catalogue
4 Contributing to the European Language Grid as a Provider                                       71

    The editor guides the user to fill in at least all of the mandatory elements with
appropriate values. Help tips and examples are available for metadata elements, and
different editing controls are used for elements depending on their data type. For
instance, the elements of controlled vocabularies are shown using dropdown lists.
For vocabularies with many values (e. g., languages, service functions, etc.), we use
a combination of dropdown lists with suggested values as the user types in the text.
    The combination of dropdown lists and dynamically suggesting values is also
applied to improve normalisation. For example, some elements such as keywords
allow free text entry, however as the user types, a popup suggests matching values
that have previously been used for the same element in other records, “nudging” the
user to choose identical values instead of slight variations. The same lookup mecha-
nism, of suggesting values from those already imported in the catalogue, is used for
reducing the chance for duplicates of related entities such as agents, projects, docu-
ments, licences, and other resources.3 For such entities, the ELG schema requires a
set of minimal information, a name/title, and, optionally, an identifier and metadata
elements that could uniquely distinguish it from similar entities (e. g., email for per-
sons, website for organisations, a URL with the text for licences, etc.). Thus, when
adding related entities through the editor, users type in a name/title, and are shown
matching entries (if any) to select from; if not, they are prompted to fill in the re-
quired elements mentioned above. The same set of metadata elements is also used
at the import of metadata records to uniquely identify the related entities.
    Through the editor, providers have the option of saving incomplete metadata
records (“draft”), for which only the data type of the metadata elements is validated
(e. g., that they have entered a valid URL). When they decide to properly save the
metadata record, we validate the entry using the yup library4 , implementing at least
the same rules used at the import of metadata files. In case of errors, messages de-
scribe the error and location where it occurred (see Figure 2); clicking on the error,
users are forwarded to its location.



2.2 Uploading and Managing Data Files

Data files, i. e., the physical files that contain the contents of a resource, must be
uploaded as a ZIP file. Section 4.2.2 presents recommendations for the packaging of
data resources, especially for those that can be split into subsets.
   Providers can upload data files as a first step when they upload an XML file5 , or
during the editing process with the editor. The editor includes a tab entitled “Data”
(Figure 3) through which users can manage the files (upload, replace and delete).

3 This is a well-known issue across catalogues; the adoption of unique persistent identifiers is rec-
ommended to resolve it, but not all entities are assigned such a unique identifier or it may not be
known to the provider that submits the metadata record.
4 https://github.com/jquense/yup

5 At the time of writing, the upload of data files during the batch import of XML metadata records

is not supported.
                                                                                                                                                                                     My grid                    Ian R Roberts

                                                                                                                                                                                           Catalogue    Documentation & Media         About

                RELEASE 2



72                                                  Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.
     My grid             My items                                            My validations                Feedback                          Administration                                                      Go to catalogue
CREATE A NEW SERVICE OR TOOL

      Correct the following errors in order to proceed

          1.   Language Resource/Technology > Identity > Description is required


          2.   Language Resource/Technology > Identity > Description language is required




                                                                                                                                                  Work in progress
 LANGUAGE                                                                                                                                                                                                                           Save
                                               TOOL/SERVICE                  DISTRIBUTION          DATA                                           ELG-compatible
 RESOURCE/TECHNOLOGY
                                                                                                                                                  service

                                  LRT name *                                                                                                                  language

      IDENTITY                    Example service                                                                                                             English

                                  The official name or title of the language resource/technology                                                              select language




      CATEGORIES                 LRT identifier
                                                                                                                                                                                                                          Fill in
                                 A string used to uniquely identify the language resource/technology

                                  LRT short name                                                                                                              language

                                  Example                                                                                                                     English
      CONTACT
                                  An abbreviation, acronym, etc. used for the language resource/technology                                                    select language




                                  Description
      DOCUMENTATION                                                                                                                                                             language
                                     Paragraph
                                                                                                                                                                                English

                                                                                                                                                                                select language


      RELATED LRΤS




Fig. 2 ELG metadata editor with error messages

                                  Version

                                  1.0.0


   A resource may be available in a range of distributable forms (“distributions”), for
                                  Recommended format: major_version.minor_version.patch (see semantic versioning guidelines at http://semver.org)




example, in different file formats (e. g., as PDF, XML or TXT files). ELG supports
                                  Version date
the upload of multiple data files for the same resource. For this reason, when users
                                  The date of the LRT version (latest update of the particular version if possible)


upload more than one package of data files, they are prompted to associate each
                               LRT provider

package with the respective distribution (i. e., the one that includes the metadata that
                               The actor responsible for providing, curating, maintaining and making available (publishing) the language resource/technology



describe the size and format of the particular set of files). This action is performed
                                  Actor type

                                  Select type and describe

by selecting the specific package on the “distribution” tab.
                                 Source of metadata record
                                                                                                                                                                                                                          Fill in
                                 The entity (repository, catalogue, archive, etc.) from which the metadata record has been imported into the new catalogue



                               LRT creator
                               The actor who created the language resource/technology

2.3 Managing Catalogue Entries
                                  Actor type

                                  Select type and describe



The ELG platform presents users that have the “provider” role set with a “grid” (dash-
                                  Publication date
                                                                                                                                                                                     My grid                    Ian R Roberts

board), through which they can access and manage the catalogue items they have
                                  The date when the language resource/technology became available to the public
                                                                                                                                                                                           Catalogue    Documentation & Media         About


created, as well as create new items (Figure 4). Since every provider is by definition
                RELEASE 2      Funding project
                               The project that funded the creation, enrichment, extension, etc. of the language resource/technology

     My grid                         My items                                My validations               Feedback                           Administration                                                      Go to catalogue
CREATE A NEW CORPUS               Project name

                                  The official title of the project


 LANGUAGE
                                               CORPUS                 PART           DISTRIBUTION           DATA                                 Work in progress                                           Save draft              Save
 RESOURCE/TECHNOLOGY
                                  Logo                                                                                                                                                                 Browse

                                  Upload an image file (e.g., JPG, PNG) or add the full URL for the LRT logo
                                                                                                                               Upload data
 DATA

                                                   Name                                            Upload date                                          Assigned to distribution                                            Actions

                                                                                                                                                                                                                                    Save
                                              AM-News.zip                                        13 April 2022                                                      No

                            *In order to delete a dataset you should first unlink it from the corresponding distribution and save your record.




     Technologies     Resources         Community           Events           Documentation    About ELG                                                                                                                  Contact us

                                                                                                                                                                                                            Save draft              Save
                The European Language Grid has received funding from the European Union’s Horizon 2020 research
                and innovation programme under grant agreement № 825627 (ELG)


       © 2022 ELG Consortium Terms of Use


Fig. 3 ELG metadata editor – “data” tab for uploading data files
     Technologies     Resources        Community            Events           Documentation    About ELG                                                                                                                  Contact us


                The European Language Grid has received funding from the European Union’s Horizon 2020 research
                and innovation programme under grant agreement № 825627 (ELG)


       © 2022 ELG Consortium Terms of Use
4 Contributing to the European Language Grid as a Provider                                    73




Fig. 4 Provider’s grid (see Figure 6 in Chapter 3, p. 48, for the Consumer’s grid)



also a consumer, the provider’s dashboard is an extension of the consumer’s dash-
board shown in Chapter 3, adding a counter of the number of records this user has
created and links to the editor, XML upload, and XML validator tools.
    Users can manage the metadata records they have created through a dedicated
page (“My items”, Figure 5), and, in accordance with their user rights and the pub-
lication status of the record, perform the following actions: edit a metadata record,
submit it for publication, create a new version of a published record, copy a metadata
record (in order to use it as a model and create a similar record), delete a metadata
record that has not yet been published, and request the unpublication of one of their
records.6 The “My items” page is a focused version of the catalogue, this time fil-
tering records according to each user’s role. This page also implements browse and
search functionalities like the main catalogue page.




6 Records cannot be completely deleted after publication except in exceptional circumstances, and
then only by request to the ELG administrators.
74                      Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.




Fig. 5 “My items” page



3 Validating and Publishing Metadata Records

Metadata records added by individuals7 enter a validation process, as specified in the
ELG publication life cycle (see Chapter 2), before they are published in the catalogue:
we perform technical/metadata and legal validation for ELG-compatible services and
resources with uploaded data files, and validation at the metadata level only for all
other metadata records. ELG-compatible services also go through a set of actions
required for the registration of the service in the ELG platform (see Section 4.1.8).
   Validators have access to the metadata records that have been assigned to them
through the “validator’s grid”, and more specifically the “My validations” page (Fig-
ure 6). The validation form includes fields in which the validator can add internal
comments (visible only to the other validators), and in the case of rejected records,
a field for noting the reasons and suggested changes that are communicated to the
provider for corrections. Providers can go through the changes and resubmit the
record, which initiates a new round of validation, until final approval. When the
metadata record has been approved by the responsible validator or validators, it is
automatically made visible in the public catalogue.



4 Entity-Type Specific Requirements

There are several technical requirements that need to be met for LT services (Sec-
tion 4.1) or resources (Section 4.2) to be deployed through or hosted in ELG success-
fully. We also present the requirements for metadata-only resources (Section 4.3).

7   For harvesting and batch import functionalities from other catalogues, see Chapter 6.
4 Contributing to the European Language Grid as a Provider                                                                                               75

 MY ITEMS          VALIDATION TASKS             MY USAGE             MY DOWNLOADS




         Search for services, tools, datasets, organizations...                                                                      Search




                                                                                                                                     15 search results

  Curator
                                                       Resource name                                                                     Status
    test-curator                 (7)
    elg-system                   (6)
    test-admin                   (1)
    test-metadata-validator      (1)                   Select All

  Items

    Corpus                       (6)                  ToolVal
                                                                                                                                             submitted
                                                      1.0.0                                       legal validator
    Model                        (3)
                                                      Tool/Service                                test-legal-validator                 legally valid
    Organization                 (3)
                                                      submitted: 22 July 2021                     metadata validator                   yes
    Lexical/Conceptual
                                 (1)
    resource                                           has data                                   test-metadata-validator              metadata valid
    Project                      (1)                                                              technical validator                  not validated

    Tool/Service                 (1)                                                              test-technical-validator             technically
                                                                                                                                       valid
                                                                                                  curator
  Status                                                                                                                               yes
                                                                                                  test-admin
    submitted                    (9)
    published                    (6)
                                                                                                                                             submitted

  Has data
                                                      KP_Division                                                                      legally valid
    no                           (12)                                                                                                  yes
                                                      Organization                                metadata validator
    yes                          (3)
                                                      submitted: 21 July 2021                     test-metadata-validator              metadata valid
                                                                                                  curator                              not validated
  Resubmitted
                                 (14)                                                             test-curator                         technically
    false
                                                                                                                                       valid
    true                         (1)
                                                                                                                                       yes

  Τechnically valid

    yes                          (15)                 new test data corpus
                                                      1.0.0 (automatically assigned)              legal validator

  Metadata valid                                      Corpus                                      test-legal-validator                       published
                                 (9)                  submitted: 16 July 2021                     metadata validator
    not validated
                                 (6)                  legal validation date: 16 July 2021                                              legally valid
    yes                                                                                           test-metadata-validator
                                                      metadata validation date: 16 July 2021                                           yes
                                                                                                  technical validator
  Legally valid                                                                                                                        metadata valid
                                                       has data
                                                                                                  test-metadata-validator
    yes                          (15)                                                                                                  yes
                                                                                                  curator
                                                                                                                                       technically
                                                                                                  test-curator
                                                                                                                                       valid
                                                       Review comments                                                                 yes

                                                       [16/07/2021] Validation review: legal rejection after metadata/tech approve
                                                       [16/07/2021] Validation review: reject metadata approve technical




Fig. 6 “My validations” page



4.1 ELG-compatible Services

A service is ELG-compatible if it is packaged in a Docker image and follows the
ELG LT internal API, i. e., the service consumes and produces messages in the ELG-
specified format, as defined in Section 4.1.1 below. When a provider adds a tool or
service to ELG either using XML metadata upload or through the metadata editor,
they are asked if the service will actually be integrated in ELG, so that conformance
to our specifications can be monitored.
76                 Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

4.1.1 Internal LT API Specification

The ELG internal LT API is closely related to the public API described in Chapter 3.
The public API is a simplified derivative of the internal API. While both the internal
and public APIs make use of the same JSON messages for input and output, the
internal API is designed strictly around a single HTTP request-response transaction
for each processing task, rather than the multi-step asynchronous mode supported
by the public API.
   For the internal API, services that accept text receive their requests as JSON,
while services that process binary audio or image data receive a MIME “multipart/-
form-data” request with the metadata in JSON and the binary data as the relevant
audio or image MIME type. The endpoint must return the appropriate JSON response
message depending on its function (standoff “annotations”, classifications, audio, or
new “texts” – which could be a single text, a series of sentences, a list of alternative
translations, etc.). Examples include:
  • Information extraction (IE) services for text accept a “text” request and return an
    “annotations” response; i. e., annotations whose position is described in terms of
    zero-based character offsets. Such services include tokenisers, sentence splitters,
    sentiment analysers, named entity recognisers, dependency parsers, etc.
  • Text classification services accept a “text” request and return a “classification”
    response with the classes that have been assigned to the whole input text by the
    service. Examples are language identifiers, text-level sentiment classifiers etc.
  • Machine translation services receive a “text” request and generate a new text or
    list of alternatives returned in a “texts” message. Services such as summarisation
    would use a similar format.
  • Information extraction services from speech take “audio” requests and return
    the same standoff annotations as IE-from-text, but in this case the annotations
    are time segments in the audio stream, e. g., keyword spotting for audio files.
  • Speech recognition services take “audio” requests and return a text transcription
    or a choice of n-best transcriptions, encoded as a “texts” message.
  • Text-to-speech services take “text” messages and return “audio” messages, which
    can either include the returned audio inline as base64-encoded data, or as a URL
    reference to audio which has been uploaded to the temporary storage helper ser-
    vice (see Section 4.1.2).
  • Optical character recognition services take “image” requests and return the ex-
    tracted text as a “texts” response.
  • Image classification services take “image” requests and return “classification”
    responses.
   The formats of the input and output messages are generic and can be easily reused
for integrating new types or classes of services. For example, Speech-to-Text ser-
vices, such as a speech summariser that would consume an “audio” request and re-
turn a “texts” response in the same way as a pure speech recogniser, can easily be
added. Other examples can be found in Chapter 7.
4 Contributing to the European Language Grid as a Provider                                    77

    Detailed, up-to-date guidance on the process of integrating an LT service and
selecting the most appropriate integration option can be found in the ELG documen-
tation8 ; more information is provided in Section 4.1.3.
    As described in Chapter 3, error, warning and progress report messages are rep-
resented as structured objects with a message code, representing a message that can
be localised into many languages. The ELG team provides a set of standard message
codes for common messages, and maintains their translations, but service providers
who use their own custom messages are welcome to contribute their own localisa-
tions for integration into the public message resolver by contacting the ELG team.
    Services that take a long time to process data have the option of returning a se-
ries of “progress” messages prior to generating the final response using the standard
HTTP “server-sent events” format.9


4.1.2 Helper Services

ELG provides certain helper services that can be called at fixed URLs by LT ser-
vice containers if they run within the platform. Notably, ELG provides a temporary
storage helper which LT services can use in order to return data that does not natu-
rally map on to the standard JSON-based response formats. This helper allows an
LT service to store arbitrary blobs of binary data on a short-term basis (for any time
from ten seconds up to 24 hours), and receive a randomly generated URL that can be
included in the response JSON, and which the caller can retrieve up until its expiry
time. Typical uses for this service include text-to-speech services that need to return
larger chunks of audio data, or services that visualise structures such as parse trees
in a binary image format. This is discussed further in the context of the Text2TCS
service in Chapter 7, Section 5.1, p. 144 ff.


4.1.3 Integration Requirements and Options

The requirements for integrating an LT tool or service into ELG are as follows.
Expose an ELG-compatible endpoint: The provider needs to make sure that the
  LT tool or service to be integrated into ELG exposes an HTTP endpoint, i. e.,
  either such an endpoint already exists or it needs to be implemented. The cor-
  responding endpoint application must consume HTTP requests that follow the
  ELG JSON format, call the included or underlying LT tool and produce responses
  again in the ELG JSON format as specified in the the ELG LT internal API (Sec-
  tion 4.1.1). Developers working in Python or Java, Groovy, Kotlin, or other JVM-
  based languages, can make use of helper libraries provided by the ELG team to
  handle much of the boilerplate code for creating the HTTP listener, parsing and


8   https://european-language-grid.readthedocs.io/en/stable/all/3_Contributing/Service.html
9   https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events
78                       Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.




                              LT Service Execution Server




                                        Adapter
                   LT tool
                                       container          Proxy                  LT tool
                 standalone
                                                        container              (own API)
                  container

                Pod                                    Pod
                                        LT tool
                                      (own API)                          Provider's server/cloud
                                     Pod


           ELG Kubernetes cluster



Fig. 7 Integration options



     producing the JSON messages, etc., so that the provider can concentrate on their
     own business logic (see Sections 4.1.6 and 4.1.5 for more details).
Provide the application in the form of a Docker image: The whole application
  must be packaged as a container image using Docker or similar tools, and up-
  loaded to a Docker registry, such as GitLab10 , DockerHub11 or Azure Container
  Registry12 . More than one image might be needed for one service, depending on
  how the service is made available. From the three options described in Fig. 7,
  providers can pick the one that best fits their needs.
     •   LT tool packaged in one standalone image: One image is created that contains
         the application that exposes the ELG-compatible endpoint and the actual LT
         tool. This is the most common approach when wrapping tools that are callable
         as libraries from custom code, such as Python machine learning models.
     •   LT tool running remotely outside the ELG infrastructure: In this case, one
         proxy image is created that exposes one (or more) ELG-compatible endpoints;
         the proxy container communicates with the actual LT service that runs outside
         the ELG infrastructure.
     •   LT tool requiring an adapter: This is a compromise between the standalone
         and remote approaches. A tool that is available as a Docker image but whose
         API is not natively ELG-compatible can be run alongside a separate ELG-
         compatible adapter image as a single pod in the ELG infrastructure. The
         adapter receives ELG API requests, communicates with the tool’s native API
         in the pod, and translates the responses back to ELG format.


10 https://gitlab.com
11 https://hub.docker.com
12 https://azure.microsoft.com/en-us/services/container-registry/
     4 Contributing to the European Language Grid as a Provider                            79

 1   # Base image .
 2   FROM openjdk :8-jdk - alpine
 3

 4   # SET TARGET DIRECTORY
 5   ENV TARGETDIR /elg/
 6   # This is required for wait.sh
 7   RUN apk update && apk add bash
 8

 9   # Install tini and create unprivileged user
10   RUN apk add --no -cache tini && \
11         addgroup --gid 1001 "elg" && \
12         adduser --disabled - password --gecos "ELG User ,,," \
13         --home /elg --ingroup elg --no -create -home --uid 1001 elg
14

15   # Create target directory
16   RUN install -d -o elg -g elg $TARGETDIR
17   # Copy everything to target directory
18   COPY --chown=elg:elg dockerCmd ${ TARGETDIR } dockerCmd
19   # Copy/ Rename server app jar.
20   ADD --chown=elg:elg /elg -ilsp -lt -services -rest -simple -0.0.1 -
         SNAPSHOT -exec.jar ${ TARGETDIR } dockerCmd /app.jar
21

22   # Set working directory
23   USER elg:elg
24   WORKDIR ${ TARGETDIR } dockerCmd
25

26   # Make sure script can be executed
27   RUN chmod +rx ./ wait.sh
28

29   # The command that is run when the container starts
30   ENTRYPOINT ["sh", " runInContainer .sh"]
     Listing 1 Example of a dockerfile for an integrated ELG LT service




     4.1.4 Creation of Docker Images

     The Docker image of an application contains the code of the tool and all dependen-
     cies required to run it, e. g., the operating system, frameworks, settings, configuration
     files and libraries etc. Containers are instantiations of images and can be thought of
     as lightweight virtual machines.
         The process of packaging a service as a Docker image involves creating a dock-
     erfile that describes the build process, running that build, and pushing, i. e., copying
     the resulting image to a Docker registry that is accessible to the ELG infrastructure.
     An example dockerfile is shown in Listing 1. The most important parts are:
       • Line 2 states that an image containing a lightweight Linux-based operating sys-
         tem that includes Java programming language will be used as the base.
       • Line 20 adds the Java-based application (.jar file) that exposes an ELG-compliant
         LT service to the image (see Section 4.1.5 for more details).
    80                     Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

1   # Login to Gitlab container registry
2   $ docker login registry . gitlab .com
3

4   # Build the image and tag it with the name registry . gitlab .com/
        ilsp -nlpli -elg/elg -ilsp -lt - services and a version number
5   $ docker build -t registry . gitlab .com/ilsp -nlpli -elg/elg -ilsp -lt -
        services :1.0.0 .
6

7   # Push the image to the container registry
8   $ docker push registry . gitlab .com/ilsp -nlpli -elg/elg -ilsp -lt -
        services :1.0.0
    Listing 2 Example sequence of commands to build and push a Docker image to a registry




         • Line 30 specifies the script (.sh) that is run when a container is created from this
           image; this script starts the Java application.
        A simple and robust way to build and store the image of a service in a registry
    is to put the service code into a source code repository such as GitHub13 or GitLab,
    and then to use the repository’s continuous integration (CI) mechanism. There are
    various examples of services built like this, i. e., using GitLab CI, in the ELG GitLab
    space.14 Gitlab CI is triggered immediately after a commit on the repository or on
    demand and runs the build process specified in .gitlab-ci.yml.
        An image can also be built and stored by running a set of commands locally. This
    option is helpful because CI services are often restricted, e. g., Gitlab has monthly
    quotas. In this case, users must first download the source code to a local folder (in-
    cluding the dockerfile), and then run a sequence of commands similar to Listing 2.
        Some languages and build systems provide alternatives for building Docker im-
    ages that do not require developers to write their own dockerfile, or to use Docker
    at all. For example, Java services based on the Micronaut15 helper described below
    can use the Micronaut built-in dockerPush or dockerPushNative gradle tasks to
    build and push an image in one step using an automatically generated dockerfile, or
    Google Jib16 , which is designed specifically around the needs of Java applications
    and produces intelligently layered images that make more efficient use of space in
    the container registry. Additional files such as models can also be included.
        To be deployed in ELG, a Docker image must meet the following requirements:
         • It must be built for the amd64 architecture (also known as x86_64); multi-
           architecture images may be appreciated by users who want to run the service
           on their own hardware, but ELG itself runs on amd64.
         • It must be compatible with the Broadwell micro-architecture, which supports
           SSE4.2, AVX and AVX2 but not AVX512 instructions.
    13   https://github.com
    14   https://gitlab.com/european-language-grid
    15   https://micronaut.io
    16   https://github.com/GoogleContainerTools/jib
4 Contributing to the European Language Grid as a Provider                             81

     • The container must run in at most 6GB of RAM, but the smaller its foot-
       print the better. By default, containers are limited to 512MB RAM; if the con-
       tainer requires more memory, this must be specified in the metadata record (us-
       ing additionalHWRequirements). Services requiring more than 6GB are ap-
       proved only in exceptional cases.
     • It must be tagged with an explicit version number such as :1.0.0, not the im-
       plicit :latest tag which typically changes over time.
     • The network socket on which the container listens for HTTP requests must bind
       to all the container’s IP addresses (typically by using 0.0.0.0). Some HTTP
       libraries only listen on the local loopback 127.0.0.1 by default, which will not
       be sufficient in ELG.
     • Ideally the container should run without needing outgoing network connections
       to locations outside the hosting cluster. In particular, any model files must be
       cached within the image at build time, not downloaded at runtime from a repos-
       itory such as Hugging Face. If outgoing network access is required, the target
       IP address ranges must be specified.
    It is recommended for the service to only start listening once it is fully initialised
and ready to start handling requests. If this is not possible (e. g., if the code re-
quires some asynchronous initialisation process and the library used opens its sock-
ets before that process is complete), then a separate “readiness” endpoint should
also be provided at a separate URL path from the main service endpoint (typically
/elg-ready) that returns the response code 503 (“service unavailable”) if the ser-
vice is not yet initialised, and 200 or 204 once it is ready to handle requests.
    Sections 4.1.5 and 4.1.6 present Java- and Python-based libraries for easily cre-
ating an application that offers an ELG-compatible service. Some of these include
utilities for creating the Docker image in which the service will be packaged.


4.1.5 Helper Libraries for Java

For LT service developers working in Java or other Java Virtual Machine (JVM) lan-
guages such as Groovy17 or Kotlin18 , ELG provides helper libraries for two popular
frameworks, Spring Boot19 and Micronaut20 . The programming style is similar in
both cases, though Micronaut is better optimised towards creating smaller, lighter
images with faster startup times, so if the service implementation does not already
have a dependency on Spring, Micronaut is the recommended option. Both libraries
depend on a common bindings library21 of Java model classes that represent the
various JSON message structures in a more Java-native way.


17 https://groovy-lang.org
18 https://kotlinlang.org
19 https://spring.io/projects/spring-boot

20 https://micronaut.io

21 https://javadoc.io/doc/eu.european-language-grid/elg-java-bindings
82                      Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

     An ELG-compatible LT service can be built in three steps22 using Micronaut:
 1. Create a blank Micronaut application using the Micronaut Launch tool.23
 2. Add the ELG helper as a dependency, which is published to the central repository
    – for Gradle this means
       implementation ("eu.european -language -grid:lt -service -
           micronaut :1.0.0 ")

 3. Create a controller that extends LTService (for services that process text-based
    requests) or BinaryLTService (for services that process requests with binary
    content) and implement the relevant handle or handleSync method.
     The process24 is similar for Spring Boot:
 1. Create a blank Spring Boot application using the “Spring Initializr”25 – addi-
    tional dependencies are not needed, unless the specific code requires them.
 2. Add the ELG helper as a dependency, which is published to the central repository
    – for Gradle this means
       implementation ("eu.european -language -grid:elg -spring -boot -
           starter :1.0.0 ")

 3. Create one or more beans annotated @ElgHandler, with one or more public
    methods annotated @ElgMessageHandler. Each method should take an ELG
    request type such as TextRequest as a parameter (and for binary requests a
    second parameter of type Flux<DataBuffer> for the actual data) and return
    an ELG response type such as AnnotationsResponse or a reactive streams
    Publisher producing that type.
   In both cases, Micronaut and Spring Boot, developers must add their code in the
appropriate places to call the actual LT tool and build a response based on the tool’s
results, using the model classes, e. g., an AnnotationsResponse object in the case
that the results are standoff annotations. Once the objects are created, the frameworks
and libraries are able to automatically serialise them into ELG-compliant JSON re-
sponse messages. Similarly, the frameworks automatically translate the received in-
put JSON messages to objects that can be easily handled by the developer, e. g., in
the Spring Boot case a “text” JSON request is deserialised to a TextRequest object.


4.1.6 Helper Tools for Python

Similar to Java, the ELG team provides helper tools to create an ELG-compatible
service from a Python-based LT service. The helper tools are included in the ELG
Pypi package presented in Chapter 3. The package provides two Python classes that
22   https://gitlab.com/european-language-grid/platform/lt-service-micronaut
23   https://micronaut.io/launch
24   https://gitlab.com/european-language-grid/platform/elg-spring-boot-starter
25   https://start.spring.io
     4 Contributing to the European Language Grid as a Provider                                83

 1   from elg import FlaskService
 2   from elg.model import TextRequest , AnnotationsResponse
 3   import langdetect
 4

 5   class ELGService ( FlaskService ):
 6       def process_text (self , request : TextRequest ):
 7           langs = langdetect . detect_langs ( request . content )
 8           ld = {}
 9           for l in langs:
10               ld[l.lang] = l.prob
11           return AnnotationsResponse ( features =ld)
12

13   service = ELGService (" LangDetection ")
14   app = service .app
     Listing 3 Example ELG service created using the FlaskService class of the ELG Python package




     can be extended to create a simple HTTP server that exposes an ELG-compatible
     endpoint of the LT tool. The ELG Python package also comes with a command-line
     interface (CLI) that helps with the creation of the Docker image.
        For the ELG-compatible endpoint, the developer creates a Python class extending
     either FlaskService or QuartService as a base class, and must implement one of
     the four following handler methods: process_text, process_structured_text,
     process_audio or process_image, depending on the required input type for the
     LT service. This method will contain the code of the LT tool, it takes as input an ELG
     request object of the relevant type and should return a valid ELG response object. As
     a simple example, Listing 3 shows an LT tool that detects the language of the input
     text. The ELGService class inherits from the FlaskService class, which already
     contains all the code needed to create the server. This allows the developer to focus
     on the LT tool by only having to define the handler method. The FlaskService
     and QuartService classes work the same way; the first is based on Flask26 , which
     is more suited to CPU-bound synchronous code, the second uses the asyncio-based
     Quart framework27 , which is better for I/O bound code – QuartService is the only
     supported option if the handler method uses async/await28 . Both base classes sup-
     port the progress reporting mechanism and correctly handle exceptions raised by the
     tool, mapping them to ELG-compliant failure responses.
        After having defined the HTTP server compatible with the ELG LT internal API
     using the FlaskService or QuartService class, the next step is to create the
     Docker image. The ELG CLI that comes with the Python package contains the elg
     docker create command to help during this step. The command automatically
     generates the dockerfile based on the arguments. Listing 4 shows an example for the
     language detection service presented in Listing 3. All the available options of the
     26 https://flask.palletsprojects.com/en/2.0.x/
     27 https://pgjones.gitlab.io/quart/
     28 https://www.european-language-grid.eu/2021/10/04/choose-the-right-tool-to-create-your-elg-

     service-in-python/
84                   Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

elg docker create -n ELGService -p elg_service .py -r langdetect
Listing 4 CLI command to generate the dockerfile automatically




command are accessible with elg docker create --help. Once the dockerfile
is generated, the creation and the publication of the Docker image follows the same
process as described in Section 4.1.4.
    The ELG documentation includes a complete tutorial on how to create an ELG-
compatible service using the Python package.29 With these helper tools, we seek to
facilitate as much as possible the creation of an ELG-compatible service from an
LT tool implemented in Python. Using the Python helper ensures that the resulting
service follows best practice in terms of error handling, request parsing, etc. and
the construction of the dockerfile. This makes the services deployed in the ELG
infrastructure efficient and secure.


4.1.7 Metadata Requirements

In addition to the metadata requirements for tools and services (see Chapter 2), the
metadata records of ELG-compatible services must also include a set of technical
metadata that are necessary for their deployment in the platform:
     • dockerDownloadLocation: location of the image with the LT service;
     • serviceAdapterDownloadLocation: location of the adapter image (if any);
     • executionLocation: REST endpoint at which the LT tool is exposed within
       the Docker image (http://localhost:{port}{/path});
     • additionalHWRequirements: can be used to specify hardware requirements
       for this tool beyond the default limits of 512MB RAM and one CPU core;
     • We also recommend providing sample data on which the service produces sen-
       sible results. Sample data help speed up the validation process, and can be used
       through the trial UIs and the “Code samples” tab by consumers who want to
       test the service. Providers can upload a file with samples, add a URL where the
       samples are located, or simply add the data in a dedicated free text element.
   Figure 8 shows the mandatory elements replicating the editor (with sections hori-
zontally and tabs vertically); elements marked with an asterisk are mandatory, given
certain conditions, or required depending on the presence of another value or ele-
ment.




29 https://european-language-grid.readthedocs.io/en/stable/all/A1_PythonSDK/TutoServiceIntegr
ation.html
4 Contributing to the European Language Grid as a Provider                                85

      LANGUAGE RESOURCE /
                                      TOOL/SERVICE                  DISTRIBUTION               DA
          TECHNOLOGY

             IDENTITY                  CATEGORIES                    TECHNICAL                 DA

  •    Resource name            •         IDENTITY
                                    Function                 •   Software distribution form
  •    Description
             IDENTITY                                        •   Private
  •    Version                          TECHNICAL            •   Docker download location
                                                             •   Service adapter download
           CATEGORIES           •   Language dependent           location *
                                                                       IDENTITY
                                •   Input content resource   •   Execution location
  •        IDENTITY
       Keyword                        • Resource type        •   Additional h/w
                                      • IDENTITY
                                           Language *            requirements *
                                •   Output resource *        •   Licence
             CONTACT                  • Resource type
                                      • Language *
 •     Additional information
              IDENTITY
                                       EVALUATION
        DOCUMENTATION

          RELATED LRT'S

Fig. 8 Mandatory metadata for an ELG-compatible service



4.1.8 Technical Validation and Registration of ELG-Compatible Services

When LT providers have completed the packaging of their service, they can add it
to ELG by supplying a metadata record via either the XML upload or editor mech-
anisms described in Section 2.1, specifying that it is an “ELG-compatible service”
when prompted. Submitting the record initiates the validation process, which is per-
formed internally by the ELG team.
   The validation starts with the service registration process: The metadata or tech-
nical validator inspects the metadata record (accessed through the validator’s grid)
and deploys the service in the ELG Kubernetes cluster by creating the respective
entries in the Helm charts that control the cluster. After that, the validator registers
the service using a registration form (Figure 9), which specifies:
  • Kubernetes-specific endpoint to be used by the LT execution server when calling
    the service, derived from the executionLocation metadata element value.
  • ID of the trial UI to be used for rendering the processing results.
  • Type of service (e. g., Speech Recognition, Text-to-Speech, Text Classification,
    etc.), which determines the appearance of the “Code samples” tab.
  • Accessor ID that is used to form the public API endpoint URLs at which the
    service can be called. If the service was created as a new version of an existing
    service then it will share the same accessor ID as the service it replaces, but other
    than this, two distinct services must have different accessor IDs.
86                      Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.




Fig. 9 Registration form for ELG-compatible LT services



   When the registration is completed, the service is visible only to the validator and
the provider. The technical validator and the provider check that the service behaves
as expected using test input, and that the results it returns can be rendered adequately
by the assigned trial UI – this is where good sample data is particularly useful. When
required, the validator may communicate with the provider to recommend changes in
the technical implementation of the service or metadata. When the service is finally
running as it should the technical validator approves it; it will be published once it
also receives approval from the legal validator (see Chapter 2 for more information
on the ELG publication life cycle).


4.1.9 Custom Try Out Interface

The ELG-provided trial UIs30 have been designed to support common service types
in a generic way, but there may be specific services for which the standard UIs either
do not work or do not represent the results in a particularly intuitive way. If this is the
case, it is possible to supply an alternative trial UI that better suits the service to be




30   https://gitlab.com/european-language-grid/usfd/gui-ie
     4 Contributing to the European Language Grid as a Provider                                         87

 1   // set up message listener
 2   window . addEventListener ('message ', (e) => {
 3     if(e. origin ===
 4          'https :// live.european -language -grid.eu ') {
 5       const serviceInfo = JSON.parse(e.data);
 6       // configure UI here - store ServiceUrl and Authorization, fetch
 7       // parameter metadata from ApiRecordUrl, etc.
 8     }
 9   });
10

11   // request configuration from the parent frame
12   setTimeout (() => {
13     // the content of the message is unimportant, any message will trigger
14     // the configuration reply.
15     window . parent . postMessage ("GUI: Ready for config ",
16               " https :// live.european -language -grid.eu");
17   }, 500);
     Listing 5 Typical JavaScript setup code for a trial UI




     added. The standard UIs are open source under the Apache Licence31 , and providers
     are free to use this code as a basis for their own UI.
         A trial UI is a single-page HTML/JavaScript application which is loaded into an
     <iframe> by the catalogue page when the user views an ELG-compatible service.
     Trial UIs run entirely in the browser and must not send user data to anywhere other
     than the ELG service endpoint and the i18n message resolver service. The JavaScript
     inter-frame messaging mechanism is used to supply the UI with the data it needs to
     configure itself for use with this particular service – when the UI <iframe> loads
     it must register a message listener that expects to receive message data that can be
     parsed as JSON, then dispatch a message to the parent frame to trigger the configu-
     ration message in return.32 An example of this mechanism is shown in Listing 5.
         The message event data sent by the parent frame will be JSON containing the
     following properties:
     ServiceUrl The public LT service API URL at which the service can be called.
       The URL may include query string parameters if the service has more than one
       deployed version.
     ApiRecordUrl The catalogue API URL from which the metadata record for this
       service may be retrieved with a GET request. This provides access to service pa-
       rameter declarations, sample data, etc.
     Authorization An HTTP Authorization header value that will authenticate
       calls to the ServiceUrl and ApiRecordUrl as the user who is logged in.


     31    https://www.apache.org/licenses/LICENSE-2.0
     32    To avoid the parent frame sending the configuration data before the UI frame is ready to receive
     it.
88                    Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

Language (optional) ISO code for the preferred language of the user. If present,
  this should be used as the lang parameter when resolving status messages to
  strings using the i18n resolver (see Section 4.1.1)
   The custom UI can be hosted at any HTTPS URL – the ServiceUrl and
ApiRecordUrl return the appropriate CORS headers to support cross-origin re-
quests. Trial UIs run as Docker images in the ELG Kubernetes cluster. UIs can be
created either by the ELG team or by a provider that needs a custom visualisation in-
terface for the tools they contribute. Custom UIs can be integrated into ELG together
with the ELG technical team.



4.2 ELG-hosted Resources

Together with metadata descriptions, providers are encouraged to upload the corre-
sponding data files of their language resources so that they are readily available for
download through ELG. To register their resources, they can select their preferred
option from the ones presented in Section 2.1 and upload the accompanying files
following the instructions in Section 2.2.


4.2.1 Requirements for ELG-hosted Resources

ELG requires data files to be uploaded as compressed ZIP files. There are no other
specific metadata requirements apart from those defined for records of the resource
type to which they belong (i. e., corpora, models, etc.). Chapter 2, Section 5, (p. 19 ff.)
describes the metadata schema in more detail.


4.2.2 Packaging Data and Splitting Metadata Records: Recommendations

Datasets are composed of files that can be organised according to different criteria.
For example, a multilingual corpus of texts from various domains can be described
as a whole (one metadata record) or split into subsets (with corresponding metadata
records) using the language or domain criteria. Depending on their intended use,
different ways of packaging datasets and making them available can be suggested.33
   We prepared a set of recommendations for the packaging of data files to enable
users, especially those accessing ELG through programmatic APIs, to automatically
identify, download and use corpora as is, without having to download them and man-
ually search among them the subsets that interest them.34

33https://www.w3.org/TR/vocab-dcat-3 provides a similar argumentation for data distributions.
34These recommendations can be applied in different contexts, depending on whether the resource
will be uploaded in ELG: when providers upload their corpora into ELG, they can use them to
package the files and register the resource as one or multiple metadata records; if they decide to
4 Contributing to the European Language Grid as a Provider                                       89

   The following cases are foreseen:
Multilingual resources are recommended to be split into bilingual pairs, so that
  users can easily find and use them, for example, in the case of bilingual corpora,
  to train bilingual models.
Resources from shared tasks are usually already split into training, development,
  gold, and test datasets, with a direct link to each of these. This is an established
  practice, and adopted in ELG as is. We recommend to register them as separate
  metadata records.
   In both cases, a parent metadata record, to which the metadata records of all sub-
sets can point is recommended using the “isPartOf” relation.



4.3 Metadata Records for External LRTs, Organisations and Projects

When external LRTs, organisations or projects are added to ELG, the only require-
ment for such metadata records is that they conform to the minimal version of the
ELG metadata schema, i. e., they include the mandatory metadata elements described
in Chapter 2, Section 5 (p. 19 ff.). Providers can use one of the options described in
Section 2.1 (p. 69 ff.). For these records, the validation process aims to ensure that
the metadata description is consistent and informative for users.



5 Provider-Related Functionalities in ELG and other Platforms

In this final section of the chapter we discuss some aspects of the functionalities of-
fered to LT providers in ELG in relation to those available in other similar platforms.
This discussion cannot be exhaustive. It rather attempts to give an overview of their
design and implementation, highlight the main options utilised by the platforms, and
offer explanations of the adopted approaches.



5.1 Metadata Requirements

Although the use of certain metadata schemas (e. g., DC35 , DCAT36 , schema.org37 ,
etc.) is growing, these schemas are usually restricted to the documentation of gen-
grant access to external corpora through hyperlinks, they can follow them for splitting the resource
into one or multiple records and marking the availability through a direct link (element “download-
Location”).
35 https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

36 https://www.w3.org/TR/vocab-dcat-3/

37 https://schema.org
90                      Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

eral properties and do not satisfy domain- or community-specific requirements, es-
pecially with regard to discovery. Thus, most platforms use their own metadata
schemas or ask for a minimum set of elements which are community-, domain-, or re-
source type specific (see Chapter 6 for a discussion of metadata schemas). Technical
metadata are typically mandatory when resources are deployed in a platform. ELG
has a detailed schema with a minimum set of required metadata to allow for flexi-
bility and strictness when this is mandated for operational reasons (i. e., resources
deployed in ELG, added by individuals, harvested from other sources).
    CLARIN has initiated the Component MetaData Infrastructure38 , which provides
a framework to describe and reuse different “metadata profiles” for resource types
and communities. Specific metadata profiles, e. g., those of web services, are “recom-
mended” with an aim to ensure interoperability and operational requirements. How-
ever, these profiles may promote different mandatory elements, depending on the
use of the profile by each CLARIN Centre. Hugging Face39 uses a dataset and model
card, in which part of the required information is specified via YAML40 tags.



5.2 Provider User Interface and Metadata User Interface

User-friendly editors that can cover multiple metadata schemas are difficult to im-
plement, especially when the schemas have a complex structure. Nevertheless, most
platforms include such an option. ELG, like META-SHARE41 (Piperidis 2012;
Piperidis et al. 2014), OpenMinTeD42 (Labropoulou et al. 2018) and the European
AI-on-demand platform43 , offer provider-specific UIs and a metadata editor support-
ing their respective schemas for describing resources. Hugging Face offers a rather
simple UI with limited functionality. LAPPS Grid44 (Ide et al. 2016) does not pro-
vide such UIs, a provider must communicate with the technical team in order to
add services to the Galaxy45 toolbox. Various CLARIN teams have created editors
that support CMDI metadata (e. g., COMEDI46 , ARBIL47 , etc.). For more technical
users, platforms offer APIs through which they can upload metadata records with
JSON being the most widely used format for the records.




38   https://www.clarin.eu/content/component-metadata
39   https://huggingface.co
40   https://huggingface.co/docs/datasets/v1.12.0/dataset_card.html
41   http://www.meta-share.org
42   https://openminted.github.io
43   https://www.ai4europe.eu
44   https://www.lappsgrid.org
45   http://galaxy.lappsgrid.org
46   https://clarino.uib.no/comedi/page
47   https://portal.clarin.nl/node/14320
4 Contributing to the European Language Grid as a Provider                        91

5.3 Try Out User Interface

Hugging Face offers embedded trial UIs to access their public “inference API”.
These are similar in spirit to the ELG “try out” UI mechanism, with a publicly docu-
mented API being called by a generic user interface. In addition, Hugging Face pro-
vides “Spaces”48 which enable users to create and deploy their own UIs for demon-
strating a model. The approach followed by Hugging Face Spaces is different from
ELG; it is based on developers coding their own back-end server code and front-end
UI as a single unit using the Streamlit49 or Gradio50 Python libraries. The developer
adds this source code to a Git repository and Hugging Face then deploys the code
to their infrastructure directly from the source code rather than from a developer-
supplied Docker image. The UI is tightly coupled to the server-side code and the
“API” is an implementation detail that varies from “space” to “space”. ELG does
not offer this kind of option by default, but the documented APIs mean that third
parties could create a similar service on top of the LT services offered by ELG.



5.4 Helper Tools for Packaging Resources

As described in the previous sections, ELG offers command line utilities and SDKs
for creating and submitting metadata for resources, preparing ELG-compatible ser-
vices, etc. OpenMinTeD offered only a metadata validation service, without a corre-
sponding command line tool. The European AI-on-demand platform, however, pro-
vides such utilities through Acumos51 an open source framework, that makes it easy
to build, share, and deploy AI applications.



5.5 Packaging Data Resources

ELG has adopted a lightweight policy for the packaging of uploaded datasets, given
that direct deployment is currently not foreseen. In the CLARIN infrastructure, each
centre has its own processes and recommended formats for uploaded resources, tak-
ing into account preservation or deployment purposes (e. g., submitting the resources
to processing). Hugging Face maintains a detailed set of instructions for the upload
of datasets and models, which is crucial for ensuring that they can be deployed.




48 https://huggingface.co/spaces
49 https://streamlit.io
50 https://gradio.app

51 https://www.acumos.org
92                   Dimitris Galanis, Penny Labropoulou, Ian Roberts, Miltos Deligiannis et al.

6 Conclusions

ELG enables producers of language resources and language technology tools and
services to upload, describe, share, and distribute their services and products as well
as to describe their companies, academic organisations and projects. ELG offers to
providers web-based user interfaces for describing LT resources or related entities
with metadata records and provides them with functionalities for managing the life
cycle of their assets; a billing component for commercial services and resources has
been implemented (see Chapter 3, Section 6, p. 59 f.) and will be activated as soon as
the ELG legal entity is in place (see Chapter 13). Providers of LT tools can exploit
such functionalities to integrate LT tools in the ELG platform as ready-to-deploy
services. LT data and tool providers are requested to follow the specifications and
recommendations for packaging tools and resources to be uploaded in ELG. In the
wider language technology ecosystem, provider-related functionalities are offered
by other platforms, too, respecting their own target groups, objectives and policies.
ELG has built bridges to some of these platforms, see Chapter 6 for more details.



References

Ide, Nancy, James Pustejovsky, Christopher Cieri, Eric Nyberg, Denise DiPersio, Chunqi Shi, Keith
    Suderman, Marc Verhagen, Di Wang, and Jonathan Wright (2016). “The Language Application
    Grid”. In: Worldwide Language Service Infrastructure. Ed. by Yohei Murakami and Donghui
    Lin. Cham: Springer, pp. 51–70. DOI: 10.1007/978-3-319-31468-6_4.
Labropoulou, Penny, Dimitris Galanis, Antonis Lempesis, Mark Greenwood, Petr Knoth, Richard
    Eckart de Castilho, Stavros Sachtouris, Byron Georgantopoulos, Stefania Martziou, Lucas Anas-
    tasiou, Katerina Gkirtzou, Natalia Manola, and Stelios Piperidis (2018). “OpenMinTeD: A
    Platform Facilitating Text Mining of Scholarly Content”. In: Proceedings of WOSP 2018 (co-
    located with LREC 2018). Miyazaki, Japan: ELRA, pp. 7–12. URL: http://lrec-conf.org/works
    hops/lrec2018/W24/pdf/13_W24.pdf.
Labropoulou, Penny, Katerina Gkirtzou, Maria Gavriilidou, Miltos Deligiannis, Dimitris Galanis,
    Stelios Piperidis, Georg Rehm, Maria Berger, Valérie Mapelli, Michael Rigault, Victoria Ar-
    ranz, Khalid Choukri, Gerhard Backfried, José Manuel Gómez Pérez, and Andres Garcia-Silva
   (2020). “Making Metadata Fit for Next Generation Language Technology Platforms: The Meta-
    data Schema of the European Language Grid”. In: Proceedings of the 12th Language Resources
    and Evaluation Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe
    Blache, Christopher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard,
   Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA,
    pp. 3421–3430. URL: https://www.aclweb.org/anthology/2020.lrec-1.420/.
Piperidis, Stelios (2012). “The META-SHARE Language Resources Sharing Infrastructure: Princi-
    ples, Challenges, Solutions”. In: Proceedings of the Eight International Conference on Lan-
    guage Resources and Evaluation (LREC’12). Ed. by Nicoletta Calzolari, Khalid Choukri,
   Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno,
   Jan Odijk, and Stelios Piperidis. Istanbul, Turkey: ELRA.
Piperidis, Stelios, Harris Papageorgiou, Christian Spurk, Georg Rehm, Khalid Choukri, Olivier Ha-
    mon, Nicoletta Calzolari, Riccardo del Gratta, Bernardo Magnini, and Christian Girardi (2014).
   “META-SHARE: One year after”. In: Proceedings of the 9th Language Resources and Evalu-
    ation Conference (LREC 2014). Ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck,
4 Contributing to the European Language Grid as a Provider                                      93

   Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios
   Piperidis. Reykjavik, Iceland: ELRA, pp. 1532–1538. URL: http://www.lrec-conf.org/proceed
   ings/lrec2014/pdf/786_Paper.pdf.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 5
Cloud Infrastructure of the
European Language Grid

Florian Kintzel, Rémi Calizzano, and Georg Rehm




Abstract The European Language Grid (ELG) is a cloud-based platform, utilising a
variety of software packages as well as infrastructure components and virtual hard-
ware. The additional software components developed by the ELG project are usually
provided as open source to facilitate re-use by third parties. This chapter provides an
overview of the infrastructural setup used by the ELG cloud platform. The selected
architecture also has implications for providers as well as users of the platform, e. g.,
in terms of the scaling behaviour of individual Language Technology (LT) services.



1 Introduction

One of the key technical goals of the ELG cloud platform is the ability to integrate
functional Language Technology (LT) services from a variety of sources, i. e., to
build a large platform and a corresponding community of providers and users of
these services. The LT tools and services to be continuously integrated into the ELG
platform are, thus, heterogeneous and vary in their technical setup, which is why a
set of common approaches needs to be established to make the integration of the
tools and services possible. One of the most basic joint technical approaches is the
requirement for all functional services to be containerised so that they can run on
the ELG cloud infrastructure. Providers can optionally benefit from utilising addi-
tional support functionality, e. g., source code repositories, container registries and
deployment pipelines offered by the ELG platform.
   Conceptually, the ELG platform consists of three layers, the user interface (UI)
layer, the back end layer and the base infrastructure (see Figure 1). While the UI
and back end are described in more detail in Chapters 2, 3 and 4, the present chapter
focuses on the base infrastructure setup along with supporting functionality. Among
others, this chapter is helpful for providers of functional LT tools and services or
users interested in running parts of the ELG platform on their own hardware.

Florian Kintzel · Rémi Calizzano · Georg Rehm
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany,
florian.kintzel@dfki.de, remi.calizzano@dfki.de, georg.rehm@dfki.de


© The Author(s) 2023                                                                  95
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_5
96                                                                                 Florian Kintzel, Rémi Calizzano, and Georg Rehm

   The rest of this chapter is structured as follows. First, Section 2 gives an overview
of the building blocks of the ELG infrastructure. Section 3 provides information
about the deployment side of the ELG platform, while Section 4 describes how the
platform’s scaling profile lends itself to usage in different real-world scenarios. Fi-
nally, Section 5 concludes the chapter with an overview of future work on the ELG
platform infrastructure.


                                                                                           Provider UI &
                          Catalogue UI          Test/Trial UI          Admin UI                                   Language Technologies             CMS UI
                                                                             Entity       Metadata Editor
              FRONTEND
              PLATFORM




                                                                                                                  (functional content)
    GATEWAY




                                                                                                                  Language Resources
                                                                                                                  (non-functional content)




      REST API

                       LT Service Execution                      Analytics             User Management           Catalogue
PLATFORM




                           Orchestrator
BACKEND




                                                                  Billing                 Storage Proxy     Database       Elastic Index
                LT          LT                 LT
               SRV 1       SRV 2      …       SRV n
                                                                Monitoring            Metadata Harvesting


                                                                      Docker repository                       File & object storage
BASE INFRA-
STRUCTURE




                                                                                                                                         Language Resources
                                                                                                                                       (non-functional content)

                                                                                        Docker
                                                                                        Images
                              Nodes


Fig. 1 ELG platform architecture




2 Cloud Infrastructure

The base infrastructure consists, first and foremost, of the compute nodes on which
the European Language Grid runs, alongside their respective volume storage and
networking facilities. On these, the Kubernetes1 core components are installed (Sec-
tion 2.1) including S3-compatible object storage (Section 2.2). We use a managed
approach to Kubernetes, i. e., the installation, update and operation of the Kubernetes
system itself is taken care of by a cloud provider. Together, this forms the hardware
basis of the European Language Grid.
   Conceptually, the base infrastructure also consists of a larger set of Git2 reposito-
ries and container registries which are described in Sections 2.3 and 2.4.



1   https://kubernetes.io
2   https://git-scm.com
5 Cloud Infrastructure of the European Language Grid                                  97

2.1 Kubernetes and Cloud Native

Kubernetes is an open source system for automating deployment, scaling, and man-
agement of containerised applications. It has seen widespread usage in recent years
as the container orchestration tool of choice. Adoption of Kubernetes in a managed
setup was still in a relatively early stage at the time the ELG project was exploring
different cloud providers in early 2019. While various products by the typical hyper-
scalers already existed, European providers had only very recently started offering
comparable solutions.
   Our selection of Kubernetes as the framework of choice for ELG was primarily
based on the following criteria:
    • Kubernetes provides self-healing capabilities that can detect common failure
      situations and restart affected containers automatically.
    • Through the use of a managed approach to Kubernetes, failures of the core Ku-
      bernetes system itself are the responsibility of the cloud provider.
      These first two criteria together allowed the ELG project to have a relatively
      small footprint in terms of operational complexity as failures are either self-
      healed or taken care of by the cloud provider, at least in theory. While exceptions
      do exist, this still has reduced the operational effort considerably.
    • Kubernetes facilitates the usage of OCI-compatible containers.3 As ELG aims
      to integrate different technologies used for the implementation of LT services
      and tools, OCI-compatible containers form a common approach for integration.
    • Kubernetes provides off-the-shelf functionality for scaling up resources based
      on dynamic load. As ELG integrates hundreds of different LT tools and services,
      this functionality was deemed essential.
    • Kubernetes namespaces4 are useful to separate the different platform compo-
      nents from one another.
    • Continuous adoption of Kubernetes within the industry assures continued sup-
      port and development of this technology.
   An ecosystem of compatible technologies has been established around Kuber-
netes with the Cloud Native Computing Foundation (CNCF).5 CNCF promotes the
use of a large set of base technologies for solving, e. g., authentication, monitoring,
deployment and other common challenges. Most supporting technologies used in
ELG (Section 3.2) are part of CNCF. Alongside this, a set of architecture patterns
has emerged that aim to support properties such as Gannon et al. (2017):
    •   Cloud-native applications often operate at the global level.
    •   Cloud-native applications must scale well with thousands of concurrent users.
    •   Built on the assumption that infrastructure is fluid and failure is constant.
    •   Designed so that upgrade and test occur without disrupting production.
    •   Security must be part of the underlying application architecture.
3 https://opencontainers.org – Open Container Initiative
4 https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces
5 https://www.cncf.io
98                                              Florian Kintzel, Rémi Calizzano, and Georg Rehm

2.2 Storage

The various components of the European Language Grid platform utilise persistent
storage differently, as follows:
    • Static Language Resources, i. e., corpora, models etc. available for direct down-
      load on the European Language Grid platform are persisted on S3-compatible
      object storage and can be fetched from there.
    • The major infrastructural part of the ELG platform – the hundreds of LT tools
      services – do not utilise persistent storage at all, as they are designed stateless.
      All application code is shipped within an OCI-compatible container. This in-
      cludes additional resources needed to run the service, e. g., language models
      and additional configuration files.
    • The core ELG platform components (catalogue, authentication, CMS etc.) utilise
      network block storage attached to their running containers for persistence. This
      block storage is in turn backed up to the object storage on a regular basis.
   Therefore, static resources can potentially be available for direct download and
be included in the respective service container image as well. We decided for this
approach to simplify deployment and management of images and resources, e. g.,
for a local installation of a set of LT services, it is only necessary to pull and run
the respective images, i. e., no additional language resources need to be handled.
Though this potentially results in duplication of resource files (within an image and
as an additional separate file for download) it was deemed a necessary trade-off to
keep the deployment model easier.



2.3 Software Repositories

ELG is comprised of various independent software packages for, e. g., platform com-
ponents and individual LT services. The main ELG GitLab project repository6 is set
up as a GitLab group, consisting of various sub-groups and repositories. The differ-
ent repositories in this group can be categorised as follows.
    • The ELG Infrastructure Repository consists of a set of configuration files,
      mostly in the form of Helm7 charts (see Section 3.1). These define which pack-
      ages, i. e., containers, the ELG system consists of, as well as numerous additional
      configuration parameters such as the number of replicas and package-specific
      configurations. It can be used to set up multiple clusters. We maintain differ-
      ent branches within the repository, usually at least one for the development and
      one for the production cluster. The branches are not only used to distinguish
      between specific configurations for each cluster, but present different versions

6   https://gitlab.com/european-language-grid
7   https://helm.sh
5 Cloud Infrastructure of the European Language Grid                                   99

      of the ELG system as it matures during development. This is used to facilitate a
      staged roll-out to the production cluster. The actual source code for these com-
      ponents is not part of this repository. It only includes references to the container
      registries with the specific components. When installing the ELG cluster, these
      images are then downloaded (“pulled”) from these registries.
    • The ELG Cluster Admin repository holds cluster-specific configurations for
      each ELG instance that are applied separately from the settings of the ELG In-
      frastructure Repository. These mostly consist of the list of active administrative
      users for accessing the ELG infrastructure (those needing access to the infras-
      tructure the ELG is running on, not users of the ELG platform), their roles and
      access rights as well as the configuration for build-bot, our continuous integra-
      tion utility of choice. Included are also various utilities to manage the cluster.
      This repository is not needed for local deployment of the ELG, as such a deploy-
      ment is usually only meant for a single user, typically a developer, and does not
      participate in continuous deployment.
    • The main ELG GitLab platform project repository.8 This repository hosts the in-
      dividual components that make up the the core ELG platform and ELG website.
      These are mainly the platform (catalogue back end and front end components
      and the website content management system, along with a larger set of internal
      supporting and utility components.
    • Individual sub-groups with repositories for individual LT services, grouped by
      provider. These consist solely of the LT services provided by members or asso-
      ciates of the ELG project consortium.
   Implementation code for LT services not provided by ELG project consortium
members is not usually held in the ELG GitLab group but rather managed via
provider-specific repositories.



2.4 Container Registries

The images for instantiating containers in the ELG cluster are stored in various con-
tainer registries. The Kubernetes installation powering ELG pulls the images from
these registries on demand. These can be categorised as follows.
    • The ELG GitLab project registry9 is the registry that corresponds to the main
      ELG GitLab group, it hosts all images for all ELG core platform components
      (e. g., UI, back end, utilities) and for several ELG LT services developed by
      ELG project consortium partners. This registry allows public access to facilitate
      download and re-use of ELG components.
    • Public registries for various externally implemented third-party components
      such as database system, identity and access management.

8   https://gitlab.com/european-language-grid/platform
9   registry.gitlab.com/european-language-grid
100                                        Florian Kintzel, Rémi Calizzano, and Georg Rehm

     • Private registries of partners who do not publish their LT services under an open
       source license (proprietary LT services) or need to use their own registries for
       technical reasons.
     • Various other public registries for open source LT services.
     • The dedicated ELG registry.10 As LT service images are partly pulled from reg-
       istries external to the ELG project, this registry was set up to serve as a point
       to collect LT service images when they are ingested into ELG in order to per-
       form versioning. Using this approach, ELG can ensure the availability of older
       versions of certain tools even if their original site is no longer serving them.



3 Installation

ELG utilises a GitOps approach (see, e. g., Beetz and Harrer 2021) to deployment,
i. e., the configuration necessary to set up the compute cluster is managed by version
control. The base artefact for deployment is the Helm chart.11 Helm charts are used
to manage the installation and update the ELG platform. Each chart bundles a set of
components along with their configuration. All custom charts are defined in the ELG
platform repository GitLab group (Section 3.1). Alongside the custom charts, a larger
set of third-party charts is utilised to set up the respective components (Section 3.2).
    We apply the charts to the cluster using a Continuous Integration (CI) approach,
i. e., automatic deployment happens whenever changes to the configuration are de-
tected by the CI (Figure 2).




Fig. 2 ELG continuous integration




10   registry.european-language-grid.eu
11   https://helm.sh
5 Cloud Infrastructure of the European Language Grid                                   101

   If a new version of the infrastructure setup is detected, the CI checks out the
respective changes and applies them to the cluster state. Any new container versions
are then pulled from their distributed container registries. The Kubernetes cluster is
updated with the latest configuration and takes care of gracefully shutting down and
instantiating new containers.
   Continuous integration regarding the ELG infrastructure only deals with updating
the ELG cluster with the latest set of images (as specified by their version number)
and configuration. It does not deal with building the respective images themselves.



3.1 ELG Charts

These charts were specifically developed for ELG and control its setup and installa-
tion. The packages are meant to be installed together, though it is possible to install
only a subset for specific use cases (e. g., custom local installations). The architec-
ture of the ELG is described in Chapter 2 as well as, e. g., Rehm et al. (2021), which
is why we focus only on the software packages themselves.
     • The ELG core package consists of definitions for various supporting function-
       alities of ELG. These are the Ingress12 definitions for routing incoming traffic
       into the ELG cluster, the configuration for the rest server component as well as
       the configuration for the temporary storage component (used for large file op-
       erations). Various smaller configurations can also be found here, e. g., priority
       classes for pod scheduling, support for maintenance operations and others.
     • The ELG back end chart consists of the definitions for the main back end com-
       ponents, the Django13 and React14 powered applications that form the ELG cat-
       alogue and the ELG back end and administrative applications. Included in this
       chart are also a set of utility functions that deal with housekeeping.
     • The ELG LT services chart bundles the whole set of individual LT services in-
       stalled in ELG. It is actually a collection of charts that follow a common struc-
       ture, each sub-chart consisting of the definitions for the LT services of a specific
       LT services provider as well as a common chart for open source LT services by
       providers who only offer a small set of services. A definition for each individual
       LT service consists at the minimum of the reference to its image location, but
       can consist of numerous additional configurations, e. g., specific hardware re-
       quirements, helper images, parameters for scaling the service up and down and
       various other parameters.




12   https://kubernetes.io/docs/concepts/services-networking/ingress
13   https://www.djangoproject.com
14   https://reactjs.org
102                                              Florian Kintzel, Rémi Calizzano, and Georg Rehm

3.2 Third-Party Charts

Apart from the core components, we use a set of third-party components, which
provide their functionality to the ELG cluster. In the following, we briefly describe
the main third-party components.
     • Cert-manager15 is a tool to manage issuing and updating of TLS certificates. It
       is used to install and refresh TLS certificates to allow for the encryption of all
       HTTPS traffic that reaches the cluster via one of the configured ingress-rules.
     • The Horizontal Pod Autoscaler (HPA)16 is a standard Kubernetes component
       used to scale pods based on their load and runtime behaviour. For scalability and
       load monitoring, Kubernetes collects certain metrics, e. g., CPU and memory
       load, from each pod. Therefore, it is necessary to have at least one instance
       of each type of pod to be up and running at all times. Otherwise, no metrics
       can be collected. This setup is useful to scale ELG core components, e. g., the
       portal website and back end. It cannot be utilised as is to scale the hundreds of
       LT services offered by the platform, as these need to be scaled down to zero
       replicas if they are not needed to not exceed the cluster capacity. Therefore, we
       introduced KNative (see below), which is feeding the standard autoscaler with
       a new metric “concurrency”, based on the number of active requests to that LT
       service. Scaling those services still makes use of cluster-autoscaler functionality,
       but with the new metric also being available if no active replica of an LT service
       is instantiated.
     • KNative17 and Kourier18 give ELG the possibility to scale down LT services
       based on the current number of parallel requests to them (concurrency). The
       concurrency metric is available even if there is no active replica of an LT ser-
       vice. KNative buffers HTTP requests to one of the ELG APIs until the specific
       LT service’s container has started and keeps track of the concurrency metric to
       terminate the replica if it is no longer needed. We cannot overstate the impor-
       tance of this functionality for ELG as the platform consists of hundreds of indi-
       vidual LT service components, not all of which need to run all the time, i. e., it
       would not be efficient to have all these services consume resources while in idle
       state. Starting up a container takes a certain amount of time though, while the
       service initialises. Using a service after it has not been used in a while therefore
       requires a certain spin-up time. KNative does not natively provide facilities to
       reduce the spin-up time further, but additional methods might be helpful in the
       ELG context, e. g., predictive auto scaling (Nanayakkara 2021). If frequent traf-
       fic is expected for a particular service, it can easily be configured to have one or
       more instances running at any given time, depending on hardware availability.


15   https://cert-manager.io/docs
16   https://kubernetes.io/de/docs/tasks/run-application/horizontal-pod-autoscale
17   https://knative.dev/docs
18   https://github.com/3scale-archive/kourier
5 Cloud Infrastructure of the European Language Grid                                    103

     • Ingress-Ningx19 is installed to act as ingress-controller, i. e., handling HTTP traf-
       fic received and forwarding them to their respective endpoint within the cluster.
     • Keycloak20 is an open source solution for authentication and authorisation. It
       interfaces with front end, back end and LT services to provide single-sign on.
     • Elasticsearch21 is used to index the catalogue database for fast faceted search.
     • Prometheus22 , Grafana, Loki and AlertManager form the ELG monitoring so-
       lution. They collect and analyse logs and metrics from all running components
       in the cluster (including the hardware) and provide visualisations in the form of
       dashboards and diagrams (Figure 3).




Fig. 3 Monitoring ELG using Prometheus and Grafana


     • The ELG back end database uses PostgreSQL23 , a well-supported open source
       database engine. It holds all relevant data concerning the ELG catalogue, e. g.,
       projects, organisations, LT resources, LT service as well as user information.
     • MariaDB24 is used for persistence of the Drupal CMS that powers the ELG
       portal. We plan to move this over to PostgreSQL for ease of maintenance.
     • Not an off-the-shelf component, but rather specifically adapted for ELG, the
       s3proxy25 facilitates the upload of LT resources (models, corpora, but also
       project and organisation logos etc.) to ELG. It acts as a proxy to the S3-com-
       patible object storage that takes care of validating upload authorisation with the
       ELG back end and streams data to the object storage.
19 https://nginx.org
20 https://github.com/keycloak/keycloak
21 https://github.com/elastic/elasticsearch

22 https://prometheus.io

23 https://www.postgresql.org

24 https://mariadb.org

25 https://gitlab.com/european-language-grid/platform/s3proxy
104                                          Florian Kintzel, Rémi Calizzano, and Georg Rehm

4 Scalability of LT Tools and Services

ELG is optimised for stateless LT tools and services. Its database systems are ex-
clusively used by the platform back end for the metadata catalogue, user data etc.
LT services do not have persistence enabled for them, with the exception of tempo-
rary files used for large file uploads. In the following, we describe our approach for
scaling up individual LT services and describe its impact for service usability.



4.1 Implementation

With the goal of hosting thousands of individual LT tools and services with very
different hardware needs, it is neither feasible nor practical to have all of them in-
stantiated at the same time as this would require hundreds of Gigabytes of RAM even
in idle mode, i. e., even if none of them are actually used. Therefore, ELG leverages
the capabilities of KNative26 which make is possible to automatically scale down
services not currently in use to zero replicas. In this state, an LT service does not
consume any hardware resources.
    Scaling up an LT service happens automatically to an initial number of replicas
once a request has been received for that individual service. Requests are buffered
while new containers are starting up. This setup is especially suitable for services see-
ing little or irregular traffic. Further scale-up happens when a configurable threshold
of concurrent requests for a given service is exceeded.
    LT services deployed on ELG need to be aware that their life-cycle is exclusively
controlled by Kubernetes and they need to expect to be started, stopped and hori-
zontally scaled regularly, e. g., when the scheduler detects low resource situations
on one of the nodes, if a container fails to respond, if high traffic is received to an
LT service and other situations. LT services, therefore, highly benefit from quick
start-up times and this is one of the reasons, why we opted for LT services to include
necessary resources like models into their OCI images directly.



4.2 Use Cases

Given its scalability (Section 4.1), a number of use cases can be solved with ELG.
     • Demonstration of service functionality: providers of LT tools and services can
       freely deploy their services to the platform and can expect to be discoverable
       via the platform’s catalogue. For the try out functionality of services, a certain
       spin-up time from idle mode will not impact its usefulness. More performant in-
       stallations of a given service could, e. g., be offered by the providers themselves.

26   https://knative.dev/docs
5 Cloud Infrastructure of the European Language Grid                              105

  • Batch processing of multiple documents: as the containers of an individual LT
    service will stay instantiated for some time after usage before scale-down hap-
    pens, ELG is a good fit for batch processing as the initial scale-up time will not
    be a major contributing factor to processing time.
  • For services intended to power applications where quick response times are re-
    quired (e. g., mobile apps), however, the time it takes to spin up a container is
    likely too long (some seconds, depending on a service’s implementation). This
    is why services on ELG can be configured to stay instantiated all the time and
    still benefit from dynamic scaling in high load situations. To be feasible, dedi-
    cated hardware is necessary, which service providers will be able to reserve on
    the ELG platform for a fee in the future so their services will show the respon-
    siveness and performance they require.
  • Remote processing is a second alternative for LT service providers who want to
    offer their services to the public. In this setup, the ELG platform uses a proxy
    to forward user requests to an external installation of a service, managed by the
    service providers themselves. This offers a flexible approach for providers to
    tune the hardware setup according to their own requirements.
  • Management of non-functional LT resources, where only bandwidth limits scal-
    ability instead of compute capacity.



5 Conclusions

The ELG platform is growing continuously and the capacity, availability, operational
readiness and tooling support of the base infrastructure need to evolve accordingly.
We foresee a need to evolve in the following areas in particular.
  • Hardware capacity and cost distribution: through the use of cloud technology,
    ELG has the technical capability to grow horizontally as required by the encoun-
    tered load. In practice, though, the available hardware is restricted by budget
    considerations. Batches of utilised compute resources would need to be individ-
    ually matched to the user requesting them or the provider offering them, to allow
    the ELG to calculate operational costs on a per request basis. With this and the
    emerging payment functionality, individual resource usage can be reimbursed.
  • Hardware acceleration: ELG currently runs on CPUs exclusively. Already now,
    a larger number of LT services in ELG would benefit from GPU support. Apart
    from higher costs, GPU support will pose a number of technical challenges,
    among them a need to map LT services to specific compute nodes (with or with-
    out GPU support).
  • Integration and deployment support: the initial integration of a functional LT
    service will need further automation and tooling support to be able to cope with
    increased demand and an increased number of running services.
  • Workflow support: ELG would benefit from a possibility for easy workflow
    composition, spanning multiple LT services. Initial efforts have been started to-
    wards this goal (Moreno-Schneider et al. 2020).
106                                             Florian Kintzel, Rémi Calizzano, and Georg Rehm

     • Gaia-X: in the Gaia-X27 project OpenGPT-X28 the ELG platform is currently
       being integrated into the wider Gaia-X ecosystem, i. e., ELG is further extended
       so that it complies to the technical Gaia-X specifications. This will enable all
       ELG LT services and resources to be discoverable and usable within Gaia-X.
   This list only includes a selection of likely areas of improvement. Many additional
use cases and requirements for ELG can be imagined – the platform infrastructure
will need to grow and evolve as required.



References

Beetz, Florian and Simon Harrer (2021). “GitOps: The Evolution of DevOps?” In: IEEE Software
   39.4, pp. 70–75. DOI: 10.1109/MS.2021.3119106.
Gannon, Dennis, Roger Barga, and Neel Sundaresan (2017). “Cloud-Native Applications”. In:
   IEEE Cloud Computing 4.5, pp. 16–21. DOI: 10.1109/MCC.2017.4250939.
Moreno-Schneider, Julián, Peter Bourgonje, Florian Kintzel, and Georg Rehm (2020). “A Workflow
   Manager for Complex NLP and Content Curation Pipelines”. In: Proc. of the 1st Int. Workshop
   on Language Technology Platforms (IWLTP 2020, co-located with LREC 2020). Ed. by Georg
   Rehm, Kalina Bontcheva, Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasiljevs.
   Marseille, France, pp. 73–80. URL: https://www.aclweb.org/anthology/2020.iwltp-1.12.pdf.
Nanayakkara, Pallage Kamindu (2021). “Serverless Performance Improvement for Knative using
   Predictive Auto Scaling”. PhD thesis. Sri Lanka: Informatics Institute of Technology. URL:
   http://dlib.iit.ac.lk/xmlui/handle/123456789/702.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.




27   https://www.gaia-x.eu, https://www.data-infrastructure.eu
28   https://www.opengpt-x.de
Chapter 6
Interoperable Metadata Bridges to the wider
Language Technology Ecosystem

Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis, Maria
Giagkou, Ondřej Košarko, Jan Hajič, and Georg Rehm




Abstract One of the objectives of the European Language Grid is to help overcome
the fragmentation of the European Language Technology community by bringing
together language resources and technologies, information about them, Language
Technology consumers, providers and the wider public. This chapter describes the
mechanisms ELG has put in place to build interoperable bridges to related initia-
tives, infrastructures, platforms and repositories in the wider Language Technology
landscape. We focus on the different approaches implemented for the exchange of
metadata records about, in a generic sense, resources and exemplify them with the
help of four use cases through which the ELG catalogue has been further populated.
The chapter presents the protocols used for the population processes as well as the
adaptations of the ELG metadata schema and platform policies that proved neces-
sary to be able to ingest these new records. Last, we discuss the challenges emerging
in large-scale metadata aggregation processes and propose a number of alternative
options to address them.



1 Introduction

One of the objectives of the European Language Grid is to help overcome the frag-
mentation of the European Language Technology community by bringing together
language resources and technologies, information about them, Language Technol-
ogy consumers, providers and the wider public.
   Additionally, ELG is meant to support digital language equality in Europe (STOA
2018; European Parliament 2018), i. e., to create a situation in which all European

Penny Labropoulou · Stelios Piperidis · Miltos Deligiannis · Leon Voukoutis · Maria Giagkou
Institute for Language and Speech Processing, R. C. “Athena”, Greece, penny@athenarc.gr,
spip@athenarc.gr, mdel@athenarc.gr, leon.voukoutis@athenarc.gr, mgiagkou@athenarc.gr
Ondřej Košarko · Jan Hajič
Charles University, Czech Republic, kosarko@ufal.mff.cuni.cz, hajic@ufal.mff.cuni.cz
Georg Rehm
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany, georg.rehm@dfki.de


© The Author(s) 2023                                                                          107
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_6
108               Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

languages are supported through technologies equally well. Technological support
for human languages has been characterised by a stark predominance of LTs for En-
glish, while almost all other languages are only marginally supported and, thus, in
danger of digital extinction (Kornai 2013; Rehm et al. 2014, 2020b; ELRC 2019;
Calzolari et al. 2011; Soria et al. 2012). More than ten years after the initial findings
(Rehm and Uszkoreit 2012), Europe’s languages are still affected by this stark im-
balance in 2022, as attested in the most recent series of Language Reports (Giagkou
et al. 2022) prepared by the European Language Equality1 project, which develops
a strategic research, innovation and implementation agenda as well as a roadmap
for achieving full digital language equality in Europe by 2030. In collaboration with
ELG, one of the first steps towards Digital Language Equality has been the creation
of an inventory of language resources and technologies available for Europe’s lan-
guages and its regular monitoring.
    In tandem with its operation as an integrated LT platform, through a battery of
selection, conversion and ingestion processes described in this chapter, ELG aims
to act as a one-stop shop and single entry point to homogenised descriptions of lan-
guage resources and technologies. Section 2 positions the ELG approach towards
this goal in the broader context of the exchange of metadata between catalogues and
repositories. Section 3 presents four use cases through which the ELG catalogue has
been populated with metadata records from other sources, highlighting the features
that have influenced the different solutions we adopted. Section 4 presents the adap-
tations made in the ELG metadata schema and platform policies to take into account
the outputs of these import procedures. Finally, in Section 5 we discuss, based on
the experience gained in this process, the challenges that need to be addressed in the
aggregation of metadata from multiple sources in order to share and promote the use
and re-use of resources, data and software among community members.



2 Approach

There are a wide range of digital catalogues, repositories and, in general, infras-
tructures2 that support the publication and dissemination of digital artefacts and re-
sources, which can be classified along various dimensions. Institutional catalogues
hosting all types of resources (publications, datasets, tools, etc.) produced by prac-
titioners affiliated with an institution, catalogues that focus on resources produced
by specific communities (e. g., OLAC3 for resources related to language and linguis-
tics, CLARIN4 and ELRA5 for language resources, Europeana6 for cultural works,

1 https://european-language-equality.eu
2 For the sake of brevity, we will use the cover term “catalogue” for all institutions of this kind.
3 http://www.language-archives.org

4 https://www.clarin.eu

5 http://elra.info

6 https://www.europeana.eu
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem             109

ELIXIR7 for bioinformatics, LLOD cloud8 for linguistic linked data, etc.), cata-
logues that collect specific content types (e. g., Hugging Face9 for Machine Learning
models and datasets, ELRC-SHARE10 for Machine Translation-related resources or
portals for open government data).11
    At the same time, we witness a strong movement towards the sharing of resources
from multiple sources and various disciplines through a common point of access, so
that they are easily discoverable, accessible and re-usable by all interested stakehold-
ers, fostering interdisciplinary research and cross-community collaborations as well
as Open Science (e. g., European Commission 2022). Google has implemented its
Dataset Search12 , a service dedicated to facilitating the discovery of datasets stored
across the World Wide Web based on keyword search (Benjelloun et al. 2020). The
European Open Science Cloud (EOSC)13 , initiated by the European Commission, is
conceived as a federated and open multi-disciplinary environment for hosting and
processing research data and all other digital objects produced along the research
life cycle, e. g., methods, software and publications (Abramatic et al. 2021). Some
European countries have launched corresponding national initiatives, including the
National Research Data Infrastructure in Germany (NFDI).14 Gaia-X15 seeks to es-
tablish a federated ecosystem in which data is made available, collated, shared and
processed in trustworthy environments, associated with the concept of data spaces,
a type of data relationship between trusted partners, each of whom apply the same
high policies, standards and technical components to the description, storage and
sharing of their data and other resources.
    All these initiatives offer catalogues, or inventories, employing, in many cases,
different metadata schemas for the description of resources. The differences between
the schemas can be attributed to the varying requirements defined by the relevant ob-
ject of description (e. g., dataset vs. software or publication or geospatial data), the
need to cover a wide range of users (for general catalogues) in contrast to the spe-
cialised practices common among scholars of a discipline, as well as to the different
purposes that catalogues may serve (e. g., preservation, dissemination, or process-
ing). Sharing metadata across catalogues presupposes interoperability, in particular,
semantic interoperability. Initiatives for the adoption of common standards in meta-
data vocabularies, documentation of the vocabularies themselves, and the creation
and publication of mappers between them are among the primary instruments to
achieve such interoperability (Chan and Zeng 2006; Zeng and Chan 2006; Haslhofer
and Klas 2010; Alemu et al. 2012; Broeder et al. 2019).

7 https://elixir-europe.org
8 https://linguistic-lod.org/llod-cloud
9 https://huggingface.co

10 https://www.elrc-share.eu

11 https://www.re3data.org/browse/ provides a registry of research data repositories.

12 https://datasetsearch.research.google.com

13 https://eosc-portal.eu

14 https://www.nfdi.de

15 https://www.gaia-x.eu
110                Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

    Equally important is the establishment of protocols and mechanisms for the shar-
ing of metadata, and subsequently of the resources themselves. The OAI-PMH pro-
tocol16 is one of the most popular mechanisms used for repository interoperability
at the metadata level. The ResourceSync17 specification is a framework for the syn-
chronisation of both metadata and resources. Finally, APIs are frequently offered
nowadays as a solution for downloading dumps of metadata records.
    ELG has established technical bridges with other infrastructures and initiatives
in order to enrich its catalogue with information about data resources and tools from
other catalogues and repositories. The catalogues of interest to ELG are usually
discipline-specific, targeting the LT/NLP and neighbouring areas, such as Machine
Learning, Artificial Intelligence as well as social sciences and humanities. Poten-
tially interesting resources for LT development purposes are also hosted in general
repositories and catalogues, the identification and filtering of which poses challenges
which are briefly discussed in Section 3.



3 Establishing Interoperable Connections: Four Use Cases

Depending on the source repositories’ respective contents, metadata schemas and
vocabularies, and the available export functionalities of their catalogues, we have
adopted different approaches towards establishing interoperable connections, a se-
lection of which is presented in the following use cases. For each use case, we de-
scribe the source repository’s technical and metadata features, explain how these
impact the import of metadata records into ELG and present the methodology and
tools used in the integration process.



3.1 Use Case 1: OAI-PMH (CLARIN Nodes and ELRC-SHARE)

The CLARIN (Common Language Resources and Technology Infrastructure) Re-
search Infrastructure (Hinrichs and Krauwer 2014; Eskevich et al. 2020) supports the
sharing, use and sustainability of digital language resources and tools for research in
the social sciences and humanities. It is established in the form of a networked fed-
eration of centres (Wittenburg et al. 2010), consisting of language data repositories,
service centres and knowledge centres, with single sign-on access for all members
of the academic community in all participating countries.
   As part of the technical interoperability specifications, CLARIN data repositories
are required to expose their metadata records to the Virtual Language Observatory18
using OAI-PMH. With regard to metadata interoperability, CLARIN has designed

16   https://www.openarchives.org/pmh/
17   http://www.openarchives.org/rs/1.1/resourcesync
18   https://vlo.clarin.eu
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem      111

and implemented the Component MetaData Infrastructure (CMDI)19 , a framework
for the description and reuse of metadata “components” (semantic groups of ele-
ments) which can be combined to build “profiles”, i. e., metadata templates for spe-
cific resource types by specific communities or groups (Broeder et al. 2008, 2012).
Both are stored and shared through a dedicated registry, with metadata records being
shared in the form of XML files compatible with one of these profiles.
    The ELG platform implements an OAI-PMH client for harvesting metadata from
external repositories which expose their metadata via OAI-PMH. The process of har-
vesting requires the registration of a third-party provider as an “OAI-PMH Provider”
in the ELG catalogue. As soon as communication is established, the third-party
provider shares their OAI-PMH endpoint, which ELG will call at regular intervals
(currently once a week) in order to harvest the metadata the external repository ex-
poses. Thus, for linking with the CLARIN infrastructure, the OAI-PMH harvesting
protocol is the ideal candidate.
    The metadata schema is a crucial parameter to be taken into account in the
harvesting process. The ELG harvester accepts metadata records compliant with
the minimal version of the ELG metadata schema (see Section 5 in Chapter 2).
LINDAT/CLARIAH-CZ20 , the Czech CLARIN national node, does indeed expose
its metadata records described using the META-SHARE minimal schema through
its OAI-PMH endpoint (Gavrilidou et al. 2012). The fact that the ELG schema
(Labropoulou et al. 2020) builds upon META-SHARE proved valuable in the conver-
sion process of the original LINDAT/CLARIAH-CZ metadata into the ELG schema
(see Chapter 8, Section 4, p. 157 ff., for more technical details).
    CLARIN-DSpace, the repository software21 (forked from DSpace22 ) developed
mainly by the LINDAT/CLARIAH-CZ team, is used by several CLARIN centres
for their repositories (Straňák et al. 2019). After pulling the latest changes, these
repositories are ready-to-import into ELG using the same harvesting mechanism and
procedure. At the time of writing, the mechanism described above is also used for
harvesting CLARIN-PL23 and CLARIN-SI24 .
    The same harvesting approach was followed for the harvesting of metadata
records from the ELRC-SHARE repository, which is used for the storage of and
access to language resources collected through the European Language Resource Co-
ordination25 initiative (Lösch et al. 2018) and for feeding the CEF Automated Trans-
lation (CEF.AT) platform.26 ELRC-SHARE (Piperidis et al. 2018) uses a metadata
schema based on the META-SHARE schema tuned to text resources for Machine


19 https://www.clarin.eu/content/component-metadata
20 https://lindat.mff.cuni.cz
21 https://github.com/ufal/clarin-dspace

22 https://duraspace.org/dspace/

23 https://clarin-pl.eu/dspace/

24 https://www.clarin.si/repository/xmlui/?locale-attribute=en

25 https://lr-coordination.eu

26 https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eTranslation
112                 Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

Translation purposes. Again, the mapping of the metadata records from the original
schema to ELG was undertaken by the two teams.



3.2 Use Case 2: Custom API and Proprietary Schema (Hugging Face)

A different procedure is used for catalogues that expose metadata records through
custom APIs and proprietary metadata schemas. This procedure is used only for cat-
alogues that are of high interest to the ELG objectives. The Hugging Face catalogue
(Wolf et al. 2020) is such a case. It is a large collection of machine learning models
and datasets that can be used for training models, with a focus on the Transformer
architecture. Since 2021 ELG and Hugging Face have been collaborating with the
goal of importing metadata records from the Hugging Face catalogue into ELG.


       Collection                         Harmonisation                         Ingestion
Download from API                Treatment of mandatory elements          Loading of resulting
                                    and controlled vocabularies          metadata records to ELG




                           Mapping                             Conversion
                           Mapping of                         Conversion to
                        elements to ELG                       ELG metadata

Fig. 1 Workflow for the import of metadata records from Hugging Face to ELG


   One of the goals of Hugging Face is to enable its users to upload datasets and
models following a set of specifications so that they can be deployed for testing and
building other models or integrating models in their applications. Although they en-
courage users to add descriptions for the resources, this is not enforced. Furthermore,
the suggested metadata elements do not follow a standard schema. Users are asked
to upload a “card” for datasets27 or models28 , with a combination of free text fields
and a set of tags (e. g., language, licence) with values from recommended controlled
vocabularies, which are, however, not strictly validated.
   Hugging Face exposes two APIs with JSON files for datasets and models respec-
tively. These JSON files include a subset of the metadata elements displayed in their
catalogue, however, not all records have values for all of the elements. Since im-
porting into ELG presupposes that the metadata records comply with the ELG meta-
data schema, which means that at least the mandatory elements of the minimal ver-
sion (see Section 5 in Chapter 2) are filled in, the conversion and import of records
from Hugging Face into ELG has so far been limited to datasets with at least the de-
27   https://huggingface.co/docs/datasets/dataset_card.html
28   https://huggingface.co/docs/hub/model-repos
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem         113

scription, language and licence elements filled in as these are deemed the minimum
threshold for findability and usability purposes in the context of ELG.
   A conversion process has been set up based on the mapping of the elements and, in
the case of controlled vocabularies, their values. Further enrichment of the resulting
records has been performed for specific elements. The most prominent case was that
of the licencing information, since ELG requires, besides its name, a URL with the
text of the licence. Hugging Face includes a list of licence identifiers taken from
the SPDX list29 (which are also used in ELG), but it allows users as well to add
a licence name without further information. Thus, in addition to the mappings of
the licence identifiers from Hugging Face into the ones used in ELG, we looked
for the licence URL of unmapped values; if no URL was found, the resource was
not imported into ELG. Finally, where required, default values have been used for
mandatory elements whose values could not be inferred from the original metadata
records (e. g., all datasets have been assigned the text value for media type). Figure 1
shows the workflow that was followed in this process.



3.3 Use Case 3: General Catalogues and Standard Schemas (Zenodo)

Catalogues with heterogeneous resources from multiple sources and disciplines
present various challenges. We use Zenodo30 to discuss these challenges.
    Zenodo31 is a repository for storing and sharing EC-funded research results to
support Open Science established and run by CERN, which was created in response
to the European Commission’s (EC) assignment to the OpenAIRE project.32 . Since
its launch, Zenodo has grown steadily and is currently used for the publication of all
types of resources beyond EC-funded ones by research communities and individu-
als. The constant update of the Zenodo catalogue and its uptake by researchers for
the upload of datasets, and, more recently, software, makes it particularly interesting
for ELG purposes. The size and increasing number, however, of catalogue contents
makes the selection of resources very challenging. During the first phase of the ELG
project, we used a manual process for the identification of resources, which is de-
scribed in Chapter 8. This process, though, does not allow for regular updates and
has been abandoned in favour of an automatic process.




29   https://spdx.org/licenses/
30   https://zenodo.org
31   https://about.zenodo.org
32   https://www.openaire.eu
114               Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

   Zenodo exposes its metadata records through two channels: a REST API33 , which
outputs records as JSON files, and an OAI-PMH API34 in a set of standard metadata
formats, i. e., DC35 , DataCite36 , MARC2137 and DCAT38 .
   With regard to the ELG import mechanism, our preferred solution is OAI-PMH,
a standard protocol for interoperability and exchange of metadata records, which
includes a mechanism for regular harvesting. However, the Zenodo OAI-PMH end-
point does not allow the selection based on resource types, which would allow us to
focus on “datasets” and “software”. The only option is to download the whole set of
metadata records in order to subsequently filter them. Furthermore, harvesting from
the OAI-PMH endpoint is rate limited, hence not appropriate for large numbers of
metadata records. We have, therefore, resorted to a combined solution:
     • We downloaded a full dump of 2,060,674 metadata records included in Zenodo
       up until 31 August 2021. This dump, which is available from Zenodo, contains
       all records in JSON format, was filtered according to resource-type.
     • For records added to Zenodo after this date, we are incrementally harvesting
       from the OAI-PMH endpoint. Through this channel, a set of additional 147,621
       records has been harvested in a three-month period.
   The next step is that of identifying the candidate resources for ELG. From the
2,208,295 metadata records available up until 31 December 2021, those of resource
type “dataset” and “software” amount to 592,509 entries. This number is rather high,
and since the majority of these records are of little or no interest to ELG users39 ,
we are experimenting with automated filtering methods to identify the records of
interest.

     Collection                           Conversion                             Ingestion
Download from                     Conversion from DCAT                   Loading of resulting
OAI-PMH server                       to ELG relaxed                     metadata records to ELG




                       Selection                            Validation
                   Filtering of entries                Targeted inspection and
                                                        metadata enrichment

Fig. 2 Workflow for the import of metadata records from Zenodo to ELG



33 https://developers.zenodo.org/#rest-api
34 https://developers.zenodo.org/#oai-pmh
35 https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

36 https://schema.datacite.org/meta/kernel-4.4/

37 https://www.loc.gov/marc/bibliographic/

38 https://www.w3.org/TR/vocab-dcat-3/

39 As a comparison, the ELG catalogue has approx. 13,000 metadata records at the time of writing.
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem                  115

    The conversion of the metadata records is based on the DCAT metadata schema
(Albertoni et al. 2022), which is in widespread use. We expect that mapping DCAT
to ELG will enable the re-use of these converters as a base for import from other
repositories. Moreover, DCAT is the schema with the richest information among the
ones exposed from Zenodo, and the only one that includes a direct link to the down-
loadable files (“downloadURL” element), an important feature for ELG consumers.
    Mapping from DCAT is, however, not straightforward. DCAT is an RDF vocab-
ulary, and restrictions and extensions are implemented in the form of profiles and
applications. The OAI-PMH endpoint makes the metadata records available in XML
format; the XSD schema used by Zenodo is not publicly available40 . A closer in-
spection of the XML files has revealed discrepancies in the representation of some
elements. For instance, “subject” (defined in DCAT as a SKOS41 Concept) appears
in Zenodo XML files either as a SKOS Concept or as an element with the IRI of the
subject value in the form of an attribute. We have analysed the Zenodo XML files,
to the extent possible, and based our mapping on this analysis. We also had to apply
some modifications in the ELG schema so that we could take into account the DCAT
features (Section 4.1). Finally, a converter for the elements in the JSON files offered
through the REST API for the first batch of files has also been implemented.
    As a result of this endeavour, the procedure for regular updates from Zenodo is
foreseen as a workflow integrating the following steps: harvesting from the Zen-
odo endpoint, offline filtering and conversion of the metadata records, possibly with
some manual targeted inspection, and import into ELG (Figure 2).



3.4 Use Case 4: Collaborative Community Initiatives (ELE, ELG)

We also populated the ELG catalogue using bulk lists of metadata records, poten-
tially containing limited information, that serve as seeds for further enrichment. We
present here two such cases, one set of resources collected collaboratively in ELE
and a second set collected by the ELG consortium.
    The European Language Equality (ELE) project (Rehm and Way 2023)42 , which
collaborates with ELG to promote digital language equality in Europe, launched a
project-internal initiative in 2021 to collect as many LRTs as possible available for
the languages under investigation by the project.43 Operationally, a web form was set
up, which included a subset of the mandatory metadata elements of the ELG schema.
Given the size and breadth of this activity (dozens of respondents throughout Eu-
rope for approx. 80 official, regional, minority languages), we considered requiring
every informant to fill in even the minimal version of the metadata schema for ev-
ery single resource identified too demanding and not paricularly realistic, perhaps

40   The XSD schema included in the OAI-PMH API for DCAT is in fact that of DataCite v4.1.
41   https://www.w3.org/2004/02/skos/
42   https://european-language-equality.eu
43   https://european-language-equality.eu/languages/
116                  Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

even negatively impacting the collection process itself, potentially resulting in fewer
resources being reported by the informants if the process of registering a resource                          4
took too much time. The modifications required to accommodate this collaborative
scenario resulted in a “relaxed” version of the schema (see Section 4.1).
   The results of this collection process were exported in a tabular format. Before
the conversion and final import of the approx. 6,500 records into ELG, a long and
demanding process of curation was undertaken using semi-automatic methods. The
final output was imported into ELG through various scripts (Figure 3).

Input collection                        Curation                         Mapping
Collaborative web-            Completion of missing values    Mappings of values according to
 form completion                of mandatory elements            controlled vocabularies




                   Deduplication                     Harmonisation                       Ingestion
             Surface similarity-based           Transformations according           Loading of resulting
                  deduplication                 to controlled vocabularies         metadata records to ELG

Fig. 3 Workflow for the import of ELE results to ELG


   The curation process included normalising, correcting, and enriching values of
elements that were absent or not used consistently. Despite the effort to control the
input through prompting for the selection of values from recommended vocabularies
and filling in mandatory values, web forms do not allow strict enforcement strategies,
especially for cases of long lists of values or multiple values. For example, although
a set of “language” values was offered for selection in the form, the informants could
also add other values, which resulted in values with alternative, unofficial or simply
unusual names. Therefore, language information had to be normalised and mapped
to the ISO 639 language codes, as required by ELG. Although the tabular format
presents some advantages, given its simplicity and users’ familiarity, it still poses a
number of challenges for validation purposes, especially for elements with patterns,
or with multiple values. For instance, the “email” element was filled in with free text
values, URL links, etc., since no validation pattern was used for the element. For el-
ements with multiple values, such as languages, functions, etc., different delimiters
were used in between values and had to be normalised. Moreover, nested informa-
tion cannot be represented in a flat form; for example, the values of language and
region (where the language is spoken) were split in two complementary columns
so that controlled vocabularies could be used, but there can be no guarantee that
both columns are consistently filled in. For these cases, we had to check and ensure
that the same number of values was consistently used across the two complementary
columns and, moreover, that the values were matched correctly.
   In a similar collaborative population setting, the catalogue was populated with
European organisations that develop or use LTs or LRs, which were collected by the
ELG team and the National Competence Centres (NCCs; see Chapter 11 for more
details), thus enabling ELG to quickly become the “yellow pages” of organisations
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem          117

active in the broader LT community. As described in more detail in Chapter 9, lists of
organisations from various sources have been merged, together with information on
list items – mainly contact data and key terms describing their LT-related activities.
The resulting enriched list, divided into sub-lists by country, was checked again by
the respective NCCs, and, after checking the consistency, more than 1,700 records
were converted into the ELG-compatible XML format and imported into ELG. At the
time of writing, a similar procedure is being followed for LT-related R&D projects
and their funding agencies.



3.5 Summary of Use Cases

Table 1 summarises the technical and the metadata conditions in each of the use cases
presented in this section and the ways these are catered for in ELG. Depending on the
export functionalities offered by the source, the ELG platform can establish a connec-
tion at regular intervals and benefit from continuous updates. Table 1 also shows the
ELG metadata schema version that can be used, depending on the source metadata
schema, as well as the quantity and information richness of metadata records.


                          Export                               ELG Schema Update
         Repository     Functionality    Metadata Schema         Version Frequency
       CLARIN nodes      OAI-PMH        META-SHARE               minimal   regular
       ELRC-SHARE        OAI-PMH        ELRC-SHARE               minimal   regular
       Hugging Face      REST API       Proprietary (JSON)       relaxed   one-off
       Zenodo            REST API       Proprietary (JSON)       relaxed   one-off
       Zenodo            OAI-PMH        DCAT (XML)               relaxed   regular
       ELE survey           –           Subset of ELG schema     relaxed   one-off
       ELG collection       –           Subset of ELG schema     relaxed   one-off
Table 1 Overview of use cases




4 Implementing Metadata Interoperability

Primarily motivated by our various interoperability use cases, some of which are
described in Section 3, we modified the ELG platform import procedures and poli-
cies, especially with regard to the metadata schema and the publication life cycle
(described in Chapter 2), so that they are able to handle the different interoperability
scenarios. These adaptations are not restricted to the requirements of the use cases
but lay the foundational principles for accommodating a broader range of metadata
import scenarios.
118                 Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

4.1 ELG Metadata Schema – Relaxed Version

The “relaxed” version of the ELG metadata schema aims to accommodate mis-
matches between the ELG schema and schemas used for metadata records that are
automatically imported into the ELG catalogue, especially those from catalogues
with limited information or catalogues populated with metadata records of interest
to a broader range of communities (e. g., Zenodo, EOSC, etc.) and, thus, using more
general schemas, e. g., DCAT (Albertoni et al. 2022) or DataCite44 (DataCite Meta-
data Working Group 2021). This version of the schema features additional alterna-
tive elements for mandatory metadata elements that may be missing from the source
records or that have different data types.
   The first case refers to two elements that are deemed important for ELG purposes:
“media type” and “licence”.
     • The element “media type part” is crucial for ELG, as it is used for attaching im-
       portant metadata properties, such as language, format, size, etc. Even in cases
       where these are included in source records, they may come with different clas-
       sification vocabularies and semantics and, therefore, cannot be imported into
       ELG. For these cases, the additional alternative value “unspecified media part”
       can be used.
     • The element “licence” is crucial for re-usability purposes; for a licence, both a
       name and a URL hyperlink to the respective legal document are required. How-
       ever, in many cases, such as legacy resources, or records in catalogues allowing
       free text as the value of “licence”, the name and URL cannot be determined au-
       tomatically. This is why we introduced the “access rights” element that takes a
       free text value as an alternative to “licence”, specifying the rights of access and
       use at a higher level of abstraction.
   The second case groups together elements which take a value from controlled
vocabularies in ELG, while in other schemas they have a free text value (e. g., “ser-
vice function”, “size unit”, etc.) and combined elements that cannot be distinguished
from the source metadata record (e. g., when size is encoded as free text combining
amount and size unit together). To address the first case, we modified the data type of
the element so that it takes a value from a recommended vocabulary or free text en-
tered by the user; to address the second case, we introduced a new element that takes
free text as a value (e. g., “sizeText” can be used as an alternative to the combination
of “amount” and “size unit”).



4.2 Publication Policies for Imported Metadata Records

ELG rates the quality of the metadata records highly. High quality metadata con-
tributes to the discovery and usage of the resources themselves. A standardised pub-
44   https://schema.datacite.org
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem       119

lication life cycle has been established in ELG for metadata records (see Chapter 2,
Section 6, 24 ff.). However, the same level of quality cannot be enforced across all
metadata records. This is also taken into account in the publication policies. Thus,
while metadata records registered by individuals go through a validation process,
for records automatically imported from other catalogues the same manual valida-
tion processes cannot be set up in a feasible way, i. e., the quality and extent, in
terms of information, of external metadata records remains under the responsibility
of the respective source catalogue. Depending on the harvesting process and source
catalogue, a three-level classification of metadata records is used:
  • Metadata records harvested automatically from collaborating catalogues (CLA-
    RIN nodes, ELRC-SHARE), which have similar metadata requirements as ELG.
    These records are added by individuals, the resource is stored in the repository.
    This is why these metadata records are considered trustworthy, and the records
    are published in the ELG catalogue as is, i. e., without any human validation.
  • Metadata records automatically imported from catalogues with “lighter” meta-
    data requirements (Hugging Face, Zenodo) have originally been added to the
    source catalogue by individuals together with the physical resource. The meta-
    data record and resource is considered trustworthy but it may lack information
    which is important for ELG purposes, and thus marked as “for information” to
    indicate to ELG users that important information may be missing.
  • Metadata records that resulted from bulk collection initiatives (ELE collection,
    ELG collection) are often incomplete, i. e., only a subset of the required infor-
    mation was collected and converted to the ELG schema. These records adhere
    to the relaxed ELG schema, the physical resource may be stored anywhere on-
    line. These records do not undergo the validation process, they are marked and
    can be claimed for further enrichment by their rightful owners (see Chapter 9,
    Section 3.3, p. 179). When a user claims a metadata record, the technical ELG
    team is notified and can approve or reject the claim, taking into account the
    professional email account of the user; if the claim is approved, the metadata
    record is unpublished and assigned to the user for further editing. Once the user
    finishes the editing, the record is submitted for publication and goes through the
    normal publication procedure. Users are notified about the claim procedure of
    these metadata records via e-mail.



5 Interoperability across Repositories

The interoperability across multiple repositories and platforms is of utmost impor-
tance in a broader, federated environment of data and services, as envisaged in ini-
tiatives like EOSC (European Open Science Cloud, see, e. g., Corcho et al. 2021),
NFDI, Gaia-X or the European Commission’s Data Spaces and in accordance with
the FAIR principles (Wilkinson et al. 2016), see Section 2. In the following, we dis-
cuss some of the open issues that need to be addressed in order to achieve this based
on the endeavours presented in this chapter.
120              Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

5.1 Technical Interoperability across Repositories

The first prerequisite for the sharing of metadata records and the construction of
a common master inventory based on the contents of all participating repositories
is that of exchange services. The OAI-PMH protocol, despite its limitation to the
exchange of metadata, constitutes the most widespread and hence usually preferred
option. REST services are becoming more popular, but they are not yet standardised
and thus require customised solutions. Rehm et al. (2020a) explore technical and
semantic interoperability in more detail.



5.2 Semantic Interoperability across Repositories

The use of shared vocabularies for the documentation of resources is the next neces-
sary step towards interoperability. The standardisation and documentation of meta-
data schemas is a requirement that many initiatives have articulated (Hugo et al.
2020; Behnke et al. 2021). While certain metadata vocabularies, such as DC45 ,
DCAT, schema.org46 and DataCite, have become de facto standards, these are gen-
eral schemas that can be used to express core metadata elements required for the
description of any type of digital resource. This, however, competes with the much
more fine-grained documentation needs of specific communities and more detailed
requirements set to achieve machine actionability. For example, “resource type” is
an element that poses problems for all catalogues: in contrast to the general vocab-
ularies (e. g., COAR resource type vocabulary47 , a limited set of values from DC48 ,
Zenodo49 ), communities prefer finer distinctions (cf. the values of “resource type”
in the CLARIN VLO50 ). This creates a burden when moving from general to spe-
cialised catalogues (e. g., from Zenodo to ELG).
   Bridges and mappers between vocabularies are developed, especially between the
popular schemas.51 Yet this is not a scalable approach, as for each new vocabulary a
new mapper has to be built. Instead, a “shared semantic space” is needed as a joint,
ontologically grounded and machine-readable vocabulary, into which all concepts
and terminologies can be mapped (Rehm et al. 2020a). This space can be envisaged
as a reference model able to represent all crucial information typically contained in
the respective metadata schema. However, a single RDF/OWL ontology covering
general and domain or community-specific semantic categories is an almost impos-
sible task to achieve (Labropoulou et al. 2018). An alternative could be a Linked

45 https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
46 https://schema.org
47 https://vocabularies.coar-repositories.org/resource_types/

48 https://www.dublincore.org/specifications/dublin-core/resource-typelist/

49 https://developers.zenodo.org/#representation

50 https://vlo.clarin.eu

51 For the mapping of metadata schemas in the wider LT ecosystem, see McCrae et al. (2015b,a).
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem        121

Data approach52 , in which different communities maintain their independent formal
models and vocabularies and subsequently refer to reference vocabularies or con-
cepts developed in a distributed fashion by the broader community. As an example
of such an approach, a collaboration was initiated between ELG and the AI4EU
project on the mapping of the ontologies used in the two platforms. This work is
continued under the umbrella of the AI Ontology Working Group which includes
members from the European AI on Demand Platform and collaborating projects.53
    Even in this scenario, though, an important issue to be addressed is that of the
appropriate semantic relations. Equivalence relations are not always one-to-one and
also need to take into account the type of elements. Additionally, there are an abun-
dance of similar vocabularies recommended by different communities or serving
different documentation needs. For example, in terms of “language”, a value taken
from ISO 63954 may suffice for general catalogues. But for the metadata of resources
in language-related catalogues, such as ELG, a more detailed value space is required,
that takes into account dialects and other varieties, and these are not included in ISO
639 (Gillis-Webber and Tittel 2019). In ELG we use the BCP 47 recommendation
(Phillips and Davis 2009) alongside values taken from the Glottolog55 vocabulary
(Hammarström et al. 2021) so that we can exploit the finer distinctions made in it for
language varieties. The fact that Glottolog includes a mapping to ISO 639-3 values,
when these exist, facilitates this endeavour and the exchange of metadata records
with catalogues that prefer using ISO 639.



5.3 Minimal Metadata Requirements

The different purposes served by the catalogues have an impact on the exchange of
metadata records, too. For example, Zenodo is used for the publication of research
outcomes by many different organisations and individuals. The fact that there is a
very small set of mandatory elements as well as the fact that providers do not have a
strong incentive to make their resources findable lowers the quality of the metadata
descriptions. In a similar way, individuals that add their resources to the Hugging
Face catalogue are mostly interested in testing their dataset and do not pay attention
to its description. Many metadata elements that are important for ELG purposes, such
as “language”, are simply not included in the formal descriptions of these records.
Often, even free text descriptions are of very low quality and cannot be used for dis-
covery purposes. There is, therefore, a strong need for training resource owners on
the importance of metadata together with the continuous curation by experts (Gor-
don and Habermann 2019). The “claim” procedure adopted in ELG is a step along
these lines. Semi-automatic methods for enriching metadata records by extracting

52   https://www.w3.org/DesignIssues/LinkedData.html
53   https://www.ai4europe.eu/ai-community/working-groups-d/ontology
54   https://www.iso.org/iso-639-language-codes.html
55   https://glottolog.org
122                Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

information from the datasets themselves, as well as other sources, will also play an
important role in ensuring that minimal documentation requirements are met.



5.4 Duplicate Resources

Looking at the resources themselves, the exchange of metadata records across cat-
alogues comes with the risk of creating duplicates and near-duplicates. The same
resource may appear with slightly different names in catalogues and similar descrip-
tions, while the same name is often used for subsets of the resource. The use of
persistent identifiers (PIDs) has been proposed to address this, but it cannot be guar-
anteed that persistent identifiers are indeed unique. Explicit relations between sim-
ilar resources (subsets, raw or annotated versions, versions and updates, etc.) must
be formally recorded in the metadata so that they can be used for deduplication pur-
poses. Establishing relations between the metadata records of the same resource in
different catalogues should also be recorded.



6 Conclusions

In this chapter we have focused on the sharing of metadata between catalogues. This
is only the basis for what is going to be the next level of sharing data and software
which is the ultimate goal. This involves not only a shared semantic space to anchor
and cross-link metadata vocabularies but also technical compatibility and coopera-
tion. ELG has closely collaborated with other platforms to explore platform inter-
operability at various levels (Rehm et al. 2020a). Experiments were conducted with
AI4EU56 , SPEAKER57 and QURATOR58 for the creation of cross-platform work-
flows, where data and services were accessed from one platform and either trans-
ferred to another platform or used for building a pipeline or workflow of different
processing services in another platform. Our initial experiments, explored further by
Moreno-Schneider et al. (2022), demonstrate that interoperability can be partially
achieved, with a certain degree of manual and automatic interventions.
    Finally, we should also mention an alternative that can be used for sharing re-
sources and their documentations across platforms and communities. This consists
of supporting cross-platform search through making search and discovery APIs used
by a platform available to third parties so that they can integrate them in their own
search space (Rehm et al. 2020a). This way, a single query would return matches
from multiple platforms whose publicly available search APIs are integrated in the
platform queried by the user. In this case, search results would show only a minimal

56   https://www.ai4europe.eu
57   https://www.speaker.fraunhofer.de
58   https://qurator.ai
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem                     123

set of metadata redirecting the user to the platform that offers the respective resource.
Again, a shared common space is required but only for a limited set of metadata –
a similar situation to the general catalogues presented above, but only for a small
subset. However, this option presents a scalability problem as soon as the number of
collaborating platforms and respective search APIs grows.
   Decentralised infrastructures such as Gaia-X, in which individual trusted plat-
forms follow a common standard (i. e., the Gaia-X federation services) and become
a networked system freely sharing and exchanging data and services across multi-
ple actors, offer a viable solution addressing this challenge. OpenGPT-X59 is a Ger-
man national project in which large language models are currently being developed,
especially for German but also for English and other European languages. In this
project, which has started in January 2022, we will have the chance to implement
the emerging Gaia-X specifications in the ELG platform so that it joins this emerging
ecosystem.



References

Abramatic, Jean-François, Jan Hrušák, and Sarah Jones, eds. (2021). European Open Science Cloud
   (EOSC) Executive Board: Final Progress Report. Publications Office. DOI: 10.2777/46019.
Albertoni, Riccardo, David Browning, Simon Cox, Alejandra Gonzalez-Beltran, Andrea Perego,
   and Peter Winstanley, eds. (2022). Data Catalog Vocabulary (DCAT) – Version 3. W3C Working
   Draft. URL: https://www.w3.org/TR/vocab-dcat-3/.
Alemu, Getaneh, Brett Stevens, and Penny Ross (2012). “Towards a conceptual framework for user-
   driven semantic metadata interoperability in digital libraries: a social constructivist approach”.
   In: New Library World 113.1/2, p. 15.
Behnke, Claudia, Kees Burger, Yann le Franc, Wim Hugo, Pekka Järveläinen, Jessica Parland-von
   Essen, and Gerard Coen (2021). “D2.6 First reference implementation of the data repositories
   features”. In: DOI: 10.5281/zenodo.5362027. URL: https://zenodo.org/record/5362027/export
   /hx.
Benjelloun, Omar, Shiyu Chen, and Natasha Noy (2020). “Google Dataset Search by the Numbers”.
   In: The Semantic Web (ISWC 2020) – 19th International Semantic Web Conference. Ed. by Jeff
   Z. Pan, Valentina A. M. Tamma, Claudia d’Amato, Krzysztof Janowicz, Bo Fu, Axel Polleres,
   Oshani Seneviratne, and Lalana Kagal. Vol. 12507. Lecture Notes in Computer Science. Athens,
   Greece: Springer, pp. 667–682. DOI: 10.1007/978-3-030-62466-8_41. URL: https://doi.org/1
   0.1007/978-3-030-62466-8_41.
Broeder, Daan, Thierry Declerck, Erhard Hinrichs, Stelios Piperidis, Laurent Romary, Nicoletta
   Calzolari, and Peter Wittenburg (2008). “Foundation of a Component-based Flexible Registry
   for Language Resources and Technology”. In: Proceedings of the Sixth International Confer-
   ence on Language Resources and Evaluation (LREC 2008). Marrakech, Morocco: ELRA. URL:
   http://www.lrec-conf.org/proceedings/lrec2008/pdf/364_paper.pdf.
Broeder, Daan, Thorsten Trippel, Emiliano Degl’Innocenti, Roberta Giacomi, Maurizio Sanesi,
   Mari Kleemola, Katja Moilanen, Henri Ala-Lahti, Caspar Jordan, Iris Alfredsson, Hervé L’Hou-
   rs, and Matej Ďurčo (2019). “SSHOC D3.1 Report on SSHOC (meta)data interoperability prob-
   lems”. In: DOI: 10.5281/ZENODO.3569868. URL: https://zenodo.org/record/3569868.
Broeder, Daan, Dieter van Uytvanck, Maria Gavrilidou, Thorsten Trippel, and Menzo Windhouwer
   (2012). “Standardizing a Component Metadata Infrastructure”. In: Proceedings of the Eighth In-

59   https://opengpt-x.de
124              Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

   ternational Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey:
   ELRA, pp. 1387–1390. URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/581_Paper
   .pdf.
Calzolari, Nicoletta, Valeria Quochi, and Claudia Soria, eds. (2011). The Strategic Language Re-
   source Agenda. URL: https://www.academia.edu/1651334/The_Strategic_Language_Resourc
   e_Agenda.
Chan, Lois Mai and Marcia Lei Zeng (2006). “Metadata Interoperability and Standardization –
   A Study of Methodology Part I: Achieving Interoperability at the Schema Level”. In: D-Lib
   Magazine 12.6. DOI: 10.1045/june2006-chan. URL: http://www.dlib.org/dlib/june06/chan/06
   chan.html.
Corcho, Oscar, Magnus Eriksson, Krzysztof Kurowski, Milan Ojsteršek, Christine Choirat, Mark
   van de Sanden, Frederik Coppens, and European Commission, Directorate-General for Re-
   search and Innovation (2021). EOSC Interoperability Framework: Report from the EOSC Exec-
   utive Board Working Groups FAIR and Architecture. Publications Office. DOI: 10.2777/620649.
   URL: https://data.europa.eu/doi/10.2777/620649.
DataCite Metadata Working Group (2021). “DataCite Metadata Schema Documentation for the
   Publication and Citation of Research Data and Other Research Outputs v4.4”. In: DOI: 10.144
   54/3W3Z-SA82. URL: https://schema.datacite.org/meta/kernel-4.4/.
ELRC (2019). ELRC White Paper: Sustainable Language Data Sharing to Support Language
   Equality in Multilingual Europe. Second online edition. URL: https://lr- coordination.eu/sit
   es/default/files/Documents/ELRCWhitePaper.pdf.
Eskevich, Maria, Franciska de Jong, Alexander König, Darja Fišer, Dieter Van Uytvanck, Tero
   Aalto, Lars Borin, Olga Gerassimenko, Jan Hajic, Henk van den Heuvel, Neeme Kahusk, Krista
   Liin, Martin Matthiesen, Stelios Piperidis, and Kadri Vider (2020). “CLARIN: Distributed Lan-
   guage Resources and Technology in a European Infrastructure”. In: Proc. of the 1st Int. Work-
   shop on Language Technology Platforms (IWLTP 2020, co-located with LREC 2020). Ed. by
   Georg Rehm, Kalina Bontcheva, Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasil-
   jevs. Marseille, France: ELRA, pp. 28–34. URL: https://aclanthology.org/2020.iwltp-1.5.
European Commission (2022). European Research Area policy agenda: overview of actions for the
   period 2022–2024. Publications Office. DOI: 10.2777/52110. URL: https://data.europa.eu/doi
   /10.2777/52110.
European Parliament (2018). Language Equality in the Digital Age. European Parliament reso-
   lution of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). URL:
   http://www.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf.
Gavrilidou, Maria, Penny Labropoulou, Elina Desipri, Stelios Piperidis, Haris Papageorgiou, Mon-
   ica Monachini, Francesca Frontini, Thierry Declerck, Gil Francopoulo, Victoria Arranz, and Va-
   lerie Mapelli (2012). “The META-SHARE Metadata Schema for the Description of Language
   Resources”. In: Proceedings of the Eighth International Conference on Language Resources
   and Evaluation (LREC 2012). Istanbul, Turkey: ELRA, pp. 1090–1097. URL: http://www.lrec
   -conf.org/proceedings/lrec2012/pdf/998_Paper.pdf.
Giagkou, Maria, Stelios Piperidis, Georg Rehm, and Jane Dunne, eds. (2022). Language Technol-
   ogy Support of Europe’s Languages in 2020/2021. Various project deliverables (language re-
   ports); EU project European Language Equality (ELE); Grant Agreement no. LC-01641480 –
   101018166 ELE. European Language Equality Project. URL: https://european-language-equal
   ity.eu/deliverables/.
Gillis-Webber, Frances and Sabine Tittel (2019). “The Shortcomings of Language Tags for Linked
   Data When Modeling Lesser-Known Languages”. In: 2nd Conference on Language, Data and
   Knowledge (LDK 2019). Ed. by Maria Eskevich, Gerard de Melo, Christian Fäth, John P. Mc-
   Crae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek, and Milan Dojchinovski. Vol. 70.
   OpenAccess Series in Informatics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-
   Zentrum fuer Informatik, 4:1–4:15. DOI: 10.4230/OASIcs.LDK.2019.4. URL: http://drops.da
   gstuhl.de/opus/volltexte/2019/10368.
Gordon, Sean and Ted Habermann (2019). Visualizing The Evolution of Metadata. Version Number:
   v0.0.1. DOI: 10.5281/zenodo.2538983. URL: https://doi.org/10.5281/zenodo.2538983.
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem                   125

Hammarström, Harald, Robert Forkel, Martin Haspelmath, and Sebastian Bank (2021). Glottolog
   database 4.5. Version Number: v4.5, Type: dataset. Leipzig, Germany: Max Planck Institute for
   Evolutionary Anthropology. DOI: 10.5281/ZENODO.5772642. URL: https://zenodo.org/recor
   d/5772642.
Haslhofer, Bernhard and Wolfgang Klas (2010). “A survey of techniques for achieving metadata
   interoperability”. In: ACM Computing Surveys 42.2, pp. 1–37. DOI: 10.1145/1667062.1667064.
   URL: https://dl.acm.org/doi/10.1145/1667062.1667064.
Hinrichs, Erhard and Steven Krauwer (2014). “The CLARIN Research Infrastructure: Resources
   and Tools for eHumanities Scholars”. In: Proceedings of the Ninth International Conference on
   Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland: ELRA, pp. 1525–1531.
   URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/415_Paper.pdf.
Hugo, Wim, Yann Le Franc, Gerard Coen, Jessica Parland-von Essen, and Luiz Bonino (2020).
   “D2.5 FAIR Semantics Recommendations Second Iteration”. In: DOI: 10.5281/zenodo.536201
   0. URL: https://zenodo.org/record/5362010.
Kornai, Andras (2013). “Digital Language Death”. In: PLoS ONE 8.10. DOI: 10.1371/journal.pon
   e.0077056. URL: https://doi.org/10.1371/journal.pone.0077056.
Labropoulou, Penny, Dimitris Galanis, Antonis Lempesis, Mark Greenwood, Petr Knoth, Richard
   Eckart de Castilho, Stavros Sachtouris, Byron Georgantopoulos, Stefania Martziou, Lucas Anas-
   tasiou, Katerina Gkirtzou, Natalia Manola, and Stelios Piperidis (2018). “OpenMinTeD: A
   Platform Facilitating Text Mining of Scholarly Content”. In: Proceedings of WOSP 2018 (co-
   located with LREC 2018). Miyazaki, Japan: ELRA, pp. 7–12. URL: http://lrec-conf.org/works
   hops/lrec2018/W24/pdf/13_W24.pdf.
Labropoulou, Penny, Katerina Gkirtzou, Maria Gavriilidou, Miltos Deligiannis, Dimitris Galanis,
   Stelios Piperidis, Georg Rehm, Maria Berger, Valérie Mapelli, Michael Rigault, Victoria Ar-
   ranz, Khalid Choukri, Gerhard Backfried, José Manuel Gómez Pérez, and Andres Garcia-Silva
   (2020). “Making Metadata Fit for Next Generation Language Technology Platforms: The Meta-
   data Schema of the European Language Grid”. In: Proceedings of the 12th Language Resources
   and Evaluation Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe
   Blache, Christopher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard,
   Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA,
   pp. 3421–3430. URL: https://www.aclweb.org/anthology/2020.lrec-1.420/.
Lösch, Andrea, Valérie Mapelli, Stelios Piperidis, Andrejs Vasiļjevs, Lilli Smal, Thierry Declerck,
   Eileen Schnur, Khalid Choukri, and Josef van Genabith (2018). “European Language Resource
   Coordination: Collecting Language Resources for Public Sector Multilingual Information Man-
   agement”. In: Proc. of the Eleventh International Conference on Language Resources and Eval-
   uation (LREC 2018). Miyazaki, Japan: ELRA. URL: https://aclanthology.org/L18-1213.
McCrae, John Philip, Philipp Cimiano, Victor Rodriguez-Doncel, Daniel Vila Suero, Jorge Gracia,
   Luca Matteis, Roberto Navigli, Andrejs Abele, Gabriela Vulcu, and Paul Buitelaar (2015a).
   “Reconciling Heterogeneous Descriptions of Language Resources”. In: Proceedings of the 4th
   Workshop on Linked Data in Linguistics: Resources and Applications. Beijing, China: ACL,
   pp. 39–48. DOI: 10.18653/v1/W15-4205. URL: http://aclweb.org/anthology/W15-4205.
McCrae, John Philip, Penny Labropoulou, Jorge Gracia, Marta Villegas, Víctor Rodríguez-Doncel,
   and Philipp Cimiano (2015b). “One Ontology to Bind Them All: The META-SHARE OWL
   Ontology for the Interoperability of Linguistic Datasets on the Web”. In: The Semantic Web:
   ESWC 2015 Satellite Events. Ed. by Fabien Gandon, Christophe Guéret, Serena Villata, John
   Breslin, Catherine Faron-Zucker, and Antoine Zimmermann. Lecture Notes in Computer Sci-
   ence. Springer International Publishing, pp. 271–282. URL: https://link.springer.com/chapter
   /10.1007/978-3-319-25639-9_42.
Moreno-Schneider, Julián, Rémi Calizzano, Florian Kintzel, Georg Rehm, Dimitris Galanis, and
   Ian Roberts (2022). “Towards Practical Semantic Interoperability in NLP Platforms”. In: Pro-
   ceedings of the 18th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (ISA 2022;
   co-located with LREC 2022). Ed. by Harry Bunt. Marseille, France, pp. 118–126. URL: http:
   //www.lrec-conf.org/proceedings/lrec2022/workshops/ISA-18/pdf/2022.isa18-1.16.pdf.
126              Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Leon Voukoutis et al.

Phillips, Addison and Mark Davis (2009). Tags for Identifying Languages. Tech. rep. RFC 5646.
   Internet Engineering Task Force. URL: https://datatracker.ietf.org/doc/rfc5646.
Piperidis, Stelios, Penny Labropoulou, Miltos Deligiannis, and Maria Giagkou (2018). “Manag-
   ing Public Sector Data for Multilingual Applications Development”. In: Proceedings of the
   Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Ed.
   by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti
   Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno,
   Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga. Miyazaki, Japan: ELRA. URL: http://w
   ww.lrec-conf.org/proceedings/lrec2018/pdf/648.pdf.
Rehm, Georg, Dimitrios Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo
   Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian
   Chiarcos, Nils Feldhus, Julián Moreno-Schneider, Florian Kintzel, Elena Montiel, Víctor Ro-
   dríguez Doncel, John P. McCrae, David Laqua, Irina Patricia Theile, Christian Dittmar, Kalina
   Bontcheva, Ian Roberts, Andrejs Vasiljevs, and Andis Lagzdiņš (2020a). “Towards an Inter-
   operable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different
   Levels of Interoperability”. In: Proc. of the 1st Int. Workshop on Language Technology Plat-
   forms (IWLTP 2020, co-located with LREC 2020). Ed. by Georg Rehm, Kalina Bontcheva,
   Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasiljevs. Marseille, France, pp. 96–
   107. URL: https://www.aclweb.org/anthology/2020.iwltp-1.15.pdf.
Rehm, Georg, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic,
   Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez
   Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek,
   Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-
   bina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De
   Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek,
   Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur
   Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás
   Váradi, Kadri Vider, Andy Way, and François Yvon (2020b). “The European Language Technol-
   ogy Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communi-
   cation in Multilingual Europe”. In: Proceedings of the 12th Language Resources and Evaluation
   Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christo-
   pher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
   Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325.
   URL: https://www.aclweb.org/anthology/2020.lrec-1.407/.
Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan-
   guages in the Digital Age. 32 volumes on 31 European languages. Heidelberg etc.: Springer.
Rehm, Georg, Hans Uszkoreit, Ido Dagan, Vartkes Goetcherian, Mehmet Ugur Dogan, Coskun Mer-
   mer, Tamás Váradi, Sabine Kirchmeier-Andersen, Gerhard Stickel, Meirion Prys Jones, Stefan
   Oeter, and Sigve Gramstad (2014). “An Update and Extension of the META-NET Study “Eu-
   rope’s Languages in the Digital Age””. In: Proceedings of the Workshop on Collaboration and
   Computing for Under-Resourced Languages in the Linked Open Data Era (CCURL 2014). Ed.
   by Laurette Pretorius, Claudia Soria, and Paola Baroni. Reykjavik, Iceland, pp. 30–37. URL:
   http://georg-re.hm/pdf/CCURL-2014-META-NET.pdf.
Rehm, Georg and Andy Way, eds. (2023). European Language Equality: A Strategic Agenda for
   Digital Language Equality. Cognitive Technologies. Forthcoming. Springer.
Soria, Claudia, Núria Bel, Khalid Choukri, Joseph Mariani, Monica Monachini, Jan Odijk, Stelios
   Piperidis, Valeria Quochi, and Nicoletta Calzolari (2012). “The FLaReNet Strategic Language
   Resource Agenda”. In: Proceedings of the Eighth International Conference on Language Re-
   sources and Evaluation (LREC 2012). Istanbul, Turkey: ELRA, pp. 1379–1386. URL: http://w
   ww.lrec-conf.org/proceedings/lrec2012/pdf/777_Paper.pdf.
STOA (2018). Language equality in the digital age – Towards a Human Language Project. STOA
   study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. URL: https://data.europa.eu/doi/10
   .2861/136527.
6 Interoperable Metadata Bridges to the wider Language Technology Ecosystem                   127

Straňák, Pavel, Ondřej Košarko, and Jozef Mišutka (2019). “CLARIN-DSpace repository at LIN-
    DAT/CLARIN : LINDAT/CLARIN FAIR repository for language data”. In: The grey Journal
   – International Journal on Grey Literature 16, pp. 52–61.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Ax-
    ton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E.
    Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier
    Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alas-
    dair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen,
    Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert
    Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik,
    Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris
   A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waag-
    meester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons (2016). “The
    FAIR Guiding Principles for Scientific Data Management and Stewardship”. In: Scientific Data
    3. DOI: 10.1038/sdata.2016.18. URL: http://www.nature.com/articles/sdata201618.
Wittenburg, Peter, Nuria Bel, Lars Borin, Gerhard Budin, Nicoletta Calzolari, Eva Hajicova,
    Kimmo Koskenniemi, Lothar Lemnitzer, Bente Maegaard, Maciej Piasecki, Jean-Marie Pier-
    rel, Stelios Piperidis, Inguna Skadina, Dan Tufis, Remco van Veenendaal, Tamas Váradi, and
    Martin Wynne (2010). “Resource and Service Centres as the Backbone for a Sustainable Ser-
   vice Infrastructure”. In: Proceedings of the Seventh International Conference on Language Re-
    sources and Evaluation (LREC 2010). Valletta, Malta: ELRA. URL: http://www.lrec-conf.org
   /proceedings/lrec2010/pdf/679_Paper.pdf.
Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
    Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick
   von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,
    Mariama Drame, Quentin Lhoest, and Alexander M. Rush (2020). “Transformers: State-of-the-
    art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical Meth-
    ods in Natural Language Processing: System Demonstrations. ACL, pp. 38–45. DOI: 10.1865
    3/v1/2020.emnlp-demos.6. URL: https://aclanthology.org/2020.emnlp-demos.6.
Zeng, Marcia Lei and Lois Mai Chan (2006). “Metadata Interoperability and Standardization - A
    Study of Methodology Part II: Achieving Interoperability at the Record and Repository Levels”.
    In: D-Lib Magazine 12.6. DOI: 10.1045/june2006-zeng. URL: http://www.dlib.org/dlib/june0
    6/zeng/06zeng.html.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
                    Part II
         ELG Inventory of
Technologies and Resources
Chapter 7
Language Technology Tools and Services

Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca, Jose Manuel
Gómez-Pérez, Miroslav Jánošík, Dimitris Galanis, Rémi Calizzano, Andis
Lagzdiņš, Milan Straka, and Ulrich Germann




Abstract At the time of writing, the European Language Grid includes more than
800 LT services of varied types, including machine translation (MT), automatic
speech recognition (ASR), text-to-speech synthesis (TTS), and text analysis rang-
ing from simple tokenisers and part-of-speech taggers through to complete named
entity recognition and sentiment analysis systems. This chapter gives a high-level
summary of the development of the ELG service catalogue over time and digs deeper
to discuss the process of service integration by looking at a few example services.



1 Introduction

The European Language Grid platform is able to support a wide variety of different
types of Language Technology tools and services (see Chapter 3 for a more detailed
description). Service types are classified based on the type of data they process as

Ian Roberts
University of Sheffield, UK, i.roberts@sheffield.ac.uk
Andres Garcia Silva · Cristian Berrìo Aroca · Jose Manuel Gómez-Pérez
Expert AI, Spain, agarcia@expert.ai, cberrio@expert.ai, jmgomez@expert.ai
Miroslav Jánošík
HENSOLDT Analytics GmbH, Austria, miroslav.janosik@hensoldt-analytics.com
Dimitris Galanis
Institute for Language and Speech Processing, R. C. “Athena”, Greece, galanisd@athenarc.gr
Rémi Calizzano
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany,
remi.calizzano@dfki.de
Andis Lagzdiņš
Tilde, Latvia, andis.lagzdins@tilde.lv
Milan Straka
Charles University, Czech Republic, straka@ufal.mff.cuni.cz
Ulrich Germann
University of Edinburgh, UK, ulrich.germann@ed.ac.uk


© The Author(s) 2023                                                                         131
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_7
132                                                 Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.

                1200


                1000


                 800
    Services




                 600


                 400


                 200


                     0
                     20


                                 20


                                           0

                                                  20


                                                            1

                                                                       21


                                                                               21


                                                                                         1

                                                                                                 22


                                                                                                             22
                                          2




                                                            2




                                                                                      02
                 20


                             20


                                       20


                                                20


                                                         20


                                                                   20


                                                                             20




                                                                                             20


                                                                                                         20
                                                                                       2
                ar


                             n


                                       p

                                               ec


                                                       ar


                                                                   n


                                                                             p

                                                                                    ec


                                                                                            ar


                                                                                                         n
                          Ju


                                      Se




                                                                Ju


                                                                            Se




                                                                                                      Ju
               M




                                                     M




                                                                                           M
                                              D




                                                                                 D
Fig. 1 Number of tools and services integrated into the European Language Grid over time; the
grey shaded area denotes services whose integration is in progress at the time of writing and will
be complete by the time of publication



input – text, audio, image data, etc. – and what they produce as output – annota-
tions, text, audio, etc. This covers all the well-known service types such as Machine
Translation (MT – text in, text out), Automatic Speech Recognition (ASR – audio in,
text out), and Information Extraction/Text Analysis (IE – text in, annotations out),
but also allows for services such as entity detection in audio data (audio in, annota-
tions out), text-to-speech synthesis (TTS – text in, audio out), or optical character
recognition (OCR – images in, text out).
    Over the course of the original ELG EU project (Figure 1) the platform has grown
from around 100 services available in the initial alpha release in 2020 to over 500 at
the start of 2022 and almost 800 at the time of writing, with more being added all the
time. The early stages of the project concentrated on services supplied by the ELG
project consortium partners – such as ASR from HENSOLDT Analytics, MT from
the University of Edinburgh and Tilde, TTS from Tilde, and a wide variety of Text
Analysis services from Expert.AI, the University of Sheffield and DFKI (Roberts
et al. 2020). More recently, an increasing number of services have been supplied by
the ELG-funded pilot projects (see Part IV) and the platform has also begun to see
contributions from third parties with no direct connection to the ELG consortium
itself (Roberts et al. 2021, 2022). Of particular note is a set of over 500 MT services
covering all pairs of EU official languages from the Neural Translation for the EU
project, discussed in more detail in Section 2.1 One third of these services have been
integrated to date, with the remaining two thirds scheduled for integration during
April and May 2022 (the grey shaded region in the graph), bringing the total number

1   https://nteu.eu
                                                                      English
                                                                                German
                                                                                Italian
                                                                                Spanish
                                                                                               French
                                                                                               Dutch
                                                                                               Swedish
                                                                                               Finnish
                                                                                               Polish
                                                                                               Czech
                                                                                               Greek
                                                                                               Portuguese
                                                                                                                                      Danish
                                                                                                                                      Bulgarian
                                                                                                                                                  Romanian
                                                                                                                                                  Estonian
                                                                                                                                                  Latvian
                                                                                                                                                  Slovenian
                                                                                                                                                  Croatian
                                                                                                                                                  Lithuanian
                                                                                                                                                  Slovak
                                                                                                                                                  Hungarian
                                                                                                                                                                                    Maltese
                                                                                                                                                                                    Irish
                                                                                                                                                                                              Total A
                                                                                                                                                                                                        Total B (24 langs.)
                                                                                                                                                                                                                               Others (69 langs.)   Total
                                            Part-of-Speech Tagging       8      3    3    3    3    4    3    3    2    2     2   3   3     2     2    2    2   2   2   1   2   1   1   1     60        15                     34                    109
                                            Morphology                   5      2    2    2    2    2    3    3    1    1     1   1   3     1     1    5    1   1   1   1   1   1   1   1     43        13                     27                     83
                                            Lemmatization                3      2    2    2    2    2    2    2    1    1     1   2   2     1     1    2    1   1   1   1   1   1   1   1     36        11                     32                     79
                                            Tokenization                 6      4    3    2    2    3    1    1    1    1     1   1   1     1     1    2    1   1   1   1   1   1   1   1     39        10                     27                     76




                             Linguistic
                                            Sentence splitting           1      1                   1                                                                                          3                                                       3




                           pre-processing
                                            Chunking                     1                                                    1                                                                2                                                       2
                           Total pre-processing                       24        12   10   9    9    12   9 9 5 5 6 7 9                      5 5 11 5 5 5 4 5 4 4 4 183 49                                                     120                    352
                                                                                                                                                                                                                                                            7 Language Technology Tools and Services




                           + Classification                           16         6   16   5    4     3   3 2 3 2 4 3 2                      2 2 1 1 1 2 1 2 2       83 8                                                       40                    131
                           + Entity annotation                        17         7    4   4    6     5   5 2 1 2 3 2 1                      1 2        1   1 1      65 14                                                      16                     95




           Text analysis
                           + Linking & disambiguation                  7         2    3   4    4     1             1                                                22 2                                                        5                     29
                           + Sentiment/Opinion mining                 13         3    2   2    2     1   1   1   1 2                                 1 1            30                                                         10                     40
                           + Text transformation                       5         3    1   2    2     1             1                                                15 1                                                        5                     21
                           + Parsing                                   1         2    1   1    1     1   1 1 1 1 2 1 1                      1 1 1 1 1 1 1 1 1 1 1 26 10                                                        27                     63
                           + Other text analysis                      14         8    2   3    1     2   6 6     3 1 2                                              48 4                                                        5                     57
           Total Text Analysis                                        97        43   39   30   29   26   25   20   11   10   19 18 15 9           10   13    7 8 10 6 9 8 5 5 472 88 228                                                             788
           + Machine Translation into …                               90        42   27   31   29   35   32   33   36   36   24 27 27 32          29   26   26 28 25 26 25 26 24 24 760 88 83                                                        931
           + Speech recognition & analysis                             2         2    2    3    2    2    1    2    2    1    2                    2    1    2        1              27 9 21                                                          57
           + Other services                                           11         7    5    4    8    4    2    1    4    2    3 2 4 4              2    3    4 2 2 4 2 2 2 1 85 10 77                                                                172
           Grand Total                                               200 94 73 68 68 67 60 56 53 49 48 47 46 45 43 43 39 38 37 37 36 36 31 30 1344 195 409 1948
Table 1 A snapshot of all services in the ELG platform, grouped by function and supported language. This includes all services integrated as at the end of March
2022, plus 368 additional MT services whose integration is ongoing. EU official languages (type A) are listed individually; type B represents other languages
used in the EU, accession candidate countries, or EEA/EFTA members; “others” refers to languages from the rest of the world. For Machine Translation, the
columns in this table represent the target language, see Table 2 for a breakdown by source.
                                                                                                                                                                                                                                                            133
134                            Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.

of integrated service entries in ELG up to at least 1,148 by June 2022. We hope this
trend will accelerate now that the third platform release is complete.
    Furthermore, the figure of 1,148 hides the fact that a number of services combine
several different functions (such as tokenisation, sentence splitting, part-of-speech
tagging, entity detection, linking and disambiguation) into a single process and/or
offer the same function in more than one language. Counting each language/func-
tion pair individually gives a more informative picture of the scope and coverage
of ELG. For example, the platform currently provides one service that does depen-
dency parsing for Portuguese; it also provides one service that does lemmatisation for
Portuguese. The user who is looking for these two functions does not care whether
they are implemented by one service or by two, only whether or not the European
Language Grid can meet their needs.
    By this measure, as of the end of March 2022, ELG offers 1,576 distinct ser-
vice function/language combinations – already exceeding the 1,300 predicted by the
project in mid-2021 (Rehm et al. 2021) – and is on track to offer at least 1,948 by
June, which are summarised in Table 1. Reading from the bottom up, the 1,948 total
breaks down into 931 MT (47.7% of the total), 788 text analysis (40%), 57 speech
recognition and audio analysis, and 172 services of other types such as text to speech
and OCR. The middle section of Table 1 breaks the 788 text analysis services down
into broad sub-categories, and the top section breaks the largest sub-category (lin-
guistic pre-processing) down into individual functions.
    The largest single category of services is MT, with 770 catalogue entries repre-
senting 931 actual translation services (since some of the models are multilingual,
with the same endpoint accepting input in several different languages and translat-
ing them all to the same target). The available text analysis services range from
low-level text processing tasks such as tokenisation, part-of-speech tagging or mor-
phological analysis, through named entity annotation and on to higher-level services
such as parsing, sentiment analysis and entity linking against knowledge bases. De-
pendency parsing in particular is supported for 60 languages courtesy of the UD-
Pipe parser from Charles University in Prague. For speech, the platform currently
supports speech transcription for 31 languages thanks to tools from HENSOLDT
Analytics and Tilde, alongside other speech processing tools such as the keyword
spotting tool described in Section 3.
    Breaking the numbers down on another dimension, the ELG platform now hosts
at least one service providing support for each of 114 distinct languages. English is
unsurprisingly the most highly represented, but there is good support for other major
EU languages – German, French, Spanish, and Italian all have support for at least
20 service functions aside from machine translation – and in total 28 languages have
support for at least ten functions.
    Of course there is a long tail on both axes, with 16 of the 48 distinct service
functions available in only one language each and 25 in fewer than five languages.
On the other hand 39 out of the 114 languages are supported by only one function,
and 51 by fewer than three. Full multilinguality is still in the future, but for the
languages with larger numbers of speakers at least, significant progress has been
and is being made.
7 Language Technology Tools and Services                                                                                       135




                                                                                                            Total Other (7)
                                                                                                            Total B (20)
                                                Portuguese



                                                                        Lithuanian
                                                                        Hungarian
                                                Romanian

                                                Slovenian
                        Bulgarian




                                                                        Estonian



                                                                                        Croatian
                        Swedish
                        German




                        Spanish




                                                                                        Maltese
              English




                                                                        Latvian




                                                                                                            Total A
                        Finnish




                                                Danish
                                                French




                                                                                        Slovak
                                                Italian
                        Polish
                        Czech

                        Dutch




                                                                                        Greek
Target →




                                                                                        Irish
Source ↓                                                                                                                      Total
English                 7 4   4 3 4 5   3   3   4   3   2   2   3   3   3   3   3   2   2   2   2   2   2   71   2011 102
German           6        1   1 2 3 1   1   1   2   1   1   1   1   1   1   1   1   2   2   1   1   1   1   34    7 3 44
Czech            5      1     2 1 1 1   2   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   30    2 5 37
Polish           4      1 2     1 1 1   2   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   29    2 6 37
Dutch            2      2 1   1   1 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   25    4 2 31
Finnish          4      3 1   1 1   3   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   31    1 1 33
Swedish          4      1 1   1 1 3     1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   29    5 1 35
Bulgarian        3      1 2   2 1 1 1       1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   27    2 3 32
Spanish          3      1 1   1 1 1 1   1       1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   25    3 1 29
Romanian         2      1 1   1 1 1 1   1   1       1   1   1   1   1   1   1   1   1   1   1   1   1   1   24      1 25
French           3      1 1   1 1 2 1   1   1   1       1   1   1   1   1   1   1   1   1   1   1   1   1   26    1 4 31
Slovenian        2      1 2   2 1 1 1   1   1   1   1       1   1   1   1   1   1   1   1   1   1   1   1   26      2 28
Italian          2      1 1   1 1 1 1   1   1   1   1   1       1   1   1   1   1   1   1   1   1   1   1   24      4 28
Danish           2      1 1   1 1 1 2   1   1   1   1   1   1       1   1   1   1   1   1   1   1   1   1   25    4 1 30
Portuguese       3      1 1   1 1 1 1   1   1   1   1   1   1   1       1   1   1   1   1   1   1   1   1   25      1 26
Latvian          3      1 1   1 1 1 1   1   1   1   1   1   1   1   1       1   1   1   1   1   1   1   1   25      2 27
Estonian         3      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1       1   1   1   1   1   1   1   25      1 26
Lithuanian       2      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1   1       1   1   1   1   1   1   24      2 26
Hungarian        2      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1   1   1       1   1   1   1   1   24      1 25
Croatian         2      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1   1   1   1       1   1   1   1   24         24
Slovak           2      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1   1   1   1   1       1   1   1   24         24
Greek            2      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1   1   1   1   1 1         1   1   24         24
Irish            2      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1   1   1   1   1 1 1           1   24         24
Maltese          2      1 1   1 1 1 1   1   1   1   1   1   1   1   1   1   1   1   1   1 1 1 1             24         24
Total A       65 32 29 29 26 31 30 27 25 27 26 26 24 26 25 25 25 25 25 25 24 24 24 24 669 51 52 772
Total B       16 4 2 3 4 1 1 2 3                                                                            36 20 12            68
Total Other      9 6 5 4 5 1 1 3 3 2 3 2 3 1 2 1 1 1 1                                      1               55 17 19            91
Grand Total 90 42 36 36 35 33 32 32 31 29 29 28 27 27 27 26 26 26 26 25 25 24 24 24 760 88 83 931

Table 2 A snapshot of supported MT language pairs as at the end of March 2022, with the addition
of the remaining NTEU services for all pairs of EU official languages



2 Machine Translation

The ELG platform includes MT tools for 781 individual source/target language pairs,
totalling 931 distinct services. Table 2 shows the breakdown; while English still
dominates, it is much less ubiquitous than in the past, with only 21% of services
involving English (102 from English, 90 into English, for a total of 192 out of the
931 available services). All pairs of EU official languages (“type A” in Table 2)
are supported. In addition there is support for unofficial or regional European lan-
guages such as Basque, Galician and Luxembourgish and languages of accession
candidates or free trade partners such as Icelandic, Norwegian2 and Serbian3 as well
as languages important for trade and political reasons such as Modern Standard Ara-
bic, Hindi, Ukrainian and Russian.
2   Both Nynorsk and Bokmål varieties.
3   Both Latin and Cyrillic script.
136                             Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.

    In addition to the MT services contributed by the ELG consortium partners Tilde
(Pinnis and Bergmanis 2020) and University of Edinburgh (Junczys-Dowmunt et al.
2018; Germann et al. 2020; Germann 2020), two contributors in particular deserve
a special mention here: the OPUS-MT ELG pilot project and the EU project Neural
Translation for the European Union (NTEU).
    The OPUS-MT ELG pilot project (Chapter 24, p. 325 ff., also see Tiedemann and
Thottingal 2020) is responsible for 312 of the total 931 translation service options. To
reduce the overall load on the ELG computing infrastructure, many of these language
pairs are supported by multilingual models, where a single Docker container can ac-
cept input and/or produce output in many related languages. For example, there is a
single OPUS model for “West Germanic”, which can translate either way between
any pair of English, German, Dutch, Luxembourgish, Afrikaans, Low Saxon, Gron-
ings and Hunsrik. Some language pairs are supported by multiple models with dif-
ferent performance characteristics, for example, English to German is supported by
a monolingual English-German model, a one-to-many “English to West Germanic”,
and the aforementioned many-to-many West Germanic model. Which model is most
appropriate for a given task will vary, for example, if the input is known to be good-
quality English then the monolingual model may be best, but if the input is a mix of
languages, or English written by native speakers of other Germanic languages, then
the multilingual model may be more accurate. Enabling users to test out different
services on their own real data and switch between them with no technical changes
to their code is one of the greatest benefits of the ELG approach.
    NTEU is a project with a different focus, it was funded to produce high-quality
translation tools for all possible pairs of EU official languages, to reduce the need
for relay translation through a better-resourced language such as English (Bié et
al. 2020; Garcı́a-Martı́nez et al. 2021). This gives a total of 552 translation models
(24 source languages each translating into the other 23 targets), so to spread the
load of developing the models, NTEU involved three partner organisations, each
responsible for models translating into eight target languages (one third of the total
EU24). At the time of writing, one of the three sets of models has been published
as ELG-integrated services and the other two sets are expected to be available by
the time this book is published. The inclusion of these services marks an important
milestone for ELG for two key reasons. First it shows the strong commitment of
ELG to full multilinguality in the European Digital Single Market, and second it is
the single largest contribution to the ELG platform originating outside the original
ELG project consortium and pilot project ecosystem, demonstrating that ELG truly
is a platform for the whole EU language technology community.



3 Automatic Speech Recognition

For automatic speech recognition, ELG currently hosts 48 services covering 30 lan-
guages and dialects. The majority of these have been provided by HENSOLDT An-
alytics, the speech recognition specialist in the ELG project consortium. In addition,
7 Language Technology Tools and Services                                           137

there have also been important contributions from Tilde for the Baltic languages,
and from two of the pilot project organisations: Elhuyar for Basque (see Chapter 15,
p. 271 ff.) and Lingsoft for Scandinavian languages (see Chapter 20, p. 301 ff.). Ling-
soft have also begun to deliver domain-specific ASR services, for example a service
tuned to recognise clinical speech in Finnish. As general purpose ASR systems in-
creasingly become commodities, the creation and provision of domain-specific mod-
els provides an important niche for smaller ASR providers.
    These organisations are all commercial service providers; though the tools them-
selves are based on open source frameworks such as Kaldi4 , the models are the pro-
prietary intellectual property of the respective provider.



3.1 Case Study: Speech Tools from HENSOLDT

In addition to the actual ASR, the components provided by HENSOLDT also per-
form several preprocessing steps: audio is downsampled and converted to the native
format of the respective models (typically 16kHz, 16 bit, mono, signed). Segmenta-
tion and classification of the input audio is carried out next. Any segment classified
as containing an insufficient amount of speech is discarded and not processed by the
ASR. Disfluencies and non-speech within segments identified as audio-segments are
processed by the ASR system via specific non-speech models. Segmentation as well
as classification are parameterised and can be adapted to specific audio conditions
(the components provided within ELG use standard settings). Processing within the
HENSOLDT ASR is staged in a pipelined manner for optimal throughput. Process-
ing parameters can be employed to balance processing speed and accuracy. Like
Lingsoft, HENSOLDT also provides domain-specific models which can be included
in the respective Docker components. The ASR engine itself is aware of processing
throughput as well as of the various models used. It can be adjusted to provide re-
altime processing as well as to reload different sub-models as soon as they become
available. While the current services use one standard model, this allows for future
updates of vocabularies and language models in a transparent manner. Output of the
HENSOLDT ASR component can be provided in 1-best, n-best or lattice formats.
The former is currently used in the deployed components, however, lattice-based out-
put is used indirectly for use of the ASR component for keyword-spotting (KWS)
applications only. A sample result of the detection of keywords via ASR can be seen
in Figure 2.




4   http://kaldi-asr.org
138                             Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.




Fig. 2 Example of the word “court” having been detected as a keyword using HENSOLDT ASR



4 Text Analytics

After the set of MT services, the second largest group of services in the ELG plat-
form are concerned in one way or another with the analysis and annotation of text, as
discussed in Section 1. These cover a wide range from low-level text pre-processing
tasks such as tokenisation and sentence splitting, through named entity annotation
and linking tools (in many languages and domains), to dependency parsing, sum-
marisation, sentiment analysis, and special purpose services such as the detection of
misinformation or hate speech, and spelling and grammar checking.
   Text analysis services have been provided by most members of the ELG project
consortium, Expert.AI contributing their Cogito Discover toolkit, the University of
Sheffield providing many services based on their GATE framework, Charles Uni-
versity providing their UDPipe dependency parser and other tools (e. g., Straka and
Straková 2020; Straka et al. 2019b; Straka 2018; Straková et al. 2019; Straka et al.
2019a) and HENSOLDT (Dikici et al. 2019), ILSP (e. g., Prokopis and Piperidis
2020; Pontiki et al. 2018; Papanikolaou et al. 2016; Pontiki and Papageorgiou 2015)
and DFKI (e. g., Schulz et al. 2022; Aksenov et al. 2021; Leitner et al. 2019) pro-
viding a variety of tools from their respective inventories. In addition, several of the
pilot projects have contributed services in this class, notably
  • European Clinical Case Corpus (Chapter 17, p. 283 ff.) – Fondazione Bruno
    Kessler. Clinical named entity recognisers in six languages.
  • Italian EVALITA Benchmark Linguistic Resources, NLP Services and Tools
    (Chapter 19, p. 295 ff.) – University of Turin. A variety of services based on
    systems that participated in the various EVALITA shared tasks throughout the
7 Language Technology Tools and Services                                           139

    years such as misogyny and hate speech detection and gender prediction, all in
    the Italian language.
  • Lingsoft Solutions as Distributable Containers (Chapter 20, p. 301 ff.) – Ling-
    soft. General text analysis, proofing tools (spelling and grammar checking) and
    morphology analysis, in English and Scandinavian languages. This includes re-
    gional variations, such as distinct services for Swedish as used in Sweden and
    Swedish as used in Finland, and domain variations with specific services for
    medical domain text.
  • Universal Semantic Annotator (Chapter 28, p. 349 ff.) – Sapienza University
    of Rome. This service performs word sense disambiguation, semantic role la-
    belling and parsing for a wide variety of different languages.



4.1 Case Study: Cogito Discover from Expert.AI

Cogito Discover is Expert.AI’s scalable software platform for automatic semantic
metadata generation and auto-classification that can be easily integrated in the pro-
duction environment of document-processing applications or workflows. It can be
deployed on premise and in cloud environments and is available for both Linux and
Windows systems. Cogito Discover services that are included in ELG are:
  • Language detection: Identify the main language used in a text.
  • Part-of-speech annotation: Annotations at different levels (token, word/com-
    pound word, group, clause, sentence) with grammatical types.
  • Named Entity Recognition: Annotation of entities, i. e., people, organisations,
    places, known concepts, unknown concepts and also tags, i. e., URLs, email
    addresses, phone numbers, addresses, dates, time, measures, money, percentage,
    file folder.
  • Semantic annotation: This service returns the concepts spotted in a text which
    are modelled in the Cogito Discover knowledge graph.
  • Lemmatisation: This service returns the lemma of each concept spotted in the
    text that is modelled in the Cogito Discover knowledge graph.
  • Keyword extraction: Annotation of the most relevant information, i. e., main
    syncons, main lemmas, main multiword expressions.
  • Sentiment analysis: Provides a sentiment score (positive or negative) for the
    entities recognised in the text, and an overall score for the whole set of entities
    in the document.
  • Summarisation: Annotation of the most relevant information, i. e., main syncons,
    main lemmas, main multiword expressions, main sentences and main domains.
  • Categorisation: Classify documents using the IPTC taxonomy.
   Most services are available in 12 languages: English, Italian, Spanish, German,
French, Dutch, Portuguese, Chinese, Arabic, Russian, Japanese and Korean.
140                                Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.

   For its deployment in ELG, Expert.AI generated a Docker image containing a
Cogito Discover installation, the linguistic packages, and a general adapter that man-
ages the communication between the ELG platform and Cogito Discover. The gen-
eral adapter was developed using the ELG Spring Boot Starter described in Chapter 4
(Part I, p. 67 ff.)5 , which makes it as easy as possible to create ELG-compliant tools
in Java using Spring Boot.



4.2 Case Study: GATE from University of Sheffield

The University of Sheffield has been developing and maintaining the GATE frame-
work for Natural Language Processing6 for over 20 years. The basic framework
is open source software written in Java and comes with a wide variety of plugins,
some implementing specific NLP algorithms and some providing the generic base
on which other specific rule-based and machine learning-based tools can be built.
   The GATE ecosystem includes its own software-as-a-service platform called
GATE Cloud (Tablan et al. 2013). An early focus of Sheffield’s work in the ELG
project was to develop a bridge to GATE Cloud, i. e., a proxy that accepts ELG API
requests and dispatches them to a service endpoint on GATE Cloud, translating the
resulting annotations into the ELG API response format. The development of this
bridge has enabled the rapid deployment of many GATE Cloud hosted services into
the ELG catalogue with little demand on the computing capacity of the ELG plat-
form itself. At the time of writing, there are 66 GATE-based services integrated in
ELG via the bridging proxy.
   However, GATE Cloud itself has rate limits, so alongside the bridge component,
Sheffield has developed a generic tool that can take any NLP application built against
the GATE framework and bundle the application and all the plugins on which it de-
pends as a Docker image that can run the application in-process within the ELG
infrastructure. This mechanism has been used to wrap up certain particularly signifi-
cant GATE-based applications so they can run directly in the ELG Kubernetes cluster
and take advantage of the ELG platform’s auto-scaling capabilities (see Chapter 5).
   As the ELG EU project draws to a close, things have started to come full circle, as
a number of recent additions to GATE Cloud have in fact been implemented as ELG-
compatible Docker images, with a bridge in the other direction to enable a GATE
application to call out to an endpoint that exposes the ELG internal LT service API.
Some of these ELG-compatible images have been contributed back to ELG.
   In addition, Sheffield has promoted the use of ELG-compatible services and
Docker images in a number of other projects, notably the Horizon 2020 projects
WeVerify7 and RISIS28 . Many of Sheffield’s contributions to these projects have
5 https://gitlab.com/european-language-grid/platform/elg-spring-boot-starter
6 General Architecture for Text Engineering, https://gate.ac.uk, see Cunningham et al. (2013).
7 Wider and Enhanced Verification For You, https://weverify.eu, see Marinova et al. (2020).

8 Research Infrastructure for Science and Innovation Policy Studies, https://www.risis2.eu, see

Reale et al. (2019).
7 Language Technology Tools and Services                                                  141

been implemented as ELG-compatible Docker images, with bridging components
written for those projects to act as clients of the ELG API. The same mechanism has
been used as part of a long-term collaboration between the University of Sheffield
and King’s College London, to integrate medical domain LT services developed in
Python at King’s into an existing GATE-based processing workflow. The use of the
ELG standardised API makes it easy to integrate a variety of services implemented
in different programming languages in a minimally-invasive way.



4.3 Case Study: Microservices At Your Service

With the third release in 2022, the ELG platform has begun to see contributions
from third parties beyond the initial ELG consortium and pilot projects. One notable
source is the project Microservices At Your Service9 , funded by the European Com-
mission’s Connecting Europe Facility (CEF) programme and led by Lingsoft (one of
the organisations funded for a pilot project in the first ELG open call, see Chapter 20,
p. 301 ff.). The project describes its mission as “bridging the gap between NLP re-
search and industry” and it aims to identify open source text analysis tools that could
benefit the community, package them as Docker images, and publish them for wider
use. The project has selected the ELG platform as its primary vehicle for publication
of the tools, and uses the ELG API as its standard specification for interoperability.
   The project concentrates primarily on Finnish, Estonian, Icelandic, Spanish and
Portuguese, plus some tools for minority languages from the same regions such as
Faroese, Galician and Catalan. So far more than 14 services have been published,
including:
    • A proxy to the Finto-AI subject indexing service10 , in Finnish, Swedish and
      English (Suominen et al. 2022)
    • Named entity recognition tools for Swedish and Norwegian, originally from the
      respective national libraries of the two countries (Kummervold et al. 2021)
    • A tokeniser and morphological analysis tool for Estonian (Kaalep and Vaino
      2001)
    • A variety of tools for Icelandic from the University of Reykjavík, including a
      tokeniser, part-of-speech tagger, shallow parser and named entity recogniser, as
      well as machine translation models between Icelandic and English
   One of the Icelandic services, a part-of-speech tagger and lemmatizer, is shown
in Figure 3.




9 https://www.lingsoft.fi/en/microservices-at-your-service-bridging-gap-between-nlp-research-a
nd-industry
10 https://ai.finto.fi
142                                Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.




Fig. 3 Icelandic lemmatizer and part-of-speech tagger from Microservices At Your Service



5 Other Service Types

Right from the start of the ELG project, it was clear that the three principal service
classes (ASR, MT, Text Analytics), while significant, would never be exhaustive.
An important goal of ELG was to remain flexible enough to be able to easily inte-
grate new classes of services and tools that had not been foreseen in the original
proposal. The API specifications were designed with this flexibility in mind, being
based solely on the kinds of data each service expects and returns, rather than placing
any requirements on what the service does with that data.
   Three classes of “other” services have emerged since the beginning of the project:
  • Text-to-speech services that take text and synthesise audio.
  • Audio analysis services that take audio input and return standoff annotations
    over time segments of the audio stream.
  • Image analysis services, in particular optical character recognition (OCR).
    Text-to-speech services have been provided by Tilde within the ELG project con-
sortium (for Latvian and Lithuanian), and by the Elhuyar pilot project (for Basque).
The audio analysis services are the keyword spotting tools from HENSOLDT Ana-
lytics described along with their speech recognition systems in Section 3.
    The University of Sheffield has contributed a multilingual image OCR service de-
veloped as part of the Horizon 2020 EU project WeVerify. The service is based on a
multi-step pipeline of neural models, first running a segmentation model to identify
regions within the image that contain text, then a classifier to identify the writing
system and language of each text block, and finally an appropriate text recognition
model on each block depending on the identified script (Arabic, Bengali-Assamese,
Chinese, Latin, Devanagari, Kanna, Hangul or Cyrillic). An example can be seen
in Figure 4. The models have been deliberately designed not to use the “attention”
mechanism typical of other deep neural models, as this was found to give only
marginal improvements in performance at the cost of significantly increased mem-
ory and compute requirements.
    Part of the reason for ELG funding the open call for pilot projects was precisely
to elicit suggestions of new classes of services that were not previously known to
the project consortium. Two pilots in particular delivered on this: Text2TCS (Sec-
tion 5.1) and Coreon’s MKS as LLOD (Section 5.2).
7 Language Technology Tools and Services                                                    143




Fig. 4 The Multilingual OCR service showing detection of two blocks of text in different scripts
(the bounding boxes are part of the “try out” UI, they have not been added to this figure)
              What is coronavirus?
              There are many different kinds of coronavirus (CoV). Known types of coronavirus include:                                                                                                                                                                                                   Annotations
                                                                                                                                                                                                                                                                                                           c01: coronavirus
              SARS coronavirus (SARS-CoV), which was first detected in 2003;
              MERS coronavirus (MERS-CoV), which first occurred in humans in 2012.                                                                                                                                                                                                                          c02: types
              In humans, certain types of coronavirus can cause illnesses ranging from a common cold to severe pneumonia. Other types
144           of coronavirus can cause a variety of infectious diseases in animals. Some types of coronavirus can be transmitted from
                                                                                                                                                     Ian Roberts, Andres Garcia Silva, Cristian       coronavirus; Aroca
                                                                                                                                                                                          c03: SARSBerrìo          MERS
                                                                                                                                                                                        coronavirus; coronavirus SARS-CoV-2
                                                                                                                                                                                                                            et al.
              animals to humans.
                                                                                                                                                                                                                                                                                                           c04: detected; identified

              The coronavirus identified in China in late 2019 was never before detected in humans.                                                                                                                                                                                                         c05: humans
                                                                                                                                                                                                                                                                                                           c06: illnesses; disease
              On 11 February 2020, WHO assigned the official name COVID-19 (coronavirus disease 2019) to this disease. The designation
                                                                                                                                                                                                                                                                                                           c07: common cold
              for the pathogen (germ) was changed from 2019-nCoV to SARS–CoV-2.
                                                                                                                                                                                                                                                                                                           c08: severe pneumonia
                                                                                   c23: infection with coronavirus SARS–CoV-2
              How dangerous is coronavirus SARS-CoV-2?                                                                                                                                                                                                                                                     c09: infectious diseases
              Similar to seasonal influenza, it affects in particular elderly personsName      Value
                                                                                     and persons  with a weakened immune system.
                                                                                                                                                                                                                                                                                                           c10: animals
              In more severe cases, infection with coronavirus SARS–CoV-2 can, id             c23cause pneumonia or severe breathing
                                                                                    for example,
              difficulties.                                                           term      infection with coronavirus SARS–CoV-2                                                                                                                                                                        c11: transmitted

                                                                                relations  [ … ]                                                                                                                                                                                                          c12: China
              How does coronavirus SARS-CoV-2 spread?
                                                                                                                                                                                                                                                                                                           c13: WHO
              Person-to-person spreading is the most frequent path of infection with coronavirus SARS–CoV-2. Contagion can be caused
              by:                                                                                                                                                                                                                                                                                          c14: COVID-19
                                                                                                                                                                                                                                                                                                           c15: designation
              Mucus and saliva
              Urine and faeces                                                                                                                                                                                                                                                                             c16: pathogen
              Body fluids like for example blood                                                                                                                                                                                                                                                            c17: 2019-nCoV
                                                                                                                                                                                                                                                                                                           c18: SARS–CoV-2
              Features                                                                                                                                                                                                                                                                                     c19: dangerous
                Name                      Value                                                                                                                                                                                                                                                            c20: seasonal influenza
                Graph Link https://live.european-language-grid.eu/temp-storage/retrieve/01h5dwxa-cn5gqz6vmgra6f4t9adkut4gfs0ef                                                                                                                                                                            c21: elderly persons; persons
                TBX Link                  https://live.european-language-grid.eu/temp-storage/retrieve/01h5dwxa-jfkfprgapwsz9i4uea8pxchf6iu5f 
                                                                                                                                                                                                                                                                                                           c22: immune system
                                                                                                                                                                                                                                                                                                           c23: infection with coronavirus
                                                                                                                                                                                                                                                                                                         SARS–CoV-2

Fig. 5 Text2TCS service results in the “try out” GUI, showing links to thec24:termbase
                                                                               pneumonia and graph
                                                                                                                                                                                                                                                                                                           c25: severe breathing difficulties
                                                                                                                                                                                                                                                                                                           c26: spread
                                                                          c11
                                                                    ['transmitted']                                                                                                                                                                                                                        c27: Contagion
                                                                                                                                      activityRelation

                                                                                                                                                                                 c10
                                                                                                                                                                                                                                                                                                           c28: Mucus                                            c15
                                                                                                                                                                             ['animals']                                                                                                                                                                   ['designation']

                                                                                                                                                                        genericRelation genericRelation                                                                                                    c29: saliva           genericRelation     propertyRelation

                                                                                                                                                                      c6                       c31                                                                                                                                     c16
                                                                                                                                                            ['illnesses', 'disease']       ['faeces']
                                                                                                                                                                                                                                                                                                           c30: Urine              ['pathogen']
                                                                                                                                                                                                                                                                                                                                                    genericRelation          propertyRelation


                                                                         genericRelation                  genericRelation genericRelation            genericRelation               genericRelation            genericRelation      genericRelation                                                                                       genericRelation

                     c22
                                                                             activityRelation
                                                                                                           c29              c30              c32                                         c9                          c7                      c8                                                            c31: faeces                 c18                      c2
               ['immune system']                                                                        ['saliva']       ['Urine']      ['Body fluids']                       ['infectious diseases']          ['common cold']     ['severe pneumonia']                                                                           ['SARS–CoV-2']            ['types']

 activityRelation                   propertyRelation                                               genericRelation    genericRelation       genericRelation                        genericRelation               genericRelation               genericRelation                                             c32: Body fluidsassociativeRelation genericRelation                      genericRelation

                                           c21                                                             c33           c28                                           c27                                            c3                                                           c25                                                                    c17                       c14
                                                                                                                                          associativeRelation                                                                                                                                                                originationRelation
                              ['elderly persons', 'persons']                                            ['blood']     ['Mucus']                                    ['Contagion']       ['SARS coronavirus', 'MERS coronavirus', 'coronavirus SARS-CoV-2']            ['severe breathing difficulties']
                                                                                                                                                                                                                                                                                                           c33: blood                                ['2019-nCoV']              ['COVID-19']

                    associativeRelation        genericRelation                                                                                            activityRelation                              activityRelation                                  genericRelation                           genericRelation

                                                                   c4                                                                                                                               c26                                             c24
                                 associativeRelation                                                                                                                                                                  associativeRelation                                    associativeRelation           genericRelation
                                                        ['detected', 'identified']                                                                                                               ['spread']                                    ['pneumonia']

                                                                    activityRelation                                                                                                                                       genericRelation           associativeRelation

                             c20                                                          c12                   c5                                                                                                                                                       c23
                                                                   activityRelation
                     ['seasonal influenza']                                            ['China']            ['humans']                                                                                                                               ['infection with coronavirus SARS–CoV-2']
                     BACK                                                                  spatialRelation originationRelation

                                                                                                   c1                  c13                c19
                                                                                             ['coronavirus']         ['WHO']         ['dangerous']




Fig. 6 The termbase graph generated from the sample input text (Figure 5)



5.1 Pilot Project: Terminological Concept Systems from Natural
            Home                      Technologies                           Resources                              Events                  Documentation                                   About ELG                                                                                                                                                       Contact us



    Language Text from University of Vienna
                                    The European Language Grid has received funding from the European Union’s
                                    Horizon 2020 research and innovation programme under grant agreement
                                    № 825627 (ELG)

               © 2022 ELG Consortium Terms of Use

The Text2TCS project (see Chapter 18, Part IV, p. 289 ff.) aimed to develop a tool for
deriving terminological concept systems from natural language text. This required
the generation not only of typical standoff annotations representing the mentions
of the detected terms in the source text, but also two additional output files for the
termbase in TBX format11 and a visualisation of the terminology as a PNG image.
   These additional outputs did not naturally fit the JSON-based data interchange
formats of the ELG API. It would have been possible to force them into this format
by, for example, encoding the PNG data in base 64 encoding, but instead the ELG
team took this as the impetus to introduce the “temporary storage” helper service
for use by LT service containers. The operation of the temporary storage service
is very simple. LT services can send arbitrary binary data to a well-known URL
http://storage.elg/store (a private host name that resolves only within the
ELG Kubernetes cluster), and will receive in return a publicly-resolvable URL which
can be returned to the caller of the LT service for them to use to retrieve the same

11   https://www.tbxinfo.net
7 Language Technology Tools and Services                                             145

data. Storage URLs include a cryptographically-secure random token to make them
un-guessable, and they expire by default 15 minutes from their generation, at which
time the stored data is permanently deleted.
   Figures 5 and 6 show how this appears in the ELG portal when a user tests the
Text2TCS service using the “try out” mechanism.
   The temporary storage service provides an elegant solution to the problem of
allowing LT services to return binary data without introducing additional complexity
for the majority of services that do not have this requirement.



5.2 Pilot Project: MKS as Linguistic Linked Open Data from Coreon

The pilot project MKS as LLOD by knowledge management company Coreon (see
Chapter 23, Part IV, 319 ff.) is an interesting case that in some ways sits at the bound-
ary between services and resources. The aim of the project was to take Coreon’s exist-
ing knowledge representation systems, known as MKS for Multilingual Knowledge
System, and expose them as Linguistic Linked Open Data (LLOD). There is already
a (de jure and de facto) standard API for querying linked (open) data resources, i. e.,
the SPARQL query language12 , so rather than defining a new format under the ELG
umbrella, we decided to adopt the existing standard.
    For ELG, the question was how best to represent this kind of resource in the ELG
metadata scheme. On the one hand, the object that was being provided by Coreon
was conceptually a data resource, albeit one accessed via a query API rather than via
direct download, but on the other hand the technical method of integration would be
through providing a SPARQL service for users to query. The eventual solution was
in fact a mixture of both.
    The Coreon SPARQL endpoint was integrated into the ELG infrastructure and set
up so that SPARQL queries could be authenticated using access tokens issued by the
ELG Keycloak identity provider, exactly as for other ELG LT services. In parallel,
Coreon developed a “try out” UI to allow users to make test queries through the ELG
catalogue interface. The two were then tied together as follows:
 1. The “try out” UI was registered in its own right as a “service” in the ELG cata-
    logue, whose function is “resource access”.
 2. Each SPARQL endpoint was then registered as an individual “ELG-compatible
    Lexical or Conceptual Resource” (LCR), with a link to the “try out” UI as “this
    resource is queried by that service”.
   Logic was introduced in the ELG catalogue to recognise when a user visits an
ELG-compatible LCR that has an associated query service, and to inject the query
UI as a “try out” tab which is configured with the necessary information and access
token to be able to query the SPARQL endpoint (see Figure 7 for the final result).


12   https://www.w3.org/TR/sparql11-overview/
146                             Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.




Fig. 7 Coreon SPARQL endpoint as an ELG-compatible Lexical/Conceptual Resource



6 Conclusions

Overall, the ELG project has succeeded in its aim to offer a broad variety of dif-
ferent service types covering many languages, and supplied by a range of different
providers both academic and industrial. All the major classes of LT services are well
represented in the ELG catalogue including ASR, MT and text analysis, with further
classes of interest emerging during the course of the project. The generic design of
the LT service execution APIs means that even services that do not exactly fit an
existing class can be easily accommodated in the ELG platform, for example the
HENSOLDT services for keyword spotting in audio required no API changes at all,
only an adaptation of the “try out” GUI mechanism.
   Inevitably, the majority of early contributions to the ELG platform were from the
original ELG project consortium members. This was expected and planned for in
7 Language Technology Tools and Services                                                   147

the original project proposal, and the pilot project funding system was designed to
help broaden the contributor pool more quickly by incentivising providers to adopt
the ELG formats and specifications. It has succeeded in this aim, and many more
details can be found in the various pilot project chapters in Part IV. As the funded
project draws to a close and the ELG platform begins to transition to its long term
sustainable mode of operation, we are seeing an increasing number of third-party
contributions from beyond the original consortium and pilot projects, which stands
the ELG in good stead for its sustainability as a platform over the coming years.



References

Aksenov, Dmitrii, Peter Bourgonje, Karolina Zaczynska, Malte Ostendorff, Julián Moreno-Schnei-
   der, and Georg Rehm (2021). “Fine-grained Classification of Political Bias in German News:
   A Data Set and Initial Experiments”. In: Proceedings of the 5th Workshop on Online Abuse
   and Harms (WOAH 2021). Ed. by Aida Mostafazadeh Davani, Douwe Kiela, Mathias Lam-
   bert, Bertie Vidgen, Vinodkumar Prabhakaran, and Zeerak Waseem. Bangkok, Thailand: ACL,
   pp. 121–131. URL: https://aclanthology.org/2021.woah-1.13.pdf.
Bié, Laurent, Aleix Cerdà-i-Cucó, Hans Degroote, Amando Estela, Mercedes Garcı́a-Martı́nez,
   Manuel Herranz, Alejandro Kohan, Maite Melero, Tony O’Dowd, Sinéad O’Gorman, Mārcis
   Pinnis, Roberts Rozis, Riccardo Superbo, and Artūrs Vasiļevskis (2020). “Neural Translation
   for the European Union (NTEU) Project”. In: Proceedings of the 22nd Annual Conference of
   the European Association for Machine Translation. Lisboa, Portugal: European Association for
   Machine Translation, pp. 477–478. URL: https://aclanthology.org/2020.eamt-1.60.
Cunningham, Hamish, Valentin Tablan, Angus Roberts, and Kalina Bontcheva (2013). “Getting
   More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics”.
   In: PLOS Computational Biology 9.2, pp. 1–16. DOI: 10.1371/journal.pcbi.1002854.
Dikici, Erinç, Gerhard Backfried, and Jürgen Riedler (2019). “The SAIL LABS Media Mining
   Indexer and the CAVA Framework”. In: Interspeech 2019, 20th Annual Conference of the Inter-
   national Speech Communication Association. Ed. by Gernot Kubin and Zdravko Kacic. Graz,
   Austria: ISCA, pp. 4630–4631. URL: https://researchr.org/publication/DikiciBR19.
Garcı́a-Martı́nez, Mercedes, Laurent Bié, Aleix Cerdà, Amando Estela, Manuel Herranz, Rihards
   Krišlauks, Maite Melero, Tony O’Dowd, Sinead O’Gorman, Marcis Pinnis, Artūrs Stafanovič,
   Riccardo Superbo, and Artūrs Vasiļevskis (2021). “Neural Translation for European Union
   (NTEU)”. In: Proceedings of Machine Translation Summit XVIII: Users and Providers Track.
   Association for Machine Translation in the Americas, pp. 316–334. URL: https://aclanthology
   .org/2021.mtsummit-up.23.
Germann, Ulrich (2020). “The University of Edinburgh’s submission to the German-to-English
   and English-to-German Tracks in the WMT 2020 News Translation and Zero-shot Transla-
   tion Robustness Tasks”. In: Proceedings of the Fifth Conference on Machine Translation. ACL,
   pp. 197–201. URL: https://aclanthology.org/2020.wmt-1.18.
Germann, Ulrich, Roman Grundkiewicz, Martin Popel, Radina Dobreva, Nikolay Bogoychev, and
   Kenneth Heafield (2020). “Speed-optimized, Compact Student Models that Distill Knowledge
   from a Larger Teacher Model: the UEDIN-CUNI Submission to the WMT 2020 News Transla-
   tion Task”. In: Proceedings of the Fifth Conference on Machine Translation. ACL, pp. 191–196.
   URL: https://aclanthology.org/2020.wmt-1.17.
Junczys-Dowmunt, Marcin, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Hea-
   field, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev,
   André F. T. Martins, and Alexandra Birch (2018). “Marian: Fast Neural Machine Translation
148                                Ian Roberts, Andres Garcia Silva, Cristian Berrìo Aroca et al.

   in C++”. In: Proceedings of ACL 2018, System Demonstrations. Melbourne, Australia: ACL,
   pp. 116–121. URL: http://www.aclweb.org/anthology/P18-4020.
Kaalep, Heiki-Jaan and Tarmo Vaino (2001). “Complete Morphological Analysis in the Linguist’s
   Toolbox”. In: Congressus Nonus Internationalis Fenno-Ugristarum Pars V, pp. 9–16.
Kummervold, Per E, Javier De la Rosa, Freddy Wetjen, and Svein Arne Brygfjeld (2021). “Op-
   erationalizing a National Digital Library: The Case for a Norwegian Transformer Model”. In:
   Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Reyk-
   javik, Iceland: Linköping University Electronic Press, Sweden, pp. 20–29. URL: https://aclant
   hology.org/2021.nodalida-main.3.
Leitner, Elena, Georg Rehm, and Julián Moreno-Schneider (2019). “Fine-grained Named Entity
   Recognition in Legal Documents”. In: Semantic Systems. The Power of AI and Knowledge
   Graphs. Proceedings of the 15th International Conference (SEMANTiCS 2019). Ed. by Maribel
   Acosta, Philippe Cudré-Mauroux, Maria Maleshkova, Tassilo Pellegrini, Harald Sack, and York
   Sure-Vetter. Lecture Notes in Computer Science 11702. Karlsruhe, Germany: Springer, pp. 272–
   287. URL: https://link.springer.com/content/pdf/10.1007%2F978-3-030-33220-4_20.pdf.
Marinova, Zlatina, Jochen Spangenberg, Denis Teyssou, Symeon Papadopoulos, Nikos Sarris,
   Alexandre Alaphilippe, and Kalina Bontcheva (2020). “Weverify: Wider and Enhanced Ver-
   ification for You Project Overview and Tools”. In: 2020 IEEE International Conference on
   Multimedia Expo Workshops (ICMEW), pp. 1–4. DOI: 10.1109/ICMEW46912.2020.9106056.
Papanikolaou, Konstantina, Harris Papageorgiou, Nikos Papasarantopoulos, Theoni Stathopoulou,
   and George Papastefanatos (2016). ““Just the Facts” with PALOMAR: Detecting Protest Events
   in Media Outlets and Twitter”. In: Tenth International AAAI Conference on Web and Social
   Media. Vol. 10. 2, pp. 135–142.
Pinnis, Mārcis and Toms Bergmanis (2020). “Tilde’s Neural Machine Translation Technology”. In:
   Latvian Academy of Sciences Yearbook 2020. Latvian Academy of Sciences, pp. 85–89.
Pontiki, Maria and Harris Papageorgiou (2015). “Opinion Mining and Target Extraction in Greek
   Review Texts”. In: Proceedings of the 12th International Conference on Greek Linguistics
   (ICGL 12). Vol. 2. Freie Universität. Berlin, Germany, pp. 871–883.
Pontiki, Maria, Konstantina Papanikolaou, and Haris Papageorgiou (2018). “Exploring the Pre-
   dominant Targets of Xenophobia-motivated Behavior: A Longitudinal Study for Greece”. In:
   Proceedings of the Natural Language Processing meets Journalism Workshop (NLPJ 2018).
   Ed. by Octavian Popescu and Carlo Strapparava. ELRA.
Prokopis, Prokopidis and Stelios Piperidis (2020). “A Neural NLP toolkit for Greek”. In: 11th
   Hellenic Conference on Artificial Intelligence, pp. 125–128. URL: http://nlp.ilsp.gr/setn- 2
   020/3411408.3411430.pdf.
Reale, Emanuela, Grazia Battiato, and Serena Fabrizio (2019). “RISIS2: an innovative research
   infrastructure as a support for STI research community”. In: ISSI, pp. 2658–2659. DOI: 10.52
   81/zenodo.3478408.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Roberts, Ian, Andres Garcia Silva, Miroslav Janosik, Nils Feldhus, Dimitris Galanis, Andis Lagz-
   diņš, and Rémi Calizzano (2022). Deliverable D4.3 Services, Tools and Components (Final
   Release). Project deliverable; EU project European Language Grid (ELG); Grant Agreement
   no. 825627 ELG. URL: https://www.european-language-grid.eu/wp-content/uploads/2022/04
   /ELG-Deliverable-D4.3-final.pdf.
7 Language Technology Tools and Services                                                      149

Roberts, Ian, Andres Garcia Silva, Miroslav Janosik, Andis Lagzdiņš, Nils Feldhus, Georg Rehm,
    Dimitris Galanis, Dusan Varis, and Ulrich Germann (2020). Deliverable D4.1 Services, Tools
    and Components (First Release). Project deliverable; EU project European Language Grid
   (ELG); Grant Agreement no. 825627 ELG. URL: https://www.european- language- grid.eu
   /wp-content/uploads/2021/02/ELG-Deliverable-D4.1-final.pdf.
Roberts, Ian, Andres Garcia Silva, Miroslav Janosik, Andis Lagzdiņš, Nils Feldhus, Georg Rehm,
    Dimitris Galanis, Dusan Varis, and Ulrich Germann (2021). Deliverable D4.2 Grid Content:
    Services, Tools and Components (Interim Release). Project deliverable; EU project European
    Language Grid (ELG); Grant Agreement no. 825627 ELG. URL: https://www.european-langu
    age-grid.eu/wp-content/uploads/2022/04/ELG-Deliverable-D4.2-final.pdf.
Schulz, Konstantin, Jens Rauenbusch, Jan Fillies, Lisa Rutenburg, Dimitrios Karvelas, and Georg
    Rehm (2022). “User Experience Design for Automatic Credibility Assessment of News Con-
    tent About COVID-19”. In: Proceedings of HCI International 2022 – Late Breaking Papers.
   Accepted for publication. 26 June-01 July 2022.
Straka, Milan (2018). “UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task”. In: Proceedings of
    CoNLL 2018: The SIGNLL Conference on Computational Natural Language Learning. Strouds-
    burg, PA, USA: ACL, pp. 197–207.
Straka, Milan and Jana Straková (2020). “UDPipe at EvaLatin 2020: Contextualized Embeddings
    and Treebank Embeddings”. In: Proceedings of LT4HALA 2020 – 1st Workshop on Language
   Technologies for Historical and Ancient Languages. Marseille, France: ELRA, pp. 124–129.
Straka, Milan, Jana Straková, and Jan Hajič (2019a). “Czech Text Processing with Contextual Em-
    beddings: POS Tagging, Lemmatization, Parsing and NER”. In: Proceedings of the 22nd Inter-
    national Conference on Text, Speech and Dialogue (TSD 2019). Cham, Heidelberg, New York
    etc.: Springer, pp. 137–150.
Straka, Milan, Jana Straková, and Jan Hajič (2019b). “UDPipe at SIGMORPHON 2019: Contex-
    tualized Embeddings, Regularization with Morphological Categories, Corpora Merging”. In:
    Proceedings of the 16th SIGMORPHON Workshop on Computational Research in Phonetics,
    Phonology, and Morphology. Stroudsburg, PA, USA: ACL, pp. 95–103.
Straková, Jana, Milan Straka, and Jan Hajič (2019). “Neural Architectures for Nested NER through
    Linearization”. In: Proceedings of the 57th Annual Meeting of the Association for Computa-
    tional Linguistics. Stroudsburg, PA, USA: ACL, pp. 5326–5331.
Suominen, Osma, Mona Lehtinen, and Juho Inkinen (2022). Annif and Finto AI: Developing and
    Implementing Automated Subject Indexing. Macerata. DOI: 10.4403/jlis.it-12740.
Tablan, Valentin, Ian Roberts, Hamish Cunningham, and Kalina Bontcheva (2013). “GATECloud-
    .net: A Platform for large-scale, Open-Source Text Processing on the Cloud”. In: Philosophical
   Transactions of the Royal Society A: Math., Phys. and Eng. Sciences 371.20120071.
Tiedemann, Jörg and Santhosh Thottingal (2020). “OPUS-MT – Building open translation services
    for the World”. In: Proceedings of the 22nd Annual Conference of the European Association for
    Machine Translation (EAMT). Lisboa, Portugal: European Association for Machine Translation,
    pp. 479–480. URL: https://helda.helsinki.fi/bitstream/handle/10138/327852/2020.eamt_1_499
    .pdf.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 8
Datasets, Corpora and other
Language Resources

Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault, Penny
Labropoulou, Miltos Deligiannis, Leon Voukoutis, and Stelios Piperidis




Abstract This chapter provides an overview of what is available in ELG in terms of
datasets, corpora and other language resources (LRs) and how this has been achieved.
We look at the procedures and steps that have been followed to complete the full
resource ingestion cycle, which goes from repository and LR identification to meta-
data description and ingestion. We explain the approaches, priorities and methodol-
ogy. The chapter also outlines the repositories that have been integrated into ELG,
discussing the different procedures followed (metadata conversion, extraction, and
completion, as well as harvesting) and the reasons behind these choices. Furthermore,
the ELG catalogue content is described, with details on key elements and features
as well as accomplishments. The last two sections are devoted to the crucial legal
issues behind such a complex platform and its data management plan, respectively.



1 Introduction

As introduced in Part I, one of the ELG platform’s primary functions is enabling shar-
ing, distribution and deployment of Language Resources and Technologies (LRT).
ELG provides access to thousands of datasets, by far the largest collection of rele-
vant datasets for the European Language Technology community. Users can search
for, download as well as provide different types of resources. As can be seen fur-
ther down, ELG has identified, filtered, described and centralised a vast amount of
datasets and other resources from different inventories and repositories, providing
an easy to use point of search for the LT community. Its aim is to become the “yellow
pages” and the primary platform for the European Language Technology community
(see Chapter 9). Our work in terms of curating and further enriching ELG is ongoing,
with new ingestions and collaborations at the time of writing.

Victoria Arranz · Khalid Choukri · Valérie Mapelli · Mickaël Rigault
ELDA, France, arranz@elda.org, choukri@elda.org, mapelli@elda.org, mickael@elda.org
Penny Labropoulou · Miltos Deligiannis · Leon Voukoutis · Stelios Piperidis
Institute for Language and Speech Processing, R. C. “Athena”, Greece,
penny@athenarc.gr, mdel@athenarc.gr, leon.voukoutis@athenarc.gr, spip@athenarc.gr


© The Author(s) 2023                                                                  151
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_8
152                    Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

    This chapter describes the work carried out so far as well as currently ongo-
ing efforts towards the population of the ELG catalogue with Language Resources
(datasets and language models). This work has consisted in 1. the identification of
sources (inventories and repositories), language resources and models, 2. their anal-
ysis, 3. the selection of elements to be ingested, as well as 4. the conversion or har-
vesting of their metadata descriptions and 5. the ingestion of these descriptions, and
actual LRs, if relevant. All these steps are complex and intertwined tasks that are
operationalised in a collaborative manner.
    As a core element of ELG, the term “Language Resource” (LR, LRs) is used for
resources composed of linguistic material used in the development, improvement or
evaluation of Language Technologies (LT, LTs), but also, in a broader sense, in lan-
guage and language-mediated research studies and applications; examples include
datasets of various types, such as textual, multimodal or multimedia corpora, lexical
data, grammars, language models, etc. In related initiatives and the literature, the
term is often used with a broader meaning, encompassing also tools and services
used for the processing and management of datasets, and standards, guidelines and
similar documents that support the research, development and evaluation of LTs. In
the ELG metadata model (see Labropoulou et al. 2020, and also Chapter 2), we use
the term as first defined for the META-SHARE metadata model (Gavrilidou et al.
2012), i. e., including both data resources and LT tools/services. The alternative term
Language Resource/Technology (LRT) is also used in the context of ELG (Rehm et
al. 2021). However, in this chapter we use LR as referring to datasets and language
models only; tools and services in ELG are discussed in Chapter 7.



2 Identification of Language Resources and Repositories

ELG aims to become the primary marketplace for the European LT community. The
organisations making use of it range from commercial to non-commercial, including
research centres and companies, as well as initiatives and infrastructures, among
others. Linking all these players and supporting them in their interaction is a two-fold
mission, which involves helping them make their tools, services and data available
and also establishing the means for them to find and have access to those they may
require in their work.
   To cover all relevant existing language resource repositories, ELG defined an
identification and collection methodology. First, the ELG project consortium mem-
bers performed a round of identification and analysis contributing their own re-
sources. Second, we reached out to the ELG National Competence Centres (NCCs,
see Chapter 11) to gather more input and pointers to additional existing repositories
and resource inventories. This identification task has been run in parallel with a pri-
ority definition task, which has been adjusted regularly according to achievements
and to the community’s needs and demands.
8 Datasets, Corpora and other Language Resources                                    153

2.1 Identification by the Consortium

ELG examined the available inventories and repositories of all potential LT/LR
providers and users. The initial results have been completed with further collabo-
rative input from the NCCs (see Section 2.2) and ELG’s sister project European
Language Equality (ELE, see Section 2.3.2). With regard to the typology of LRs
searched for, all types and modalities deemed useful for some sort of LT applica-
tion were considered. These comprise corpora, lexicons, terminologies, and derived
resources (such as language models for ASR or TMX models for MT), and also
focus on media such as speech/audio, text, video/audio-visual, images, OCR and
sign language datasets (images, videos). The identification strategy was adjusted
following initial findings. For example, users’ needs guided us to take into account
high-priority dataset types such as language models, and has led us to look into repos-
itories which contain and even focus on such types of resources (see Section 4.2).



2.2 Identification by the National Competence Centres

In addition to the work described above (Section 2.1), a survey was carried out to
gather more input from the NCCs and from other collaborators, often related to their
local and regional repositories (Rehm and Marheinecke 2019). This way we have
been able to identify new repositories and, moreover, we were also provided with
extensive documentation by the NCCs (content, contacts, etc.). The collaboration
with the NCCs has been valuable. We plan to continue the joint work to maximise
ELG’s coverage.



2.3 Collaboratively Filling the Gaps

With its (at the time of writing) 8,873 dataset descriptions and following the inges-
tion of several repositories, ELG is at a compelling stage for taking the next steps
in its dataset provision strategy. It must be stressed that our collaboration with other
initiatives has also had an impact on these numbers. Bearing that in mind, the pop-
ulation of ELG now follows the analysis and identification of gaps from several
perspectives:
 1. The ELG consortium members’ analysis of contributions and ingestion statistics
    in the platform.
 2. The analysis of gaps carried out under a joint strategy, such as the ELE project
    and the ELG pilot projects (see Part IV), which have contributed datasets and
    also shared their own needs with regard to ELG, thus supporting ELG on its
    LRT collection venture from the point of view of the provider and the user.
154                        Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

    3. The analysis of feedback received from technology developers and data users
       who shared their needs with us.


2.3.1 Contributions from the ELG Pilot Projects

The ELG pilot projects were intended to demonstrate the usefulness of ELG by con-
tributing datasets or services to the platform or by making use of existing datasets
or services for the development of innovative LT applications. These contributions
provided by the pilot projects benefit both the community that will have access to
the assets provided as well as the pilot projects themselves that will gain visibility
with their work and by displaying it in ELG. These projects are an excellent proof
of concept for the ELG platform and those pilot projects that provide datasets often
target – and fill – specific gaps. At the time of writing, the already concluded pilot
projects have finished their work, which has resulted in a set of 52 datasets available
through ELG. The pilot projects are described in detail in Part IV of this book.


2.3.2 Contributions from the European Language Equality Project

ELG collaborates with the European Language Equality (ELE) project1 to promote
digital language equality in Europe. In 2021, ELE organised an online survey ad-
dressed primarily to the more than 30 language experts of the consortium to collect
information on language resources and technologies available for the languages2 un-
der investigation (see Chapter 6 for more details). Through a web form, the ELE
consortium partners responsible for one or more of the languages addressed by the
project were able to record and report new language resources and also new resource
repositories. This additional and collaborative collection procedure resulted in ap-
prox. 6,300 records (Arranz et al. 2022), which have already been cleaned up, nor-
malised and curated and finally ingested into ELG (4,127 metadata records for data
resources and 2,215 metadata records for tools). Just like ELG organisation pages,
metadata records can be claimed by the resource creators or other rightful owners
(see Chapter 9, Section 3.3, p. 179) and enriched with further information. This is
why all contact persons included in these metadata records have been notified of
their publication in ELG; we encouraged them to claim their resources and enrich
the descriptions. Complete metadata descriptions are an important aspect of ensuring
findability and future reuse of the resources (see Chapter 2, Section 7).




1   https://european-language-equality.eu
2   https://european-language-equality.eu/languages/
8 Datasets, Corpora and other Language Resources                                       155

2.3.3 Platform Users

Finally, users of the ELG platform can also provide feedback about their interaction
with ELG or about unmet expectations with regard to the availability of datasets or
LT services. With regard to the latter, if users raise a certain need for specific datasets
in relation to specific technologies, the ELG team can investigate whether relevant
datasets or resources exist.



3 Integrating Repositories into ELG

The individual ELG releases follow an evolutionary strategy with regard to the pop-
ulation of the catalogue. This strategy has evolved as procedures have been put in
place and new priorities and needs identified. ELG Release 1 (R1) followed a rather
pragmatic approach, exploring procedures while targeting large repositories under
the management of ELG consortium members. This allowed us to set up procedures,
locate flaws and address problems (e. g., pending legal issues). ELG Release 2 (R2)
launched an ambitious acquisition of very large catalogues which were not compli-
ant with ELG’s structure and metadata schema. This was the case, for instance, for
Quantum Stat and Zenodo (see Section 4 and Arranz et al. 2021). Repositories like
Zenodo are extremely large digital libraries in which many different research arte-
facts are published, which is why it requires a certain amount of effort to find and
extract artefacts that are relevant for ELG. Despite these challenges, the overall result
is rewarding as it provides access to many LT-related datasets, which have not been
directly discoverable so far and which are now made available to the community
through ELG as a one-stop-shop. The LR provision strategy for ELG Release 3 (R3)
has built on top of the processes firmly established in R2. It continued and finished
up the integration of the already initiated repositories, it set up harvesting procedures
for as many ingested repositories as possible and added further repositories.



3.1 Priorities in the Ingestion Work

The list of identified repositories comprised different types of portals, such as those
storing data from evaluation campaigns or shared tasks (e. g., WMT resources,
Yeganova et al. 2021), large catalogues of language resources (e. g., ELRA, Mapelli
et al. 2022), networks of LR repositories (e. g., various META-SHARE nodes,
Piperidis et al. 2014), databanks, initiatives supporting the collection of language
data, etc. This initial list was prioritised by taking into account the following dimen-
sions of the different repositories:
  • Relevance of their content for ELG, its services and users.
  • Access information (conditions of use, prioritising open licensing schemes).
  • Languages covered (covering multiple different languages, filling detected gaps).
156                     Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

    • LR typology (covering different modalities, filling detected gaps).
    • Number of resources (prioritising repositories with larger numbers of resources).
    • Metadata schema (prioritising schemas that allow automated conversions).
    Following this prioritisation strategy, three repositories – all of which are run by
members of the ELG project consortium – were initially selected for ingestion in
ELG Release 1: ELRA3 , ELRC-SHARE4 and the three META-SHARE nodes man-
aged by DFKI5 , ELDA6 and ILSP7 . This choice was strategic, as a proof of concept
for resource availability and metadata conversion, given that the involved partners
were familiar with the content and metadata schemas of these repositories. All the
datasets selected for metadata ingestion were filtered down for legal compliance to
ensure that licensing or distribution conditions that could not be addressed by ELG
at this early stage could be taken care of for a later release. ELG Release 2 continued
with additional repositories under the management of ELG project consortium part-
ners (ELRA-SHARE-LRs 2014, 2016, 2018 and 20208 , and LINDAT/CLARIAH-
CZ9 ) but also by extending its work on the META-SHARE network and looking
into very large digital inventories such as Quantum Stat and Zenodo. The reasons
behind these choices combined strategy and diversity, which were also the goal with
repositories such as Hugging Face for ELG Release 3 (see Section 4.2.4).



3.2 Contributing Language Resources

Interested institutions or individuals can make datasets available for download, i. e.,
hosting datasets in the ELG platform, or they can simply point ELG users to exter-
nal download locations. In both cases, a description of the resource in the form of a
metadata record is needed that can be discovered through the ELG catalogue. Such
metadata descriptions can be manually created in ELG using the corresponding edi-
tor, they can be prepared as an XML file, which is then uploaded and imported into
ELG, or they can be automatically converted from existing metadata records that use
a different schema and imported into ELG afterwards. The flexibility behind these
different options to populate the ELG catalogue makes contributions very easy, they
can be done according to the provider’s needs and preferences.
    ELRC-SHARE follows the metadata-only option; this repository is financed by
the European Commission under the ELRC initiative (Lösch et al. 2021), datasets
will be available through ELRC-SHARE for at least the duration of the ELRC con-
tracts. For that reason, the master copies of the LRs provided to ELG remain within
3 http://catalogue.elra.info
4 https://elrc-share.eu
5 http://metashare.dfki.de

6 http://metashare.elda.org

7 http://metashare.ilsp.gr:8080

8 LRs contributed by LREC participants, see http://www.elra.info/en/lrec/shared-lrs/.

9 LINDAT is the CLARIN Centre for Language Research Infrastructure in the Czech Republic.
8 Datasets, Corpora and other Language Resources                                  157

ELRC-SHARE but corresponding metadata records are available through ELG, en-
abling their discovery through ELG and their download via a redirect to the corre-
sponding ELRC-SHARE page. In addition to contractual reasons, some repositories
prefer to host their LRTs themselves, such as the ELRA catalogue, which distributes
its LRs under a typology of licences that cannot be fully covered or recreated by the
ELG metadata schema for the time being. Repositories like Zenodo or Quantum Stat
mostly provide links to the locations of their datasets, very often these are links to
Github or Gitlab pages. Again, only metadata records with the links to the dataset
locations have been ingested into ELG. Likewise, harvested repositories only export
metadata records (e. g., different CLARIN nodes or Hugging Face).



4 Procedures to Ingest Language Resources

Different repositories need to be approached differently with the goal of extracting
metadata records and ingesting them into ELG. This relates to a number of dimen-
sions that have allowed us to categorise repositories and, thus, to set up procedures
to process them. These relate to the conversion, extraction and completion as well
as harvesting of LR metadata, further described in Sections 4.1, 4.2 and 4.3 below.



4.1 Metadata Conversion

We converted (through mapping) the metadata records of several repositories so
that we could import them into the ELG catalogue, which follows the ELG meta-
data schema (Labropoulou et al. 2020). This was the case for the ELRA catalogue,
the META-SHARE nodes and the initial ingestion of the ELRC-SHARE repository
(managed through harvesting now, see below). This conversion work is complex, but
it has paved the way for improvements and updates on both sides of the conversion
line, on both the source and target metadata elements and descriptions.


4.1.1 From ELRA Catalogue to ELG

The conversion of the LR metadata entries in the ELRA catalogue into the ELG
metadata format followed several steps:
  • Updating the ELRA catalogue XML Schema Definition (XSD): The ELRA cata-
    logue is based upon the META-SHARE structure, it has been adapted to ELRA’s
    specific distribution requirements. Before proceeding with the metadata conver-
    sion, an analysis of discrepancies between the META-SHARE XSD and the
    ELRA catalogue XML files was performed. This allowed us to update the ELRA
    catalogue XSD and to export the XML files in META-SHARE 3.1 format.
158                    Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

  • Mapping between META-SHARE 3.1 and ELG-SHARE 1.0.2: Once exported,
    the ELRA XML files were mapped to the ELG metadata schema 1.0.2. This
    mapping allowed us to adapt the validated ELRA XML files (in META-SHARE
    3.1 format) and to make them compliant with the ELG-SHARE model. Several
    elements had to be adapted for that purpose.
  • Conversion from META-SHARE 3.1 to ELG Metadata Model 1.0.2: Once the
    mapping between the ELRA catalogue and ELG was completed, we imple-
    mented an XSLT stylesheet to transform the META-SHARE 3.1 format to the
    ELG metadata model.
   While the implementation of this first tool required quite a bit of effort, the expe-
rience gained was valuable for the subsequent implementation of other converters.


4.1.2 From META-SHARE to ELG

META-SHARE’s DKFI, ELDA and ILSP nodes are based on META-SHARE XSD
3.0. An already existing XSLT stylesheet was used to convert from META-SHARE
XSD 3.0 to 3.1. We implemented a second XSLT stylesheet to convert META-
SHARE 3.1 XML files into ELG metadata 1.0.2 (as for the ELRA-SHARE con-
version into ELG). This modular approach allowed us to use META-SHARE v3.1
as pivot schema, reusing the implemented XSLTs stylesheets for further conversions
(such as ELRC-SHARE’s below).


4.1.3 From ELRC-SHARE to ELG

ELRC-SHARE is also based on META-SHARE. The initial ingestion was carried
out through conversion, a harvesting protocol was put in place later (see Section 4.3
and Chapter 6 in Part I). To benefit from the ELRA to ELG metadata converter, a
subset of ELRC-SHARE LRs was converted first into the ELRA and then into the
ELG format.


4.1.4 Import into ELG

The XML files converted from the metadata of the different repositories were then
imported into ELG using the API developed for this purpose. Some inconsistencies
remained that led to corrections both in the XML files and the ELRA catalogue.



4.2 Metadata Extraction and Completion

Now we look into those repositories that did not allow for a straightforward conver-
sion or for which building converters was not a feasible option.
8 Datasets, Corpora and other Language Resources                                    159

4.2.1 Zenodo

Zenodo10 is a digital library launched in May 2013 within the OpenAire11 project, to
enable the compilation of research artefacts, such as publications, images, datasets,
software, etc. A good number of those artefacts consists of LRs that may be of inter-
est to the LT community. However, the extremely high number of artefacts in Zenodo
together with the incompatibility of the Zenodo and ELG metadata schemes made
the identification of relevant LRs a big challenge. We opted for a semi-automatic
approach to collect what ELG considers as LRs, using a combination of Python and
directly querying the Zenodo database, among others.12 However, the compilation
of metadata information still required manual intervention to ingest our selection of
actual LRs as well as to add the minimal set of metadata elements which are manda-
tory for ELG and which do not exist in the Zenodo records. This semi-automated pro-
cess required a lot of manual effort. We currently work on an automated harvesting-
oriented approach (see Section 4.3 and Chapter 6 in Part I).


4.2.2 ELRA-SHARE-LRs

The ELRA-SHARE-LRs are provided by participants attending the Language Re-
sources and Evaluation Conference (LREC). Participants can share the LRs they
present at the conference either by uploading them in a special LREC repository or
by linking them to their original download location using an online form. We selected
a subset of these LRs by checking the compliance of licences with the ones accepted
in ELG. Licences that are too vague were left aside (e. g., “Open Source”, “Cre-
ative Commons” without further specification). Given that the original metadata was
available as a spreadsheet, the sheet and conversion tool produced to gather Zenodo
metadata (see above) was adapted. As the ELRA-SHARE-LRs metadata contained
only a minimal set of information, missing but required information was added man-
ually into the spreadsheet to comply with the mandatory ELG metadata (e. g., type
of LR, linguality, annotation, data format, licence, etc.). Finally, the spreadsheet was
converted into XML and ingested into ELG.


4.2.3 Quantum Stat

Quantum Stat enables LR producers to register datasets in the “Big Bad NLP
Database”.13 The procedure for identifying, describing and ingesting datasets into
ELG is as follows: first, an initial table with 481 datasets was exported and analysed
for relevance to ELG by checking licensing information (whether licences are well

10 https://zenodo.org
11 https://www.openaire.eu
12 https://developers.zenodo.org/#records

13 https://datasets.quantumstat.com
160                        Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

identified), dataset type, and whether the resource can be downloaded. The datasets
not complying to the LR description requirements were discarded and only compli-
ant metadata information was kept. Then, as for ELRA-SHARE-LRs and Zenodo,
the minimal set of metadata information was compiled, while also adding missing
information before the actual conversion into XML and ingestion into ELG.


4.2.4 Hugging Face

Often described as a “model zoo”, the Hugging Face14 repository includes a large
collection of machine learning models and datasets that can be used for training
new models, with a focus on the Transformers architecture (Wolf et al. 2020). ELG
collaborates with Hugging Face regarding the import of Hugging Face metadata
records into ELG. One challenge relates to the fact that the description of resources
in Hugging Face does not follow a specific methodology. To begin with, adding de-
scriptions to resources is encouraged but not mandatory. Furthermore, the suggested
metadata elements do not follow a standard schema. The manual work needed to pro-
cess the filtered entries was considerable in order to enrich the information available.
A conversion process was applied based on mapping the elements (see Chapter 6 for
more details).



4.3 Metadata Harvesting

We implemented metadata harvesting solutions for ELRC-SHARE, LINDAT/CLA-
RIAH-CZ, CLARIN-PL and CLARIN-SI as well as Zenodo, as described below.


4.3.1 ELRC-SHARE

Three groups of datasets were originally selected from the three prioritised reposito-
ries to be converted and ingested into ELG Release 1 (see Section 4.1). Of these, only
ELRC-SHARE allowed for the import of the whole list given that its resources met
the following conditions: their licensing conditions allowed it (all data were shared
under CC-BY licences, they were open under the directive on the re-use of public sec-
tor information, or they belong to the public domain), and their metadata elements
were compatible and fully covered by the ELG metadata schema. We have imple-
mented an OAI-PMH15 client that harvests metadata records compliant with the ELG
metadata schema, and we use this for regular harvesting from ELRC-SHARE.




14   https://huggingface.co
15   Open Archives Initiative Protocol for Metadata Harvesting (2015).
8 Datasets, Corpora and other Language Resources                                  161

4.3.2 LINDAT/CLARIAH-CZ

The LINDAT/CLARIAH-CZ repository makes its metadata available for harvesting
through its OAI-PMH end-point.16 Means for ingesting metadata complying to the
META-SHARE schema17 were already in place in ELG and the repository did pro-
vide a mapping from its internal metadata storage to META-SHARE. An attempt
was made at reusing this conversion, but the result was deemed unacceptable as not
all of the available metadata was mapped. After a few iterations we arrived at a
mapping between concepts that are important and required in the ELG schema and
the metadata stored in LINDAT/CLARIAH-CZ. LINDAT updated the metadata for
several of its resources following the feedback received from ELG. Also, based on
the feedback from LINDAT/CLARIAH-CZ, some changes were applied to the ELG
schema. The implementation of this mapping represents around 1,200 changed lines
of code, including some tooling to reflect some of the metadata issues discovered.18


4.3.3 CLARIN-PL and CLARIN-SI

The LINDAT/CLARIAH-CZ repository makes available an OAI-PMH endpoint
which exposes ELG-compatible metadata records. The repository software devel-
oped by the LINDAT/CLARIAH-CZ team, based on DSpace, is also used by several
other CLARIN centres for their repositories, i. e., their metadata records are ready
to be imported into ELG using the same harvesting procedure. For ELG Release 3,
this collaboration has resulted in the regular harvesting of the CLARIN centres in
Slovenia (CLARIN-SI) and Poland (CLARIN-PL).19


4.3.4 Zenodo

As described in Chapter 6 (Part I), Zenodo is a particularly interesting catalogue for
ELG purposes. Zenodo exposes its metadata records through a REST API20 as JSON
data and through an OAI-PMH API21 in a set of standard metadata formats, i. e.,
DC22 , DataCite23 , MARC2124 and DCAT25 . Work is currently ongoing to replace
the semi-manual import of Zenodo metadata records that started for ELG Release

16 http://lindat.mff.cuni.cz/repository/oai/request?verb=Identify
17 http://www.meta-share.org/p/93/Documentation
18 https://github.com/ufal/clarin-dspace/pull/930

19 http://www.clarin.si and https://clarin-pl.eu

20 https://developers.zenodo.org/#rest-api

21 https://developers.zenodo.org/#oai-pmh

22 https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

23 https://schema.datacite.org/meta/kernel-4.4/

24 https://www.loc.gov/marc/bibliographic/

25 https://www.w3.org/TR/vocab-dcat-3/
162                      Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

2 with a more automated process taking advantage of the standard protocols and
schemas offered by Zenodo. This task involves a number of challenges that we are
currently addressing with regard to the selection of the source API, the selection
and conversion of metadata, the selection of a subset of the downloaded metadata
records and the setting-up of an automated procedure for regular harvesting.



5 Language Resources in the ELG Catalogue

After the most recent ingestions of datasets as well as the contributions from the
pilot projects and ELE, the ELG catalogue has reached a total of 8,873 metadata
entries in April 2022, far exceeding our expectations when we started the project. The
majority of these are description records without the data being hosted in ELG (103
resources are fully available through ELG). However, even if not available through
ELG directly, most datasets are available through the referenced repository page,
often available for download, which is reflected in the ELG catalogue too. Figures 1
and 2 illustrate the breakdown of repository sources ingested so far together with the
breakdown of the current numbers per source.


         4500
         4000
         3500
         3000
         2500
         2000
         1500
         1000
          500
            0
                                           I




                                                                                       ns
                                          s

                                        CZ


                                         at




                                         ce




                                                                              rib s
                                       RA




                                        RE




                                                                             tP E
                                          L
                                        do
                                        RE




                          CL N.S
                                     LR




                                                                                        t
                                                                                    EL
                                       .P




                                                                                    jec
                                      St




                                                                                     io
                                    Fa
                                    no
                                      -
                                    A

                                    A
                                  EL




                                   N
                                   H
                                    -




                                                                                  ut
                                 um




                                                                                 ro
                                  RI
                                RE
                     R A - SH




                                 Ze
                                  H




                                ng
                                RI
                                 A
                               -S




                               A
                             RI

                              nt




                              gi
                              A
                              A




                                                                            nt
                           CL
                           RC


                  EL TA




                                                                           lo
                           ua




                           ug
                            H

                          LA




                                                                         co
                                                                         Pi
                         -S




                         Q




                         H
                         E
                        EL




                       /C




                                                                      er
                      M




                                                                   th
                    AT




                                                                      O
                  D
                N
             LI




Fig. 1 Repository sources of the 8,873 datasets available in ELG in April 2022


   Regarding resource types and their linguality, Figure 3 illustrates the numbers.
As expected, the highest numbers apply to corpora (6,236 available in ELG), with
twice as many monolingual corpora as bilingual ones (which in turn are three times
as many as the multilingual ones). Lexical/Conceptual resources are also very well
represented with 2,229 entries.
   One of our bigger concerns at the time of Release 2 was the fact that there were
barely any language descriptions (there were only 7). This has changed with the work
towards ELG Release 3: at the time of writing, we count 408 language descriptions
with the majority being monolingual. Further regarding language descriptions, the
8 Datasets, Corpora and other Language Resources                                               163

                           52 37                                         ELRA
                                                                         ELRC-SHARE
                                         1180
                                                                         META-SHARE
                                                                         ELRA-SHARE-LRs
                                                                         LINDAT/CLARIAH-CZ
                                                        1299             Quantum Stat
           4127                                                          Zenodo
                                                                  71     CLARIN.SI
                                                                144      CLARIN.PL
                                                       353
                                                                         Hugging Face

                                                492       261            ELE
                                                                         Pilot Projects
                                 385
                                                                         Other contributions
                                       254 218

Fig. 2 Repository sources of the 8,873 datasets available in ELG in April 2022


          4000
          3500
          3000
          2500
          2000
          1500
          1000
           500
             0
                        Corpus                 Language description      Lexical/Conceptual
                                                                              resource

                                   Bilingual      Monolingual     Multilingual

Fig. 3 Types of resources according to linguality



number of its “language models” subclass has increased to 358. This is good news
as models are a popular and highly demanded resource type, currently providing the
state of the art for many LT/NLP tasks. ELG is actively encouraging the use of its
platform for the creation of models. The pilot projects have supported this resource
type as well by contributing their models, too.
   ELG also offers a very broad language coverage, with 450 languages represented
by lexical/conceptual resources, and with corpora available in 438 languages, at the
time of writing. The language models cover 156 languages, grammars are available
for 25 languages. These are either monolingual or multilingual resources. Figure 4
shows the language resource type distribution for the EU official languages.
   Finally, different media types are also represented in ELG. As expected, the
largest number of resources belongs to the type “text” with more than 7,000 datasets.
164                                             Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

          4000


          3500


          3000


          2500


          2000


          1500


          1000


           500


            0
                                     French
                 English




                                                                                    Czech
                           Spanish


                                              German
                                                       Finnish
                                                                 Polish
                                                                          Italian


                                                                                            Greek




                                                                                                                           Estonian


                                                                                                                                                  Dutch


                                                                                                                                                                   Hungarian


                                                                                                                                                                                           Romanian


                                                                                                                                                                                                                Croatian
                                                                                                                                                                                                                           Slovak
                                                                                                    Portuguese
                                                                                                                 Swedish




                                                                                                                                                          Danish




                                                                                                                                                                                                                                    Lithuanian
                                                                                                                                                                                                                                                 Irish
                                                                                                                                                                                                                                                         Maltese
                                                                                                                                      Bulgarian




                                                                                                                                                                               Slovenian


                                                                                                                                                                                                      Latvian
                                              Corpora                        Language descriptions                                                 Lexical/Conceptual resources


Fig. 4 Language resource type distribution for the official EU languages



Nonetheless, the type “audio” already offers more than 1,200 resources while cur-
rently 385 image and video resources are available.



6 Language Resources and Legal Issues

Managing legal issues in a large platform such as ELG implies taking care of a wide
variety of legal aspects, often regarding licensing. It also implies taking into account
processes that may differ from one provider to another. A provider may choose to
distribute resources either through implicit or explicit licences, through specific con-
ditions of use, or through considering a particular user status such as profiles or
membership status. Moreover, the need to ensure GDPR compliance requires cer-
tain monitoring processes. For the development of the platform, the project has ben-
efited from the support and advice of a dedicated team of legal experts who helped
deploy the platform in a manner that is legally sound. This ranges from establishing
the necessary legal context (e. g., Privacy Policy and Terms of Use) to stepping in
for consultations. The legal team has also contributed to the preparation of a Data
Management Plan (see Section 7). Below, we briefly describe some of the specific
issues the ELG legal team has taken care of.
Advice on implicit versus explicit licences: One main distinction to make is the
  management of implied (or implicit) versus expressed (or explicit) licences. For
  implied licences, it has become a commonly and widely used practice to grant
8 Datasets, Corpora and other Language Resources                                     165

   users access when they click on the licence terms acceptance button indicated on
   the repository pages.
Advice on conditions and terms of use: The conditions of use of a resource are
   another factor that has been defined and which may require further discussion
   and interaction between the provider and the user. Among the various elements
   to consider in licensing data or tools, we need to review the purpose of use (which
   could be commercial, for research, etc.), as well as the profile of the licencee (this
   is the type of institution, some resources may be restricted to particular types of
   institutions, e. g., academic or commercial)26 .
Financial and distributional issues: Not only legal issues may condition the de-
   livery of resources to a user, but also the financial and distribution policies of
   the provider. Such policies involve a dedicated team, with expertise in technical,
   legal and financial domains. Parameters like the legal profile of the licencee, the
   purpose of use and the pricing policy need to be clearly displayed.
META-SHARE licensing: The selection of LRs for ingestion done for the three
   META-SHARE nodes needed to be revised due to licensing restrictions. These
   involved proprietary licences (e. g., MS-C-NoReD, MS-NC-NoReD and MS-
   Commons-BY-SA), as well as licences that required negotiations with providers.
   To address this, a study of the licences was performed by the ELG legal team
   for discussion with node managers. A proposal for licence mapping was drafted
  where non-restrictive licences were invited to move to Creative Commons li-
   cences. Restrictive licences were encouraged to move to more open licences, too.
Legal checking: The identification of various repositories demonstrated the im-
   portance of legal checking all throughout the information compilation process. In
   some cases (e. g., Zenodo), licences were well identified and could usually be in-
   tegrated in the ELG metadata without further analysis. However, for other cases
  (e. g., ELRA-SHARE-LRs, Quantum Stat), legal information did not always com-
   ply with ELG requirements or was simply missing. Consequently, legal expertise
  was needed to either check and confirm the accuracy of present legal information,
   or to search for and gather the appropriate legal information.
Improvement of the licence list: When we processed the Zenodo datasets, we re-
   alised that several licences were not part of the ELG metadata values. Thus, the
   ELG legal expert was asked to compare the Zenodo list with the ELG list and
   make suggestions to integrate some of those licences into the ELG metadata. A
   list of 68 licences that did not correspond to ELG values was checked, out of
  which 40 could be added to the ELG licence list, whereas the other 28 did not
   need to be added because they were already used within ELG using other labels,
   they were not used, or they had no link.
Addition of conditions of use in the ELG metadata: We decided to add a new
   metadata field corresponding to the “conditions of use” associated to each iden-
   tified licence to improve the search functionality for resources based on their li-
   censing conditions. For “standard” licences, the conditions of use were added by
   the ELG team, based on information gathered from Creative Commons licences,

26   https://live.european-language-grid.eu/terms-of-use
166                      Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

     values from the CLARIN licencing framework27 , META-SHARE licences, and
     the ELRA licence wizard28 . For all other LRs, a thorough analysis of over 300
     licences (all licences in the SPDX list29 ) was done by our legal team who went
     through the different conditions of use such as the intellectual property rights
     granted by the licences, the requirements on redistribution imposed by the licence,
     the requirements on use of the data and, finally, the requirements imposed on
     users (Rigault et al. 2022b).



7 Language Resources and Data Management

ELG is a platform for commercial and non-commercial Language Technologies,
both functional (running services and tools) and non-functional (datasets, resources,
models). In order to achieve this, the consortium in charge of the ELG platform has
enacted several priorities that include the processing of massive amounts of data
and of different types. These large amounts of data derive from partners’ contribu-
tions, external providers willing to share their datasets through ELG, our harvesting
of other repositories as well as different kinds of resource and repository identifi-
cation work. As can be expected, such a data intensive project requires clear data
management policies, in particular considering GDPR constraints. For that purpose,
we implemented a Data Management Plan (DMP) as a concrete necessity for organ-
isational, technical and legal management of all data types processed in the course
of the project (Rigault et al. 2022a). The DMP documents the variety of data types
collected, received and/or processed in the course of the project and reports on how
the data is going to be managed with regard to technical, organisational and legal
aspects. The DMP also complies with best practices and, in particular, with the re-
quirements of Horizon 2020 as well as GDPR obligations. It defines useful practices
to enhance compatibility with the FAIR principles (see Section 7 in Chapter 2 and
Wilkinson et al. 2016)30 , as endorsed and specified for Horizon 2020. Moreover, the
DMP provides advice in terms of best practices for language resource creation in all
steps of an LR life cycle (Choukri and Arranz 2012; Rehm 2016).



8 Conclusions

We integrated more than 10,000 metadata records for datasets, models and other
classes of language resources into the ELG platform. These LRTs have been care-
fully described so as to ease their findability (following the FAIR principles) and to
27 See https://www.clarin.eu/content/licenses-and-clarin-categories#res and https://www.clarin.e
u/content/clarin-license-category-calculator
28 http://wizard.elra.info/principal.php

29 https://spdx.org/licenses/

30 https://www.go-fair.org
8 Datasets, Corpora and other Language Resources                                            167

ensure compliance with the ELG metadata schema while advocating for interoper-
ability. A series of steps and best practices has been followed with the objective of
establishing procedures for resource identification, description and ingestion. The
work carried out during the ELG project has allowed us to consider expertise and
lessons learned to improve protocols and principles. This has been the reason for
updating the integration approach of some repositories (e. g., ELRC-SHARE and
Zenodo). The strategy behind the choice of repositories has also been planned care-
fully, following technical and strategic priorities, as well as evolutionary needs and
demands. ELG users can now either access thousands of resources or contribute
resources through the different means provided. Legal issues have also been con-
sidered with a special focus on licensing. Moreover, a Data Management Plan has
been conceived to address the handling of all types of data (including sensitive data)
within ELG as well as guiding the production and life cycle aspects of LRs.



References

Arranz, Victoria, Khalid Choukri, Valérie Mapelli, Mickaël Rigault, Jan Hajic, Ondrej Kosarko,
   Cristian Berrio, Andrés Garcia-Silva, Rémi Calizzano, Nils Feldhus, Miltos Deligiannis, Penny
   Labropoulou, Stelios Piperidis, and Ulrich Germann (2021). Deliverable D5.2 Data Sets, Iden-
   tified Gaps, Produced Resources and Models (Version 2). Project deliverable; EU project Euro-
   pean Language Grid (ELG); Grant Agreement no. 825627 ELG. URL: https://www.european-
   language-grid.eu/wp-content/uploads/2022/04/ELG-Deliverable-D5.2-final.pdf.
Arranz, Victoria, Khalid Choukri, Valérie Mapelli, Mickaël Rigault, Penny Labropoulou, Miltos
   Deligiannis, Leon Voukoutis, Stelios Piperidis, and Ulrich Germann (2022). Deliverable D5.3
   Data Sets, Models, Identified Gaps, Produced Resources and their Exploitation within ELG
   (Version 3). Project deliverable; EU project European Language Grid (ELG); Grant Agreement
   no. 825627 ELG. URL: https://www.european-language-grid.eu/wp-content/uploads/2022/04
   /ELG-Deliverable-D5.3-final.pdf.
Choukri, Khalid and Victoria Arranz (2012). “An Analytical Model of Language Resource Sustain-
   ability”. In: Proceedings of the Eighth International Conference on Language Resources and
   Evaluation (LREC 2012). Istanbul, Turkey: ELRA, pp. 1395–1402. URL: http://www.lrec-con
   f.org/proceedings/lrec2012/pdf/846_Paper.pdf.
Gavrilidou, Maria, Penny Labropoulou, Elina Desipri, Stelios Piperidis, Haris Papageorgiou, Mon-
   ica Monachini, Francesca Frontini, Thierry Declerck, Gil Francopoulo, Victoria Arranz, and Va-
   lerie Mapelli (2012). “The META-SHARE Metadata Schema for the Description of Language
   Resources”. In: Proceedings of the Eighth International Conference on Language Resources
   and Evaluation (LREC 2012). Istanbul, Turkey: ELRA, pp. 1090–1097. URL: http://www.lrec
   -conf.org/proceedings/lrec2012/pdf/998_Paper.pdf.
Labropoulou, Penny, Katerina Gkirtzou, Maria Gavriilidou, Miltos Deligiannis, Dimitris Galanis,
   Stelios Piperidis, Georg Rehm, Maria Berger, Valérie Mapelli, Michael Rigault, Victoria Ar-
   ranz, Khalid Choukri, Gerhard Backfried, José Manuel Gómez Pérez, and Andres Garcia-Silva
   (2020). “Making Metadata Fit for Next Generation Language Technology Platforms: The Meta-
   data Schema of the European Language Grid”. In: Proceedings of the 12th Language Resources
   and Evaluation Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe
   Blache, Christopher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard,
   Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA,
   pp. 3421–3430. URL: https://www.aclweb.org/anthology/2020.lrec-1.420/.
168                      Victoria Arranz, Khalid Choukri, Valérie Mapelli, Mickaël Rigault et al.

Lösch, Andrea, Valérie Mapelli, Khalid Choukri, Maria Giagkou, Stelios Piperidis, Prokopis Proko-
   pidis, Vassilis Papavassiliou, Miltos Deligiannis, Aivars Berzins, Andrejs Vasiljevs, Eileen
   Schnur, Thierry Declerck, and Josef van Genabith (2021). “Collection and Curation of Lan-
   guage Data within the European Language Resource Coordination (ELRC)”. In: Proceedings
   of the Conference on Digital Curation Technologies (QURATOR 2021). Ed. by Adrian Paschke,
   Georg Rehm, Jamal Al Qundus, Clemens Neudecker, and Lydia Pintscher. Vol. 2836. CEUR
   Workshop Proceedings. Berlin, Germany: CEUR-WS.org. URL: http://ceur-ws.org/Vol-2836
   /qurator2021_paper_6.pdf.
Mapelli, Valérie, Victoria Arranz, Hélène Mazo, and Khalid Choukri (2022). “Language Resources
   to Support Language Diversity – the ELRA Achievements”. In: Proceedings of the 13th Lan-
   guage Resources and Evaluation Conference (LREC 2022). Ed. by Nicoletta Calzolari, Frédéric
   Béchet, Philippe Blache, Christopher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara,
   Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA,
   pp. 551–558. URL: http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.58.pdf.
Piperidis, Stelios, Harris Papageorgiou, Christian Spurk, Georg Rehm, Khalid Choukri, Olivier Ha-
   mon, Nicoletta Calzolari, Riccardo del Gratta, Bernardo Magnini, and Christian Girardi (2014).
   “META-SHARE: One year after”. In: Proceedings of the 9th Language Resources and Evalu-
   ation Conference (LREC 2014). Ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck,
   Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios
   Piperidis. Reykjavik, Iceland: ELRA, pp. 1532–1538. URL: http://www.lrec-conf.org/proceed
   ings/lrec2014/pdf/786_Paper.pdf.
Rehm, Georg (2016). “The Language Resource Life Cycle: Towards a Generic Model for Creat-
   ing, Maintaining, Using and Distributing Language Resources”. In: Proceedings of the 10th
   Language Resources and Evaluation Conference (LREC 2016). Ed. by Nicoletta Calzolari,
   Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asun-
   cion Moreno, Jan Odijk, and Stelios Piperidis. Portorož, Slovenia: ELRA, pp. 2450–2454. URL:
   https://aclanthology.org/L16-1388.pdf.
Rehm, Georg and Katrin Marheinecke (2019). Deliverable D7.2 National Competence Centres
   and Language Technology Council. Project deliverable; EU project European Language Grid
   (ELG); Grant Agreement no. 825627 ELG. URL: https://www.european-language-grid.eu/wp-
   content/uploads/2021/02/ELG-Deliverable-D7.2-final.pdf.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Rigault, Mickaël, Victoria Arranz, Khalid Choukri, Valérie Mapelli, Pawel Kamocki, and Lucille
   Blanchard (2022a). Deliverable D5.6 Data Management Plan (Version 3). Project deliverable;
   EU project European Language Grid (ELG); Grant Agreement no. 825627 ELG. URL: https:
   //www.european-language-grid.eu/wp-content/uploads/2022/04/ELG-Deliverable-D5.6-final
   .pdf.
Rigault, Mickaël, Victoria Arranz, Valérie Mapelli, Penny Labropoulou, and Stelios Piperidis
   (2022b). “Categorizing Legal Features in a Metadata-Oriented Task: Defining the Conditions
   of Use”. In: Proceedings of the Legal and Ethical Issues Workshop (LREC 2022). Ed. by Nico-
   letta Calzolari, Frédéric Béchet, Philippe Blache, Christopher Cieri, Khalid Choukri, Thierry
   Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis.
   Marseille, France: ELRA, pp. 22–26.
8 Datasets, Corpora and other Language Resources                                              169

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Ax-
   ton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E.
   Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier
   Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alas-
   dair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen,
   Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert
   Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik,
   Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris
   A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waag-
   meester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons (2016). “The
   FAIR Guiding Principles for Scientific Data Management and Stewardship”. In: Scientific Data
   3. DOI: 10.1038/sdata.2016.18. URL: http://www.nature.com/articles/sdata201618.
Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
   Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick
   von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,
   Mariama Drame, Quentin Lhoest, and Alexander M. Rush (2020). “Transformers: State-of-the-
   art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical Meth-
   ods in Natural Language Processing: System Demonstrations. ACL, pp. 38–45. DOI: 10.1865
   3/v1/2020.emnlp-demos.6. URL: https://aclanthology.org/2020.emnlp-demos.6.
Yeganova, Lana, Dina Wiemann, Mariana Neves, Federica Vezzani, Amy Siu, Inigo Jauregi Unanue,
   Maite Oronoz, Nancy Mah, Aurélie Névéol, David Martinez, Rachel Bawden, Giorgio Maria
   Di Nunzio, Roland Roller, Philippe Thomas, Cristian Grozea, Olatz Perez-de-Viñaspre, Maika
   Vicente Navarro, and Antonio Jimeno Yepes (2021). “Findings of the WMT 2021 Biomedical
   Translation Shared Task: Summaries of Animal Experiments as New Test Set”. In: Proceedings
   of the Sixth Conference on Machine Translation. ACL, pp. 664–683. URL: https://aclantholog
   y.org/2021.wmt-1.70.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 9
Language Technology Companies,
Research Organisations and Projects

Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou




Abstract The European Language Grid is meant to develop into the primary plat-
form of the European Language Technology community. In addition to LT tools
and services (Chapter 7) and Language Resources (Chapter 8), ELG represents the
actual members of this community, i. e., the companies and research organisations
that develop language technologies and that are engaged in related activities. The
goal of becoming the primary platform for LT in Europe implies that ELG should
ideally represent all European companies and all European research organisations
with corresponding metadata records in the ELG catalogue, which are interlinked
with the respective LT tools and services as well as language resources they offer.
This chapter describes the European stakeholders and user groups that are relevant
for the ELG initiative, the composition of the community and the locations of the
companies and research groups as currently listed in ELG. Furthermore, we describe
a number of technical and organisational challenges involved in the preparation of
our list of stakeholders, and outline the process of catalogue population.



1 Introduction

The European Language Grid is meant to develop into the primary platform of the
European LT community. This is why, in addition to functional LT tools and services
and more static Language Resources (LRs), ELG also represents the actual members
of this community, i. e., the companies and research organisations that develop LTs
and that are engaged in related activities such as the integration of LT into existing
systems or support services such as data annotation at scale. This overall goal of
eventually establishing ELG as the primary platform for LT in Europe implies that
ELG should ideally represent all European companies and all European research

Georg Rehm · Katrin Marheinecke · Rémi Calizzano
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany, georg.rehm@dfki.de,
katrin.marheinecke@dfki.de, remi.calizzano@dfki.de
Penny Labropoulou
Institute for Language and Speech Processing, R. C. “Athena”, Greece, penny@athenarc.gr


© The Author(s) 2023                                                                      171
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_9
172                             Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou

                  Corpus          Language Description/Model                    Lexical/Conceptual Resource (LCR)
                                         Organization              Project          Tool/Service

7,000

6,000

5,000

4,000

3,000

2,000

1,000

       0
              0



                            0



                                         0



                                                      1



                                                                   1



                                                                                1



                                                                                             1



                                                                                                           2



                                                                                                                         2



                                                                                                                                      2
              2



                          2



                                       2



                                                    2



                                                                 2



                                                                                2



                                                                                              2



                                                                                                            2



                                                                                                                       2



                                                                                                                                    2
           20



                       20



                                    20



                                                 20



                                                              20



                                                                             20



                                                                                           20



                                                                                                         20



                                                                                                                    20



                                                                                                                                 20
         4/



                       7/



                                    0/



                                                 1/



                                                              4/



                                                                           7/



                                                                                       0/



                                                                                                    1/



                                                                                                                    4/



                                                                                                                                 7/
       /0



                     /0



                                  /1



                                               /0



                                                            /0



                                                                         /0



                                                                                      /1



                                                                                                    /0



                                                                                                                  /0



                                                                                                                               /0
     01



                   01



                                01



                                             01



                                                          01



                                                                       01



                                                                                    01



                                                                                                  01



                                                                                                                01



                                                                                                                             01
Fig. 1 Evolution of resources in ELG over time broken down by resource type



organisations in the ELG catalogue, which are interlinked with the respective LT
tools and services as well as language resources these organisations offer on and
through the European Language Grid. In other words, the European Language Grid
also functions as the “yellow pages” of the European LT community, ideally listing
and promoting all relevant members of this community, i. e., small and medium-
sized companies as well as large enterprises, research centers, universities and other
academic institutions that develop LT but also organisations in the periphery of this
core, e. g., integrators and annotation service providers (Rehm et al. 2020, 2021).1
   In addition to serving as the central directory for members of the European LT
community, ELG also includes information about relevant projects in the area.2 The
reasoning behind this is the way many LTs are typically developed, i. e., through
publicly funded project consortia in which academic or commercial organisations
participate. Such projects often result in concrete tools and technologies as well as
language resources, which can then be made available, among others, through ELG,
which allows representing and interlinking these project artefacts (LTs, LRs), the
projects that helped create these artefacts and the members of the respective project
consortia. Technically, project consortia can provide relevant metadata to create and
later edit and update their own project pages in ELG ensuring more visibility as well
as an additional dissemination channel for their projects’ outputs.
   In the second half of the ELG project’s runtime, corresponding activities in terms
of populating the ELG catalogue with information about companies, academic or-
ganisations and projects have been drastically increased so that, towards the end of
the project, ELG now includes convincing figures in terms of community members,
1   https://live.european-language-grid.eu/catalogue/?entity_type__term=Organization
2   https://live.european-language-grid.eu/catalogue/?entity_type__term=Project
9 Language Technology Companies, Research Organisations and Projects                                                  173

                                              Consumers            Providers

500

450

400

350

300

250

200

150

100

 50

  0
          0



                     20



                                 20




                                              1



                                                          1



                                                                     21




                                                                                   1



                                                                                               2



                                                                                                          22




                                                                                                                        2
        02




                                             02



                                                         02




                                                                                 02



                                                                                              02




                                                                                                                      02
                  20



                              20




                                                                   20




                                                                                                       20
         2




                                           /2



                                                       /2




                                                                               2



                                                                                            /2




                                                                                                                    2
      4/



                7/



                            0/




                                                                 7/



                                                                            0/




                                                                                                     4/



                                                                                                                 7/
                                          1



                                                      4




                                                                                           1
  /0



                /0



                            /1



                                       /0



                                                   /0



                                                               /0



                                                                            /1



                                                                                        /0



                                                                                                     /0



                                                                                                                 /0
01



              01



                          01



                                      01



                                                  01



                                                              01



                                                                          01



                                                                                       01



                                                                                                   01



                                                                                                               01
Fig. 2 Number of ELG consumer and provider accounts over time



projects and also active users of the platform. At the time of writing, ELG lists more
than 13,000 metadata records on tools and services, resources, organisations and
projects. Figure 1 shows the corresponding development of the ELG catalogue and
its population over time, differentiated by type of entry.
    Not only the number of resources and organisations listed in ELG is constantly
growing. In addition, the number of users is rising continuously. The number of
ELG users of the consumer category who have a registered a user account went up
significantly at the end of April 2020, after the first official release to the public, and
has grown further ever since. The number of ELG users of the provider category,
i. e., users with the right to integrate metadata, tools and resources in ELG, is also
increasing continuously, albeit more slowly, as can be expected (see Figure 2).
    As encouraging as this development is, ELG is still at the beginning. The platform
has been designed in such a way that it can be actively used by the community and
that it can grow. To achieve this goal of a true one-stop shop for the whole European
LT community, it is necessary to steadily expand the consumer and provider base
and monitor as well as reflect all changes and new developments in the European LT
landscape. Only with this momentum will the desired snowball effect be generated
eventually, which ultimately helps ELG to achieve sustainable success from which
all stakeholders can benefit.
174                Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou

2 The European Language Technology Landscape

One key characteristic of the European Language Technology landscape is its ex-
treme fragmentation, which has been mentioned repeatedly throughout the years, as,
for example, in the META-NET White Paper Series (Rehm and Uszkoreit 2012), in
the META-NET Strategic Research Agenda (Rehm and Uszkoreit 2013; Rehm et al.
2016), in the Final study report on CEF automated translation value proposition in
the context of the European LT market/ecosystem (Vasiljevs et al. 2019) or in the
various reports of the European Language Equality project (especially see Aldabe
et al. 2022). In fact, this extreme fragmentation is one of the main reasons why the
ELG platform has been developed in the first place because the fragmentation is gen-
erally perceived as one of the main reasons why the European LT community has
been unable to unleash its full potential.
   The analysis in the CEF LT Market study (Vasiljevs et al. 2019) shows that Eu-
ropean LT vendors are often SMEs with local or regional, often highly specialised
solutions. In the study, 473 companies were collected that are active in EU member
states in the domain of LT and that fully qualify as LT vendors. According to the
research, the total size of the LT industry within the EU member states (plus Iceland
and Norway) was estimated at approx. 800M€ in the year 2017. In the study sample
investigated, only 14% of the LT vendors had a revenue of more than €10M, whereas
almost half of them (48%) had a revenue below €1M. In terms of size, 52% of the
companies had between 10 and 99 employees, and 26% had less than 10 employees,
both combined representing nearly 80% of the 473 companies studied. Only 44% of
the EU companies in this sample received external funding or venture capital.
   Consequently, the global LT and NLP market continues to be dominated by large
technology enterprises from the United States and Asia which establish “data-driven
intellectual monopolies” (Rikap and Lundvall 2020) – in that regard, large compa-
nies are the exception in Europe. However, these big non-European LT providers
have certain deficiencies regarding under-resourced languages, customisation needs,
as well as security and privacy requirements which is a frequently expressed demand
from corporate clients and European administrations (Overton 2017).
   Despite the fact that the LT market is relatively small when compared to the gen-
eral IT market at large, it is a market with strong competition, which is one of the rea-
sons why many LT developing companies tend to focus on highly specialised niche
markets with less intense competition. This, however, affects profitability, which is,
on average, rather low and margins are compressed. On the other hand, LT can also
be considered a growing market: today, (potential) customers have more awareness
of the benefits of LT, which is also due to marketing activities of large international
players. From a local vendors’ point of view, the large technology enterprises help
create a market awareness that simply did not exist ten years ago. Nevertheless, these
companies are also the toughest competition of the European LT community as they
tend to offer high-quality LT software free of charge or for very low prices, which
European SMEs usually cannot afford to do.
   The STOA study Language equality in the digital age – Towards a Human Lan-
guage Project (STOA 2018), which examines the causes of language barriers in
9 Language Technology Companies, Research Organisations and Projects                         175

Europe and formulates recommendations for policies to overcome these barriers,
mentions among its 11 key recommendations the need for a pan-European LT Plat-
form of resources and services and ELG has stepped up to solve this problem (also
see European Parliament 2018). ELG not only brings together LT resources from all
over Europe supporting almost all European languages (although ELG is not lim-
ited to European languages) but ELG also has the ambition to unite the European
LT community behind these services, tools and resources using one shared umbrella
platform to create a common access point and marketplace from which all languages
and members of the community will eventually benefit (see Part III of this book).
   At the time of writing, ELG contains approx. 1,800 organisations operating in
the European LT sphere. One half of these organisations consists of companies, the
other half of universities and research groups (Figure 3).3

                                                                    Companies
                                                                    Academic Institutions

                                                                    Research Groups

                       406 (22,8%)




                                               890 (50,1%)



                      481 (27,1%)




Fig. 3 Distribution of organisations listed in ELG per type


   The quantitative distribution of LT developing organisations among the respec-
tive countries in Europe already hints at a strongly varying coverage of LT resources
for their respective national and regional languages. Whereas countries like the UK,
Germany or Spain are well or relatively well equipped with LT developing compa-
nies, smaller countries like Malta or Cyprus have only little representation in the
European LT community (see Figure 4).4 Figure 5 shows the geographical distribu-
tion in Europe of organisations listed in ELG.




3 Companies are commercial organisations, academic institutions are universities and research cen-

ters, research groups are sub-groups of academic institutions, e. g., faculties or departments.
4 In Figure 4, countries are ordered by decreasing number of organisations. The country with the

head office of the respective organisation is used as the organisation’s country.
176                        Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou

                            Companies                Academic Institutions         Research Groups


United Kingdom                             53                                             142         98
        Germany                                               104             48           46
           Spain                                64               49          35
          Poland                     39                 46 15
          France                               60      21   16
      Netherlands                               66 6 15
            Italy                     44 13            25

         Belgium                          47    17 7
          Ireland                   36 8 9
        Denmark                      38 5 8
          Austria                   37 6 6
         Bulgaria                 30 12 7
          Finland                   36 4 9
         Norway                     35 6 6
          Greece            16 15 15

         Portugal               24 12 6
         Hungary                23 11 7
         Sweden            14 7 12
      Switzerland          13 9 7
          Estonia               23 2 1
         Czechia          11 7 6
           Latvia         9 58
         Romania         5 5 12

          Iceland           16 1 3
        Lithuania          12 2 4
         Slovakia         9 45
           Serbia        55 6
          Croatia        54 3

         Cyprus         3 3 3
      Luxembourg         7 1 1
         Slovenia       3 1 4
           Malta        2 41
          Turkey        1 0 0

                    0                                   100                            200           300

Fig. 4 Distribution of organisations listed in ELG per type and country
9 Language Technology Companies, Research Organisations and Projects               177




Fig. 5 Organisations listed in ELG per country



3 Organisations in the European Language Grid

To bootstrap the ELG catalogue with as many LT developing European compa-
nies and academic organisations as possible, we decided on the following proce-
dure. First, together with the ELG National Competence Centres (see Chapter 11,
p. 205 ff.), we collected LT developing organisations semi-automatically and in a
decentralised way, i. e., on the national level (Section 3.1). Second, based on the re-
sults of this collection, metadata records were prepared that could be automatically
ingested into the ELG catalogue (Section 3.2). This resulted in the ELG catalogue
being populated with approx. 1,800 metadata records, i. e., pages, each of which de-
scribes one LT developing organisation with a basic profile. These organisation pro-
files can then be claimed by the rightful owners (Section 3.3), i. e., an organisation
described in such an ELG page can take over the maintenance of its own page and
enrich it with additional information, e. g., upload a logo, associate resources with
their organisation etc. (Section 3.4). This bootstrapping procedure enables members
of the European LT community to participate actively in ELG with their own organi-
sation within minutes. As a positive side effect, it enabled ELG – including its sister
project ELE – to produce a fairly detailed picture of the European LT landscape.
178                  Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou

3.1 Collecting the Members of the European LT Community

In order to populate ELG with organisations, we used our own databases, carried
out desk research and, most importantly, we involved the 32 National Competence
Centres (NCCs) to tap into their detailed knowledge of their respective countries’ LT
communities. Our general goal was to identify and to record, in a machine-readable
format, as many national and regional members of the European LT community as
possible so that ELG can eventually provide as complete and up to date a picture
as possible. In September 2020, this data collection task was conducted with NCC
Leads representing their countries and regions to ideally identify all companies and
academic organisations in the European LT community to be listed in ELG.
    To streamline the process, based on data gathered in various workshops, confer-
ences and other events over the last ten years, the ELG project team created lists of
organisations involved in LT activities in all European countries. Each entry in the
list contained, among others, the following information: organisation name, depart-
ment name, website, address (region, ZIP code, city, country) and LT areas in which
they are active. Each NCC Lead received the data records for their country, along
with detailed guidelines, and they were asked to check the data included in the list,
to correct the data if necessary (e. g., remove duplicates with similar names, correct
wrong names of organisations) and to complete them where possible, i. e., to fill in
blanks. Furthermore, the NCCs were asked to do their own research and provide
new, unlisted organisations. The goal was to find all relevant organisations of each
country that develop, market or sell LT in their countries. This way, the ELG con-
sortium wanted to ensure that in addition to well-known orgnaisations also start-ups
andyoung research groups are included in ELG.
    The feedback received from the NCCs was submitted to a comprehensive inter-
nal quality review by the ELG team, which resulted in the final dataset that reflects
a fairly complete representation of the relevant stakeholders and providers of Lan-
guage Technology and language-centric AI in Europe.5



3.2 Preparation and Integration of Metadata Records

The efforts of the NCCs and the ELG team for the collection of data regarding LT
organisations relevant for ELG resulted in two spreadsheets per country contain-
ing companies and research groups respectively. All entries were automatically con-
verted into XML files that are compliant with the ELG metadata schema as described
in Chapter 2. Furthermore, for columns corresponding to metadata elements that take
values from controlled vocabularies (e. g., LT area), we mapped the input to the val-
ues in the controlled vocabulary. This process also served as a sanity check during


5In this procedure, the regulations of the Data Protection Act were adhered to at any time and no
personal data have been published without the consent of the data owners.
9 Language Technology Companies, Research Organisations and Projects              179

which errors were identified and resolved. The procedure resulted in 1,740 XML
files, 867 for companies and 873 for research groups.
    The ELG life-cycle for the publication of individual resources includes a vali-
dation process aiming to ensure the quality of the metadata published in ELG (see
Chapter 2). For the import of the organisation-related XML files, we applied a dif-
ferent procedure that involved their bulk import with the assignment of the tag “im-
ported by ELG”. Metadata records marked as such do not go through a validation
process and are immediately published on ELG.



3.3 Claiming and Enriching Organisation Pages

Once the population of ELG with these entries was completed, a campaign was
launched inviting (via email) legitimate owners to claim, edit and curate the entries
of their own organisations. Since the pages created by the ELG team contained only
minimal information, the representatives of the organisations were invited to enrich
these pages with reliable and accurate content and also to start providing tools, ser-
vices and resources. In several email campaigns, we reached out to contact persons
identified by the NCCs and we informed them about the existence of their organ-
isations’ pages on ELG, also inviting them to take over the pages. To do so, the
legitimate owner can “claim” their organisation’s page as their own by clicking the
“Claim” button on the page (see Figure 6).




Fig. 6 Imported organisation page with a “Claim” button


    The claiming process can only be triggered by persons signed in with an ELG
account (with provider role). This step serves as a security mechanism ensuring cor-
rect and rightful authorisation of eligible persons. Once a request is made, the ELG
team checks its validity, which also includes checking the email address used to reg-
ister the ELG account, making sure that it belongs to the organisation, the page of
which is being claimed. Approval of the request entails that the entry is assigned
180                  Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou

to the claimant and returns to a status that it can be edited. The claiming person is
prompted by email that they can now start editing the metadata entry and ELG page.
Once edited, the page needs to be submitted to publication and the usual ELG valida-
tion process starts, i. e., the changes made to the resource are reviewed by the ELG
team and the entry is made publicly available again.



3.4 Organisation Pages in the European Language Grid

Organisation pages can include different tabs. The “Overview” tab includes a de-
scription of the organisation as well as an info box on the right with data such as
postal address and contact email as well as a link to the organisation’s own website.
This tab can also include keywords that describe the general domain and LT areas
an organisation addresses. ELG pages can also be exported in XML format. The tab
“Related LRTs & projects” lists all resources and technologies the respective organ-
isation has made available on ELG and the projects they are involved in. This helps
companies to promote their tools and resources and to show connections between
companies or research organisations and their research projects and corresponding
results. The “Related organisations” tab is especially important for academic institu-
tions and universities to reflect their relationship to other departments, faculties or
the umbrella organisation (usually the university). Figure 7 provides an example for
a page of an academic organisation. Figure 8 (p. 182) shows a company page.



4 Projects in the European Language Grid

ELG is also able to represent research projects, especially for the purpose of acknowl-
edging the funding that made the development of a technology or resource possible
and also to interlink projects with organisations and resources.6 ELG project pages
are structured in a similar way, but they are especially adapted to the characteristics
and metadata of a typical research project. In addition to information regarding the
start and end of the project, the info box also contains details on the funding agency,
the funding country, the type of project and the amount of funding provided. Besides
the project description and keywords, the “Overview” tab contains the list of consor-
tium partners, that are linked to their respective ELG pages if they exist. Again, the
tab “Related LRTs” lists all technologies and resources associated with or resulting
from the project. Two examples are shown in Figures 9 (p. 183) and 10 (p. 184).




6 At the time of writing, we are preparing a list with more than 500 projects that will be imported

into the ELG catalogue in the second half of 2022; this list was put together in a similar manner as
the list of organisations described in Section 3.1.
9 Language Technology Companies, Research Organisations and Projects              181




Fig. 7 Example ELG organisation page: Bangor University



5 Conclusions

The European Language Grid is meant to develop into the primary platform of the
European LT community. In addition to the technical resources, ELG also repre-
sents the actual members of this community: companies and research organisations
that develop LTs and related organisations. Our ambition is for ELG eventually to
represent all companies and all research organisations active in the European LT
community. In order for ELG to function as a marketplace for European LT, it also
needs to provide core information about the European LT community (i. e., “yellow
pages” functionalities).
    ELG currently contains approx. 1,800 organisations active in the European LT
community. Like every similar repository or digital catalogue with certain artefacts,
one of the key challenges is the maintenance of the records and metadata entries,
i. e., keeping the entries up to date and also making sure that the community is fully
182                Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou




Fig. 8 Example ELG organisation page: Code Runners



represented. Our long-term vision for ELG is to become the primary platform of
the European LT community, which entails that all members of the European LT
community, both commercial and academic, immediately recognise the value, im-
portance and relevance of ELG and, thus, actively want to participate in ELG, keept-
ing their pages up to date, sharing technologies and resources, benefiting from this
European marketplace. Until this intended snowball effect is fully in place, i. e., all
stakeholders recognise the benefit ELG brings about and participate actively, we
will perform, even if time-consuming and logistically challenging, manual updates
of the ELG catalogue, we will continue to convert as many members of the com-
munity as possible into active users and also active providers of ELG and we will
increase our our outreach activities, encouraging more organisations to claim their
ELG pages. As soon as the snowball effect is in place and ELG is accepted as the pri-
mary platform of the European LT community, all participating organisations will
have a sufficient amount of intrinsic motivation to maintain their ELG pages and
to keep their information, technologies and resources up to date. At this time, ELG
strives to be an established player, which is known throughout the community so
9 Language Technology Companies, Research Organisations and Projects           183




Fig. 9 Example ELG project page: EMBEDDIA (Overview)



that also new companies are attracted by and to ELG. In addition to simplifying the
claim process, the attractiveness of ELG will be further enhanced through increased
community-related promotions, new features and improved offerings.
184                 Georg Rehm, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou




Fig. 10 Example ELG project page: EMBEDDIA (Related LRTs)



References

Aldabe, Itziar, Georg Rehm, German Rigau, and Andy Way (2022). Deliverable D3.1 Report on
   existing strategic documents and projects in LT/AI (second revision). Project deliverable; EU
   project European Language Equality (ELE); Grant Agreement no. LC-01641480 – 101018166
   ELE. URL: https://european-language-equality.eu/wp-content/uploads/2022/06/ELE___Deliv
   erable_D3_1__second_revision_2.pdf.
European Parliament (2018). Language Equality in the Digital Age. European Parliament reso-
   lution of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). URL:
   http://www.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf.
Overton, David (2017). Next Generation Internet Initiative – Consultation. URL: https://ec.europ
   a.eu/futurium/en/system/files/ged/ec_ngi_final_report_1.pdf.
Rehm, Georg, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic,
   Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez
   Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek,
   Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-
   bina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De
   Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek,
   Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur
   Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás
   Váradi, Kadri Vider, Andy Way, and François Yvon (2020). “The European Language Technol-
   ogy Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communi-
   cation in Multilingual Europe”. In: Proceedings of the 12th Language Resources and Evaluation
   Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christo-
   pher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
   Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325.
   URL: https://www.aclweb.org/anthology/2020.lrec-1.407/.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
9 Language Technology Companies, Research Organisations and Projects                          185

   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan-
   guages in the Digital Age. 32 volumes on 31 European languages. Heidelberg etc.: Springer.
Rehm, Georg and Hans Uszkoreit, eds. (2013). The META-NET Strategic Research Agenda for
   Multilingual Europe 2020. Heidelberg, New York, Dordrecht, London: Springer. URL: http://w
   ww.meta-net.eu/vision/reports/meta-net-sra-version_1.0.pdf.
Rehm, Georg, Hans Uszkoreit, Sophia Ananiadou, Núria Bel, Audronė Bielevičienė, Lars Borin,
   António Branco, Gerhard Budin, Nicoletta Calzolari, Walter Daelemans, Radovan Garabík,
   Marko Grobelnik, Carmen García-Mateo, Josef van Genabith, Jan Hajič, Inma Hernáez, John
   Judge, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Joseph
   Mariani, John McNaught, Maite Melero, Monica Monachini, Asunción Moreno, Jan Odjik,
   Maciej Ogrodniczuk, Piotr Pęzik, Stelios Piperidis, Adam Przepiórkowski, Eiríkur Rögnvalds-
   son, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadiņa, Koenraad De Smedt, Marko
   Tadić, Paul Thompson, Dan Tufiş, Tamás Váradi, Andrejs Vasiļjevs, Kadri Vider, and Jolanta
   Zabarskaite (2016). “The Strategic Impact of META-NET on the Regional, National and Inter-
   national Level”. In: Language Resources and Evaluation 50.2, pp. 351–374. DOI: 10.1007/s1
   0579-015-9333-4. URL: http://link.springer.com/article/10.1007/s10579-015-9333-4.
Rikap, Cecilia and Bengt-Åke Lundvall (2020). “Big Tech, Knowledge Predation and the Implica-
   tions for Development”. In: Innovation and Development, pp. 1–28. DOI: 10.1080/2157930
   X.2020.1855825.
STOA (2018). Language equality in the digital age – Towards a Human Language Project. STOA
   study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. URL: https://data.europa.eu/doi/10
   .2861/136527.
Vasiljevs, Andrejs, Khalid Choukri, Luc Meertens, and Stefania Aguzzi (2019). Final study report
   on CEF Automated Translation value proposition in the context of the European LT market/e-
   cosystem. DOI: 10.2759/142151. URL: https://op.europa.eu/de/publication-detail/-/publicatio
   n/8494e56d-ef0b-11e9-a32c-01aa75ed71a1/language-en.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
                     Part III
ELG Community and Initiative
Chapter 10
European Language Technology Landscape:
Communication and Collaborations

Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens




Abstract The European Language Technology community is a diverse group of
stakeholders that is characterised by severe fragmentation. This chapter provides
an overview of the stakeholders that are relevant for the European Language Grid.
We also briefly describe our communication channels and strategies with regard to
the promotion of ELG. Furthermore, we highlight a few of the current projects and
initiatives and their relationship to and relevance for ELG, especially with regard
to collaborations. The overall goal of the target group-specific communication strat-
egy we developed is to create more and more uptake of ELG in the European LT
community, eventually creating a snowball effect.



1 Introduction

A key challenge to which ELG aims to respond is the ubiquitous fragmentation of the
European LT landscape. ELG addresses this problem by bringing together all Euro-
pean stakeholders under one umbrella platform (European Parliament 2018). While
Chapter 9 (p. 171 ff.) provides a high-level description of the LT companies, research
organisations and projects registered in ELG at the time of writing (including statis-
tics etc.), the present chapter focuses upon the stakeholder groups themselves. The
challenge of severe fragmentation (STOA 2018) has been taken up in ELG from
the very beginning on different levels by implementing various communication and
cooperation activities. Their aim has been to make ELG known in all relevant com-
munities within a short time in such a way that companies and research organisations
develop an active interest in ELG: the more providers offer high quality and attrac-
tive services and datasets, the faster ELG will become a central marketplace, which
in turn will benefit providers and users alike. This is why the ELG consortium pur-
sued a strategy through which the communication activities in combination with the
high quality of the platform and its services and resources, as well as fast and reliable

Georg Rehm · Katrin Marheinecke · Jens-Peter Kückens
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany, georg.rehm@dfki.de,
katrin.marheinecke@dfki.de, jens_peter.kueckens@dfki.de


© The Author(s) 2023                                                                 189
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_10
190                                Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

support services, produce this desired snowball effect. Some of the communication
and cooperation areas and activities were:
National Competence Centres (NCCs) Establish a network of 32 carefully se-
  lected National Competence Centres (see Chapter 11, p. 205 ff.).
ICT-29b) Projects Cooperate with the six EU projects funded through the call
  ICT-29b), i. e., Bergamot1 , Comprise2 , ELITR3 , Embeddia4 , GoURMET5 , Prêt-
  à-LLOD6 and their consortia and networks.
Major European Initiatives Collaborate with all relevant major European initia-
  tives including, among others, the European AI on Demand Platform7 , CLAIRE8 ,
  HumanE AI Net9 , CLARIN10 and others (see Chapter 2, Section 8, p. 27 ff., as
  well as Chapter 6, p. 107 ff.). These collaborations are described in more detail in
  Section 4 (p. 199 ff.) of the present chapter.
Events Organise local, regional and national events together with the ELG Na-
  tional Competence Centres (see Chapter 11, p. 205 ff.).
Talks and Presentations Give talks and presentations, especially at networking
  and outreach events, to decision-makers and multipliers, both in the industrial
  sector but also in scientific European conferences.
Open Calls and Pilot Projects Selected 15 pilot projects, with which we also col-
  laborated in terms of communication activities on their respective regional and
  local levels (see Part IV, p. 256 ff., of this book).
   Next up, Section 2 describes the European Language Technology community in
more detail, focusing upon the different stakeholder groups. A key driver of success
of the ELG initiative is this support and buy-in from the stakeholder community
including the uptake of the platform. In addition to these networking activities in
the project, several public communication channels have been established. Under
the umbrella brand “European Language Technology”, ELG and its sister project
European Language Equality (ELE, see Rehm and Way 2023) have started address-
ing the stakeholders and initiatives listed above, community members outside these
networks and the wider public in order to provide them with news about relevant
project developments, events and updates on ELG features, among others. For this
purpose, social media profiles on Twitter and LinkedIn were established. We also set
up an email newsletter, which was initially published on a monthly basis and later
on changed to a biweekly schedule. These communication channels, their purpose,
effectiveness and the content shared through them is further detailed in Section 3.
1 https://browser.mt
2 https://www.compriseh2020.eu
3 https://elitr.eu

4 http://embeddia.eu

5 https://gourmet-project.eu

6 https://pret-a-llod.github.io

7 https://www.ai4europe.eu

8 https://claire-ai.org

9 https://www.humane-ai.eu/workpackages/

10 https://www.clarin.eu
10 European Language Technology Landscape: Communication and Collaborations        191

2 Stakeholders of the European Language Grid

For our main groups of stakeholders we defined their specific relationships with ELG
and how we will communicate and engage with them in terms of communication
channels but also in terms of messages, considering our overall communication goals.
Most target groups also interact with ELG in one way or another, which is why they
are, in most cases, not just passive audiences but also immediate stakeholders within
the ELG community (Rehm et al. 2020c, 2021). In the following, all stakeholder
groups are defined including aspects relating to communication.



2.1 Language Technology Providers

The interests of LT providers (see Chapter 9 for specific numbers) are different from
those of LT users, which is why specific communication formats need to be applied.
Typically, commercial providers of LT (also see Vasiljevs et al. 2019) want to show-
case their products and promote their solutions and services or their company and –
on a more abstract level – they look for an appropriate marketplace in which they can
participate. In contrast to other target groups, their interactions with ELG are active
and direct. In order to upload or offer a service or tool via ELG, they need specific
technical information and an understanding of how ELG works. This demand is met
through various forms of communication, including a technical documentation with
clear and in-depth explanations of ELG’s functionalities, based on which video tu-
torials were prepared. These videos are shared through all communication channels.
Furthermore, blog articles explain specific ELG features to LT providers and short
announcements of new features are included in the newsletter.
   The more ELG meets business requirements, the more likely LT providers are
to actively use and promote it and to exploit it as an additional or sales channel or
even as their preferred marketplace. Our communication activities addressing LT
providers uses a marketing tonality and promotes the advantages of the ELG initia-
tive. We have also reached out repeatedly to LT providers, inviting them to send
in their questions or feedback with regard to their experience with ELG, including
missing features or suggestions for improvement.
   In many cases research centres and universities are also LT providers, but their
interest is usually not a monetary but a research-driven one. This stakeholder group
provides larger or smaller datasets or perhaps tools or rudimentary, experimental ser-
vices that have evolved from research projects rather than robust, production-ready
services that can be directly monetised. For researchers, sharing their results, the
further development of tools and the exchange with other researchers is the main
driver to use ELG. Finding datasets and tools bundled in one place, they can test
functionalities in the development phase and provide feedback. Ideally, they spread
the word about ELG in scientific articles or in academic conferences, and they can
be approached most easily through these channels. Public communication about the
usefulness of an easy-to-use platform for hosting, sharing and making available LT
192                                   Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

services has proven an effective measure to attract researchers and initiate direct
communication about ELG.


2.1.1 Participants in the Open Calls – Pilot Projects

ELG tested the platform and demonstrated its usefulness with the help of 15 pilot
projects that it supported financially (see Part IV for more details). After their com-
pletion, the results were fed back into the ELG platform and community. To attract
companies or research centres to submit proposals and to make the selected pilot
projects known, communication activities were necessary. The open calls were pub-
lished and advertised through email campaigns, through the ELG website and on
multiple events. META-FORUM 2019 was the first public occasion where the open
calls have been publicly presented. This target group had a high demand for informa-
tion, therefore different channels like online trainings, videos, fact sheets and news
articles were implemented. The pilot projects were an important measure and in-
strument to make ELG known to a wider audience, communication in this area had
to be especially effective. The overwhelming response with more than 200 project
proposals in total proves that this strategy has worked out and the successful comple-
tion of all 15 selected pilot projects is evidence of successful communication (and a
thorough evaluation of the proposals). The results of the pilot projects were also pre-
sented in the virtual project expos at META-FORUM 2020, 2021 and 2022 and also
in a number of sessions and presentations.11 . Several pilot projects were showcased
on the ELG blog, presenting their activities but also the greater implementations of
making use of a pan-European LT platform, while the promotional videos created
for META-FORUM 2021 were featured in the newsletter and on social media.



2.2 Language Technology Users

The users of Language Technology are the most diverse and also, by a margin, the
biggest target group. Users include almost everyone – from students doing research
for a paper to job seekers in the LT field, to companies looking for a machine trans-
lation solution for the corporate website, just to mention a few examples. Members
of this group can look for information, try to find certain LT services or datasets or
they can be potential buyers or integrators of LT. This enormous group interacts with
ELG in the form of a user, consumer or potential customer (Rehm et al. 2021). This
stakeholder group is addressed by a communication strategy that treats this highly
heterogeneous audience as a homogeneous entity. The strategy involves focusing
on what is common in terms of customer needs and preferences instead of focus-
ing upon the differences of individual subgroups. Communication-wise, messages
promoting ELG are designed to have a general appeal, transmitting communication

11   https://www.european-language-grid.eu/meta-forum-2021/project-expo/
10 European Language Technology Landscape: Communication and Collaborations        193

primarily concentrated on the common needs such as information (ELG as an infor-
mation hub), match-making (a digital marketplace where supply meets demand) and
elimination of language barriers. The main communication channels include email
campaigns, social media posts, regular newsletter editions and the ELG website, but
also presentations and booths at industry events and conferences. For the target group
of LT users, we emphasised the possibilities of modern LT and its various applica-
tion areas. With this aim at stressing the importance of LT, for instance in terms of
cross-language communication, information access and automation in fields such as
research and the information industry, ELG intends to include both experienced and
potential LT users and informs about the important role LT plays in the digital world.


2.2.1 Public Administrations and NGOs

As an EU-funded project, ELG can also provide technologies to public administra-
tions, e. g., to the European institutions or national or regional administrations. For
example, ELG offers the language resources provided by ELRC, which were col-
lected and prepared to serve the needs of public services and administrations across
the EU, Norway and Iceland. At the same time, ELG wants to offer solutions to
non-governmental organisations that often have to pursue their goals with limited
financial resources. They can benefit from ELG as users of LT because they typ-
ically do not have the funding or technological know-how to find LT services or
tools that would suit their needs. Apart from more general forms of communication
like email campaigns or press releases, representatives of public administrations as
well as NGOs were invited to conferences like META-FORUM, where traditionally
one of the keynotes or opening addresses is given by a representative of the EU.


2.2.2 European Citizens – Members of the European Language Communities

This stakeholder group also includes the members of the European language com-
munities, i. e., all citizens of Europe, speaking and representing the official EU lan-
guages, regional or minority languages or any of the other languages spoken in
Europe. Communication, networking and surveying activities have primarily taken
place in the EU project European Language Equality (ELE). Through the tight col-
laboration between ELG and ELE we have been able to identify and exploit a number
of synergies, such as, among others, the EU Citizen Survey, through which we have
been able to learn more about how Europe’s citizens perceive Language Technology
and what kind of preconceptions and demands they have.
194                               Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

2.3 Additional Horizon 2020 EU Projects

The projects supported through the Horizon 2020 call ICT-29-2018 (see Section 1)
are a special stakeholder group, as their consortia consist of research centres and uni-
versities as well as several industry partners. All projects dealt with domain-specific,
challenge-oriented LT and provided services, tools and datasets which are also show-
cased in ELG. As the projects were especially featured, they benefited from a higher
level of promotion (Rehm et al. 2021). Furthermore, they could make use of the
various features as well as of the vast community connected with ELG. Due to their
outreach into industry and academia, they functioned as excellent multipliers on mul-
tiple occasions. This target group proved to be very dynamic. We were engaged in
active, bi-directional communication with all consortia, e. g., via online meetings,
mutual invitations to each other’s events, or by advertising our projects on our web-
sites. Communication activities with this group had started in early 2019 and turned
out to be successful and vivid.



2.4 Major European Projects and Initiatives

An overarching platform like ELG can only be successful if it is recognised in and
used by the whole LT community. To establish ELG within the LT scene and to avoid
silo-thinking, we communicated closely and in a targeted manner with other major
projects and initiatives in the field including neighbouring areas, in an attempt to es-
tablish collaborations to create synergies and to share best practices. The ELG con-
sortium has cooperated directly with projects active in similar areas, with a similar
scope or working on similar topics, for example, the European AI on Demand Plat-
form (i. e., the AI4EU EU project), CLAIRE, CLARIN and various other projects
and initiatives. In addition to meetings, conferences like META-FORUM are an ap-
propriate format to share information and knowledge about each other’s activities. At
META-FORUM 2019, 2020, 2021 and 2022, many relevant projects and initiatives
showcased their plans and missions with the help of (virtual) expo booths, presenta-
tions or panel discussions. Members of the ELG consortium took every opportunity
to present ELG at conferences and public events to make the ELG concept and ap-
proach known in different sectors and industries. Existing networks like ELRC (Eu-
ropean Language Resource Coordination) and META-NET were tapped regularly
with regards to knowledge transfer and information exchange. Section 4 presents
these collaborations in more detail.



2.5 National Competence Centres

The National Competence Centres (NCC) played a crucial role for ELG’s communi-
cation and promotion activities (see Chapter 11). This stakeholder group also func-
10 European Language Technology Landscape: Communication and Collaborations       195

tioned as an abstract communication channel (Rehm et al. 2021). The NCCs were and
still are an important target group included in our communication channels, they also
served as multipliers of the ELG mission in their own regions and networks, through
mailings, social media posts, newsletter features, face-to-face meetings, conferences,
tutorials, training sessions and promotion events.



2.6 Public at Large

ELG is a public and inclusive platform that also attempts to address citizens inter-
ested in Language Technology. Members of civil society who browse the web and
visit ELG with no specific intention, also need to be addressed adequately. ELG
wants to promote the purpose and usability of LT beyond the borders of tech-savvy
stakeholders. Our communication activities aim not only at experts, but also at the
public at large. Appropriate communication channels are news and blog posts on
the website or videos on platforms such as YouTube. Social media channels, espe-
cially Twitter, are used to communicate updates and project results in a style that
intends to make them interesting and comprehensible to audiences beyond the core
LT community. Of interest are especially those ELG features that have broader so-
cial implications due to related topics in the news, which are perceived positively by
followers and readers with diverse professional and personal backgrounds.



3 Communication and Outreach Activities

As a project with several objectives, addressing various gaps in the European Lan-
guage Technology landscape and serving as a marketplace for research and industry,
ELG depends on the reputation and brand it has established. In addition to the plat-
form’s functionality and positive experience of users and providers interacting with
ELG, another relevant aspect is the ease of access with regard to content and in-
formation served by the platform. This refers to the information architecture of the
website, structure and quality of the technical documentation, responses to requests
directed at the ELG technical team as well as the overall communication strategy.



3.1 Communication Strategy

A communication strategy enables effective communication, in the case of ELG, this
relates to informing specific target audiences and the broad public about the project
and its results, gaining users and providers for ELG and representing ELG as a brand
for pan-European, multilingual and all-encompassing LT. The key elements of the
196                               Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

communication strategy are the stakeholders, the overall goals of the communication,
the messages to communicate, the communication channels and the timing.
   We have two main communication goals that are closely aligned with our Unique
Selling Propositions (USP), which are the key differentiators from existing platforms
and offerings on the market. The success of the project and the ELG legal entity de-
pend on these two USPs to be widely known by all relevant stakeholders. This is why
the USPs became central messages for communication related to the uptake and pop-
ularity of ELG directed at potential users, participating organisations or stakeholders
to be won over.
        ELG is the primary platform for Language Technologies in Europe.
   ELG strives to become the most important and most relevant marketplace for Lan-
guage Technology in Europe – a one-stop LT shop in which all kinds of stakeholders
can find what they are looking for in terms of services, tools or resources provided
by research or industry. ELG is not only a directory of companies, universities and
research centres, but also contains a repository of thousands of datasets as well as
hundreds of functional tools and services. To make ELG useful and efficient for its
users, visibility and completeness are crucial. Moreover, to include as many relevant
players as possible, one of the main objectives is wide outreach.
         ELG provides Language Technology for Europe built in Europe.
   The second USP relates to the fact that LT from other continents or large global
technology corporations do not have intimate linguistic knowledge of Europe’s lan-
guages including their varieties (i. e., European developers of LT can serve European
demand in a better and more adequate way) and that legal aspects such as copyright
law, the General Data Protection Regulation (GDPR) and other policies are well con-
sidered by European players. The same goes for core European values like privacy,
confidentiality and trustworthiness. Users of ELG do not have to fear their data being
sold to third parties when using or offering services or resources on the platform.



3.2 Communication Campaign

The ELG communication campaign was developed and operationalised with com-
munication experts and continuously revised and expanded to meet the changing
conditions in the project and initiative. The initial situation was thoroughly analysed
and then appropriate marketing measures were planned using various communica-
tion channels including social media.


3.2.1 Communication Objectives

In addition to communicating the overall USPs of ELG to the relevant stakeholders,
all ELG communication activities are also geared towards supporting and realising
10 European Language Technology Landscape: Communication and Collaborations        197

ELG’s overall objectives. We distilled the overall objectives into three main mes-
sages, which are the underlying drivers in all ELG communication activities:
     • Support the Multilingual Digital Single Market by providing technologies for all
       European languages, which can be used by LT user stakeholders in all European
       countries to provide digital offerings, products and solutions that support all
       European languages relevant for the respective LT user stakeholder.
     • Establish and grow a vibrant community and help coordinate all European LT
       activities by becoming the primary platform for LT in Europe.
     • Develop and offer a powerful and scalable LT platform through a novel techno-
       logical approach, which enables innovations and synergies between commercial
       and non-commercial LT providers, buyers and users.


3.2.2 Communication Channels

For ELG, we selected four main areas of communication as the most relevant ones
for informing the main stakeholders and for marketing the project and the platform.
These four areas include the ELG website itself, the annual ELG conference (and
other events), the ELG social media channels and the ELG newsletter. While the
ELG website and the representation of the project at conferences and events was
primarily connected to the ELG brand, a more flexible approach was chosen for
social media and the newsletter.
    For the duration of the project, we maintained, in addition to the actual European
Language Grid, a separate ELG website for information, promotion and marketing
purposes. This website served as the face to the public with all relevant information
on the project itself and its wider setup, including, among others, the ELG architec-
ture, NCCs, annual conferences, newsletter and many other topics. It also included
a news section and a blog. This stand-alone website has been merged with the Euro-
pean Language Grid proper in the summer of 2022 so that all the relevant information
and the European Language Grid itself are now available at the same address.12
    From 2019 to 2022, ELG organised an annual conference (in 2022 in collabora-
tion with the EU project ELE). At these conferences, all relevant aspects of ELG
have been presented and discussed with relevant stakeholders. In addition, ELG par-
ticipated in many other conferences, workshops, industry events and expos. For more
details see Chapter 11, Section 3 (p. 210 ff.).
    In terms of social media channels, ELG uses Twitter and LinkedIn, their main
advantages are the potential to create a very wide reach and large number of fol-
lowers, thus enabling the project to address exactly the right stakeholders. Instead
of establishing dedicated channels for ELG, we decided to create one slightly more
general online identity, namely the umbrella-brand “European Language Technol-
ogy” (ELT), which serves as the name of the social media channels on Twitter13


12   https://www.european-language-grid.eu.
13   https://twitter.com/EuroLangTech
198                                    Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

and LinkedIn14 . This brand serves as the outlet and interaction channel for ELG and
also for its sister project, European Language Equality. The ELT brand solves the
problem of communicating about two related but different projects through a single
channel, while tackling the topic of European Language Technology from a tech-
nological (ELG) and from a strategic perspective (ELE). The approach has proven
successful, as the ELT channels quickly gathered several hundred followers each.
Table 1 shows some key statistics on both platforms.


                                   Channel                       Twitter LinkedIn
              Followers (total)                                   666         818
              Posts (total)                                       316         150
              Posts per month (example: March 2022)                27          19
              Followers gained per month (example: March 2022)     63          75
              Profile visits per month (example: March 2022)    5,944         198
              Impressions per month (example: March 2022)      40,300       9,248
Table 1 European Language Technology: social media statistics (July 2022)



    The differences in the statistics of the two channels can be attributed to the fact
that while Twitter generally sees more activity in interaction and content recep-
tion, LinkedIn follows more professional conventions and goals. Its user base has a
slightly bigger overlap with the main target groups of ELG. This is why the LinkedIn
channel gained more followers even though there was less activity in comparison to
Twitter. Both channels are used for communicating a variety of contents in specified
formats: 1. new ELG platform features and quotes from reports are shared in specif-
ically designed images (known as shareables); 2. new blog articles are promoted
through links and quotes from the text; 3. upcoming events are promoted using, e. g.,
summaries of the programme and links to the event website; 4. related news from
other sources are shared through links or the retweet/sharing function, ideally with
a comment regarding the relevance for ELG.
    Following the concept of the ELT brand, a newsletter was established under the
same name, sharing information from and about ELG and ELE with a total of ap-
prox. 4,000 subscribers as of July 2022.15 We invited many of our existing contacts
to subscribe to the newsletter, we invite visitors of the website to subscribe to the
newsletter and we also share the newsletter on a regular basis through our other com-
munication channels. At first the newsletter was published on a monthly, later on a
bi-weekly basis. Each issue of the newsletter includes a general introduction to the
latest edition, including a list of highlights from social media and an overview of
press articles in relation to ELT, followed by dedicated sections on ELG and ELE.
The ELG section contains general news from and about ELG, a summary of the lat-
est ELG blog article, a few of the latest tools or services added to ELG and the latest
organisation that joined ELG (short profile and link to their ELG entry).

14   https://www.linkedin.com/company/74073406
15   https://www.european-language-technology.eu/elt-newsletter-archive/
10 European Language Technology Landscape: Communication and Collaborations             199

4 Collaborations with other Projects and Initiatives

ELG is a technology platform for the whole European LT community, which is why
collaboration played and plays an important role for the success and uptake of the
ELG initiative (Rehm et al. 2020c). While we are unable to list all projects and organ-
isations we collaborated with during the ELG project’s runtime, below we attempt to
list the major ones (see Chapter 2, Section 8, p. 27 ff., as well as Chapter 6, p. 107 ff.).
European Language Equality ELG and ELE16 worked together on many differ-
  ent topics. ELE collected more than 6,000 LT and LR records, which were in-
  gested in ELG, resulting in a substantial increase of the total number of available
  resources (Giagkou et al. 2022). The Digital Language Equality metric, developed
  by ELE (Gaspari et al. 2022; Grützner-Zahn and Rehm 2022), is based on the con-
  tents of the ELG catalogue and can be accessed through a dashboard developed
  by ELE and available on ELG.17 While ELE prepares the strategic agenda and
  roadmap towards digital language equality in Europe, ELG offers the appropriate
  platform for sharing and deploying these Language Technologies. The synergies
  between the projects were communicated through blog articles and our shared
  social media channels as well as our shared newsletter.
Open Calls and Pilot Projects ELG collaborated with the organisations behind
  the 15 selected pilot projects in terms of technical aspects and communication
  activities on their respective regional and local levels (see Part IV, p. 256 ff.).
ICT-29b) Projects ELG collaborated with the six EU projects funded through
  the Horizon 2020 call ICT-29b), i. e., Bergamot18 , Comprise19 , ELITR20 , Embed-
  dia21 , GoURMET22 , Prêt-à-LLOD23 and their consortia and networks, especially
  with regard to outreach and communication, coordination and making project re-
  sults available through ELG.
European AI on Demand Platform ELG cooperated with the European AI on
  Demand Platform through the EU project AI4EU.24 Topics include strategic
  and coordination aspects, the technical interoperability between both platforms
  (Rehm et al. 2020b), the preparation of an AI ontology and participation in out-
  reach and promotion events.
HumanE AI Net This EU network of excellence25 , which also belongs to the Eu-
  ropean AI on Demand Platform, aims at facilitating a European brand of trustwor-

16 https://european-language-equality.eu
17 https://live.european-language-grid.eu/catalogue/dashboard
18 https://browser.mt

19 https://www.compriseh2020.eu

20 https://elitr.eu

21 http://embeddia.eu

22 https://gourmet-project.eu

23 https://pret-a-llod.github.io

24 https://www.ai4europe.eu

25 https://www.humane-ai.eu/workpackages/
200                                 Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

  thy, ethical AI that enhances human capabilities and empowers citizens and soci-
  ety to effectively deal with the challenges of an interconnected globalised world.
  ELG supports this initiative as language is a core topic in human-oriented AI.
  Many organisations involved in ELG are also active in HumanE AI Net through
  specific microprojects that focus on certain research questions, funded by the ini-
  tiative. HumanE AI Net and ELG collaborated with regard to joint outreach and
  promotion activities.
CLAIRE ELG and the Confederation of Laboratories for AI Research in Eu-
  rope26 , the world’s largest network for AI research, collaborated with regard to
  strategic and coordination topics. ELG, representing the language-centric AI land-
  scape, serves as a link between the LT and the AI communities. We also partici-
  pated in various joint events.
CLARIN ELG and the Common Language Resources and Technology Infrastruc-
  ture27 (Eskevich et al. 2020) collaborated with regard to strategic and technical
  aspects such as metadata harvesting (see Chapter 6) and events.
Microservices at your Service This CEF-supported EU project collects and de-
  velops a larger number of functional services, develops ELG-compatible contain-
  ers and makes these available through ELG.28 Additionally, the two projects col-
  laborated by participating in relevant outreach and training events.
NTEU and MAPA The two CEF-supported EU projects Neural Translation for
  the EU (NTEU)29 and Multilingual Anonymisation for Public Administrations
  (MAPA)30 have contributed a large number of tools and services to ELG (Garcı́a-
  Martı́nez et al. 2021). NTEU alone has provided hundreds of high quality machine
  translation models, which are now available through ELG.
WeVerify This EU project develops tools and technologies for the identification
  and verification of various types of news and media (Marinova et al. 2020).31
  Internally, the WeVerify tools make use of several ELG services.
ELRC The CEF-supported EU initiative European Language Resource Coordina-
  tion (ELRC)32 supports multilingual Europe, among others, by collecting publicly
  available language data from national public administrations and making them
  available to the European Union through the repository ELRC-SHARE (Lösch
  et al. 2018). ELG automatically harvests ELRC-SHARE, enabling the discovery
  of these resources through ELG. ELRC and ELG also collaborated in terms of
  joint communication and dissemination activities.
QURATOR The German project QURATOR has developed a technology plat-
  form and large number of tools, services and resources for several digital con-

26 https://claire-ai.org
27 https://www.clarin.eu
28 https://www.lingsoft.fi/en/microservices-at-your-service-bridging-gap-between-nlp-research

-and-industry
29 https://nteu.eu

30 https://mapa-project.eu

31 https://weverify.eu

32 https://www.lr-coordination.eu
10 European Language Technology Landscape: Communication and Collaborations          201

   tent curation use cases (Rehm et al. 2020a).33 Both projects, QURATOR and
   ELG worked together closely from the very beginning in terms of platform de-
   velopment, communication and dissemination, among others, through the annual
   QURATOR conferences. Many tools and resources created by QURATOR are
   available through ELG.
PANQURA This sister project of QURATOR focuses upon the application of
   QURATOR technologies to the COVID-19 pandemic, striving for more trans-
   parency in times of a global crisis.34 Among others, PANQURA has developed
   tools for the automated assessment of the credibility of online content, which are
   now available through ELG (Schulz et al. 2022).
OpenGPT-X and Gaia-X The German project OpenGPT-X develops large lan-
   guage models for the German language.35 The project is part of a group of AI
   projects that will test and deploy their project results through the emerging Gaia-
   X infrastructure.36 In Gaia-X, representatives from business, politics, and science
   are working together to create a federated and secure data infrastructure for Eu-
   rope, addressing the topic of data sovereignty in Europe. OpenGPT-X will not
   only make use of various resources available in and through ELG, the project will
   also extend ELG so that the platform is compatible with Gaia-X, i. e., OpenGPT-X
   will integrate the ELG platform into the emerging Gaia-X infrastructure.
NFDI4DS The project NFDI for Data Science and AI37 is part of the German
   NFDI initiative, which develops, with a total of approx. 20-25 projects, the na-
   tional German research data infrastructure.38 NFDI4DS will support all steps of
   the research data life cycle, including collecting or creating, processing, analysing,
   publishing, archiving, and reusing resources in Data Science and AI. In NFDI4DS,
   ELG will be integrated into the emerging NFDI infrastructure.
DataBri-X The EU project Data Process and Technological Bricks for expand-
   ing digital value creation in European Data Spaces (DataBri-X), which will start
   in October 2022, will develop a toolbox for data processing, data handling and
   data curation. The ELG platform will be used and also extended as one technical
   infrastructure in this project.
SciLake The EU project Democratising and Making Sense out of Heterogeneous
   Scholarly Content (SciLake), which will start in January 2023, will build upon
   the OpenAIRE ecosystem and European Open Science Cloud (EOSC) services to
   facilitate, among others, the creation, interlinking and maintenance of research-
   oriented knowledge graphs. In SciLake we will establish technical bridges be-
   tween the ELG platform and EOSC.




33   https://qurator.ai
34   https://qurator.ai/panqura/
35   https://opengpt-x.de
36   https://gaia-x.eu
37   https://www.nfdi4datascience.de
38   https://www.nfdi.de
202                                   Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

5 Conclusions

As a community platform and initiative, ELG does not operate in a vacuum without
contact to other projects, groups or initiatives. On the contrary, it is of fundamental
importance that ELG is tightly integrated into the community with active use of the
ELG platform by many members of the community. To achieve this, ELG has de-
fined its target groups and cooperates closely with a number of relevant projects to
exploit existing synergies. These networking and collaboration efforts will be con-
tinued after the runtime of the ELG EU project, i. e., when the ELG legal entity is
established and operational. This approach is based on a clear communication strat-
egy with transparent goals that are pursued jointly with other key stakeholders.
   While we have been able to establish a shared platform for the European LT com-
munity during the 42 months of the ELG project, we now need to concentrate on en-
gaging with more and more stakeholders so that ELG is also utilised and expanded
by more and more active users, resulting in a European Language Grid from the
European LT community for the European LT community.



References

Eskevich, Maria, Franciska de Jong, Alexander König, Darja Fišer, Dieter Van Uytvanck, Tero
   Aalto, Lars Borin, Olga Gerassimenko, Jan Hajic, Henk van den Heuvel, Neeme Kahusk, Krista
   Liin, Martin Matthiesen, Stelios Piperidis, and Kadri Vider (2020). “CLARIN: Distributed Lan-
   guage Resources and Technology in a European Infrastructure”. In: Proc. of the 1st Int. Work-
   shop on Language Technology Platforms (IWLTP 2020, co-located with LREC 2020). Ed. by
   Georg Rehm, Kalina Bontcheva, Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasil-
   jevs. Marseille, France: ELRA, pp. 28–34. URL: https://aclanthology.org/2020.iwltp-1.5.
European Parliament (2018). Language Equality in the Digital Age. European Parliament reso-
   lution of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). URL:
   http://www.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf.
Garcı́a-Martı́nez, Mercedes, Laurent Bié, Aleix Cerdà, Amando Estela, Manuel Herranz, Rihards
   Krišlauks, Maite Melero, Tony O’Dowd, Sinead O’Gorman, Marcis Pinnis, Artūrs Stafanovič,
   Riccardo Superbo, and Artūrs Vasiļevskis (2021). “Neural Translation for European Union
   (NTEU)”. In: Proceedings of Machine Translation Summit XVIII: Users and Providers Track.
   Association for Machine Translation in the Americas, pp. 316–334. URL: https://aclanthology
   .org/2021.mtsummit-up.23.
Gaspari, Federico, Owen Gallagher, Georg Rehm, Maria Giagkou, Stelios Piperidis, Jane Dunne,
   and Andy Way (2022). “Introducing the Digital Language Equality Metric: Technological Fac-
   tors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-
   located with LREC 2022). Ed. by Itziar Aldabe, Begoña Altuna, Aritz Farwell, and German
   Rigau. Marseille, France, pp. 1–12. URL: http://www.lrec-conf.org/proceedings/lrec2022/wor
   kshops/TDLE/pdf/2022.tdle-1.1.pdf.
Giagkou, Maria, Stelios Piperidis, Penny Labropoulou, Miltos Deligiannis, Athanasia Kolovou,
   and Leon Voukoutis (2022). “Collaborative Metadata Aggregation and Curation in Support of
   Digital Language Equality Monitoring”. In: Proceedings of the Workshop Towards Digital Lan-
   guage Equality (TDLE 2022; co-located with LREC 2022). Ed. by Itziar Aldabe, Begoña Altuna,
   Aritz Farwell, and German Rigau. Marseille, France, pp. 27–35. URL: http://www.lrec-conf.o
   rg/proceedings/lrec2022/workshops/TDLE/pdf/2022.tdle-1.3.pdf.
10 European Language Technology Landscape: Communication and Collaborations                   203

Grützner-Zahn, Annika and Georg Rehm (2022). “Introducing the Digital Language Equality Met-
   ric: Contextual Factors”. In: Proceedings of the Workshop Towards Digital Language Equality
   (TDLE 2022; co-located with LREC 2022). Ed. by Itziar Aldabe, Begoña Altuna, Aritz Farwell,
   and German Rigau. Marseille, France, pp. 13–26. URL: http://www.lrec-conf.org/proceedings
   /lrec2022/workshops/TDLE/pdf/2022.tdle-1.2.pdf.
Lösch, Andrea, Valérie Mapelli, Stelios Piperidis, Andrejs Vasiļjevs, Lilli Smal, Thierry Declerck,
   Eileen Schnur, Khalid Choukri, and Josef van Genabith (2018). “European Language Resource
   Coordination: Collecting Language Resources for Public Sector Multilingual Information Man-
   agement”. In: Proc. of the Eleventh International Conference on Language Resources and Eval-
   uation (LREC 2018). Miyazaki, Japan: ELRA. URL: https://aclanthology.org/L18-1213.
Marinova, Zlatina, Jochen Spangenberg, Denis Teyssou, Symeon Papadopoulos, Nikos Sarris,
   Alexandre Alaphilippe, and Kalina Bontcheva (2020). “Weverify: Wider and Enhanced Ver-
   ification for You Project Overview and Tools”. In: 2020 IEEE International Conference on
   Multimedia Expo Workshops (ICMEW), pp. 1–4. DOI: 10.1109/ICMEW46912.2020.9106056.
Rehm, Georg, Peter Bourgonje, Stefanie Hegele, Florian Kintzel, Julián Moreno Schneider, Malte
   Ostendorff, Karolina Zaczynska, Armin Berger, Stefan Grill, Sören Räuchle, Jens Rauenbusch,
   Lisa Rutenburg, André Schmidt, Mikka Wild, Henry Hoffmann, Julian Fink, Sarah Schulz,
   Jurica Seva, Joachim Quantz, Joachim Böttger, Josefine Matthey, Rolf Fricke, Jan Thomsen,
   Adrian Paschke, Jamal Al Qundus, Thomas Hoppe, Naouel Karam, Frauke Weichhardt, Chris-
   tian Fillies, Clemens Neudecker, Mike Gerber, Kai Labusch, Vahid Rezanezhad, Robin Schae-
   fer, David Zellhöfer, Daniel Siewert, Patrick Bunk, Lydia Pintscher, Elena Aleynikova, and
   Franziska Heine (2020a). “QURATOR: Innovative Technologies for Content and Data Cura-
   tion”. In: Proceedings of QURATOR 2020 – The conference for intelligent content solutions. Ed.
   by Adrian Paschke, Clemens Neudecker, Georg Rehm, Jamal Al Qundus, and Lydia Pintscher.
   CEUR Workshop Proceedings, Volume 2535. 20/21 January 2020. Berlin, Germany. URL: htt
   p://ceur-ws.org/Vol-2535/paper_17.pdf.
Rehm, Georg, Dimitrios Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo
   Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian
   Chiarcos, Nils Feldhus, Julián Moreno-Schneider, Florian Kintzel, Elena Montiel, Víctor Ro-
   dríguez Doncel, John P. McCrae, David Laqua, Irina Patricia Theile, Christian Dittmar, Kalina
   Bontcheva, Ian Roberts, Andrejs Vasiljevs, and Andis Lagzdiņš (2020b). “Towards an Inter-
   operable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different
   Levels of Interoperability”. In: Proc. of the 1st Int. Workshop on Language Technology Plat-
   forms (IWLTP 2020, co-located with LREC 2020). Ed. by Georg Rehm, Kalina Bontcheva,
   Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasiljevs. Marseille, France, pp. 96–
   107. URL: https://www.aclweb.org/anthology/2020.iwltp-1.15.pdf.
Rehm, Georg, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic,
   Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez
   Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek,
   Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-
   bina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De
   Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek,
   Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur
   Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás
   Váradi, Kadri Vider, Andy Way, and François Yvon (2020c). “The European Language Technol-
   ogy Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communi-
   cation in Multilingual Europe”. In: Proceedings of the 12th Language Resources and Evaluation
   Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christo-
   pher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
   Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325.
   URL: https://www.aclweb.org/anthology/2020.lrec-1.407/.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
204                                   Georg Rehm, Katrin Marheinecke, and Jens-Peter Kückens

   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Rehm, Georg and Andy Way, eds. (2023). European Language Equality: A Strategic Agenda for
   Digital Language Equality. Cognitive Technologies. Forthcoming. Springer.
Schulz, Konstantin, Jens Rauenbusch, Jan Fillies, Lisa Rutenburg, Dimitrios Karvelas, and Georg
   Rehm (2022). “User Experience Design for Automatic Credibility Assessment of News Con-
   tent About COVID-19”. In: Proceedings of HCI International 2022 – Late Breaking Papers.
   Accepted for publication. 26 June-01 July 2022.
STOA (2018). Language equality in the digital age – Towards a Human Language Project. STOA
   study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. URL: https://data.europa.eu/doi/10
   .2861/136527.
Vasiljevs, Andrejs, Khalid Choukri, Luc Meertens, and Stefania Aguzzi (2019). Final study report
   on CEF Automated Translation value proposition in the context of the European LT market/e-
   cosystem. DOI: 10.2759/142151. URL: https://op.europa.eu/de/publication-detail/-/publicatio
   n/8494e56d-ef0b-11e9-a32c-01aa75ed71a1/language-en.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 11
ELG National Competence Centres and Events

Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm




Abstract The National Competence Centres (NCCs) in ELG are an international net-
work of 32 regional and national networks, lead by one regional/national representa-
tive. The 32 NCCs play a crucial role in ELG, they support the project by bringing in
their corresponding regional and national perspective and stakeholders, organising
ELG workshops and functioning as regional/national representatives. The chapter
explains why, despite a considerable coordination effort, it was worth putting this
network together. One important task carried out by the NCCs was to conduct re-
gional/national dissemination events and to participate in relevant regional/national
events and also in the annual META-FORUM conferences, organised by ELG.



1 Introduction

The diverse Multilingual Europe community, consisting of multiple stakeholder
groups, is an important component of our concept for the ELG (Rehm et al. 2020).
This heterogeneous set of stakeholder groups includes LT provider companies, LT
user/buyer companies, research centres and universities involved in LT research, de-
velopment and innovation activities, language communities, politics and public ad-
ministrations, national funding agencies, language service providers and translators
as well as the European citizen at large (Rehm et al. 2021).
   In this chapter we focus upon one specific part of the wider group of stakeholders
involved in the ELG initiative, i. e., the National Competence Centres (NCCs). The
ELG NCCs are an international network of 32 regional and national networks. Sec-
tion 2 describes the NCCs as well as the activities carried out together with the NCCs.
We also touch upon the setup procedure and the involvement of the NCCs. Confer-
ences, workshops and other events play a crucial role in disseminating the mission
and idea of the ELG initiative, as well as the platform itself. We involved the NCCs
to help spread the word about ELG on the regional and national levels. A major part

Katrin Marheinecke · Annika Grützner-Zahn · Georg Rehm
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany,
katrin.marheinecke@dfki.de, annika.gruetzner-zahn@dfki.de, georg.rehm@dfki.de


© The Author(s) 2023                                                               205
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_11
206                               Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm

of their involvement was, thus, devoted to the organisation of and participation in
conferences and events.
    Section 3 provides a brief overview of the events and conferences ELG organised
or participated in. We focus upon the four editions of the annual META-FORUM
conference series, which were organised by the ELG project (2019 until 2022). Due
to the impact of the COVID-19 pandemic, more than two thirds of all events planned
under the umbrella of the project had to be organised as virtual events.



2 National Competence Centres

The ELG National Competence Centres comprises 32 colleagues from all over Eu-
rope who all have their own strong regional and national networks, which com-
prise both industry and also research. For the setup of the NCCs, we benefited from
structures and instruments that have been set up by partners of the ELG consor-
tium starting in 2010 and that have been in active use since then, including META-
NET1 , META-SHARE (Piperidis et al. 2014)2 , CRACKER (Cracking the Language
Barrier, Rehm 2017)3 , EFNIL (European Federation of National Institutions for
Language)4 , ELRC (European Language Resource Coordination)5 and the META-
FORUM conference series (Rehm et al. 2016, 2020).
    In ELG, we made use of this large set of collaborators, established infrastructures
and communication instruments. The involvement in different projects and initia-
tives made it possible to set up a strong and representative network of National Com-
petence Centres with broad reach into regional and national networks already during
the ELG proposal preparation phase, i. e., before the project had actually started. We
invited more than 30 experts from the field that met a number of criteria (participa-
tion in relevant initiatives, members of academic organisations, good connections
to industry and research etc.) to participate in ELG as National Competence Centre
Leads with a clearly defined set of tasks and responsibilities.



2.1 Tasks and Responsibilities

The NCCs support the ELG project and initiative in various ways. This international
network of national networks not only significantly contributes to the population
of the ELG cloud platform with services, resources and data sets, it also plays an
important role for broadening the reach of the ELG project and initiative. Early in

1 http://www.meta-net.eu
2 http://www.meta-share.org
3 http://www.cracker-project.eu

4 http://www.efnil.org

5 https://lr-coordination.eu
11 ELG National Competence Centres and Events                                                  207

the project, the NCCs were asked to provide information and share their knowledge,
e. g., on national/regional information about services, datasets, resources, tools, tech-
nologies, research centres, experts, communities, companies, initiatives, projects etc.
Additionally, the NCCs have been crucial as multipliers who spread the word about
ELG and inform regional and national stakeholders and organisations about ELG
and its benefits. The NCCs also fed local needs, ideas and demands back to the ELG
to make sure that the ELG development takes the requirements of their constituency
into account. Moreover, the NCCs helped with general outreach and dissemination
activities, e. g., promoting events like the ELG conferences (Section 3) or the ELG
open calls (see Part IV) through their own established channels and networks.
    Whereas some activities could be performed by the NCCs with sending emails
and providing quickly accessible information, there are a number of tasks that re-
quired more effort. These included:
    • Organisation of a regional/national ELG workshop including agenda prepara-
      tion, advertising and promotion (web, social media, emails), identification of
      speakers and participants etc.
    • Participation in regional/national events (both scientific and industry confer-
      ences and workshops) on behalf of ELG to promote ELG and to interest relevant
      stakeholders from research and industry.
    • Participation in each of the annual ELG conferences (META-FORUM) in order
      to strengthen the LT community, support dissemination activities related to ELG
      and to foster discussion on current LT-related topics and trends.
    • Desk research and information gathering: Collection of relevant regional/na-
      tional information regarding funding programmes, national language (technol-
      ogy) development plans, AI strategies etc. with the overall goal of putting to-
      gether a comprehensive picture of the European LT landscape.6
    These tasks corresponded to the priorities of the ELG project consortium, but
were to be understood as recommendations rather than mandatory activities. The
actual selection of tasks to be organised by an NCC depended on the situation in
their country and was determined individually.
    We organised meetings with all NCC Leads approximately twice a year; originally
at least one annual meeting was meant to be held as a face-to-face meeting co-located
with the annual ELG conference in order to minimise travel efforts.7 Due to the
COVID-19 pandemic, further face-to-face meetings have been impossible, which
is why all follow-up meetings were held virtually. In the NCC meetings, all NCCs
Leads were asked to report briefly on the situation in their countries; furthermore,
planned activities and tasks foreseen were discussed. Contractual and organisational
matters could also be addressed.

6 With regard to these desk research activities, many synergies with the project European Language
Equality (ELE), which started in January 2021 and which included almost all NCCs as consortium
partners, have been identified and made use of, see https://european-language-equality.eu.
7 The first and, so far, only face-to-face meeting of all National Competence Centres took place on

7 October 2019, as a pre-conference meeting of META-FORUM 2019 in Brussels (see Figure 1).
208                              Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm




Fig. 1 National Comptence Centre meeting in Brussels, Belgium (7 October 2019)



2.2 Role and Structure

The rationale behind setting up this international network of national networks was
to broaden the reach of the ELG consortium, to provide input with regard to the
linguistic situation in the different countries and to fuel the knowledge transfer and
sharing between national programmes and initiatives on the one hand and ELG on
the other. Since the EU Member States and other European countries have quite
diverse situations and individual language policies, a “one-size-fits-all” approach
would not have worked. It was crucial for ELG to have access to dedicated experts
in all countries to turn to and ask for input. Due to their vast personal connections,
the NCCs were ideally suited to make the ELG initiative known in the local markets
and in the research spheres of their home countries. It was a deliberate decision
to move forward only with academic organisations as NCCs in order to guarantee
independence from any commercial interests.
   The network of NCCs was compiled based on participation in existing structures
and initiatives (META-NET, ELRC NAPs, CLARIN etc.), taking into account sci-
entific standing and existing connections to industry and research. Table 1 lists the
NCC Leads, their country and affiliation. Figure 1 shows the NCC Leads at the NCC
kick-off meeting in October 2019 in Brussels, Belgium.



2.3 Visibility and Promotion

The NCCs provided valuable insights and feedback to the ELG project and initiative.
In return, the project consortium helped increase the visibility of the NCCs and their
institutions, for example, by promoting the NCCs and their organisations on the ELG
11 ELG National Competence Centres and Events                                                209

     Name and Country                                    Institution
Dagmar Gromann          AT   Zentrum für Translationswissenchaft, Universität Wien
Walter Daelemans        BE   Comp. Ling. and Psycholing. Res. Centre (CLiPS), Univ. of Antwerp
Svetla Koeva            BG   Institute for Bulgarian Language, Bulgarian Academy of Sciences
Marko Tadić             HR   Inst. of Ling., Faculty of Hum. and Social Science, Univ. of Zagreb
Dora Loizidou           CY   Department French and Modern Languages, University of Cyprus
Jan Hajič*              CZ   Inst. of Formal and Applied Linguistics, Charles University
Bolette S. Pedersen     DK   Centre for Lang. Tech., Dpt. of Nordic Research, Univ. of Copenhagen
Susanna Oja             EE   Competence Centre for NLP at the Institute of the Estonian Language
Krister Lindén          FI   Department of Digital Humanities, University of Helsinki
François Yvon           FR   Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS
Georg Rehm*             DE   Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)
Maria Gavriilidou*      EL   Institute for Language and Speech Processing (ILSP), R. C. “Athena”
Tamás Váradi            HU   Research Institute for Linguistics, Hungarian Academy of Sciences
Eiríkur Rögnvaldsson    IS   School of Humanities, University of Iceland
Andy Way                IE   ADAPT Centre, Dublin City University (DCU)
Bernardo Magnini        IT   Human Language Technology, Fondazione Bruno Kessler (FBK)
Inguna Skadina          LV   Institute of Mathematics and Computer Science, University of Latvia
Albina Auksoriūtė       LT   Institute of the Lithuanian Language
Dimitra Anastasiou      LU   Luxembourg Institute of Science and Technology (LIST)
Michael Rosner          MT   Department Intelligent Computer Systems, University of Malta
Vincent Vandeghinste    NL   Instituut voor de Nederlandse Taal (INT)
Kristine Eide           NO   Norwegian Language Council
Maciej Ogrodniczuk      PL   Institute of Computer Science, Polish Academy of Sciences
António Branco          PT   Department of Informatics, University of Lisbon
Dan Tufiş               RO   Research Institute for AI, Romanian Academy of Sciences
Cvetana Krstev          RS   Faculty of Mathematics, Belgrade University (UBG)
Radovan Garabík         SK   L’udovít Štúr Institute of Linguistics, Slovak Academy of Sciences
Simon Krek              SI   Jozef Stefan Institute (JSI)
Marta Villegas          ES   Barcelona Supercomputing Center (BSC)
Jens Edlund             SE   Royal Institute of Technology (KTH)
Hervé Bourlard          CH   Idiap Research Institute
Kalina Bontcheva*       UK   Department of Computer Science, University of Sheffield
                             * Person belongs to the ELG consortium
Table 1 List of National Competence Centres



website.8 At the ELG conferences, the organisers dedicated several sessions to the
activities and concerns of selected NCCs and also addressed locally relevant aspects
in the conference programme. Furthermore, the NCC meetings served as discussion
platforms where the NCCs could promote their topics and exchange experience and
knowledge with colleagues from other countries.
    The fact that more than two thirds of the project’s runtime took place during the
global COVID-19 pandemic thwarted our collective plans for almost all face-to-face
events and workshops and severely affected our dissemination activities. However,
the shift to virtual formats has allowed interested people to attend conferences or
workshops who might not have attended otherwise because of the effort and expenses

8   https://www.european-language-grid.eu/ncc/
210                           Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm

involved. In addition, online events have made it easier for the ELG team to provide
presentations and platform demos because there was no travel component involved.
   In June 2022, a new format was introduced for META-FORUM 2022, as this
final project conference was planned and conducted as a hybrid event, combining
the benefits of face-to-face and online conferences.



2.4 Operational Aspects

Operationally, DFKI as the coordinating partner of the ELG project prepared subcon-
tracts that specified the details of the cooperation between ELG and the NCCs. The
NCCs agreed to take over tasks in the interest of disseminating and promoting the
European Language Grid in their countries with the activities described. In return,
the ELG project reimbursed costs incurred for activities like:
  • Organisation of a regional or national ELG workshop.
  • Participation in the annual ELG conferences 2019 and 2022 (including costs for
    travel and accommodation).
  • Participation in regional or national conferences or other events to promote ELG
    (including costs for travel, accommodation and conference fees, if applicable).
  • Desk research, participation in surveys or questionnaires, communication and
    participation in virtual meetings.



3 Conferences and Workshops

ELG organised four annual conferences (META-FORUM 2019, 2020, 2021, 2022)
to present, discuss and widely disseminate the idea of a joint technology cloud plat-
form under the umbrella of the wider Multilingual Europe topic. While these confer-
ences are described in more detail in Section 3.1, the more focused ELG workshops
and additional events are described in Sections 3.2 and 3.3, respectively.



3.1 META-FORUM Conference Series

META-FORUM is the annual international conference on Language Technologies
in Europe, organised by ELG together with the META-NET Network of Excel-
lence, dedicated to fostering the multilingual European information society. Previous
META-FORUM editions were organised and financially supported through the EU
projects META-NET (T4ME; 2010, 2011, 2012, 2013) and CRACKER (2015, 2016,
2017). For the four editions 2019-2022, ELG took over the organisation of META-
FORUM, which at the same time serves as the annual ELG conference (Section 3.1.1
to 3.1.4). Table 2 shows all META-FORUM conferences so far.
11 ELG National Competence Centres and Events                                                       211

    The two main goals of META-FORUM are community building and outreach
to the wider European Language Technology community including research and in-
dustry. The ELG editions also had the goal of promoting the ELG initiative and also
ELG as the primary platform for Language Technology in Europe. The conferences
featured presentations and project expos with a special collaboration focus in order
to attract users and providers of LT. As the conference also functions as a dissemina-
tion and promotion platform, the ambition was to attract a large and varied number
of participants so that all relevant stakeholder groups were adequately covered.


      Year                     Conference Motto                         Location         Date
      2010   Challenges for Multilingual Europe                        Brussels, BE   Nov. 17/18
      2011   Solutions for Multilingual Europe                         Budapest, HU   June 27/28
      2012   A Strategy for Multilingual Europe                        Brussels, BE   June 20/21
      2013   Connecting Europe for New Horizons                        Berlin, DE     Sept. 19/20
      2015   Technologies for the Multilingual Digital Single Market   Riga, LV       April 27
      2016   Beyond Multilingual Europe                                Lisbon, PT     July 04/05
      2017   Towards a Human Language Project                          Brussels, BE   Nov. 13/14
      2019   Introducing the European Language Grid                    Brussels, BE   Oct. 08/09
      2020   Piloting the European Language Grid                       online         Dec. 01-03
      2021   Using the European Language Grid                          online         Nov. 15-17
      2022   Joining the European Language Grid                        Brussels, BE   June 08/09
Table 2 META-FORUM conference series




3.1.1 META-FORUM 2019

META-FORUM 2019 took place in October 2019 in Brussels.9 Its motto was “Intro-
ducing the European Language Grid”. The first session was dedicated to a presenta-
tion of the overall ELG project including a very first prototype of the platform, which
was demonstrated live on stage to the LT community and stakeholders from the EU
institutions for the very first time. After presentations of the three project areas (ELG
Platform, ELG Content, ELG Community), the open calls for pilot projects were an-
nounced including overall procedures and timeline. Another session focused on the
six LT research projects – ELITR, COMPRISE, Bergamot, EMBEDDIA, Gourmet
and Prêt-à-LLOD – funded under the Horizon 2020 call ICT-29b-2018 “A multilin-
gual Next Generation Internet”. Moreover, panel discussions and presentations on
LT and AI, on LT and digital public services, on news from the language communi-
ties as well as discussions with stakeholders from industry were organised. An expo
featured LT and relevant AI projects. Interest in the ELG platform was very high dur-
ing and after the conference, as evidenced by a high number of relevant discussions
during the sessions and in the breaks. These discussions provided valuable feedback
for the further development of the platform. All in all, feedback regarding the event

9   https://www.european-language-grid.eu/meta-forum-2019/
212                             Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm

was overwhelmingly positive. Among others, stakeholders from minority languages
expect ELG to make significant breakthroughs, because they hope to find datasets
more easily. After the conference, we received several enquiries from companies
highly interested in including their services in the ELG platform.


3.1.2 META-FORUM 2020

Due to the global COVID-19 pandemic, META-FORUM 2020 had to be organised
as a virtual event, it was held in early December 2020.10 The motto of the conference
was “Piloting the European Language Grid” and it consisted of three half days of
presentations and panel discussions. META-FORUM 2020 received a lot of interest
with many fruitful conversations. Once again, a strong focus was on presenting the
wider landscape of currently funded projects in the area of LT and language-centric
AI but also the industry perspective was taken into account.
   Holding a conference that is supposed to foster community building and network-
ing as an online event, is a technical challenge. At the same time, the year 2020, with
many cancelled events, made it even more necessary to provide room for open ex-
change among colleagues and (potential) collaborators. This is why we decided to
organise a large project expo to enable in-depth discussions on different approaches
in the various projects.11 Like a face-to-face expo, not only the general idea of the
respective project was presented but the virtual booths also allowed for technical de-
mos, detailed explanations and profound exchange between visitors and the project
representatives. The expo featured 35 projects, all of which had their own dedi-
cated virtual meeting room. We also prepared web pages for each project with an
abstract, project poster and other visual materials provided by the projects. Thus,
visitors could study the material on the website or jump into the project meeting
rooms (i. e., the virtual expo booths) and stay in the meetings as long as they liked.
Apart from the first set of ten ELG pilot projects, the following projects participated
in META-FORUM 2020 with project booths: AI4MEDIA12 , Bergamot13 , COM-
PRISE14 , CURLICAT15 , DSDE16 , Elexis17 , ELG18 , ELITR19 , ELRC20 , EMBED-




10 https://www.european-language-grid.eu/meta-forum-2020
11 https://www.european-language-grid.eu/meta-forum-2020/meta-forum-2020-project-expo/
12 https://ai4media.eu

13 https://browser.mt

14 https://www.compriseh2020.eu

15 http://clip.ipipan.waw.pl/CURLICAT

16 https://www.cjvt.si/rsdo/en/project/

17 https://elex.is

18 https://www.european-language-grid.eu

19 https://elitr.eu

20 http://www.lr-coordination.eu
11 ELG National Competence Centres and Events                                       213

DIA21 , EUCPT22 , FedTerm23 , Gourmet24 , Lynx25 , MAPA26 , MARCELL27 , Mar-
ian28 , MeMAD29 , MT4All30 , NexusLinguarum31 , NTEU32 , Prêt-à-LLOD33 , PRIN-
CIPLE34 , PROVENANCE35 , QURATOR36 and WeVerify37 . In addition, members
of the ELG consortium provided demos of the platform and discussed questions and
ideas of (potential) users, providers and other interested parties.
    Interest in the ELG platform and initiative was considerably stronger than in 2019,
i. e., ELG was gaining more and more traction. META-FORUM has proven to be an
effective marketing and information channel for ELG. Discussions that took place
in the expo provided, again, a lot of valuable feedback and inspiration. This format
worked also very well to advertise the work of the ELG pilot projects. Despite the
challenging conditions, the conference was successful, while it is obvious that virtual
events can only emulate certain parts of a face-to-face event while others – the often
mentioned informal chats over coffee – are difficult to recreate in the virtual format.
While not every participant attended each session, the online format made it possible
for visitors to select only those sessions they are interested in and for which they had
sufficient time capacities. The virtual format made it possible for all participants to
attend including those with time and budget restrictions. A poll during the opening
session showed that more than half of the participants attended META-FORUM for
the first time in 2020. All META-FORUM 2020 sessions are available online.38


3.1.3 META-FORUM 2021

META-FORUM 2021 was the 10th edition of the conference series overall and the
second to take place online, given the ongoing pandemic situation.39 The motto of

21   http://embeddia.eu
22   https://www.presidencymt.eu
23   https://www.eurotermbank.com
24   https://gourmet-project.eu
25   https://lynx-project.eu
26   https://mapa-project.eu
27   http://marcell-project.eu
28   https://marian-project.eu
29   https://memad.eu
30   http://ixa2.si.ehu.eus/mt4all/
31   https://nexuslinguarum.eu
32   https://nteu.eu
33   https://pret-a-llod.github.io
34   https://principleproject.eu
35   https://www.provenanceh2020.eu
36   https://qurator.ai
37   https://weverify.eu
38   https://www.youtube.com/playlist?list=PLL1cFzaG0S5ghZz0HxO5TEUIdwrY7J8qJ
39   https://www.european-language-grid.eu/meta-forum-2021/
214                              Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm

the conference was “Using the European Language Grid” and it highlighted the first
actual uses of the ELG platform. The setup of the conference was similar to the
structure used in 2020. However, the project expo was organised using the virtual
meeting space environment Gather.town to further stress the community aspect.40
   As the motto implies, in 2021 using and benefiting from ELG was the main focus.
To demonstrate this, five of the ten successfully finished pilot projects were featured
with their results. Furthermore, representatives from the European LT industry took
part in a panel to discuss their expectations towards and experiences with the ELG
platform. In the more hands-on ELG integration tutorial, potential users who were
considering to integrate their own tools and services into ELG had the chance to
learn how. All META-FORUM 2021 sessions are available online.41
   Overall interest in the conference was enormous and the number of participants
significantly exceeded that of the previous year. The feedback collected in the ses-
sion again proved to be a valuable source of information and was thoroughly evalu-
ated after the conference in order to further enhance the platform development.


3.1.4 META-FORUM 2022

While the virtual editions of META-FORUM 2020 and 2021 were very successful,
there are certain disadvantages of online-only events compared to face-to-face con-
ferences. This is why META-FORUM 2022 was organised as a hybrid event, com-
bining the advantages of flexibility and higher reach with the benefits of face-to-face
discussions. The onsite conference in Brussels was held under appropriate COVID-
19-safe conditions with approx. 100 participants from the European LT community
and representatives of the European Institutions. Several hundred participants at-
tended the conference online.



3.2 ELG Workshops

ELG is committed to community building and collaborating with relevant initiatives
on the European level as well as forming its own network of networks (Section 2).
The network of 32 NCCs acts as local and national bridges to the ELG initiative
and cloud platform. Accordingly, dedicated workshops with and for the national
LT communities have been a crucial task the NCCs were asked to fulfil.42 These
workshops were organised with the goal of making ELG known all over Europe.
   Usually the workshops were organised as individual events by each NCC. In some
cases, they were co-hosted by several NCCs together, e. g., the ELG workshop at
SwissText 2020 (hosted by the ELG NCCs Austria, Switzerland and Germany) or

40   https://www.european-language-grid.eu/meta-forum-2021/project-expo/
41   https://www.youtube.com/playlist?list=PLL1cFzaG0S5iDaCg2SliyA-4axKY0LfiQ
42   https://www.european-language-grid.eu/events/
11 ELG National Competence Centres and Events                                     215

              National Competence Centre(s)      Location            Date
              Switzerland, Austria, Germany   online             23 June 2020
              Lithuania, Latvia, Estonia      Kaunas, LT         21 Sept. 2020
              Poland                          online             27 Oct. 2020
              Finland                         online             15 Dec. 2020
              Germany                         online             20 April 2021
              Austria                         online             11 May 2021
              Switzerland, Austria, Germany   online             14 June 2021
              Belgium, Luxembourg             online             08 July 2021
              Spain                           online             23 Sept. 2021
              Czech Republic, Slovakia        online             18 Oct. 2021
              Denmark                         Copenhagen, DK     16 Nov. 2021
              Netherlands                     online             03 Dec. 2021
              France                          online             08 Feb. 2022
              Bulgaria                        online             11 Feb. 2022
              Serbia                          online             11 March 2022
              Norway                          Oslo, NO           16 March 2022
              Romania                         online             24 March 2022
              Slovenia                        online             27 May 2022
              United Kingdom                  online             17 June 2022
Table 3 Workshops organised by the National Competence Centres



the ELG workshop of the Baltic NCCs of Lithuania, Latvia and Estonia that was
co-located with the Baltic HLT conference in 2020.
   Since all workshops were held during the pandemic, almost all were online events
that usually attracted between 25 and 100 participants. Depending on the country and
target audience of the workshop, they either had a more informative or a more techni-
cal spin, or a combination of both. In an introductory talk by the project coordinator
or a partner of the consortium, ELG and its history, its goals and current status was
presented. In a separate presentation, the technical setup of the platform and its of-
ferings were explained. After that, the NCCs either organised discussion panels or
invited speakers from industry to emphasise the demands and expectations towards
ELG. Especially these talks often spurred interesting and inspiring discussions and
provided valuable feedback for the ELG consortium. In various workshops, a short
hands-on tutorial session was included in which a member of the technical ELG team
explained how to make available services or resources through ELG. Many of the
ELG NCC workshops are available online.43 Table 3 lists all NCC workshops.



3.3 Additional Conferences

Representatives of the ELG consortium took the opportunity to promote the plat-
form and the initiative at numerous occasions throughout the run-time of the ELG
project. In addition to local events, ELG was also present with talks and papers at
43   https://www.youtube.com/channel/UCarEHmsWT2JslcvvWkbhL4A
216                                Katrin Marheinecke, Annika Grützner-Zahn, and Georg Rehm

more than 50 different European and international conferences, such as LT4ALL
(2019), LREC 2020, AI Boost (2021), European Big Data Value Conference (2021),
Fachtagung Maschinelle Verfahren in der Erschließung (Deutsche Nationalbiblio-
thek, 2021) and Wales Academic Symposium on Language Technologies (2022).



4 Conclusions

The collaboration with the National Competence Centres was successful. The impact
they have had in their countries to promote ELG cannot be overstated. Also, the
NCCs’ expert knowledge of language resources in their regions and their contacts to
representatives from industry and research have been and continue to be extremely
useful. Although the formal contracts with the NCCs will expire at the end of the
project, we will make an effort to maintain good working relationships with these
experts in the future and, if possible, to intensify the work again in future projects.
   Under the umbrella of the ELG legal entity we will continue to organise events
and workshops in the coming years to demonstrate new developments and to seek
contact with the communities in the various European countries and regions in or-
der to further promote networking. The annual META-FORUM conference is an
established brand and will continue to be an important activity to bring stakeholders
together and counteract the fragmentation of the European LT community. Experi-
ences from the last years with different meeting formats have significantly extended
the spectrum of what is possible.



References

Piperidis, Stelios, Harris Papageorgiou, Christian Spurk, Georg Rehm, Khalid Choukri, Olivier Ha-
   mon, Nicoletta Calzolari, Riccardo del Gratta, Bernardo Magnini, and Christian Girardi (2014).
   “META-SHARE: One year after”. In: Proceedings of the 9th Language Resources and Evalu-
   ation Conference (LREC 2014). Ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck,
   Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios
   Piperidis. Reykjavik, Iceland: ELRA, pp. 1532–1538. URL: http://www.lrec-conf.org/proceed
   ings/lrec2014/pdf/786_Paper.pdf.
Rehm, Georg, ed. (2017). Language Technologies for Multilingual Europe: Towards a Human Lan-
   guage Project. Strategic Research and Innovation Agenda. CRACKER and Cracking the Lan-
   guage Barrier federation. URL: http://cracker-project.eu/sria/.
Rehm, Georg, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic,
   Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez
   Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek,
   Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-
   bina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De
   Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek,
   Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur
   Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás
   Váradi, Kadri Vider, Andy Way, and François Yvon (2020). “The European Language Technol-
11 ELG National Competence Centres and Events                                                 217

   ogy Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communi-
   cation in Multilingual Europe”. In: Proceedings of the 12th Language Resources and Evaluation
   Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christo-
   pher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
   Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325.
   URL: https://www.aclweb.org/anthology/2020.lrec-1.407/.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Rehm, Georg, Hans Uszkoreit, Sophia Ananiadou, Núria Bel, Audronė Bielevičienė, Lars Borin,
   António Branco, Gerhard Budin, Nicoletta Calzolari, Walter Daelemans, Radovan Garabík,
   Marko Grobelnik, Carmen García-Mateo, Josef van Genabith, Jan Hajič, Inma Hernáez, John
   Judge, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Joseph
   Mariani, John McNaught, Maite Melero, Monica Monachini, Asunción Moreno, Jan Odjik,
   Maciej Ogrodniczuk, Piotr Pęzik, Stelios Piperidis, Adam Przepiórkowski, Eiríkur Rögnvalds-
   son, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadiņa, Koenraad De Smedt, Marko
   Tadić, Paul Thompson, Dan Tufiş, Tamás Váradi, Andrejs Vasiļjevs, Kadri Vider, and Jolanta
   Zabarskaite (2016). “The Strategic Impact of META-NET on the Regional, National and Inter-
   national Level”. In: Language Resources and Evaluation 50.2, pp. 351–374. DOI: 10.1007/s1
   0579-015-9333-4. URL: http://link.springer.com/article/10.1007/s10579-015-9333-4.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 12
Innovation and Marketplace: A Vision for the
European Language Grid

Katja Prinz and Gerhard Backfried




Abstract This chapter provides a comprehensive overview of innovation and the
ELG marketplace as core elements for the generation of value and the creation of
an active, attractive and vibrant community surrounding the European Language
Grid. Innovation is an essential element in making ELG a credible and sustainable
undertaking. However, it does not happen by itself nor materialise in a vacuum. Con-
sequently, ELG provides a habitat for various kinds of innovation and a home for
the necessary community to put innovation into action. The marketplace is essential
for attracting participants supplying and demanding services, resources, components
and technologies on a European scale. Innovation and marketplace – as well as the
overall business model – are tightly connected and need to be developed and man-
aged in a joint manner. Clearly, this is not a one-off activity, but rather needs to be
carried out continuously and extend into the future. ELG is designed and created to
promote the excellence and growth of the European LT market, creating new jobs
and business opportunities and supporting European digital sovereignty. Encompass-
ing a wide array of technologies and resources for many languages spoken across
Europe and in neighbouring regions, it contributes to the Multilingual Digital Single
Market as a cross-European driver for innovation.



1 Introduction

The ELG marketplace and the kinds of innovations it enables form central elements
of ELG and its goal to become the one-stop-shop for Language Technology in Eu-
rope. These aspects are closely interlinked with a series of further topics concerning
the business aspects of ELG in a wider sense, none of which can be viewed in isola-
tion but rather need to be approached in a connected and holistic manner.
   Artificial Intelligence (AI), Natural Language Processing (NLP) and Natural Lan-
guage Understanding (NLU) are highly active areas of research and development

Katja Prinz · Gerhard Backfried
HENSOLDT Analytics GmbH, Austria,
katja.prinz@hensoldt.net, gerhard.backfried@hensoldt.net


© The Author(s) 2023                                                               219
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_12
220                                                    Katja Prinz and Gerhard Backfried

leading to novel applications on a continuous basis. Over time, new actors enter the
stage and change the course of events. In this highly dynamic landscape it is im-
perative to constantly monitor progress, remain alert and be able to adapt to newly
emerging trends. Consequently, any platform and strategy implemented on and by
AI/NLP/NLU need to remain flexible and open to change. Fundamental concepts
such as value-generation provide orientation across time and should form the base
of any strategies developed.
   Neither the marketplace nor innovation make sense without an underlying crowd
of committed actors, which drive the cycles of supply and demand, form the ingre-
dients of cooperation and consulting and are at the heart of creation and innovation.
Establishing and fostering this community who will take LT one step further thus
forms one of the most important tasks to be addressed by ELG.



2 Innovation

In today’s agile, interconnected and virtualised world, the paradigm of open innova-
tion (Chesbrough 2006), connecting many different disciplines, sectors and actors
in a non-linear fashion has gained considerable traction. Under this paradigm, inno-
vation takes place within as well as outside an organisation with knowledge flowing
in both directions. It allows different actors to collaborate and experiment across or-
ganisational boundaries, across different sectors and disciplines, and enables them to
dynamically produce innovation in a heterogeneous manner. Eco-systems like ELG
form a natural habitat for such activities and a powerful environment for innovation.
In this chapter, the concept of innovation is viewed from the angle of open innova-
tion, forming the most appropriate and promising approach for a platform like ELG,
rather than the silo’d and closed kind of innovation which is limited to individual
organisations. For innovation to occur, two fundamental ingredients need to be com-
bined: innovation = invention + adoption (Schrage 2004). Both of these factors must
be present for innovation to take place and to put it into effect in order to generate
new knowledge, to develop new products, services or processes. Any environment
or innovation-strategy consequently has to reflect both factors, balance efforts and
encourage and support both kinds of activities.



2.1 Significance of Innovation

Applications in the fields of AI, NLP or NLU reside in a highly competitive and
dynamic landscape. As technology leaps are produced in rapid succession and mar-
kets and opportunities expand, organisations can and should make use of internal as
well as external ideas and paths to market as they seek to advance their technology
(Chesbrough 2006). Justin Rattner, Intel’s former CTO evangelised the concept of
21st century industrial research where innovation is driven by teams of boundary
12 Innovation and Marketplace: A Vision for the European Language Grid                                 221

spanners that possess multidisciplinary skills. Online platforms such as ELG pro-
vide ideal multi-sided ecosystems for such teams, offering the means to link up and
collaborate and to unite a multitude of participants with the joint aim to create novel
products and services ready for swift adoption. However, beyond providing the tech-
nical framework, resources and tools, such platforms also foster the sharing and ex-
change of knowledge and ideas between participants. As a result of the increased
diversity and connectedness of actors, the generation of genuinely new knowledge
and more radical innovation is possible. Whether and to what extent these goals also
materialise in practice depends on a variety of factors, such as acceptance and open-
ness to a culture of open innovation that also supports the useful and selective sharing
of research results and data. If exercised successfully, open innovation has the po-
tential to eliminate barriers in research and development and generates a dynamic
environment that cannot be achieved with traditional methods.



2.2 Types of Innovation and Innovation Strategies

Innovation may span a wide spectrum concerning products, services, methods, busi-
ness models and even entire organisations. Figure 1 depicts different dimensions and
types of innovation and provides several examples for each kind.

           LEVERAGES EXISTING TECHNICAL COMPETENCES        REQUIRES NEW TECHNICAL COMPETENCES




    REQUIRES                    DISRUPTIVE                              ARCHITECTURAL
    NEW               Open-Source SW for SW companies           Personalized Medicine for Pharma
    BUSINESS         Video on Demand for Rental Services           Digital Imaging for Polaroid
                            Ride-Sharing Services                Internet Search for Newspapers
    MODEL




    LEVERAGES                    ROUTINE                                    RADICAL
                      Next-generation 3 Series for BMW          Biotechnology for Pharma-Industry
    EXISTING            New Index Fund for Vanguard           Jet-engines for Aircraft Manufacturers
    BUSINESS          New 3D Animated Movie for Pixar         Fiber-optic Cables for TelCo Companies
    MODEL




Fig. 1 Innovation landscape (Pisano 2015)


   Routine innovation (or incremental innovation) builds on an organisation’s exist-
ing technological competences and fits with its existing business model and customer
base. Routine innovation aims at improving existing products (or services) contin-
uously until the end of their life-cycles. It typically involves activities to improve
features, reduce costs or expand production lines and mechanisms. Architectural in-
novation combines technological and business model disruptions. Disruptive inno-
vation typically requires a new business model but not necessarily a technological
222                                                    Katja Prinz and Gerhard Backfried

breakthrough. For that reason, it also challenges, or disrupts, the business models of
other actors. Breakthrough innovation can be regarded as the more radical version
of disruptive innovation causing fundamental changes in the market through the in-
troduction of new products, methods or business models. These categories are not
clear-cut and overlap to some extent. However, the dimensions can serve to locate
different types of innovation when designing an innovation strategy. Aside from
these categories, innovation can also be characterised by the kinds and magnitude of
impact caused by it.
   Any innovation strategy needs to specify how the different types of innovation (as
outlined above) fit into the overall business strategy. It must map an organisation’s
value proposition for the defined markets and at the same time set realistic bound-
aries. Furthermore, the strategy must be clearly communicated in order to assure a
common goal for all participants involved, secure their commitment and to stream-
line activities between all partners. Innovation for innovation’s sake or for generic
goals such as “we need to be innovative” are neither sufficient nor effective. Pisano
(2015) emphasises the importance of these inter-connections by defining the term
“innovation strategy” as the “commitment to a set of coherent, mutually reinforcing
policies or behaviours aimed at achieving a specific competitive goal, promoting
alignment among diverse groups within an organisation, clarifying objectives and
priorities, helping focus efforts around them and specifying how various functions
will support it”. Innovation – and an innovation strategy – can neither be developed
nor executed in isolation, but need to be carried out in sync with the defined business
strategies of an organisation to be successful.



2.3 Open Innovation in the ELG Platform and Marketplace

Innovation does not take place in a vacuum, but is tightly connected to the vision,
business, marketplace and sustainability strategies aiming to establish and sustain
ELG as the primary marketplace for LT in Europe. The platform and community are
positioned at the centre around which these different strategies are aligned, support-
ing each other in the overall goal as depicted in Figure 2.
    ELG is a multi-sided and integrated platform and envisoned to function as an
innovation driver during the lifetime of the project as well as beyond. The platform
itself is complemented by a vibrant and active community of users and stakeholders.
These are a key ingredient in creating the critical mass required to make ELG an
established marketplace. Building and strengthening this community consequently
forms an essential element of the ELG innovation and communication strategies.
    Placing the platform and community at the core allows us to adopt an open and
collaborative approach to innovation, which needs to become an inherent element
(a process) of ELG. The principles of Open Innovation as coined by Chesbrough
(2006) form the over-arching theme of this continuous process. Figure 3 provides a
schematic overview of the actors and interactions which need to be aligned for inno-
vation and value creation. It is imperative that all groups are present and participate
12 Innovation and Marketplace: A Vision for the European Language Grid                                            223



                                                         Innovation
                                                          Strategy




                         Business                       Platform /                    Marketplace
                         Strategy                      Community                       Strategy




                                                       Sustainability
                                                         Strategy



Fig. 2 Strategies centred around the ELG platform and community



actively in the process. To attract and motivate these groups, targeted communication
is required.


                                                     Innovation
          Invention, Knowledge Transfer        Business Models, Positioning               Adoption




        LT Research               LT Vendors             Consultants/         LT Buyers               End-users
                                                         Integrators




                      Research Challenges       Gaps, Opportunities             Needs, Expectations, Feedback


                                               Market Intelligence

Fig. 3 ELG innovation cycle


   In line with the overall approach of ELG, in Figure 3 the process of innovation
spans the complete set of activities and actors from invention to adoption. The goal
to generate value within the scope of the business model forms the central element.
Continuous feedback regarding the needs, gaps, expectations and opportunities is
collected via the community, leading to further cycles, which need to be carried
out repeatedly and continuously. As a result of the continuous feedback mechanism,
strategies can be updated and the speed of adoption increased over time, hence al-
224                                                                                 Katja Prinz and Gerhard Backfried

lowing for more rapid cycles of innovation. Figure 4 shows four main dimensions
and associated issues to be addressed and considered regarding innovation in ELG.

                                                                 Processes
                                                 Products                    Services
                                                          Where to innovate?
                                                                 Focus                                Value chains

            Creation of market
                                                                                                             Standardization
       Extension of services

   Product portfolio extension                                                                                       Radical /
                                                 Channel
                                                                                                                     disruptive /
                                                         Brand                                                       breakthrough
Business creation
           Why innovate? Objective                                                                  Intensity How to innovate?
Business refinement
                                                                      Marketplace                                Incremental /
 Reducing costs                                                                                                  sustainable

  Improving efficiency                                                       Rapid Prototyping
                                                                                                                Improving features
      Customer experience
                                                                                                             Academia & Industry
         Product Performance
                                                             Boundaries                                Network / Community
                                     internal                                       With partners
                                                    With whom to innovate?
                                         Supply / push                       Demand / pull

Fig. 4 Dimensions of innovation


    For each dimension, several possible approaches are outlined. Together, they
form a portfolio of possibilities and opportunities which need to be monitored contin-
uously. Depending on the evolution of ELG, they may need to be adapted to chang-
ing conditions and (re-)prioritised. The innovation cycle shown in Figure 3 forms
the blueprint for these continuous activities.
    For the duration of the ELG project the most important element of innovation
is the creation of ELG itself. The use of a platform in the scope of LT as a multi-
sided marketplace, allowing participants to create value together by interacting with
each other represents an innovative business model (Still et al. 2017). Beyond the
platform itself, the creation of products (Section 2.3.1) and services (Section 2.3.2)
form two further promising alleys for innovation activities.


2.3.1 Products

ELG provides a large set of technological components and resources which provide
a broad basis for product offerings as individual products or product bundles. In
terms of innovation potential, both bundles as well as individually improved and
adapted LTs provide a wide range of opportunities. Different setups of where services
are hosted and run are provided by ELG to optimise resource usage and adapt to
the particular needs of customers. An extensive catalogue of tools and resources
provides a single point of entry and access to these tools and LTs.
12 Innovation and Marketplace: A Vision for the European Language Grid              225

2.3.2 Services

Two kinds of services are provided by ELG: services in the sense of running com-
ponents (technological services) and services in the sense of experts providing their
expertise (human services). In terms of the former, the services can be used individ-
ually or in combination (as chains of services) to create value-chains. Combination
and composition allow us to establish more complex workflows, enabling end-users
to benefit from the capabilities of individual providers without having to worry about
any inner workings or being locked in the products of an individual supplier. Corre-
sponding mechanisms regarding billing, licensing and support will provide a single
point of contact for customers. Regarding the latter, ELG provides a virtual agora, a
business-space for connecting stakeholders developing or deploying (complex) so-
lutions which require skills beyond that of individual actors. This includes services
of consultants and integrators who are crucial elements in broadening the adoption
(and hence boosting innovation) of LT. They are expected to act as enablers and mul-
tipliers for putting LT into practice, supporting their introduction into organisational
as well as business processes.


2.3.3 Further Aspects of Innovation

Regarding the intensity of innovation, ELG is expected to mainly operate on a level
of incremental, continuous innovation, improving existing features and extending
the portfolio of features. Through this continuous extension, new combinations of
services and products are expected to become available over time which allow the
implementation of new features. Linking different services and thus producing value
chains in a simple and transparent manner will allow for increased experimenta-
tion and thus for an agile environment for the creation of new features. Regarding
the boundaries of innovation, ELG will focus on the community and stakeholders
present on the platform. A catalogue of resources (services, corpora, datasets etc.)
as well as of LT experts, consultants and integrators provides a prime resource for
locating crucial resources for business. The strength, weight and activity of the com-
munity is one of the determining factors for the overall success and adoption of the
ELG and hence one of the gate factors for innovation. Regarding the objective of
innovation, the refinement as well as creation of business form viable alleys. The
above-mentioned manner of gradual and incremental innovation lends itself to var-
ious kinds of business refinement such as reducing costs, improving the efficiency
or product performance and improving customer experience.
   Business creation may take place via the platform and community and through
the creation of novel services or products via the combination of building blocks
offered by ELG. The creation of standards for resources, processing services and
interfaces can play an important role as it effectively decouples individual compo-
nents and vendors. In combination with the technical environment of ELG, this en-
ables increased resilience, scalability, composability and replaceability of compo-
nents, avoiding vendor lock-in situations. Furthermore, standardisation of these ele-
226                                                             Katja Prinz and Gerhard Backfried

ments will allow for a higher level of experimentation and show-casing and lower
the risk of failure in the development of innovative solutions.



3 Multi-sided Marketplace Approach

To date, there is no general digital umbrella platform for LT in Europe. The ELG plat-
form is designed to fill this gap: it is envisioned to serve as the comprehensive virtual
marketplace, where value is created for all its members in Europe and beyond. Based
on a multi-sided marketplace approach (see Figure 5), ELG will facilitate value and
business creation and efficient transactions coupled with large developer ecosystems
that build innovative technologies and services on top of a digital platform in an open
and agile manner. The advantage of this approach lies in the nature of multi-sided
marketplaces as enablers of transactions driving positive network externalities. They
make it easier and more efficient for the participants from diverse markets to inter-
act with each other, as the friction between different contact points is reduced. In
addition, these interactions increase the value created together which almost comes
naturally due to the network effects. A platform becomes more attractive to poten-
tial new users the more users meet and interact on it. In other words, value increases
for all participants when more users actively use the platform (Sánchez-Cartas and
León 2021). As a marketplace, ELG is designed to make it easy and efficient for
participants to connect and exchange ideas and products. These can be as diverse
as language resources, technologies, services, components, expertise, innovation or
even information. The distinctive feature of the multi-sided approach is that the mar-
ketplace enables direct interactions between two or more sides, who can be both –
product suppliers and demanders at the same time. In other words, value creation is
two-way and continuous.



                                                                      TECHNOLOGIES,
                  COMPENSATE                                             SERVICES,
                                                                      COMPONENTS,
                                                                        RESOURCES




                                                                        PLATFORM
                   CORE          VALUE CREATION                                          COMMUNITY
CONSUME                                           MATCHMAKING
               TRANSACTIONS        INNOVATION                                             BUILDING




                                                                      ORCHESTRATING
                  CONNECTION




Fig. 5 ELG multi-sided marketplace approach
12 Innovation and Marketplace: A Vision for the European Language Grid             227

   The core transactions of the multi-sided marketplace are represented by the left
part in Figure 5 and are mainly concerned with creating value, establishing connec-
tions between supply and demand, and facilitating consumption and compensation
of the products (technologies, services, components and resources) offered by ELG.
Connection is a complex mechanism consisting of the elements portrayed in the right
half of Figure 5. Various kinds of connections are supported and promoted by the
platform, from matchmaking, to matching of technologies, resources and services
vertically and horizontally in order to provide a more comprehensive offering, to
orchestrating all interactions between, users, providers and innovators, as well as
nurturing a vibrant and active community. The multi-sided marketplace approach
encompasses the following principles.
Value Creation ELG aims to be a platform for value creation which will be
  achieved by facilitating reciprocal exchanges between multiple marketplace par-
  ticipants. In addition, participants can create value by tapping into resources and
  capacities that they do not have to own. Any resource exchange handled via ELG
  will reduce transaction costs for each participant and enables access to exter-
  nalised innovation. The cornerstone of the ELG marketplace positioning is the
  value it provides to its participants. As the European marketplace for LT, it con-
  nects previously unmatched supply-side and demand-side participants through
  innovative forms of value creation, capture and delivery. The value proposition
  depends on the components and services, their uniqueness, and the means of deliv-
  ering value to target groups as well as on the right balance between the perceived
  value and the set price. Furthermore, ELG is the orchestrator to ensure value cre-
  ation and high quality of participation on the platform. As such, the unique posi-
  tioning as a marketplace will be based on the value generated and offered across
  verticals (see Figure 6). For example, a particular buyer receives a vertically pack-
  aged LT solution for their desired domain (e. g., the health industry) in the form
  of a unique combination of components and services from ELG. In addition, they
  can select the languages for the desired technologies, services and resources for
  the particular domain.
Connection, Gravity and Flow Whereas traditional offline marketplaces tend to
  push products and technologies to the market, ELG will rather create a pull-effect.
  As a multi-sided marketplace it will be equipped to create network effects, i. e.,
  effects that attract new users to enter the marketplace to be part of an ever-growing
  number of partners who are also part of the network. Together they engage in a
  mutual value exchange process which is orchestrated by the marketplace. ELG
  will enable easy access, meaning that participants can easily plug into the platform
  to share, transact and connect. ELG will function like a magnet in creating a pull
  that attracts participants to the platform with its gravity. Because it is both, a
  transaction and innovation platform, both LT providers and LT users (supply and
  demand) will be present to achieve critical mass. The flow of value will be fostered
  by matchmaking, i. e., making connections between LT providers and LT users.
  Rich data will be used for successful matchmaking and the co-creation of value.
228                                                      Katja Prinz and Gerhard Backfried

                      Technologies,
                      Services,
                      Resources,
                      Components, …

                                                            Languages




                                                      Verticals,
                                                      Domains

Fig. 6 Value dimensions of the marketplace



Compound Growth The marketplace aims at providing its participants a broad
   base that enables compound growth and scaling. Growth will be mainly driven
   by the network effects described above.
Visibility ELG is designed to enhance the visibility of each of its participants,
   extending their reach and networking power. From the LT vendor perspective
   the main interest is to acquire customers. As an umbrella platform for European
   LT, the ELG aims at removing geographic boundaries and language barriers, thus
   fostering the European Digital Single Market.
Community Building A very important aspect of this approach is to attract, grow
   and nurture a vibrant and active community around ELG thus promoting an inter-
   active marketplace. The stakeholders include LT providers, academic research or-
   ganisations, LT customers, EU institutions, public administrations, NGOs, policy
   makers, project consortia, research projects, as well as the ELG National Com-
   petence Centres (NCCs) in 32 European countries. This critical mass of active
   participants also generates the necessary market pull: an excellent case in point
   for this are the several pilot projects funded by ELG (see the chapters in Part IV),
   e. g., Lingsoft, Inc., Coreon GmbH and Elhuyar, among many others, have suc-
   cessfully enhanced the attractivity of the marketplace by contributing highly de-
   manded services, technologies and languages to the platform.



3.1 Foundations for a Successful Marketplace

What are the key ingredients for a successful marketplace? The answer is not straight-
forward because the formation and growth of marketplaces depends on many fac-
tors such as the availability of capital, sufficient demand, talent, legal situation, tax
systems, the innovation and startup culture of a country and many more. Nonethe-
less, there are certain elements successful marketplaces have in common which are
equally important for ELG.
12 Innovation and Marketplace: A Vision for the European Language Grid               229

Attraction Indisputably, success can only be achieved if enough participants are
  attracted to join the ecosystem. This gravity, which is one of the most important
  ingredients, will be supported by a well-balanced interplay of supply and demand
  all of which will be governed by ELG. It is vital for the marketplace to generate a
  market pull in order to fulfil the goals of self-sustainability. The more participants
  the marketplace attracts, the greater will be the network effect and compound
  value growth (a critical mass has to be reached, cf. Bonchek and Choudary 2013).
  The technical foundation to ensure that people are attracted to ELG is an innova-
  tive and state-of-the-art solution for containerised LT components, services and
  resources coupled with cloud solutions to enable fast and efficient interaction and
  speedy and scalable innovation.
Demand Economies of Scale ELG will also rely on demand economies of scale,
  which take advantage of technological improvements on the demand side and are
  driven by demand aggregation, efficiencies in networks, and other phenomena
  (like crowd sourcing of software development) that make bigger networks more
  valuable to their users (Osterwalder and Pigneur 2010). Once the gravity of the
  marketplace is functioning, network effects will be the natural result. Growth via
  network effects leads to market expansion. New buyers enter the marketplace,
  attracted to ELG by the growing number of partners who are part of the network.
Time-to-Market Strategically speaking, ELG will also focus on reduced time-
  to-market objectives: the corporate strategy of the future marketplace will be de-
  signed to truly fulfill the role as accelerator for business creation and will consider
  concepts like “lean management” and “just-in-time” supply chain delivery. Fur-
  thermore, the agile environment will provide a flexible test-bed for trying out new
  technologies and approaches.
Quality Standards In order to be successful, the marketplace needs to facilitate
  the exchange of value which means that the components, services, resources pro-
  vided through ELG require certain quality standards. In order to safeguard the
  quality of products (technologies, services, resources and components) provided,
  ELG standards and quality seals will eventually be implemented. In any case, the
  provision of high-quality state of the art LT, open architecture, reusable software,
  industry-grade robust components provide key ingredients for establishing con-
  fidence and trust in ELG as a whole. In addition, trust in the marketplace will
  be created through transparent product offering and by providing feedback and
  reviews of participants concerning their prior transactions.
Orchestration Furthermore, a proper organisation and infrastructure have to be
  provided to guarantee that the platform smoothly works as enabler of transactions:
  ideally, the whole setup fosters the exchange and creation of value and supports
  doing business in an easy and smooth manner. A prerequisite for this is an attrac-
  tive, simple and transparent licensing and pricing model, and a simple business
  processing scheme (Täuscher and Laudien 2018).
Ecosystem of Participants Successful ecosystems have the ability to provide for
  coopetition (competition and cooperation) and value co-creation, which are ide-
  ally governed by structure and orchestration to work best. ELG will provide
  for the ideal environment to foster the structured creation and well-coordinated
230                                                        Katja Prinz and Gerhard Backfried

   growth of the ecosystem. This principle is also reflected in the paradigm of open
   innovation adopted and encouraged by ELG.



3.2 ELG Ecosystem of Participants

One of the most important ingredients for a sustainable and successful marketplace
rely on the ability of ELG to create, nurture and grow an ecosystem of participants.
ELG is in the process of expanding and sustaining a unique ecosystem by attracting
diverse stakeholder groups holding different roles – reaching from LT suppliers and
demanders to networks and associations, industry members and academia, as well
as policy makers and national competence centers (see Figure 7). By aligning itself
with key associations and initiatives, ELG aims at establishing itself as a central
element in a platform-of-platforms landscape.

                                       32 National
                                       Competence
                                       Centres (NCCs)

Networks, initiatives and
associations (among others):
                                                                   Projects (among others):
ELRC, ECSPM, EFNIL, BDVA, ELRA,
NPLD, GAIA-X, NFDI, CLARIN,                                        ICT-29b), MeMAD, CEF Smart INEA,
DARIAH, W3C, RDA, EOSC,                                            QURATOR, SPEAKER, Elexis, Lynx,
OpenAIRE, CLAIRE, LT-Innovate,                                     Fandango, ELE, etc.
etc.




              Industry

                                                            ELG pilot
                                                            projects




                                  AI4EU – European AI on
                                  demand platform

Fig. 7 ELG ecosystem of participants


    The ecosystem is designed to connect people, foster an environment for open
and two-way communication, create mutually beneficial relationships, and promote
community building. In short, it is there to provide an umbrella platform for its par-
ticipants enabling them to build relationships and to provide value to one another.
The role of community building is very important because it is the driver of the mar-
ketplace. It is needed in order to reach a critical mass of active participants which
eventually generate the intended market pull. From a business perspective, ELG will
provide the infrastructure for an ecosystem allowing to match products, services,
providers (supply) and users (demand), within a multi-sided setup. By orchestrating
different stakeholders’ needs, the ecosystem will allow for matchmaking of demand
12 Innovation and Marketplace: A Vision for the European Language Grid             231

and supply and the continuous multi-directional exchange of values. The technolog-
ical and organisational infrastructure for this matchmaking as well as the interaction
governance principles are key building blocks of ELG.



3.3 Technical and Practical Aspects

From a technical perspective, ELG will be the first large-scale LT platform applying
containerisation through Kubernetes. This choice and combination of technnologies
provides a scalable environment with an web user interface and corresponding back-
end components and REST APIs. During the course of the project and beyond, it
will provide access to a multitude of state-of-the-art technologies, services and com-
ponents. Furthermore, it will include an overarching LT directory of stakeholders
from research, innovation and technology, i. e., it will be the “yellow pages” or the
“who’s who” of the European Language Technology community.
   On the provider side, ELG adheres to a number of standards in order to facilitate
the integration of a large number of disparate tools:
 1. Definition of common APIs for each class of tool, designed to be powerful
    enough to support the necessary use cases but lightweight and flexible enough
    to allow tools to expose their own specific parameters where this makes sense.
 2. Containerisation to isolate tools from one another and to allow each tool to man-
    age its own software dependencies. ELG uses the well-established Kubernetes
    system to manage the deployment, scaling and execution of containers in com-
    bination with Knative to handle auto-scaling of containers on demand.
 3. Orchestration of services will become an important topic as the set of offered
    services grows and the demand for complex workflows becomes visible. This
    may potentially even concern workflows spanning multiple platforms.

   With regard to the user interface, standards in user friendliness are adopted and
marketplace-related features, such as upload/download, licensing, billing, payment
as well as transparent pricing models will be used. In addition, ELG will promote
direct contact to its participants which is important to create additional transparency
and trust in the platform.



4 Conclusions

ELG has set its goal to become the primary platform for Language Technologies
in Europe which incorporates many aspects in one setting: marketplace, business
space and a scalable environment for innovation. With regard to innovation, an open
innovation approach is adopted, putting the combination of creation and adoption
at the centre. Different kinds and granularities of innovation (step-wise and gradual
to disruptive) are enabled by ELG and the way the community behind it is set up
232                                                            Katja Prinz and Gerhard Backfried

and managed. Innovation, however, is not viewed in isolation but rather as a crucial
element within the larger context of the ELG business model. The marketplace will
focus on commercial aspects and communities, linking supply and demand and en-
abling reciprocal value exchange. In addition, ELG will form a business space and
innovation platform in the sense of becoming a virtual agora, bringing researchers,
experts, end-users, requirements and capabilities together in one forum. Moreover,
it will serve as a promoter for open innovation, providing access to (external and
internal) resources and ingredients for innovation. As the umbrella platform shared
by the whole European LT community, it will support the bundling of efforts and
forces and facilitate the reciprocal transaction of values for all participants to grow
and benefit from this scaling.



References

Bonchek, Mark and Sangeet Paul Choudary (2013). “Three Elements of a Successful Platform
    Strategy”. In: Harvard Business Review (January). URL: https://hbr.org/2013/01/three-elemen
    ts-of-a-successful-platform.
Chesbrough, Henry (2006). Open Innovation: Researching a New Paradigm. Oxford University
    Press.
Osterwalder, Alexander and Yves Pigneur (2010). Business Model Generation – A Handbook For
    Visionaries, Game Changers, And Challengers. Wiley.
Pisano, Gary (2015). “You Need an Innovation Strategy”. In: Harvard Business Review (June).
    URL: https://hbr.org/2015/06/you-need-an-innovation-strategy.
Sánchez-Cartas, Juan Manuel and Gonzalo León (2021). “Multisided Platforms and Markets: A
    Survey of the Theoretical Literature”. In: Journal of Economic Surveys 35 (2). URL: https://do
    i.org/10.1111/joes.12409.
Schrage, Michael (2004). “Interview in Ubiquity”. In: ACM Ubiquity (December). URL: http://ub
    iquity.acm.org.
Still, Kaisa, Heidi Korhonen, Miika Kumpulainen, Marko Seppänen, Arho Suominen, and Katri
    Valkokari (2017). “Business Model Innovation of Startups Developing Multisided Digital Plat-
    forms”. In: IEEE 19th Conference on Business Informatics. Vol. 2. Thessaloniki, Greece: IEEE,
    pp. 70–75.
Täuscher, Karl and Sven Laudien (2018). “Understanding Platform Business Models: A Mixed
    Methods Study of Digital Marketplaces”. In: European Management Journal 36 (3), pp. 319–
    329. DOI: 10.1016/j.emj.2017.06.005. URL: https://doi.org/10.1016/j.emj.2017.06.005.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 13
Sustaining the European Language Grid:
Towards the ELG Legal Entity

Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina
Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried,
Katja Prinz, Jose Manuel Gómez-Pérez, and Ulrich Germann



Abstract When preparing the European Language Grid EU project proposal and
designing the overall concept of the platform, the need for drawing up a long-term
sustainability plan was abundantly evident. Already in the phase of developing the
proposal, the centrepiece of the sustainability plan was what we called the “ELG
legal entity”, i. e., an independent organisation that would be able to take over oper-
ations, maintenace, extension and governance of the European Language Grid plat-
form as well as managing and helping to coordinate its community. This chapter
describes our current state of planning with regard to this legal entity. It explains the
different options discussed and it presents the different products specified, which
can be offered by the legal entity in the medium to long run. We also describe which
legal form the organisation will take and how it will ensure the sustainability of ELG.


Georg Rehm · Katrin Marheinecke · Stefanie Hegele
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany, georg.rehm@dfki.de,
katrin.marheinecke@dfki.de, stefanie.hegele@dfki.de
Stelios Piperidis
Institute for Language and Speech Processing, R. C. “Athena”, Greece, spip@athenarc.gr
Kalina Bontcheva
University of Sheffield, UK, k.bontcheva@sheffield.ac.uk
Jan Hajič
Charles University, Czech Republic, hajic@ufal.mff.cuni.cz
Khalid Choukri
ELDA, France, choukri@elda.org
Andrejs Vasiļjevs
Tilde, Latvia, andrejs@tilde.lv
Gerhard Backfried · Katja Prinz
HENSOLDT Analytics GmbH, Austria, katja.prinz@hensoldt.net,
gerhard.backfried@hensoldt.net
Jose Manuel Gómez-Pérez
Expert AI, Spain, jmgomez@expert.ai
Ulrich Germann
University of Edinburgh, UK, ulrich.germann@ed.ac.uk


© The Author(s) 2023                                                                     233
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_13
234                  Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

1 Introduction

One of the challenges the European Language Grid initiative aims to address is
the fragmentation of the European Language Technology landscape, with regard to
academia, research institutions and commercial entities. ELG aims to bring together
all stakeholders, currently scattered all over Europe, under the European Language
Grid platform as a common umbrella (Rehm et al. 2021; Vasiljevs et al. 2019). How-
ever, the efforts taken within the project can only be translated into a large-scale
success if ELG continues to exist beyond the project runtime of 42 months. This is
why it had already been foreseen in the ELG project proposal to develop a long-term
sustainability plan during the project. Its centrepiece is the idea of establishing, in
the second half of 2022, a dedicated ELG legal entity, which is meant to take over
operations, maintenance, extension and governance of the European Language Grid
platform as well as managing and helping to coordinate its community. Only with
such a sustainable, long-term activity can the overarching goal of strengthening, har-
monising and bringing together the European LT business and research community
be met. In other words, the sustainability plan and the legal entity are mission-critical
for the success of the project.
    After a brief presentation of the long-term vision of ELG (Section 2), this chapter
describes business and operation models that have been examined in order to assess
if they are suitable for the ELG legal entity (Section 3). Not only shall the ELG
platform and initiative continue to exist, we also want to expand its functionalities
further in order to serve and adapt to evolving user needs even better and to fulfil
ELG’s mission for the European LT community. We explore a number of different
dimensions with regard to the shaping of the ELG legal entity and place special em-
phasis on the description of a set of products we specified that can be offered by the
legal entity. At the same time, it is important to point out that the AI landscape –
including LT – must still be characterised as highly dynamic (Rehm et al. 2020b).
Precise predictions of where the field is headed in Europe in the next years are diffi-
cult to be made right now. It remains to be seen what the post-COVID market will
look like, which breakthroughs will come next in AI and LT, what the impact of the
various ongoing large-scale initiatives will be and how the LT/AI-related situation
in the different European countries will develop in the future. This dynamic situa-
tion creates additional challenges when it comes to specifying the final shape of the
ELG legal entity, which must consequently correspond to this agile and dynamic
environment.



2 Long-term Vision and Mission of ELG

Our vision and long-term goal is to establish ELG as the primary platform and mar-
ketplace for all commercial and non-commercial Language Technologies developed
and offered by the European LT community. In order to achieve this goal, multiple
prerequisites need to be in place, e. g., the ELG cloud platform must have very high
13 Sustaining the European Language Grid: Towards the ELG Legal Entity               235

availability and it must exhibit near real-time performance for individual services,
legally safe service level agreements need to be prepared so that services can be ap-
plied in production environments, simple mechanisms for billing need to be available
and technical support needs to be offered. Trust in the platform and its reliability need
to be established in a transparent manner. Operating these and other components of
the platform and initiative incurs various system-relevant costs (Teece 2017).



2.1 Mission of the European Language Grid

To achieve the goal of becoming the primary platform for European LTs, ELG fol-
lows its mission of creating impact beyond the platform itself:
  • Grow a vibrant community and help coordinate all European LT activities: ELG
    is an initiative from the European LT community for the European LT commu-
    nity, including industry, innovation and research. ELG can only be successful if
    the whole community makes active use of the platform and contributes as well
    as uses datasets and services. ELG collaborates with many related projects, com-
    panies, research organisations and further initiatives (see Chapters 10 and 11),
    most notably its sister project European Language Equality (ELE), which is cur-
    rently developing a strategic agenda and roadmap that specify how to achieve
    digital language equality in Europe by 2030. In the agenda developed by ELE,
    ELG functions as the main technology platform of the ELE Programme so that
    the support of Europe’s languages through technologies can be measured and
    monitored over time (Gaspari et al. 2022; Grützner-Zahn and Rehm 2022).
  • Create and maintain a powerful, scalable and useful Language Technology plat-
    form: ELG’s novel technological approach enables innovations and synergies
    between commercial and non-commercial LT demanders, suppliers and users
    (see Chapter 12). The unique ELG platform is based on the principle of en-
    capsulating services in containers. This approach tackles and solves some of
    the issues of technical interoperability, which is a crucial obstacle on the way
    of cross-provider and cross-platform interoperability. ELG enables providers to
    deposit and deploy their services.
  • Support the Multilingual Digital Single Market: ELG strengthens the commer-
    cial European LT landscape through the pan-European platform and market-
    place. Offering powerful multilingual, cross-lingual and monolingual technolo-
    gies, ELG aims to contribute to the emergence of a truly connected, language-
    crossing Multilingual Digital Single Market. European companies can showcase
    and offer their LTs and consulting services to customers on the ELG marketplace
    (see Chapter 12).
236                   Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

2.2 Added Value for Stakeholders

The implementation of this mission in the form of the ELG platform provides added
value for all stakeholders, e. g., 1. ability to attract participants (i. e., customers, buy-
ers, users, providers etc.), 2. ability to create demand economies of scale, 3. benefit
of reduced time-to-market (especially from lab to market), 4. standardised quality,
5. ease of doing business and a 6. coherent ELG technology exploitation ecosystem.
   Traditional, linear value chains are focused on a one-way process of value cre-
ation, e. g., raw materials are used and manufactured into products, which are then
distributed and used by the consumer, until they are disposed of. For ELG, we fore-
see a two- or multi-way value creation. As a digital platform, ELG will maintain an
ecosystem of reciprocity. LT providers, LT consumers, ELG stakeholders and the
whole ELG community help to generate two-way and reciprocal value as a result of
the combination of resources of its participants, cost benefits (demand economies of
scale) and network effects. As such, marketplace participants will create value by
tapping into resources and capacities that they do not have to own themselves. In
addition, marketplace participants will enjoy cost benefits and positive compound
effects, arising from demand aggregation, from efficiencies in networks and from
technological improvements on the demand side. Third, there is value within the
network itself: growth via network effects will lead to market expansion for each
of the members of the ecosystem. New participants (buyers and suppliers) enter the
marketplace, because they are attracted to ELG by the growing number of partici-
pants who are also part of the network. That way, value is created in a reciprocal,
multi-sided (almost infinite) way. For more details, see Chapter 12.



3 Main Pillars of the Business and Operational Model

Given the large number of possible routes to evaluate as well as decisions to be made
eventually, we stretched the consortium-internal discussion of the main pillars of the
ELG legal entity’s business and operational model over the whole project duration,
initiating the consortium-wide discussion in late 2019, i. e., we started immediately
after the implementation of the proof of concept of the ELG platform. The goal was
to specify, in a step by step fashion, the main ingredients of the sustainability plan.
Relevant intermediate results were presented at META-FORUM 2020 and 2021 as
well as in a number of talks.
    At the very start of the overall process we looked at the setup and models of vari-
ous other organisations that might serve as potential blueprints for ELG or, the other
way around, as examples of organisations that would not work for ELG. We paid
special attention to the domain of Language Technology and related fields, to the
aspect of community-driven organisations, to combining industry and research and
to the relevance of Europe as an overarching umbrella. All organisations we exam-
ined in more detail operate in the sphere of IT, LT or AI. Some of them have been
created as spin-offs of research projects. With regard to their size and setup, though,
13 Sustaining the European Language Grid: Towards the ELG Legal Entity            237

these organisations are very diverse; the similarities with ELG in terms of their re-
spective starting points and target groups also vary considerably. The organisations
are: DBpedia Association1 , World Wide Web Consortium (W3C)2 , Industrial Data
Spaces (IDS)3 , LT Innovate4 , OpenAIRE5 , CLARIN ERIC6 , Big Data Value Associ-
ation (BDVA)7 , Translation Automation User Society (TAUS)8 , ELRA/ELDA9 and
GATE Cloud10 . While discussing and learning more about these organisations – es-
pecially with regard to the type of legal entity they use, their membership as well as
governance and fee structure, revenue streams etc. – we realised that despite some
superficial similarities, none of them could serve as a direct model for the ELG le-
gal entity, i. e., we are not aware of any organisation that could serve as an actual
blueprint. However, we have been able to derive some important questions from this
comparison that have informed the subsequent steps of the process.
   The following sections present the main pillars of the legal entity approx. in the
order in which we discussed and designed them.



3.1 Expectations by the ELG Consortium’s SME Partners

Next up in the overall process of designing the ELG legal entity, we initiated a dis-
cussion with the ELG consortium’s SME partners, primarily to collect their expec-
tations and demands towards a legal entity that operates and maintains the “primary
platform for Language Technology in Europe”. The most important aspects of their
considerations can be summarised as follows.
Sales channel: ELG is, first and foremost, understood as a channel to promote and
   to sell the products and services offered by the SMEs. ELG should stir interest
   and convince potential customers to invest in European LT. This is also true for
   public administrations and governmental bodies, the European Institutions and
   NGOs with the general idea being that interested parties and stakeholders look
   at ELG first in their procurement processes for LT. It was suggested that, in the
   medium to long run, ELG should consider fulfilling or even establishing certain
   quality and security standards as well as some kind of quality seal.
Strategy and collaboration: Europe has strengths in certain areas and language
   combinations but new business opportunities can only be reached by joining
1 https://www.dbpedia.org
2 https://www.w3.org
3 http://www.industrialdataspace.org

4 https://lt-innovate.org

5 https://www.openaire.eu

6 https://www.clarin.eu

7 https://www.bdva.eu

8 https://www.taus.net

9 http://www.elra.info

10 https://cloud.gate.ac.uk
238                   Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

   forces and combining the offers with those of other European players. Missing
   or needed tools and services from others will help expand one’s own set of tools
   and services. The SMEs expect ELG to help in this regard, i. e., identifying and
   closing strategic partnerships (also see Interoperability below).
Buy-in from the whole community: According to the SME partners, ELG must
   be positioned in the right way with regard to other platforms and infrastructures,
   e. g., a controlled transition from META-SHARE to ELG should be achieved
   by also integrating those organisations who have participated in META-SHARE.
   Furthermore, ELG should be backed, i. e., supported and actively used, by na-
   tional centres and institutions. In terms of the governance model, all stakeholders
   should be able to have their say, yet dominance must be avoided. ELG can also
   provide a channel so that the results of national and international funding pro-
   grammes can be disseminated efficiently on an international level.
Information channel: The goal is for ELG to become the primary European plat-
   form for participants from academia, research institutions and commercial enti-
   ties. Especially with regard to industry, the relevance, understanding and benefits
   of LT for companies of all sizes needs to be increased. ELG could function as
   a means to keep interested stakeholders informed by serving as an information
   source and matchmaker for buyers and suppliers alike (marketplace approach, see
   Chapter 12).
Interoperability: 1. Throughout Europe, there is a sizable number of other rel-
   evant platform and infrastructure initiatives including, among others, Gaia-X11 ,
   the European AI-on-demand platform12 , EOSC13 and NFDI14 . The SMEs men-
   tioned their expectation that ELG becomes part of this larger ecosystem of plat-
   forms around Artificial Intelligence, data economy, research data management
   and Open Science, i. e., that ELG should ideally be fully interoperable with
   these other infrastructures, eventually opening up additional markets (Rehm et
   al. 2020a). 2. Furthermore, providers of LT need to understand what the require-
   ments are to participate in ELG and why it is beneficial for them. ELG needs to
   be compatible with existing businesses and should not duplicate existing systems.
   Since various companies already operate their own or managed cloud platforms,
   platform interoperability should be ensured so that ELG complements existing or
   emerging clouds rather than appearing like competition. ELG should avoid cre-
   ating the impression of being yet another collection of data and tools but rather
   emphasise the ability to combine services and resources from different compa-
   nies. 3. For this, however, full interoperability on the level of the actual tools and
   services, i. e., on the level of APIs, annotations, semantic descriptions, closed vo-
   cabularies etc. needs to be achieved (also see Strategy and collaboration above).




11 https://gaia-x.eu
12 https://www.ai4europe.eu
13 https://eosc.eu

14 https://www.nfdi.de
13 Sustaining the European Language Grid: Towards the ELG Legal Entity                239

3.2 Key Aspects of the ELG Legal Entity

Informed by the SME partners’ expectations and other desk research we performed
(see above), we started defining key aspects of the ELG legal entity, as follows:
Not-for-profit or for-profit organisation? There was a broad consensus in the
   consortium-internal discussions that the legal entity should be a not-for-profit or-
   ganisation. This decision is rooted in the overall approach of ELG as an initiative
   from the European LT community for the European LT community. Moving into
   the for-profit direction would constitute a significant change of plan, effectively
   compromising the initiative’s independence and ability to be perceived as neutral
   and non-competitive; this could also jeopardise the initiative’s political standing
   with national and international administrations and funding agencies. In addition,
   the not-for-profit direction comes with additional benefits (e. g., in terms of taxa-
   tion, more favourable funding conditions when participating in EU projects etc.).
Distributed team or central location? Due to the fact that the ELG consortium
   is already a distributed team and that the development of the platform and its
   technical infrastructure is spread across different European countries, the decision
   was made to keep this distributed setup and to build the team virtually rather
   than in one physical location. Current technical setups for remote work enable
   efficient virtual meetings and distributed teams are very common in business by
   now anyway, which is why we made this decision. The suggestion was made to
   position the legal entity’s “headquarter” in the country where the majority of the
   costs are likely to be incurred, which, for the time being, will be the rented cloud
   infrastructure plus part of the personnel costs.
Start small or big? Given that developments in the AI/LT field and in Europe
   as a whole are very dynamic, the preparation of a detailed ten-year plan does not
   seem to be the right approach. A large organisation with a rigid hierarchical struc-
   ture was perceived to be an obstacle in our consortium-internal discussions. In-
   stead, we favour a flexible and agile setup that can react quickly and efficiently to
   changes and new framework conditions. However, the organisation must be large
   enough to ensure that the existing infrastructure and platform can be maintained
   and extended in a meaningful way and so that growth is possible. We currently
   assume a headcount of 10-15 employees for Phase 3 (see Table 1).
Abrupt transition or soft launch? While the ELG EU project will end on 30
   June 2022, various partners of the ELG consortium are involved in a number of
   new projects, in which the European Language Grid plays a certain role. Through
   these new projects, some of the costs of operating the cloud platform can be cov-
   ered. This situation is ideal because it gives the consortium a bit more time and
   flexibility for completing the overall setup of the legal entity. Our goal is to estab-
   lish the legal entity in the second half of 2022, performing a rather soft launch.
Membership organisation? There are good reasons for having a setup that in-
   cludes a membership structure, especially for actively including the many mem-
   bers of the European LT community and also because membership fees can be
   considered a constant, reliable source income if the ELG legal entity is able to
240                  Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

   continuously provide added value. On the other hand, the membership fee needs
   to be reasonable to make sure that interested parties are not deterred from the very
   outset. The specifics are still under discussion.



3.3 Assessment of Operational Costs

Operating the ELG legal entity will create costs, that need to be covered, even if
the organisation itself will be a not-for-profit one. While the key tangible outcome
of the EU project, the implemented and populated cloud platform, is an important
prerequisite for the legal entity, several additional components need to be put in
place. Crucially, the legal entity needs a team and director to take care of operations,
maintenance and further development of the platform, associated tools and the ELG
community. The main cost items are as follows.
Staff Labour costs represent the largest share of the organisation’s expenses. Even
   a minimal team includes employees for operations, development, marketing, sup-
   port and management. It might not be necessary to hire full-time employees for
   each of these areas right away but in order to run a successful organisation, a
   stable team is essential.
Cloud hosting To enable the legal entity to operate the ELG platform, a cloud in-
   frastructure (including CPU, GPU, RAM, SSD and bandwidth) needs to be rented
   from a cloud service provider.
Overhead This refers to costs like rent of office space, hardware like workstations
   and printers, furniture, electricity, heating, etc. Even if remote and part-time work
   might reduce these costs because there is no need to rent larger office spaces,
   overhead still accounts for part of the fixed costs of the organisation.
Legal Especially in the ramp-up phase of an organisation, comprehensive and
   sound legal advice is crucial. The ELG legal entity will have to draw up and
   maintain model contracts and service level agreements for its products. More-
   over, advice on GDPR, tax legislation and human resources issues is needed. The
   legal entity will not have the capacity for an inhouse legal expert, instead, legal
   services will be outsourced.
   To facilitate future planning, a preliminary cost-structure has been developed (Ta-
ble 1). It illustrates the foreseen soft start of the legal entity, which is separated into
three phases. The gradual soft launch is meant to go from a small team that is work-
ing part-time (Phase 1) to a team of 10-15 full-time employees (Phase 3).
13 Sustaining the European Language Grid: Towards the ELG Legal Entity                     241

           Cost Item       Phase 1 (start)     Phase 2 (ramp-up)     Phase 3 (stable)
           Staff                      2,500€               25,000€            100,000€
           Cloud hosting              2,500€               10,000€             20,000€
           Overhead                     500€                2,500€              7,500€
           Legal                           –                2,500€              5,000€
           Total                      5,500€               40,000€            132,500€
Table 1 Estimated monthly costs in three phases (numbers are preliminary and indicative)



3.4 Business Model Canvas

The Business Model Canvas (BMC)15 is a template used in strategic management for
the development or documentation of existing or new business models. It is widely
known and often serves as the first instrument applied when it comes to the visu-
alisation and structuring of business models. The BMC helps to bring all essential
elements of a business model into a scalable system. It consists of a visual chart with
all necessary elements of an organisation or company. The idea is that the company
or startup recognises its potential and weaknesses and understands where to align
their activities by illustrating potential trade-offs (Osterwalder and Pigneur 2010).
The nine “building blocks” of the business model design template that came to be
called the Business Model Canvas were initially proposed by Osterwalder (2004)
based on his work on a business model ontology. It outlines nine segments for the
business model in a simple one-page canvas that can be inspected alongside each
other. The nine BMC segments are: 1. Key Partners, 2. Key Activities, 3. Key Re-
sources, 4. Value Proposition, 5. Customer Relationships, 6. Channels, 7. Customer
Segments, 8. Cost Structure and 9. Revenue Streams. Below we explain how the
ELG legal entity relates to each of the nine segments of the BMC. This ELG-specific
BMC was prepared by all nine ELG consortium partners. First, we asked all partners
to prepare a partner-specific BMC, i. e., to prepare their own vision and approach of
the ELG legal entity. Afterwards we processed the nine individual, partner-specific
BMCs into one consolidated BMC, which is the basis of the following description.
Segment: Key Partners “Who are the key partners/suppliers? What are the mo-
  tivations for the partnerships?”
  One key partner in the ELG BMC are commercial and non-commercial LT ser-
  vice providers, either with or without their own cloud platform. Equally important
  are Language Resource and data providers that own existing data sets and repos-
  itories. These two key partners contribute to the thriving of the ELG platform.
  Their motivation is not (or not only) to use available services and resources, but
  they offer their own services and resources and create value or profit for their
  own organisations. Another key partner is the wider ELG community, including
  the ELG consortium, the 32 National Competence Centres, the national language
  communities, and all running EU projects and initiatives in the field of LT (includ-
15   https://en.wikipedia.org/wiki/Business_Model_Canvas
242                  Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

  ing ELE). This community consists of academic and research partners as well as
  a number of companies that need multilingual datasets and services for their re-
  search. Equally important for raising awareness are the European Commission
  and the European Parliament as well as national institutions such as ministries
  and funding agencies and other established networks and associations.
Segment: Key Activities “What key activities does the value proposition require?
  What activities are the most important in distribution channels, customer relation-
  ships, revenue stream, etc.?”
  The most crucial key activity is the maintenance, further development and opera-
  tion of the ELG platform. It needs to provide an interesting and relevant offering
  in order to grow a critical mass of members and users and gain popularity in the
  whole European LT community and beyond. Regular posting of content and other
  outreach activities (such as events, tutorials, talks, publications, meetups etc.) are
  essential to generate visibility and create a strong reputation (see Chapter 10). All
  communication and dissemination activities have to be treated with the highest
  priority to retain existing users and keep attracting new ones. Leveraging exist-
  ing communication networks and sales channels can support this process and will
  be further explored. Quick and reliable service and support helpdesks are needed
  to strengthen customer relationships. Licensing and billing models need to be
  maintained and promoted. Maintenance and management of cloud storage and
  computing for running services has to be ensured.
Segment: Key Resources “What key resources does the value proposition re-
  quire? What resources are the most important ones in distribution channels, cus-
  tomer relationships, revenue stream etc.?”
  The most important resource is the ELG platform itself with all its functionali-
  ties and included services, corpora and additional information. ELG can be re-
  garded as a set of seed technologies, tools and components that are extended over
  time. Customer feedback can be seen as a useful resource as well. It can come in
  many different forms such as evaluation from market data or helpdesk and user
  support feedback. Equally important is a dedicated ELG team, committed to not
  only maintaining existing technology, but growing it and promoting the impor-
  tance of ELG on an international level. To achieve this, a wide international net-
  work is a key resource. The consortium combines vast experience and expertise,
  good knowledge of ongoing trends and access to numerous European networks
  in academia and industry.
Segment: Value Proposition “Which customer needs are being satisfied? What
  core value is delivered to the customer?”
  ELG is envisioned to become the primary LT platform for Europe and to function
  as a one-stop-shop, offering a rich portfolio of LT services, tools and datasets.
  One of its core values is the availability of state of the art services which are
  fast, effective, robust and high-quality. Another special attribute is the fact that
  ELG is “made in Europe, for Europe”. This strong branding inspires trust and
  confidence and ensures that the system is compliant with European regulations,
  security constraints and ethics. For customer satisfaction, ELG needs to be cus-
  tomisable, cover niches, address verticals and offer direct access to providers. Fur-
13 Sustaining the European Language Grid: Towards the ELG Legal Entity            243

  thermore, all solutions come with high usability and are easy to integrate. Stake-
  holders familiar with the European LT landscape are aware of the fragmentation
  of the community which impairs an effective exchange of resources. ELG is com-
  mitted to tackle this existing fragmentation. Competitive pricing is another value
  that makes ELG attractive for customers. Unique about ELG is that it offers a
  new or additional channel for service providers and consumers. Suppliers can
  gain more visibility, easy portability between providers is guaranteed through
  joint standards. Workflow functionalities will eventually be integrated to com-
  bine services from different providers and even their own clouds. ELG also offers
  added value to academia. It allows the use of services and data and offers easy
  comparison between systems on the same data or different data with the same sys-
  tem. ELG is meant to act as a broker for European LT and as a catalyst to boost
  innovation that also makes both the European industry LT sector and academic
  institutions an attractive employer for young high-potentials.
Segment: Customer Relationships “What relationship that the target customer
  expects are you going to establish? How can you integrate that into your business
  in terms of cost and format?”
  The ELG brand is intended to be a quality seal for customers that guarantees state
  of the art services, a high level of security and compliance with all relevant EU
  regulations. Customers can use ELG through the web UI including code samples
  and libraries or through the APIs or SDKs. High quality guidelines and a user-
  friendly design make processes intuitive. Support through a service helpdesk is
  also possible. Technical onboarding and support packages will be offered and a
  fine-grained customer relationship model is being developed. Essential for tar-
  geting customers is strong brand building. Related marketing activities are tai-
  lored to different audiences and distributed regularly. While retaining customers
  is essential, new potential customers can be attracted through outreach and train-
  ing events, tutorials, webinars and conferences. A brand that has earned people’s
  trust can also create a need for other customer services such as consulting services
  around ELG and language-centric AI.
Segment: Channels “Through which channels do customers want to be reached?
  Which channels work best? How much do they cost? How can they be integrated
  into customers’ routines?”
  Customers will be reached through a variety of channels. Events, both estab-
  lished and new ones, will play an important role, for example, events targeted at
  stakeholders in a specific industry domain. Dedicated networking sessions, con-
  ferences and presentations are also foreseen. Online advertising campaigns will
  accompany all events. Since ELG builds on an existing network of stakeholders,
  email marketing and social media campaigns have proven to be successful means
  of reaching out. Presence on social media channels such as Twitter or LinkedIn
  helps to promote events and maintain customer relationships. ELG itself is a chan-
  nel through which customers can retrieve information, not only about services
  and datasets, but also about the community and events. Cloud platforms that are
  either currently being developed in other EU or national projects as well as exist-
244                 Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

  ing commercial platforms can also act as channels to point potential customers to
  ELG. SEO can also help promoting ELG since users trust search engines.
Segment: Customer Segments “For which segment is value being created? Who
  is the most important customer?”
  The ELG platform offers value to different customer segments. LT providers, both
  commercial and academic ones, can use ELG to offer their services and datasets.
  Research organisations can benefit immensely from the wide offer. Customers
  from industry that demand LT (including large enterprises, SMEs, startups etc.)
  represent an essential customer segment that contributes to turning ELG into a
  flourishing marketplace. The European Union, public administrations and NGOs
  can also integrate ELG services into their current solutions. The same holds true
  for funding agencies and policy makers, advertising companies etc. Other EU
  project consortia as well as project consortia on the national level can benefit
  from the value created by ELG.
Segment: Cost Structure “What are the highest costs? Which key resources or
  activities are most expensive?”
  As mentioned earlier, the highest costs are created by the human resources and the
  digital infrastructure. Personnel costs are created by the team maintaining and fur-
  ther developing ELG including daily operations as well as customer support, but
  also community management work that requires marketing and communication
  activities. Further resources need to be assigned to management and administra-
  tion work that includes budgeting, accounting and legal counselling. Moreover,
  overhead costs are to be covered.
Segment: Revenue Streams “For what value are customers willing to pay?”
  Part of the overall revenue will be generated through different products including
  usage or subscription fees, brokerage fees (marketplace approach), commission
  fees and products such as LT as a Service (LTaaS; hosting of services, models,
  datasets), LT Platform as a Service (PaaS; combining ELG services into work-
  flows) and Repository as a Service (RaaS; hosting service for whole reposito-
  ries). Advertisements can, for instance, showcase companies, services, confer-
  ences etc. Sponsored content, services, data sets, companies etc. present another
  revenue stream as well as commission fees. Paid training events, tutorials, we-
  binars etc. can be offered to commercial stakeholders. Conferences (event reg-
  istration fees; sponsorship packages for companies) are also an opportunity to
  generate income as well as general consulting services around ELG and language-
  centric AI.
    This brief summary of the nine segments is an extract of the ELG BMC, pro-
duced by consolidating the BMCs prepared by the ELG consortium partners. For
many segments, there was broad agreement within the individual BMCs, especially
with regard to key partners, key activities and key resources. Also, in value propo-
sition, customer relationships and channels the answers were largely similar. The
customer segments are quite heterogeneous, though, which may make a targeted ap-
proach more difficult. As far as the cost structure is concerned, there are few devia-
tions. A crucial open question concerns the appropriate size and ambition of the ELG,
in particular with regard to team size. The answers were rather diverse in the case
13 Sustaining the European Language Grid: Towards the ELG Legal Entity                  245

of revenue streams; here, positions could be aligned more closely through the subse-
quent step of specifying and discussing the different ELG products. As a follow-up
step, the exact revenue streams will be evaluated with regard to cost-effectiveness
and sustainability.



3.5 Product Portfolio and Revenue Streams

Together with all partners of the ELG consortium we defined, in a process that in-
cluded several iterations, a portfolio of products that the ELG legal entity can poten-
tially offer. These products are targeted at members of the European LT community
and also at stakeholders interested in using, implementing, integrating or purchasing
European LT. The products are primarily foreseen as revenue streams for the ELG
legal entity so that it is able to cover the fixed costs associated with operating the
ELG legal entity and platform (Section 3.3).
    Such a structured portfolio of products, including associated fees, is necessary for
eventually preparing the budget plan of the legal entity. In the following, we briefly
describe the main categories of the ELG product portfolio; due to space restrictions
we are unable to include all the details (especially aspects such as competitors, pric-
ing, technical preconditions and general prerequisites are left out), i. e., the descrip-
tion in this chapter is not meant to be exhaustive but rather indicative of the overall
plan and vision of the legal entity. It is also important to note that not all products will
be offered right from the start but that the set of products will be expanded gradually
over time.


3.5.1 Product Category: Marketplace

Marketplace Commission ELG features a directory of all European LT devel-
  opers and can enable a match-making process, i. e., ELG facilitates, for potential
  buyers or integrators of LT, the discovery of the right LT provider. In this product,
  ELG receives a commission from every contract generated through the market-
  place (approx. 5-10%). This product can be used by commercial LT developers to
  broaden their reach and to penetrate new markets, especially if the current is lim-
  ited or if the developer is operating in a niche. On the demand side, we foresee
  this product to be used by larger organisations that want to buy LT or integra-
  tors that need a specific LT for a customer project. In order to participate in this
  marketplace, LT developer companies have to agree and to sign a marketplace
  participation framework agreement.
Public Request for Bids Model This product is a potential extension of the mar-
  ketplace commission product: Customers can publicly and maybe anonymously
  post the need for a certain technology or resource or perhaps for an integration
  task and ask supplier companies for bids. Multiple LT developers and integrators
  can post their bids (not publicly) so that the organisation that posted the origi-
246                 Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

   nal request for bids can identify a cost-effective way to move forward. Posting
   the original request for bids would require a small fee to be paid. If a contract is
   established, the usual ELG marketplace fee applies on top of this.


3.5.2 Product Category: Consulting

Technical ELG Platform Consulting The ELG legal entity has enough expertise
  so that it can offer various types of technical consulting services, for example, re-
  garding ELG, providing or using ELG services, combining services, training new
  models and making them available, i. e., services with a clear focus on the ELG
  platform, ecosystem and technical basis. This product is likely to be purchased by
  organisations that have a certain need for LT and that want to test and explore cer-
  tain functionalities, models or tasks, but these organisations realise that they need
  some kind of help, e. g., implementation of prototypes, selection of technologies,
  evaluations etc. Using this product, organisations are able to make full use of the
  ELG platform and all its services. This product can be offered for a one-time fee
  or, for larger companies, also as part of a framework contract.
Conceptual ELG Community Consulting This product is similar to the one de-
  scribed above; it primarily makes use of the ELG team’s in-depth knowledge of
  the ELG community, i. e., of the European LT developer or provider landscape.
  In that regard, the ELG team can support organisations with a certain need for a
  general or specific type of LT in finding the right technology provider. Customers
  interested in this type of product know that they have a certain need for LT but
  they are unsure about the concrete next steps, i. e., where and how to find the
  provider company.
LT Market Intelligence Report The ELG legal entity could exploit its in-depth
  knowledge of the European LT landscape and community and publish an annual
  or semi-annual market intelligence report about the European and maybe also
  global LT landscape including topics such as, among others, emerging trends, new
  players and rising stars, new projects and success stories. Such market analyses
  are highly relevant for a larger group of stakeholders including larger companies
  and enterprises (LT developers, LT users), non-governmental organisations, ven-
  ture capital companies and others. These reports could be offered for a one-time
  fee or as packages that cover multiple reports with a slightly reduced fee.


3.5.3 Product Category: ELG APIs

ELG Power User Flatrate (for commercial users) Through this product, com-
  mercial customers get unlimited and unrestricted access to the ELG APIs of all
  integrated services and tools. This product targets companies of any type (SMEs,
  integrators, enterprises) that have to pay a small monthly or annual fee to be able
  to use it. This subscription product provides direct to all ELG APIs for experimen-
  tation and evaluation purposes, enabling fast comparisons and immediate results.
13 Sustaining the European Language Grid: Towards the ELG Legal Entity            247

  It can also be used to develop smaller LT-driven applications by integrating ELG
  APIs into existing systems. Like with many other products, any surplus generated
  through this product will be transferred to those LT developing companies that
  have provided the ELG-integrated services that were used in the relevant month,
  based on the proportionate number of API calls.
ELG Power User Flatrate (for academic users) Technically, this product is ex-
  actly like the first one but it targets academic users exclusively. The monthly or
  annual fee will be significantly lower than the fee of the power user flatrate for
  commercial users.
ELG Professional Flatrate Conceptually, this product is similar to the first one
  but the professional flatrate includes additional features and support services,
  e. g., faster tools, more compute resources, faster helpdesk support, workflow or
  pipeline functionality etc. The price of this product will be significantly higher
  than the pricer for the first product.


3.5.4 Product Category: LT-as-a-Service

LT-as-a-Service (for commercial users) This product targets commercial LT de-
  velopers. Paying a certain fee, it enables them to host a limited number of LT tools
  or services within the ELG platform with guaranteed performance and availabil-
  ity. In order to be able to host more services or API endpoints in ELG, a differ-
  ent type of product needs to be purchased (see Section 3.5.6). This product is
  especially interesting for those companies that do not operate their own cloud
  infrastructures or that are eager to participate in the ELG initiative, i. e., ELG’s
  LT-as-a-Service product can be seen as an alternative to renting cloud infrastruc-
  ture. Another benefit of this product is that companies are able to extend their
  reach and to open up new markets, i. e., once again ELG can be used as an addi-
  tional sales, promotion and distribution channel. This product can also be set up
  in multiple tiers, representing different maximum numbers of services and cor-
  responding prices. While companies have to pay a certain fee for this product,
  the different ELG APIs products (see Section 3.5.3) will generate revenue, from
  which the companies will benefit. In that regard, it is important to identify the
  right balance over time.
LT-as-a-Service (for academic users) Technically, this product is exactly like
  the previous one but it targets academic users exclusively. The monthly or an-
  nual fee will be significantly lower than the fee of the LT-as-a-Service product for
  commercial users. This product also targets research projects, for which ELG can
  function as a secondary or maybe even primary dissemination and exploitation
  channel for their research results. Like the ELG power user flatrate for academic
  users, we consider making this product available for free for academic users if
  and when the ELG legal entity has established stable revenue streams.
248                  Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

3.5.5 Product Category: Data-as-a-Service

Data-as-a-Service (for commercial users) This product is very similar to LT-as-
  a-Service but instead of focusing upon running services or tools, it only allows
  making datasets or other (static) resources available on ELG, again, with guar-
  anteed availability. Like LT-as-a-Service, this is an entry level product and, thus,
  only allows hosting a limited number of datasets (or up to a certain amount of
  data) on ELG. In case of more demand on the side of the customer, a different
  type of product needs to be purchased (see Section 3.5.6). This product needs to
  be priced lower than the LT-as-a-Service product.
Data-as-a-Service (for academic users) Technically, this product is like the pre-
  vious one but it targets academic users. The monthly or annual fee will be signif-
  icantly lower than the fee of the Data-as-a-Service product for commercial users.


3.5.6 Product Category: Repository-as-a-Service, Platform-as-a-Service

Repository-as-a-Service, Platform-as-a-Service Using this product, customers
  can host whole LT platforms or repositories on ELG while the ELG team takes
  care of all technical aspects including branding, availability, backups etc. This
  product targets a variety of stakeholders including goverments and ministries
  (e. g., for hosting national LT platforms on ELG), smaller or larger companies,
  smaller research groups and also whole research centres. The idea behind the
  product is that setting up and operating a cloud with an LT repository requires a
  lot of effort and expertise, which can be fully avoided by purchasing the corre-
  sponding ELG product. While the branding of the respective hosted platform or
  repository can be adapted to the brand and logo of the respective customer on the
  user interface level, at the same time, all hosted services, tools and other resources
  are automatically also part of the ‘wider’ ELG cloud platform, which will auto-
  matically broaden their reach significantly. We currently foresee three different
  tiers of this product: one entry level tier for research groups, one for SMEs and
  research centres and one for national LT repositories.


3.5.7 Product Category: Events

Training Events and Tutorials In addition to the more technical products de-
  scribed above, training events and tutorials can be offered as products, especially
  for commercial customers. These can be, among others, general ELG-related
  training events (from half a day to multiple days) where the training relates to the
  ELG platform, using, providing and combining services etc. This type of event
  can be offered to organisations that have a need for certain LT and that want to be
  able to make the most of the ELG platform. This product is a pre-packaged and
  generic course, while those training events that involve customisation of content,
13 Sustaining the European Language Grid: Towards the ELG Legal Entity            249

  tailoring the course to the respective customer and its specific needs, would be
  rather considered technical ELG platform consulting (see Section 3.5.2).
Annual Conference The annual ELG conference assembles the whole ELG com-
  munity, including commercial and academic participants, related projects and ini-
  tiatives etc., and also the ELG team. While the annual ELG conference organised
  by the ELG EU project has been free of charge, this model could change (e. g.,
  registration fees, sponsorship packages, paid presentation slots, booths for a fee
  in the industry exhibition etc.).


3.5.8 Product Category: Marketing and Advertisements

Conference Sponsorship This product relates to typical conference sponsorship
  packages, which can be purchased by, typically, companies to position themselves
  as gold or platinum sponsors of the annual ELG conference. This product model
  is well established and accepted in industry and research but to be successful it
  requires the respective conference to be of very high relevance for its community.
Online Advertisements The ELG platform could offer a small part of its screen
  real estate for online advertisements that can be purchased, among others, by
  members of European LT community to position their products or services in
  a more targeted way on the ELG website, for example, when certain keywords
  or search terms are used. In terms of revenue generated, this product only makes
  sense if the website has a very high number of users. Furthermore, it remains to
  be discussed and seen if online ads are a welcome addition on the ELG website
  or if they are perceived as not appropriate.
Sponsored Content Similar to online advertisements, the idea behind this product
  is that customers can pay a small fee to get one or more of their products, services
  or resources or perhaps even their own organisation’s or project’s page in the ELG
  catalogue featured on the ELG website, clearly marked as “sponsored content”
  (for example, the first search result).
Merchandise The final product relates to ELG-branded merchandise, which could
  be sold online, for example, tshirts, hats or pens with the ELG logo.


3.5.9 Miscellaneous

In addition to the actual products offered by the ELG legal entity, there are at least
three other potential revenue streams or activities related to marketing the ELG prod-
ucts. These additional revenue streams cannot be considered products per se.
Foundations The ELG legal entity could approach one or more foundations with
  the request to grant financial support. In return, the foundations could position
  themselves as supporters of the ELG initiative.
Project Grants EU or national project grants are an obvious mechanism to sup-
  port part of the ELG team and platform as well as its operation.
250                  Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

ELG Use Cases as Show Cases Together with larger enterprises and some of the
  commercial LT developer companies represented in ELG, interesting and relevant
  show cases as well as success stories can be published on the ELG website, which
  can function as marketing instruments and testimonials that demonstrate that ELG
  is an important and valuable activity.


3.5.10 Summary and Assessment

The ELG product portfolio is diverse and broad, it offers multiple different options
of moving forward under the umbrella of the legal entity. As mentioned, we will
not start with all products right away but only with a selection. Before we make the
final selection, we will validate the products and their chance of being accepted by
the European LT community with a number of experts from the field. As the most
promising products we currently perceive the ELG APIs (Section 3.5.3) due to the
enormous market for this product, the LT-as-a-Service products (Section 3.5.4) due
to high demand, the marketplace (Section 3.5.1) as well as the consulting product
(Section 3.5.2).
   Additionally, we see a lot of potential in offering countries the technical infrastruc-
ture for the purpose of supporting national LT platforms (Section 3.5.6). Especially
for smaller countries or regions, it is challenging to develop, operate and maintain
an elaborate technical platform all by themselves. For these, having their National
LR/LT Repository hosted as a service within ELG can be an attractive offer. For
ELG, in return, it appears to be an interesting financial pillar to operate such plat-
forms, charging an annual hosting fee.
   Making use of the ELG platform as the primary dissemination and exploitation
channel for research projects is another product idea that has a lot of potential (Sec-
tion 3.5.4). It enables research projects to fully concentrate on the actual research
work without a need for developing complicated exploitation plans on their own be-
cause they can fully rely on ELG for this purpose. This approach can increase the
general visibility of European research results significantly.



3.6 Legal Entity Type

For the creation of a dedicated legal entity with European scope, we considered a
number of different entity types. The decision to move forward with a not-for-profit
organisation was made rather early in the process. The main options that we explored
were a professional association or a foundation. In that regard, each EU country has
its own set of different types of business entities as part of their legal system, which,
generally, all have their own specific sets of rules. These include, among others,
cooperatives, partnerships and limited liability companies. Looking at Germany, for
a not-for-profit organisation, a gGmbH (a not-for-profit private limited company), or
an e. V. (eingetragener Verein, registered association) would be two obvious options.
13 Sustaining the European Language Grid: Towards the ELG Legal Entity               251

An alternative that enjoys some popularity with EU-funded projects is the Belgian
Association without lucrative purpose (AISBL). As the ELG consortium does not
have any partners in Belgium or Luxembourg, the AISBL option was ruled out for
reasons of efficiency. In addition to national entities, there are several types of legal
entities on the level of the European Union.
    The EEIG (European Economic Interest Grouping) is part of European Corporate
Law, created in 1985. An EEIG makes it easier for companies in different countries
to do business together. Its activities must be ancillary to those of its members. Any
profit or loss is attributed to its members. It is liable for VAT and social insurance
of its employees but it is not liable to corporation tax and it has unlimited liability.
Several thousand EEIGs exist and are active in various fields. This legal entity only
applies to companies, it does not include research institutions.
    The SE (Societas Europea) is a European company, established in 2001 by an
EU Regulation. The SE has been growing in popularity ever since. It is a type of
public limited-liability company and allows an organisation to operate its business
in different European countries under the same rules. An SE offers many advantages
such as easily setting up Europe-wide subsidiaries as well as an international holding
company. The company headquarters can be relocated easily and the SE legal form
conveys a strong European image. However, the SE comes with strict foundation
criteria, such as the requirement of high initial capital.
    The SCE (Societas cooperativa Europaea, European Cooperative Company) was
established in 2006, it is related to the SE. An SCE can be established in the European
Economy Area. This entity type was created to remove the need for cooperatives to
establish subsidiaries in each EU Member State in which they operate, and to allow
them to move their registered office and headquarters from one EU Member State to
another. SCEs are governed by a single EEA-wide set of rules and principles which
are supplemented by the laws on cooperatives in each Member State.
    The SPE (Societas privata Europaea) is a European private limited company, it
corresponds to an Ltd. in Anglo-Saxon countries or a GmbH in Germany, Austria
and Switzerland. This legal entity type has been a European Commission proposal
for more than ten years. As of now, it still does not exist.
    For ELG, a crucial requirement is that the selected solution provides flexibility,
agility and the ability to ramp up the operation of the legal entity in a careful way.
The final decision must also be made on the basis of financial considerations, i. e., it
must be specified which products or services can be offered to generate which profit.
    At the time of writing, we will establish a registered association headquartered
in Germany (e. V., eingetragener Verein). This option does not require any initial
capital and frees ELG from the pressure of having to generate income immediately.
Since some of the staff members who will be active in the ELG e. V. in the first phase
are based in Berlin, it appears practical to set up the entity in Germany and under
German law. It must be noted, however, that the legal entity will work in virtual teams
primarily. The only legal entity type on the European level that could be appropriate
for ELG, the SPE, does not exist yet.
252                  Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

4 Summary and Next Steps

This chapter presents the current state of planning of the ELG legal entity, which
is foreseen to be established as an eingetragener Verein, e. V., as a registered, not-
for-profit association, in the second half of 2022. The legal entity will start small,
with a soft launch, and is meant to be flexible and agile. The main pillars of this
concept have been under development since late 2019 and cover most of the crucial
aspects of the legal entity. In terms of financing, a mixed model is envisaged, driven
by the product portfolio (Section 3.5), that includes shared revenue streams through
LT provider companies that use ELG as a sales channel and their customers who use
ELG to find the right providers and suppliers as well as services.
    One aspect that still needs to be specified in more detail is the inclusion and ac-
tive involvement of the European LT community and the governance structure of the
legal entity. As an initiative from the European LT community for the community,
its involvement is crucial to create trust and transparency as well as to provide repre-
sentation to academic and industrial European LT developers. The proper inclusion
of the community in a representative manner will require a number of discussions
and deliberations. Fortunately, with regard to an e. V., these matters do not need to
be fully resolved before establishing the organisation but can also be taken on board
and revised through updates of its statutes.
    Originally we had envisioned to establish the legal entity within the project run-
time and to start with a ‘bigger’ approach than is currently foreseen. The afore-
mentioned delay of a few months in establishing the entity does not pose a prob-
lem because the overall framework conditions have changed in the last 12 to 18
months. Through recently started and publicly funded projects including ELE, ELE2,
OpenGPT-X, NFDI4DataScience and AI as well as the upcoming EU projects
DataBri-X and SciLake, which are about to start in October 2022 and early 2023
respectively, we are able to operate the ELG cloud platform and we can also per-
form some maintenance and other ELG-related work, including the extension of the
ELG platform itself so that it is compatible with the emerging Gaia-X ecosystem. In
addition, SciLake will establish the first bridges to the EOSC ecosystem.
    Since the start of the project, we have been collaborating with the European AI
on demand platform, especially with the AI4EU project, to ensure compatibility of
our approaches in terms of semantically describing resources. Furthering these col-
laborative efforts will facilitate cross-platform search and discovery enabling ELG
resources and other assets to be visible and usable by the wider AI community. Con-
sidering the EU’s plan to deploy the European AI on demand platform, ELG is ready
to act as the central language-related AI hub and marketplace providing access to and
direct use of several thousands of LT services and related data.
    While the future is always difficult to predict, it is clear already now that over the
past three years the interest in ELG has risen constantly and that the legal entity that
will take over the initiative after the EU project has ended has very good starting con-
ditions. The ELG brand has been established in the community and a considerable
buy-in can be observed already now. However, to take advantage of this momentum,
the marketplace, broker, dissemination, exploitation and participation model needs
13 Sustaining the European Language Grid: Towards the ELG Legal Entity                       253

to be extremely simple and easy to grasp to make sure users understand and accept
it and the platform needs to be as user-friendly and all-encompassing as possible in
every regard, including the various levels of technical interoperability. Quality and
security aspects play a crucial role and can become the unique selling proposition as
opposed to providers of LT services from the US or Asia.



References

Gaspari, Federico, Owen Gallagher, Georg Rehm, Maria Giagkou, Stelios Piperidis, Jane Dunne,
   and Andy Way (2022). “Introducing the Digital Language Equality Metric: Technological Fac-
   tors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-
   located with LREC 2022). Ed. by Itziar Aldabe, Begoña Altuna, Aritz Farwell, and German
   Rigau. Marseille, France, pp. 1–12. URL: http://www.lrec-conf.org/proceedings/lrec2022/wor
   kshops/TDLE/pdf/2022.tdle-1.1.pdf.
Grützner-Zahn, Annika and Georg Rehm (2022). “Introducing the Digital Language Equality Met-
   ric: Contextual Factors”. In: Proceedings of the Workshop Towards Digital Language Equality
   (TDLE 2022; co-located with LREC 2022). Ed. by Itziar Aldabe, Begoña Altuna, Aritz Farwell,
   and German Rigau. Marseille, France, pp. 13–26. URL: http://www.lrec-conf.org/proceedings
   /lrec2022/workshops/TDLE/pdf/2022.tdle-1.2.pdf.
Osterwalder, Alexander (2004). “The Business Model Ontology: A Proposition in a Design Science
   Approach”. PhD thesis. France: University of Lausanne.
Osterwalder, Alexander and Yves Pigneur (2010). Business Model Generation – A Handbook For
   Visionaries, Game Changers, And Challengers. Wiley.
Rehm, Georg, Dimitrios Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo
   Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian
   Chiarcos, Nils Feldhus, Julián Moreno-Schneider, Florian Kintzel, Elena Montiel, Víctor Ro-
   dríguez Doncel, John P. McCrae, David Laqua, Irina Patricia Theile, Christian Dittmar, Kalina
   Bontcheva, Ian Roberts, Andrejs Vasiljevs, and Andis Lagzdiņš (2020a). “Towards an Inter-
   operable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different
   Levels of Interoperability”. In: Proc. of the 1st Int. Workshop on Language Technology Plat-
   forms (IWLTP 2020, co-located with LREC 2020). Ed. by Georg Rehm, Kalina Bontcheva,
   Khalid Choukri, Jan Hajic, Stelios Piperidis, and Andrejs Vasiljevs. Marseille, France, pp. 96–
   107. URL: https://www.aclweb.org/anthology/2020.iwltp-1.15.pdf.
Rehm, Georg, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic,
   Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez
   Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek,
   Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-
   bina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De
   Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek,
   Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur
   Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás
   Váradi, Kadri Vider, Andy Way, and François Yvon (2020b). “The European Language Technol-
   ogy Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communi-
   cation in Multilingual Europe”. In: Proceedings of the 12th Language Resources and Evaluation
   Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christo-
   pher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
   Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325.
   URL: https://www.aclweb.org/anthology/2020.lrec-1.407/.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
254                    Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis et al.

   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Teece, David J. (2017). “Dynamic Capabilities and (Digital) Platform Lifecycles”. In: Entrepreneur-
   ship, Innovation, and Platforms 37 (Advances in Strategic Management), pp. 211–225. DOI: 1
   0.1108/S0742-332220170000037008.
Vasiljevs, Andrejs, Khalid Choukri, Luc Meertens, and Stefania Aguzzi (2019). Final study report
   on CEF Automated Translation value proposition in the context of the European LT market/e-
   cosystem. DOI: 10.2759/142151. URL: https://op.europa.eu/de/publication-detail/-/publicatio
   n/8494e56d-ef0b-11e9-a32c-01aa75ed71a1/language-en.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
                          Part IV
ELG Open Calls and Pilot Projects
Chapter 14
Open Calls and Pilot Projects

Lukáš Kačena, Jana Hamrlová, and Jan Hajič




Abstract We describe the two ELG open calls for pilot projects, the objective of
which was to demonstrate the use and the advantages of ELG in providing basic LT
for applications and as a basis for more advanced LT-based modules or components
useful to industry. Our main goal was to attract SMEs and research organisations
to either contribute additional tools or resources to the ELG platform (type A pilot
projects) or develop applications using Language Technologies available in the ELG
platform (type B pilot projects). We start with the detailed description of the submis-
sion and evaluation processes, followed by a presentation of the open call results.
Afterwards we describe the supervision and evaluation of the execution phase of the
projects, as well as lessons learned. Overall, we were very satisfied with the setup
and with the results of the pilot projects, which demonstrate an enormous interest in
ELG and the Language Technology topic in general.



1 Introduction

To demonstrate the advantages of ELG (Rehm et al. 2021) in providing LT for appli-
cations and as a basis for more advanced LT-based modules or components useful to
industry, the ELG project set up a mechanism for using close to 30% of its budget for
small scale demonstrator projects (“pilots”) through two open calls. The calls were
prepared using the ICT-29a call specification, making use of the Financial Support to
Third Parties (FSTP) scheme according to the ICT Work Programme 2018-2020 (Eu-
ropean Commission 2017). In total, we provided 1,950,000€ to the selected projects
as FSTP with an awarded amount of up to 200,000€ per project. We established
a lightweight submission procedure and a transparent evaluation process, in which
external evaluators participated as reviewers.
    The main objective of the open calls was to attract SMEs and research organi-
sations to either contribute tools and services to ELG (type A projects) or develop

Lukáš Kačena · Jana Hamrlová · Jan Hajič
Charles University, Czech Republic, kacena@ufal.mff.cuni.cz, hamrlova@ufal.mff.cuni.cz,
hajic@ufal.mff.cuni.cz


© The Author(s) 2023                                                               257
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_14
258                                          Lukáš Kačena, Jana Hamrlová, and Jan Hajič

applications using Language Technologies available in the ELG platform (type B
projects). The results of the pilot projects are included in the ELG platform for dis-
semination, testing and external evaluation by other entities or the public.



2 Organisation of the Open Calls

2.1 Management Structure and Organisation

While agile, simple and lightweight from the proposers’ point of view, the organisa-
tion of the two open calls was an internally complex procedure requiring close col-
laboration of three different teams (management team, technical team, Pilot Board)
with support from a broad panel of external evaluators.


2.1.1 Pilot Board

The Pilot Board (PB) was set up for the supervision of the pilot projects. While the
management team took care of the organisation and handling of the open calls and
the execution of the pilots, the PB provided a forum so that the ELG project could
discuss the progress of the pilots, their feedback and results. The PB was meant to
be the main technical and strategic interface between the pilot projects and the ELG
project proper, so that ELG could maximise its benefits from supporting the pilots
and to make sure that the pilot projects benefit from ELG.
   The PB operational procedures were drafted by the management team and ap-
proved by the ELG Steering Committee. Afterwards, seven PB members were nom-
inated and approved. The operational procedures defined the main responsibilities
of the PB as follows: approval of the open calls and related documentation; pilot
project selection process; supervision of pilot project execution, including progress
monitoring, evaluation of results and approval of the phased payments.


2.1.2 External Evaluators

An independent panel of experienced external evaluators ensured an open, trans-
parent and expert-evaluation based selection process. The pool of evaluators was
created using a separate open call. The evaluators were responsible for evaluating
the project proposals and worked remotely using the web interface of the ELG Open
Calls Platform. They were selected from the pool, avoiding any conflicts of inter-
est. All evaluators were asked to sign a non-conflict of interest declaration and a
confidentiality agreement before being accepted to perform the task.
14 Open Calls and Pilot Projects                                                 259

2.1.3 Management Team

The management team organised the whole Open Calls process, including managing
and directing the technical team. In line with Annex K of the Work Programme (Eu-
ropean Commission 2017) and other relevant sections of the Rules for Participation,
the management team prepared all prerequisites and procedures: the Open Calls Plat-
form, web content, informational materials, forms, contract templates, presentation
and reporting forms and templates, submission procedure, hiring and selection of
external evaluators, call management structure, internal auditing and project results
evaluation procedures. In the initial setup phase, the management team tapped the
legal and financial expertise of the Technology Centre of the Czech Academy of
Sciences, which is charged by the Czech government to host the National Contact
Point (NCP) and other experts related to the preparation, execution and evaluation
of EU framework programmes and projects.


2.1.4 Technical Team

An essential task was to set up the ELG Open Calls Platform for the proposal submis-
sion, evaluation and reporting process. We decided to develop the platform in-house
to ensure that it fit our needs.1 The technical team was responsible for developing
the platform and for support during each phase of the process.



2.2 Timeline

Figure 1 shows the open calls execution timeline. After the announcement, each call
was open for submissions for two months, followed by an evaluation procedure of
approx. two months. After signing the contracts with the selected projects, the exe-
cution phase started. The expected project duration was 9-12 months. Four projects
asked for a short extension of one or two months (which was accepted), mainly due
to COVID-19 related delays of dissemination activities.



2.3 Communication with Stakeholders

Prospective applicants were targeted through various channels, e. g., the open calls
website, a survey for stakeholders and other communication and dissemination ac-
tivities carried out by all ELG consortium members.
    From early 2019 onwards, the open calls were presented on the ELG website.2
The content was regularly updated, starting from basic information including the
1   https://opencalls.european-language-grid.eu
2   https://www.european-language-grid.eu/open-calls
260                                                                                                                                                Lukáš Kačena, Jana Hamrlová, and Jan Hajič
    Mar               Apr    May           Jun        Jul      Aug    Sep       Oct             Nov    Dec          Jan         Feb        Mar     Apr   May     Jun    Jul   Aug         Sep      Oct   Nov    Dec     Jan   Feb    Mar
    2020             2020    2020         2020       2020      2020   2020     2020             2020   2020        2021         2021       2021   2021   2021   2021   2021   2021        2021    2021   2021   2021   2022   2022   2022




    Open Call 1
Open Call                                           Contract
Published                                           Signing


           Project               Evaluation
    Submission Period       (Experts+Pilot Board)



                                                                                                       Project Execution




    Open Call 2
                                                                             Open Call                                                 Contract
                                                                             Published                                                 Signing


                                                                                      Project                Evaluation
                                                                              Submission Period         (Experts+Pilot Board)



                                                                                                                                                                              Project Execution




    Mar               Apr    May           Jun        Jul      Aug    Sep       Oct             Nov    Dec          Jan         Feb        Mar     Apr   May     Jun    Jul   Aug         Sep      Oct   Nov    Dec     Jan   Feb    Mar
    2020             2020    2020         2020       2020      2020   2020     2020             2020   2020        2021         2021       2021   2021   2021   2021   2021   2021        2021    2021   2021   2021   2022   2022   2022




Fig. 1 Open Calls overall timeline



timeline and key parameters at the beginning of the project, followed by the call for
evaluators3 and complete information regarding the open calls4 .
   We first monitored the interest in the open calls using a survey, which ran from
May 2019 until June 2019. A total of 108 respondents participated. The result
showed significant interest in the open calls and also a high demand for more in-
formation. Five months before the first call announcement, a second survey was pre-
pared. We disseminated this survey during the first annual ELG conference META-
FORUM 2019 in October 2019 in Brussels and collected answers from 47 respon-
dents, 84% of which expressed an interest in taking part in the open calls.
   The open calls were promoted through social media (Twitter, LinkedIn), various
e-mail distribution lists, internal networks and collaborators, through the META-
FORUM conference and through other means whenever an opportunity arose.



2.4 Submission Process

As explained in the previous section, in the preparatory period the overall open call
procedure was set up, including all related documents and the development of the
online platform for the management and evaluation of submissions. After the official
announcement of one of our two open calls, applicants could then prepare and submit
their project proposals. There was a continuous need for support, mainly answering
questions we received by the participants via email.
   With regard to the call announcement, we paid special attention to a well-prepared
call documentation, which provided all necessary information for applicants, and
a user-friendly submission platform. The documentation was prepared as an easy-
to-understand document. It contained several annexes: Guide for Applicants, Third
Party Agreement, Project Proposal Template and Evaluation Criteria.

3    https://www.european-language-grid.eu/open-calls/call-for-evaluators
4    https://www.european-language-grid.eu/open-calls
14 Open Calls and Pilot Projects                                                    261

   In the “Guide for Applicants” the management team showed, using screenshots,
how to submit a project proposal through the platform, i. e., how to create an appli-
cant account, how to log in and manage the account, how to create a new project
proposal, fill in the forms and finally submit the proposal. We also maintained a list
of (expected) frequently asked questions, for example “Who can apply for a pilot
project?”, “How much money is allocated for the pilot projects?”, and “Does Brexit
have any implications on eligibility?”.
   The Open Calls Platform was developed using the open source Content Manage-
ment System Drupal with the guiding principle to keep the submission and evalu-
ation process easy and straightforward for the participants and manageable for the
call organisers. The platform runs under the ELG domain5 , while physically residing
with the technical team to ensure quick reactions to any technical problems.



2.5 Evaluation Process

2.5.1 Preparation of the Evaluation Process

The most important part of the preparation of the evaluation process was the selection
and specification of evaluation criteria that match the objectives to be achieved by
the calls. At the same time, the criteria ought to be clear for the external experts
evaluating each proposal.
    The criteria were defined and described in detail in the call documentation. First,
the submitted proposal should fulfill formal requirements (language, submission
date, declaration of honor, legal status, eligible country, number of submitted propos-
als per applicant and no conflict of interest) which were checked by the management
team before any further evaluation. Then, three independent evaluators checked the
binary eligibility criteria: uniqueness, relevance for ELG, and whether the proposal
contains all the required phases (experiment, integration, dissemination). These were
followed by the graded and ranked evaluation criteria: objective fit, technical ap-
proach, business, integration and dissemination plan, budget adequacy, and team.
    In order to identify evaluators with experience in language technologies and eval-
uation, a call for evaluators was published in February 2020. All relevant informa-
tion (description of tasks, eligibility of candidates, selection criteria, contact email
for questions, and a link to the registration form on the Open Call platform) was pub-
lished on the ELG website as well as on the European Commission Funding and Ten-
der portal. In addition, ELG consortium members disseminated the call through var-
ious channels. Potential evaluators were asked to fill in a registration form, through
which contact information, CV, and professional experience related to evaluation
and LT were collected. From about 156 applications, the management team selected
64 evaluators (a total for both project open calls) with relevant expertise in both the
subject field(s) and in evaluating projects of at least similar size.

5   https://opencalls.european-language-grid.eu
262                                          Lukáš Kačena, Jana Hamrlová, and Jan Hajič

   Before assigning projects to evaluators, we sent instructions via email and we
organised webinars in which the evaluation process and criteria were explained.
   All evaluators signed a contract with the ELG project. The contract included a
clause to keep in strict confidence any technical or business information about the
evaluated projects, as well as a no-conflict-of-interest declaration.


2.5.2 Execution of the Proposal Evaluation Process

Each proposal was evaluated by three independent external experts to ensure an trans-
parent selection process. The evaluators were carefully assigned to the proposals by
the management team. We also paid attention to gender (at least one female evaluator
per proposal) and country of residence of the evaluator, avoiding at the same time
possible personal or nationality-based conflicts of interest. The whole process was
monitored by the Pilot Board. Each proposal was assigned to one of the PB members.
These project coaches checked and confirmed or rejected the selection of evaluators
with special regard to conflict of interest.
    After the evaluation, the project coaches prepared summary reports for each pro-
posal assigned to them. In these summaries, the coaches first reviewed the three
reports by the external evaluators. They also suggested potential budget adjustments
and changes of the total number of points (the maximum was 300 points, i. e., 100
points from each evaluator) in range of at most 30 points (open call 1) or 45 points
(open call 2) up or down, where applicable. According to the evaluation criteria,
project proposals by SMEs developing applications using LT available in ELG (B
type projects) received 30 bonus points. Finally, the project coaches reviewed the
eligibility criteria (uniqueness, relevance for ELG and project phases) as checked
by the evaluators and suggested their decision on their fulfilment if the evaluators
differed in opinion. The coaches also assessed the performance of the evaluators and
quality of the reports. After all summary reports had been submitted by the coaches,
a Pilot Board meeting was convened, in which the final ranking and selection was
decided. All proposals were ranked by the total sum of points assigned. The ranked
list was cut at the maximum available financial support (1,365,000€ for open call 1
and 585,000€ for open call 2).



3 Results

3.1 Open Call 1

3.1.1 Overview

The first call was opened on 1 March 2020 and closed on 30 April 2020 in accor-
dance with the timeline (Figure 1). We accepted a total of 110 project proposals for
evaluation from 103 applicants.
14 Open Calls and Pilot Projects                                                   263

                           Submitted by        Type A    Type B    Total
                       Research organisation        43         5        48
                       SME                          36        26        62
                       Total                        79        31       110
Table 1 Proposals submitted to the first open call and accepted for evaluation



    Seven applicants (five SMEs and two research organisations) submitted two pro-
posals (one type A and one type B). Regarding the type of project, 79 submitted
proposals were of type A (contribute resources, services, tools, or datasets to ELG)
and 31 proposals were of type B (develop applications using language resources and
technologies available in ELG), see Table 1. We received proposals from 29 differ-
ent countries, including eligible countries outside the EU (Iceland, Israel, Norway,
Serbia, South Africa, Switzerland, Turkey, United Kingdom). The total amount of fi-
nancing requested by the submitted projects was 16,900,000€. One project requested
283,000€, which was over the limit of 200,000€ per project, and the lowest requested
amount was 50,000€. The average amount requested per project was 153,000€.
    At the end of June 2020, the results of the first open call were announced on the
ELG website, including the list of projects selected for funding.6 The two projects
from the reserve list were informed that they might be selected for financial support
if any of the selected projects rejected the financial support. The remaining projects
were informed that they were not selected. In July 2020, contracts with all selected
projects were signed, and the first payments were made (half of the awarded financial
support), in line with the approved call documentation and procedures. All projects
had started their execution phase by August 6. Furthermore, at the end of July 2020,
abridged versions of the summary evaluation reports were provided to all applicants
through the Open Calls Platform.


3.1.2 Selected Projects

The projects selected in open call 1 are listed in Table 2. All supported organisations
are from the EU – three from Finland, two from Austria, Germany and Italy, and one
from Spain. The awarded budget varies from 87,445€ to 167,375€.
   Although we obtained more proposals from SMEs than from research organisa-
tions, there are three SMEs and seven research organisations among the selected
projects. Similarly, although B type projects from SMEs were preferred, only two B
type projects were accepted for financing which probably reflected the fact that the
ELG platform was still being developed at the time of the first open call. Thus, it
appeared to make more sense to create missing resources or tools rather than build
applications using resources and tools available in ELG.


6   https://www.european-language-grid.eu/open-calls/open-call-1
264                                                    Lukáš Kačena, Jana Hamrlová, and Jan Hajič

           Organisation                          Pilot Project              Type Country Funding
Fondazione Bruno Kessler              European Clinical Case Corpus          A         IT   139,370€
Lingsoft, Inc.                        Lingsoft Solutions as Distributable    A         FI   140,625€
                                      Containers
Coreon GmbH                           MKS as Linguistic Linked Open          A         DE   167,375€
                                      Data
Elhuyar Fundazioa                     Basque-speaking smart speaker          B         ES   117,117€
                                      based on Mycroft AI
Universita’ Degli Studi di Torino     Italian EVALITA Benchmark Lin- A                 IT   126,125€
                                      guistic Resources
University of Helsinki                Open Translation Models, Tools         A         FI   154,636€
                                      and Services
University of Vienna                  Extracting Terminological Con- A                 AT   132,977€
                                      cept Systems from Text
University of Turku                   Textual paraphrase dataset for         A         FI   166,085€
                                      deep language modelling
Weber Consulting KG                   Virtual Personal Assistant Proto- B              AT    87,445€
                                      type
FZI Research Centre for Informa- Streaming Language Processing in            A         DE   132,160€
tion Technology                  Manufacturing
Table 2 List of pilot projects selected for financial support in the first open call



    Four of the eight A type projects aimed to enrich the ELG platform with language
resources and six of them planned to provide various language tools (i. e., two of
the projects provide both resources and tools). The two B type projects promised
speech applications – a smart speaker and a digital twin based on real-time language
translation and analysis. The projects in general often dealt with underrepresented
languages such as Basque, the Nordic languages, and European minority languages.
    Technologically, the projects targeted a diverse set of goals and areas. There are
projects targeting important interdisciplinary areas (medical informatics, manufac-
turing), modern technologies relating to language and semantic as well as world
knowledge (Linked Open Data, paraphrasing) and core scalable technologies (dis-
tributable containers). Evaluation platforms as well as advanced and scalable ma-
chine translation still are and will be relevant issues for Language Technologies. Fi-
nally, the two speech-oriented applied projects broaden the portfolio of the usual
Language Technologies in the desired direction, too.


3.1.3 Feedback provided and Survey for Proposers

With the goal of evaluating and improving our open call procedure, we conducted
several surveys with everyone involved in the first open call. We started with the
14 Open Calls and Pilot Projects                                                  265

project proposers. After the evaluation process we also conducted a survey among
all evaluators. The last survey was conducted among the Pilot Board members.
    Two short surveys were designed for those who submitted a proposal (proposers)
and those who uploaded an initial draft but did not submit a final version (non-
proposers). The survey consisted of 15 questions, some open and some multiple
choice. The survey topics were clustered into three sections: “motivation”, “project
proposals”, and “your organisation”. The information was collected anonymously.
    The surveys were conducted in May 2020. Of the proposers, 73 out of 110 (66%)
responded, and of the non-proposers, 6 out of 17 (35%) responded. The main con-
clusions from the proposers’ survey that were relevant for the setup of the second
open call: Almost 70% of respondents were interested in ELG because of both (func-
tional) services and datasets. Slightly more than two thirds of the respondents pre-
ferred smaller, agile calls over large, consortium-based calls.
    There was a demand for more detailed documentation (e. g., in the form of a
webinar) that allows proposers to better interpret the strategic goals of ELG and
get better information on already existing services in ELG. More details about the
ELG API integration and about the infrastructure for working with data, applications
and possibly also workflows were requested. Some improvements of the Open Calls
Platform and its user-friendliness were made (e. g., limited space).



3.2 Open Call 2

The second open call was launched in October 2020 and experience from the first
open call was reflected in its organisation.


3.2.1 Changes made between Open Call 1 and Open Call 2

The basic parameters, specified in the ELG Grant Agreement, remained the same for
the second open call. Based on the lessons learned from open call 1, we implemented
the following changes in the call documentation and the open call procedure:
  • We improved the explanation of the strategic goals of ELG and the goals of
    the open calls. Links to an overview of ELG, its history and context and to an
    overview of the ELG platform were provided in the call documentation.
  • We also improved the technical documentation of the ELG infrastructure and
    provided an easy-to-find list of currently available services – this was done with
    the launch of ELG Release 1 (June 2020).
  • We organised a webinar, which took place during the submission period, on
    12 November 2020. We explained the goals of the open call and presented the
    call documentation. The second part of the webinar was dedicated to questions
    and a discussion. A recording was made available to all applicants.
  • The documentation, annexes, templates, and forms along with the Open Calls
    Platform were further improved.
266                                                 Lukáš Kačena, Jana Hamrlová, and Jan Hajič

    • In the proposal template, budget breakdowns were requested in a fixed structure
      as well as a more detailed budget justification.
    • New evaluators were recruited and added to the current group, with the aim to
      attract more experienced evaluators.
    • It was decided that the second open call, like the first open call, should have no
      specific thematic focus.


3.2.2 Overview

The second call was opened on 1 October 2020 and closed on 30 November 2020 in
accordance with the open calls timeline (Figure 1). We accepted 103 project propos-
als in total for evaluation.


                           Submitted by        Type A    Type B    Total
                       Research Organisation        38        5        43
                       SME                          28       32        60
                       Total                        66       37       103
Table 3 Proposals submitted to the second open call and accepted for evaluation



   Five applicants (four SMEs and one research organisation) submitted two propos-
als (one type A and one type B). Regarding the project type, 66 proposals were of
type A, and 37 project proposals were of type B. A total of 43 applicants who submit-
ted a proposal in the second open call indicated that they had submitted the same or
a similar proposal in the first open call. We received applications from 28 different
countries, including eligible countries outside the EU (Iran, Israel, Norway, Serbia,
Switzerland, Turkey, United Kingdom). The total amount of financing requested by
the submitted projects was 13,257,919€. The average amount requested per project
was 129,000€, which is less than in the first open call (153,000€).
   In February 2021, the results of the second open call were announced on the ELG
website.7 All applicants were informed about the results. In February and March
2021, contracts with all selected projects were signed, and the first payments were
made (half of the awarded financial support), in line with the call documentation
and procedures. All projects had started their execution phase by 1 April 2021. Fur-
thermore, in March 2021, abridged versions of the summary evaluation reports were
made available to all applicants through the Open Calls Platform.




7   https://www.european-language-grid.eu/open-calls/open-call-2
14 Open Calls and Pilot Projects                                                             267

3.2.3 Selected Projects

The projects selected for financial support in open call 2 are listed in Table 4. The
supported organisations are from five EU countries and the awarded budget varies
between 85,421€ and 137,227€.


           Organisation                     Pilot Project Name           Type Country Funding
Institute for Bulgarian Language     Multilingual Image Corpus 2021        A       BG   110,960€
EDIA BV                              CEFR Labelling and Assessment         B       NL   137,560€
                                     Services
University of West Bohemia           Motion-Capture 3D Sign Lan- A                 CZ    85,421€
                                     guage Resources
Sapienza University of Rome          Universal Semantic Annotator: A               IT   113,228€
                                     A Unified API for Multilingual
                                     WSD, SRL and AMR
Sign Time GmbH                       Sign language explanations for        B       AT   137,227€
                                     terms in a text
Table 4 List of pilot projects selected for financial support in the second open call



   Although we obtained more project proposals from SMEs than from research or-
ganisations, there are two SMEs and three research organisations among the selected
projects. Similarly, only two B type projects were accepted for financing.
   Three A type projects aimed at providing tools to enrich the ELG platform. One
project contributed multilingual annotated data, tools and services for image process-
ing whilst the second one aimed at improving the ELG offer of linguistic tools by
proposing a unified service powered by state-of-the-art neural models for carrying
out annotations on three Natural Language Understanding tasks, i. e., Word Sense
Disambiguation, Semantic Role Labelling and Semantic Parsing, in around 100 lan-
guages. The third A type project expanded the portfolio of language resources avail-
able in ELG by adding a dataset and search tool for Czech sign language. Regarding
the B type projects, one of the projects also dealt with sign language. Its goal was to
simplify text comprehension for deaf people by linking words and phrases to a sign
language encyclopedia. The other project aimed to develop a set of tools, datasets,
and services to enable automatic classification of the reading difficulty of texts on
the Common European Framework of Reference.


3.2.4 Survey for Proposers to the Open Call 2

Just like for the first open call, a survey with 15 questions was designed for those
who submitted a proposal. The survey had three sections: “motivation”, “project
proposals”, “your organisation”. In total, 39 out of 103 proposers (38%) responded.
Regarding the motivation to submit a proposal, contributing services or resources
268                                          Lukáš Kačena, Jana Hamrlová, and Jan Hajič

to ELG to make them available to the ELG community and further development of
an existing software or data project were the most frequent reasons reported by the
respondents. The main expectations toward ELG were that the platform increases
the visibility of the applicant’s organisation on the European level and to get ac-
cess to a large repository of tools and datasets. Also, almost all respondents think
that more EU-funded activities dedicated to Language Technology and Language-
centric AI are needed, preferably in the form of agile calls (with short proposals and
quick evaluations, 9-12 months project run-time). Regarding the specialisation of
respondents, most frequently they specialised in text analytics, machine translation
or speech recognition. Respondents reported more than twenty domains that they
specialise in (most frequently health sector), one fourth of all respondents have no
particular specialisation.



4 Pilot Project Execution

Once the pilot projects were selected and the contracts signed, the continuous support
from the ELG consortium started so that the projects could start their execution.
The first opportunity where the newly selected pilot projects could become more
familiar with ELG were the online meetings with the Pilot Board and other members
of the ELG consortium. During these meetings, basic information about ELG and
its technology as well as guidelines for project execution were presented.
    Project execution (Figure 2) consisted of three phases: Phase 1 – Experiment;
Phase 2 – Integration; Phase 3 – Dissemination. After finishing Phase 1, report-
ing from the applicants was required, and then the Pilot Board decided whether the
project was allowed to continue execution (and consequently, whether the next pay-
ment, 35% of the awarded support, is made). After finishing Phase 3, a final report
was required, and the Pilot Board evaluated the whole project and decided whether
the project receives the final payment (15% of the awarded financial support).
    As mentioned, each project was supervised by a project coach who was responsi-
ble for training the project team, collecting and answering questions during project
execution, collecting reports, and guiding the team through the project phases.
    To advertise them to a wider public, the pilot projects were presented at two an-
nual ELG conferences, i. e., META-FORUM 2020 and META-FORUM 2021, in
dedicated pilot project sessions in which all projects could present their main ap-
proaches and goals. In addition, workshops and training events organised by the ELG
National Competence Centres (NCCs) were also used as opportunities to present cer-
tain pilot projects in the respective countries and regions.
14 Open Calls and Pilot Projects                                                                              269

                              Project Evaluation:
   Project Proposal                                                 Pilot Board Meeting      Contract Signing +
                              three independent
  Submission Period                                                (selection of projects)   first payment (50%)
                             experts per proposal
 1 March – 30 April 2020                                                 June 2020                July 2020
                               May – June 2020




   Project Execution
                                                                                             Pilot Board approves
 (Phase 2: Experiment)                              Reporting to
                                                                                               Phase 2 + second
                                                    Pilot Board
                                                                                                payment (35%)
    Start in July 2020




   Project Execution
                                                                                             Pilot Board approves
 (Phase 3: Integration,                             Reporting to
                                                                                                 Phases 3 & 4 +
       Phase 4:                                     Pilot Board
                                                                                              final payment (15%)
    Dissemination)




Fig. 2 Project execution scheme for pilot projects from the first open call



5 Conclusions

The results of the two open calls demonstrate an enormous interest in the European
Language Grid and the Language Technology topic in general. The interest also
indicates that the setup, including documentation, proposal template, platform etc.,
was easy to follow. In total, we received 213 project proposals from 156 different
institutions (86 SMEs, 70 research organisations) in 32 different countries (including
nine eligible countries outside the European Union); 15 projects were selected for
funding, ten in the first open call and five in the second. The total amount requested
was approx. 30 mil. €, while the available funding amounted to only 1.95 mil. € (an
oversubscription of more than 15 times).
   In the following we briefly summarise the main lessons learned, as gathered
through the different surveys (see Sections 3.1.3 and 3.2.4):
  • We aimed at a simple and light-weight procedure which led to a high number
    of submitted proposals. At the same time, the simplicity of the proposal tem-
    plate may have led to a higher number of low-quality proposals that were not
    adequately described or thought through. In both calls this rather high number
    of proposals required more person days and increased the costs related to the
    external evaluators.
  • The quality of evaluation reports submitted by external evaluators was not en-
    tirely stable and, in some cases, could have been more profound. This was usu-
    ally balanced by the project coach or Pilot Board.
  • It was a good decision to develop the Open Calls Platform internally. Among
    others, it provided us with more flexibility, control over deadlines and quick and
    reliable support from the technical team.
  • In the ELG project budget, the costs for the Open Calls Platform and for the
    proposal evaluation should have been planned more carefully.
270                                                 Lukáš Kačena, Jana Hamrlová, and Jan Hajič

    Overall, we were very satisfied with the open calls setup and with the results of
the pilot projects. While the results improved the ELG offering in terms of data, tools
and services, and the applications developed using the ELG provided mutual benefit
to the developers and ELG, we consider the overwhelming interest in the open calls
an extremely important, albeit non-technical result: it demonstrates that Language
Technologies are of tremendous interest to both researchers and commercial compa-
nies. It also shows that the open calls setup, as designed and implemented, was very
attractive and can be considered as a model in similar undertakings in the future.



References

European Commission (2017). Horizon 2020 – Work Programme 2018-2020. Annex K: Actions
   involving financial support to third parties. Extract from Part 19 – Commission Decision
   C(2017)7124. Brussels, Belgium. URL: https://ec.europa.eu/research/participants/data/ref
   /h2020/other/wp/2018-2020/annexes/h2020-wp1820-annex-k-fs3p_en.pdf.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 15
Basque-speaking Smart Speaker based on
Mycroft AI

Igor Leturia, Ander Corral, Xabier Sarasola, Beñat Jimenez, Silvia Portela, Arkaitz
Anza, and Jaione Martinez




Abstract Speech-driven virtual assistants, known as smart speakers, such as Amazon
Echo and Google Home, are increasingly used. However, commercial smart speak-
ers only support a handful of languages. Even languages for which ASR and TTS
technology is available, such as many official EU member state languages, are not
supported due to a commercial disinterest derived from their – relatively speaking
– rather small number of speakers. This problem is even more crucial for minority
languages, for which smart speakers are not expected anytime soon, or ever. In this
ELG pilot project we developed a Basque-speaking smart speaker, making use of
the open source smart speaker project Mycroft AI and Elhuyar Foundation’s speech
technologies for Basque. Apart from getting it to speak Basque, one of our goals was
to make the smart speaker privacy friendly, non-gendered and use local services, be-
cause these are usual issues of concern. The project has also served to improve the
state of the art of Basque ASR and TTS technology.



1 Overview and Objectives of the Pilot Project

Commercial smart speakers are increasingly popular despite the fact that their lan-
guage coverage leaves much to be desired. Many large official national languages
and practically all minority languages are unsupported by these devices. In many
cases, the lack of support for a language in a smart speaker is not due to the lack of
the necessary speech technologies, i. e., Automatic Speech Recognition (ASR) and
Text To Speech (TTS). ASR and TTS technologies do exist for the Basque language


Igor Leturia · Ander Corral · Xabier Sarasola
Elhuyar Fundazioa, Spain, i.leturia@elhuyar.eus, a.corral@elhuyar.eus, x.sarasola@elhuyar.eus
Beñat Jimenez
Talaios Koop., Spain, jimakker@talaios.coop
Silvia Portela · Arkaitz Anza · Jaione Martinez
Skura Mobile, Spain, silvia@skuramobile.com, arkaitz@skuramobile.com, jaione@skuramobile
.com


© The Author(s) 2023                                                                       271
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_15
272                  Igor Leturia, Ander Corral, Xabier Sarasola, Beñat Jimenez, Silvia Portela et al.

but it is unlikely that they will be implemented in smart speakers developed by the
big technology enterprises because of its relatively small number of speakers.
   On the other hand, there is a rather mature, open source smart speaker project
called Mycroft AI.1 Our ELG pilot project develops an open source smart speaker for
the Basque language, based on Mycroft AI, that makes use of Elhuyar Foundation’s
ASR and TTS technologies. Apart from being open source and in Basque, other
points of interest were the handling of privacy, gender and service locality issues.
   One objective of the project was to improve the state of the art of Basque ASR and
TTS technologies, since it would be necessary to adapt them to the context of a smart
speaker. Specifically, we wanted to 1. improve the performance of Basque ASR tech-
nology for noisy environments; 2. create a grammar-based ASR system instead of
a general vocabulary one to only recognise the commands of the speaker and, thus,
improve precision; 3. create a neural network-based TTS system for Basque and
replace the old HMM one; and 4. try to develop a gender-neutral voice.



2 Mycroft Localisation

A crucial and necessary part of the project was the localisation of Mycroft to Basque
in its broadest sense. This involved not only a string translation process, but also
making it understand speech commands and respond via speech in Basque. Thus,
we had to develop plugins to connect Mycroft to Elhuyar’s ASR and TTS services.
   The localisation also involved the adaptation to Basque of Mycroft’s linguistic
module called lingua-franca, responsible for parsing numbers, days, times, dura-
tions, etc. in speech commands and to pronounce them correctly when responding.
   Finally, the routine job of string translation of any software localisation process
turned out not to be as straightforward for the commands’ part. The parsing of many
skills’ intents from the commands is done by simply detecting some required or
optional keywords and parameters, which is why their translation required more
than just a simple sentence translation. We translated the Mycroft core module and
40+ of its skills (volume control, date, time, lists, alarms, audio record, radio, news,
Wikipedia, weather, jokes, Wikiquote, e-mail etc.).



3 Privacy, Gender and Proximity

As mentioned in Section 1, we wanted to address the privacy and gender concerns
often associated with smart speakers and also promote the use of local services. Re-
garding privacy, users and potential buyers have concerns with having a device in
their homes with a microphone that is always on (Lau et al. 2018). However, respect
for privacy is precisely one of Mycroft AI’s unique selling propositions. They claim

1   https://mycroft.ai
15 Basque-speaking Smart Speaker based on Mycroft AI                                273

that they are “private by default” and that they “promise to never sell your data or
give you advertisements” using their technology. This materialises in the fact that
the wake word (“Hey, Mycroft”) is detected locally, i. e., no audio is sent to remote
servers except when saying a command after the detection of the wake word. On the
other hand, if some big enterprise’s cloud-based ASR or TTS services are used for
the recognition of commands and the utterance of responses, there are logically some
doubts as to what these companies will do with that data. Using Elhuyar’s Basque
ASR and TTS remote APIs from Mycroft, no data would be kept or collected.
    Regarding gender treatment, smart speakers are known for their improper gen-
der treatment, as stated in the Unesco report “I’d blush if I could: closing gender
divides in digital skills through education” (West et al. 2019). According to this re-
port, practically all commercial smart speakers exhibit a female voice and female
personalities, and respond obligingly even to hostile requests, verbal abuse and sex-
ual harassment, which may lead to reinforce and spread gender biases. The report
ends with some recommendations that range from not making digital assistants fe-
male by default to developing neutral voices and personalities, which our project has
tried to follow. The Basque voice installed at the moment is a male voice by default.
Also, the speaker’s name, Mycroft, – although fictional – is male, its “personality” is
neutral, and it has no skill to respond in a docile manner to sexual comments or ver-
bal abuse. However, we have also carried out some experiments in order to develop
a gender-neutral synthetic voice (see Section 4.4).
    We felt that our smart speaker should prioritise the local region and, for instance,
allow listening to local radio stations, read the news from local media or buy goods
or order food from local stores. We developed half a dozen local skills of our own,
including local news, local radio stations, dictionary querying or Basque music.



4 Developments in Basque Speech Technology

4.1 ASR Robustness in Noisy Environments

One of the main challenges regarding the use of ASR technology in a smart speaker
is making it robust enough to be reliable under non-optimal conditions: low volume,
background noise, music, speech, room reverberation, low quality microphone, etc.
   Elhuyar’s ASR system for the Basque language is a general purpose system based
on the Kaldi2 toolkit. The speech data used to train the acoustic model comprises
high quality clean parliamentary speeches. To make our acoustic model more ro-
bust, we used several synthetic data augmentation techniques during the training
phase (Alumäe et al. 2018). This means that training data was 1. synthetically aug-
mented by adding background noises from the MUSAN dataset (Snyder et al. 2015),
which comprises several recordings of music, speech and a wide variety of noises;


2   https://kaldi-asr.org
274            Igor Leturia, Ander Corral, Xabier Sarasola, Beñat Jimenez, Silvia Portela et al.

2. artificially reverberated with various real and simulated room impulse responses
(Ko et al. 2017); and 3. augmented with threefold speed and volume perturbations.



4.2 ASR Closed Grammar-based Recognition

For general purpose ASR systems, typically a large language model is trained with
a vast amount of diverse texts. For a smart speaker, however, where the user is ex-
pected to use a closed set of commands, limiting the ASR’s vocabulary to just the
necessary commands can increase the precision of the speech recognition.
   Since Kaldi internally uses weighted finite state transducers (WFST) to model the
language, simply by converting all the commands defined in Mycroft skills to the for-
mat used by Pynini (a Python library for WFST grammar compilation), we would
obtain a language model limited to Mycroft’s commands. But although Mycroft’s
skills were originally defined using its old-style intent parser Padatious (where the
whole command is defined), nowadays most skills use the new intent parser Adapt,
which defines commands using a few keywords and parameters. This makes it un-
feasible to automatically generate all possible commands containing the keywords
and parameters. Rewriting all skills to the Padatious format would have made the
code much more difficult to maintain as well as losing Adapt’s recall gain. This is
why the creation of a custom grammar was eventually discarded.



4.3 Neural Network-based Basque TTS

Elhuyar’s previous Basque TTS service was based on Hidden Markov Models
(HMMs). In the ELG pilot project we developed a new neural network-based TTS
service. Since the first neural system was published in 2013 (Zen et al. 2013), these
have taken a clear advantage over HMM-based approaches and systems like Taco-
tron 2 (Shen et al. 2018) have achieved naturalness comparable to natural voice.
   The key challenge with neural TTS systems is the size of the training dataset. The
original Tacotron 2 monospeaker system was trained with 24.6 hours of speech, and
subsequent research concluded that 10 hours is the minimum time required to ob-
tain maximum quality (Chung et al. 2019). The only publicly available database of
Basque speech of that size is a multispeaker database created by Google (Kjartans-
son et al. 2020), which contains recordings from 53 speakers with a maximum of 15
minutes per speaker. Modified configurations of Tacotron 2 using speaker embed-
dings have proved successful providing good quality multispeaker TTS systems (Jia
et al. 2018), i. e., systems trained using combined recordings of multiple speakers,
capable to synthesise the voice of each of them. We recorded a small multispeaker
database, combined it with the Google database, and trained a multispeaker TTS
using speaker embeddings, obtaining our own neural quality TTS voices.
15 Basque-speaking Smart Speaker based on Mycroft AI                                          275

4.4 Gender-neutral Voice

Apart from the interventions to address gender issues (Section 3), we conducted
experiments towards obtaining a gender-neutral voice. Tolmeijer et al. (2021) ob-
served that we do not regard voices of intermediate pitch (which is what could be
understood as gender-neutral) as genderless, that we assign them one gender or the
other, and that those that could be best considered as ambiguous in terms of gender
or genderless were those with the greatest division of opinion.
    Most of the literature on the field of generating gender-ambiguous voices seek
gender neutrality through pitch modification, such as Tolmeijer et al. (2021), or the
first genderless voice Q (Carpenter 2019). We employed a different and innovative
approach. We first calculate the average speaker embedding for each gender with
the embeddings obtained in the training and then we compute the embedding that is
midway between the average male and female embeddings. Using this embedding in
the trained Tacotron 2, we can synthesise sentences with a voice which has produced
divided opinions as to its gender and which can thus be considered genderless.



5 Conclusions and Results of the Pilot Project

This ELG pilot project developed an open source Basque-speaking smart speaker
based on Mycroft AI, which respects privacy and which uses a more appropriate ap-
proach regarding the voice’s gender than commercial smart speakers. We connected
Mycroft to Elhuyar’s Basque ASR and TTS services, and we improved the state of
the art of Basque speech technologies. Our ASR for Basque performs better in noisy
environments and we developed a new deep neural network-based TTS for Basque
and made experiments towards a gender-ambiguous synthetic voice. We translated
more than 40 Mycroft skills and developed half a dozen new ones addressing local
services. We tested the Basque Mycroft in PCs and Google AIY Kits.
    Anyone can now download, install on a device and try Mycroft in Basque. While
the ELG pilot project is finished, we continue to work on the project with the aim of,
if possible, bringing a Basque smart speaker device to the market. We believe that
the work carried out, the experience gained and the code developed in the ELG pilot
project can be very useful for other minority language communities that would like
to have access to a smart speaker that speaks their own language.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects. The project has also been helped by the Basque
Government through its Hazitek programme (project DomEus) and the Gipuzkoa Provincial Coun-
cil through its Etorkizuna eraikiz programme (project Mycroft.eus).
276               Igor Leturia, Ander Corral, Xabier Sarasola, Beñat Jimenez, Silvia Portela et al.

References

Alumäe, Tanel, Ottokar Tilk, and Asad Ullah (2018). “Advanced rich transcription system for Esto-
    nian speech”. In: Human Language Technologies – the Baltic Perspective: Proc. of the Eighth
    Int. Conference (Baltic HLT 2018). Ed. by Kadri Muischnek and Kaili Müürisep. Amsterdam,
    the Netherlands: IOS Press, pp. 1–8. DOI: 10.3233/978-1-61499-912-6-1.
Carpenter, Julie (2019). “Why Project Q is More than the World’s First Nonbinary Voice for Tech-
    nology”. In: Interactions 26.6, pp. 56–59. DOI: 10.1145/3358912.
Chung, Yu-An, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and RJ Skerry-Ryan (2019). “Semi-
    supervised training for improving data efficiency in end-to-end speech synthesis”. In: ICASSP
    2019. IEEE, pp. 6940–6944.
Jia, Ye, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick
    Nguyen, Ruoming Pang, Ignacio Lopez Moreno, et al. (2018). “Transfer learning from speaker
    verification to multispeaker text-to-speech synthesis”. In: Proceedings of the 32nd International
    Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Asso-
    ciates, pp. 4485–4495.
Kjartansson, Oddur, Alexander Gutkin, Alena Butryna, Isin Demirsahin, and Clara E. Rivera
    (2020). “Open-Source High Quality Speech Datasets for Basque, Catalan and Galician”. In:
    SLTU-CCURL 2020. 11–12 May, Marseille, France, pp. 21–27.
Ko, Tom, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur (2017).
    “A study on data augmentation of reverberant speech for robust speech recognition”. In: ICASSP
    2017, pp. 5220–5224. DOI: 10.1109/ICASSP.2017.7953152.
Lau, Josephine, Benjamin Zimmerman, and Florian Schaub (2018). “Alexa, Are You Listening?
    Privacy Perceptions, Concerns and Privacy-Seeking Behaviors with Smart Speakers”. In: Proc.
    of Human-Computer Interaction 2.CSCW, pp. 1–31. DOI: 10.1145/3274371.
Shen, Jonathan, Ruoming Pang, Ron Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
    Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. (2018). “Natural TTS Synthesis
    by Conditioning Wavenet on Mel Spectrogram Predictions”. In: ICASSP 2018. IEEE, pp. 4779–
    4783.
Snyder, David, Guoguo Chen, and Daniel Povey (2015). “Musan: A music, speech, and noise cor-
    pus”. In: arXiv preprint arXiv:1510.08484.
Tolmeijer, Suzanne, Naim Zierau, Andreas Janson, Jalil Sebastian Wahdatehagh, Jan Marco Marco
    Leimeister, and Abraham Bernstein (2021). “Female by Default? – Exploring the Effect of Voice
    Assistant Gender and Pitch on Trait and Trust Attribution”. In: Conference on Human Factors
    in Computing Systems (CHI). New York, NY, USA: ACM, pp. 1–7.
West, Mark, Rebecca Kraut, and Han Ei Chew (2019). I’d blush if I could: closing gender divides
    in digital skills through education. Unesco EQUALS.
Zen, Heiga, Andrew Senior, and Mike Schuster (2013). “Statistical parametric speech synthesis
    using deep neural networks”. In: 2013 IEEE International Conference on Acoustics, Speech
    and Signal Processing. IEEE, pp. 7962–7966.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 16
CEFR Labelling and Assessment Services

Mark Breuker




Abstract Our pilot project aims to develop a set of text collections and annotation
tools to facilitate the creation of datasets (corpora) for the development of AI classifi-
cation models. These classification models can automatically assess a text’s reading
difficulty on the levels described by the Common European Framework of Refer-
ence (CEFR). The ability to accurately and consistently assess the readability level
of texts is crucial to authors and (language) teachers. It allows them to more easily
create and discover content that meets the needs of students with different back-
grounds and skill levels. Also, in the public sector using plain language in written
communication is becoming increasingly important to ensure citizens can easily ac-
cess and comprehend government information. EDIA already provides automated
readability assessment services (available as APIs and an online authoring tool) for
the CEFR in English. Support for Dutch, German and Spanish are added as part of
this project. Using the infrastructure developed in this project the effort for creating
high quality datasets for additional languages is lowered significantly. The tools and
datasets are deployed through the European Language Grid. The project is scheduled
to be completed in the second quarter of 2022.



1 Overview and Objectives of the Pilot Project

The CEFR (Common European Framework of Reference for Languages: Learning,
Teaching, Assessment, Council of Europe 2020) aims to provide a comprehensive
learning, teaching and assessment method that can be used for all European lan-
guages. Indicating the level of learners of foreign languages, the CEFR facilitates
the assessment of a person’s language proficiency. By now, most are familiar with
the six reference levels (A1 – C2) used for this purpose (Figure 1).
   CEFR levels are the foundation for a communicative approach to (foreign) lan-
guage acquisition, teaching and certification. Although the CEFR levels represent a
widely supported approach, the availability and quality of (educational) content la-

Mark Breuker
EDIA b. v., The Netherlands, mark@edia.nl


© The Author(s) 2023                                                                  277
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_16
278                                                                           Mark Breuker




                A1              A2   B1           B2       C1            C2
                     beginner          intermediate             expert


Fig. 1 CEFR proficiency levels



belled with CEFR levels are limited. This is because the highly laborious, error-prone
labelling process is performed manually (save for some exceptions). This results in
several practical obstacles regarding publishing, teaching, and learning:
  • Content creators (publishers, authors, teachers, government officials) struggle
    to use consistent criteria for checking a text’s difficulty level.
  • Teachers have trouble finding or creating appropriate texts for their students.
  • Content managers struggle to monitor the readability level of their content col-
    lections over time.
   To tackle this problem, we have developed an automated text classification tech-
nology using Natural Language Processing. Our technology can perform CEFR text
levelling in a scalable and consistent manner for multiple languages at a very gran-
ular level. By removing blockers through automation, we expect to impact the prac-
tical application of CEFR, enabling the labelling of more content in less time in a
highly consistent manner. This way, we will lay the foundation for making written
content with properly labelled text levels more widely available, adhering to the
CEFR standard. After all, practical obstacles will have been eliminated.
   The European Language Grid (Rehm et al. 2021) provides EDIA with a market-
place to promote, sell and distribute its CEFR services to a broad audience. Through
the standardised ELG catalogue and API specification, developers can more easily
adopt the services provided by EDIA in their applications.



2 Methodology

The infrastructure for the CEFR readability services developed during the pilot
project consists of various components (Figure 2). The infrastructure facilitates the
creation of the CEFR readability assessment services, using the following process:
 1.   Data collection – collect (unlabelled) texts for each language
 2.   Data labelling – label the texts on CEFR reading level using human experts
 3.   Model training – train classification models on the datasets
 4.   Integration – expose the models as REST services on ELG using API proxies
 5.   Authoring – integrate the services in a CEFR levelling and authoring application
16 CEFR Labelling and Assessment Services                                                             279
                      CEFR
              Levelling & Authoring                                       Data Labelling




                   Application                                               Application




       uses                          uses                        output                       input
                                                                   data                       data


                                                     REST API

          CEFR                   Billing
                                                          CEFR   CEFR Labelled        Unlabelled



                                                    Word Lists       Texts                 Texts
       API Proxy                 Service

       uses




      CEFR SERVICE



                                            dataset for

        AI Model



Fig. 2 CEFR infrastructure diagram



3 Implementation

To create the corpus, we collected approx. 1,200 texts per language from various
public sources such as newspapers, magazines, educational sources and government
websites. To speed up the text collection process we developed several text-scraping
algorithms. Each text was stored as plain-text in a database together with information
about its source and copyright licence. To ensure that the unlabelled dataset was well
balanced and covered both easy and more difficult reading levels, we used texts from
sources known to be targeted at basic, intermediate and advanced language users. In
addition we used heuristics-based methods of readability assessment. This provided
us with an initial indication of the reading difficulty of each text.
   Our first attempt at a data labelling application was based on a pairwise compar-
ison algorithm (Crompvoets et al. 2020). We applied this approach on a collection
of 1,200 Dutch texts. The rationale for this approach was that comparing two texts
on reading difficulty is a relatively easy task for teachers and would suffer less from
inconsistent and subjective criteria used when evaluating a text directly on its CEFR-
level. This approach resulted in a rank-ordered list of texts on reading difficulty. Next
we set the boundaries for the CEFR-reading levels within this rank-ordered list. Un-
fortunately we found that we were not able to train a classification model on the
dataset. Upon closer inspection (based on a random sample of 100 texts) we found
280                                                                       Mark Breuker

that many texts were labelled incorrectly (i. e., 25 percent more than two levels off).
Although we compared each text with six other texts (resulting in a total of 7,200
annotations), possibly the number of comparisons per text was still insufficient to
create a reliable measurement. This means the pairwise comparison approach also
offers no benefits compared to labelling each text on CEFR level by three experts
(resulting in a total of 3,600 annotations) with regard to the number of annotations
needed to a reliable dataset.
   In our second attempt we labelled texts directly regarding their CEFR reading
level. This new labelling application provides functionality for organising the unla-
belled texts into various projects which supports working with multiple languages
and creating subsets from the total corpus to label the texts in smaller batches. This
allows us to annotate the texts iteratively which means we can better monitor the
quality of the annotations during the labelling process. Within each project, annota-
tion tasks are created and are assigned to language experts. Each text is evaluated by
three different experts to ensure high quality CEFR assessments. For each text anno-
tators complete an assessment form with criteria described in the CEFR reading level
descriptors (such as vocabulary and grammatical complexity, Alderson et al. 2006).
We have based this approach in part on the CEFR Estim Grid project (Tardieu et al.
2010). Prior to completing the content labelling tasks, annotators participate in an
(online) workshop to collaboratively assess the CEFR level of a small subset of texts
to align on the CEFR level descriptors.
   Once we labelled all texts and completed the datasets we were able to develop the
CEFR readability classification models. The models we created return the predicted
difficulty on a linear scale, which means that we can predict the reading difficulty
more granularly than the 6-level CEFR scale. In other words, we can say, for exam-
ple, that a text is on the more difficult end of the B2 level. Based on the models, we
created web services for assessing the overall readability of a text, difficult words in
the text and alternative words (suggestions) for these difficult words.
   We then integrated our CEFR services into the ELG platform using proxy ser-
vices. A proxy service maps incoming ELG requests onto our classification API
running on our web servers. The proxy service was packaged as a Docker container,
stored in our company’s Docker registry and then deployed on ELG. To improve
performance and avoid blocking requests, we used the Asyncio library to support
asynchronous processing of service requests. To speed up the development of the
proxy services, we switched to using ELG’s Python SDK for later versions of our
service implementations.
   For the authoring application we chose to integrate our CEFR services with the
Fonto editor1 as an add-on. This allowed us to focus on developing the text anal-
ysis rather than basic text editing features. In addition we used the Fonto Content
Quality component to highlight relevant sections in the text and provide feedback to
authors which allows them to improve the readability and quality of their texts. The
Fonto editor is a popular tool by major (educational) publishers, which enables easy
integration and adoption of our technology by new clients.

1   https://www.fontoxml.com
16 CEFR Labelling and Assessment Services                                             281

4 Evaluation

For collecting the texts for our dataset we had planned to use the C4 Corpus (public
domain part)2 which is a huge collection of plain texts, released under a Creative
Commons licence, which appeared to be very useful for our project. However, upon
closer inspection we found that the licence detection algorithm that was used is not
very accurate and that the structure of the texts was not very suitable for our purposes.
Also, the sheer size of this corpus added to the complexity of its processing. We
therefore decided not to use the C4 Corpus, but create a new corpus instead. We
tried various methods for data labelling. Unfortunately the pairwise comparison did
not yield a useful dataset from which we could create a classification model. Possible
explanations may be that the number of comparisons per text was too low, that we
did not select the right pairs of texts for the language teachers to compare, or that the
teachers did not consistently select the most difficult text from each set. This would
need to be investigated further.
    Integrating our services into the ELG was straightforward and easy. Using the
ELG Python SDK we were able to make our services available through ELG. We
also appreciated the thorough review process of our submitted services and datasets
by the ELG team. We received good feedback and support to improve the required
metadata, code performance and overall compatibility with the ELG API specifica-
tion. The standards-based ELG integration (e. g., using the ELG Python SDK) makes
it significantly easier for third-party developers to consume and integrate our ser-
vices in their language learning applications. We have not yet been able to evaluate
the billing services of the ELG in a production setting. We can see that the services
we deployed on ELG have been used multiple times, but we have little information
about the use over time and the types of users (e. g., commercial vs. academic).



5 Conclusions and Results of the Pilot Project

Our goals with this project were to extend our CEFR service to additional languages
beyond English and to use the European Language Grid as a marketplace for com-
mercialising our services. Although the project has not yet been completed we can
already see that the project has helped us to improve our data collection and la-
belling process, which helps to create high quality datasets for training additional
language models. We created CEFR readability classification models using these
datasets which we have made available on ELG as services.3 The services are in-
tegrated into a text authoring application which helps authors assess and improve
the readability of their (educational) texts in multiple languages. Deploying services
on the ELG is currently easy and useful for demonstration and trial purposes. We


2   https://live.european-language-grid.eu/catalogue/#/resource/service/corpus/1186
3   https://live.european-language-grid.eu/catalogue/project/5258
282                                                                                 Mark Breuker

believe the ELG SDKs enable third party developers to more easily discover and
consume our APIs.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects. We thank Cito Labs for their valuable exper-
tise and support in the pairwise labelling experiment; all language teachers involved in labelling
hundreds of texts on the CEFR; FontoXML for their support with integrating our NLP services with
their product; the ELG team for funding our pilot project and the help we received for deploying
our CEFR services and datasets on the ELG platform.




References

Alderson, J. Charles, Neus Figueras, Henk Kuijper, Guenter Nold, Sauli Takala, and Claire Tardieu
   (2006). “Analysing Tests of Reading and Listening in Relation to the Common European Frame-
   work of Reference: The Experience of The Dutch CEFR Construct Project”. In: Language As-
   sessment Quarterly 3.1, pp. 3–30. URL: https://doi.org/10.1207/s15434311laq0301_2.
Council of Europe (2020). Common European Framework of Reference for Languages: Learning,
   teaching, assessment – Companion volume. Strasbourg: Council of Europe Publishing, pp. 53–
   59. URL: https://www.coe.int/lang-cefr.
Crompvoets, Elise A. V., Anton A. Béguin, and Klaas Sijtsma (2020). “Adaptive Pairwise Compar-
   ison for Educational Measurement”. In: Journal of Educational and Behavioral Statistics 45.3,
   pp. 316–338. DOI: 10.3102/1076998619890589. URL: https://doi.org/10.3102/107699861989
   0589.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Tardieu, Claire, Raili Hildén, Magda Lehmann, and Monique Reichert (2010). The CEF-ESTIM
   Grid. URL: http://cefestim.ecml.at.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 17
European Clinical Case Corpus

Bernardo Magnini, Begoña Altuna, Alberto Lavelli, Anne-Lyse Minard, Manuela
Speranza, and Roberto Zanoli




Abstract Interpreting information in medical documents has become one of the most
relevant application areas for language technologies. However, despite the fact that
huge amounts of medical documents (e. g., medical examination reports, hospital
discharge letters, digital medical records) are produced, their availability for research
purposes is still limited, due to strict data protection regulations. Aiming at fostering
advanced information extraction technologies for medical applications, we present
E3C, a corpus of clinical case narratives fully based on freely licensed documents.
E3C (European Clinical Case Corpus) contains a vast selection of clinical cases (i. e.,
narratives presenting a patient’s history) that cover different medical areas, are based
on different styles and produced in different languages. A portion of the corpus has
been manually annotated to be used for training and testing purposes, while a larger
set of documents has been automatically tagged to serve as a baseline for future
research in information extraction.



1 Overview and Objectives of the Pilot Project

The interest in information extraction from clinical narratives has increased in recent
decades, including clinical entity extraction and classification (Schulz et al. 2020;
Grabar et al. 2019; Dreisbach et al. 2019; Luo et al. 2017), clinical prediction systems,
e. g., MIMIC III (Johnson et al. 2016), and the organisation of challenges at CLEF
(Kelly et al. 2019), and Semeval. However, only a few shared datasets have been
created, limiting the potential of developing applications in this area.

Bernardo Magnini · Alberto Lavelli · Manuela Speranza · Roberto Zanoli
Fondazione Bruno Kessler, Italy, magnini@fbk.eu, lavelli@fbk.eu, manspera@fbk.eu,
zanoli@fbk.eu
Begoña Altuna
Fondazione Bruno Kessler, Italy, HiTZ Centre, University of the Basque Country, Spain,
begona.altuna@ehu.eus
Anne-Lyse Minard
Université d’Orléans, France, anne-lyse.minard@univ-orleans.fr


© The Author(s) 2023                                                                     283
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_17
284                    Bernardo Magnini, Begoña Altuna, Alberto Lavelli, Anne-Lyse Minard et al.

   We report upon the E3C (European Clinical Case Corpus) ELG pilot project,
which resulted in a large collection of clinical cases in five European languages:
English, Spanish, French, Italian and Basque. A clinical case is a statement of a clin-
ical practice, presenting the reason for a clinical visit, the description of physical
exams, and the assessment of the patient’s situation. Clinical cases are typically re-
ported and discussed in research papers, and are often used for education purposes in
medicine. In addition, published clinical cases are de-identified, overcoming privacy
issues, and are rich in clinical entities as well as temporal information.

    A 25-year-old man with a history of Klippel-Trenaunay syndrome presented to the hospital
    with mucopurulent bloody stool and epigastric persistent colic pain for 2 wk. Continuous
    superficial ulcers and spontaneous bleeding were observed under colonoscopy. Subsequent
    gastroscopy revealed mucosa with diffuse edema, ulcers, errhysis, and granular and friable
    changes in the stomach and duodenal bulb, which were similar to the appearance of the rec-
    tum. After ruling out other possibilities according to a series of examinations, a diagnosis of
    GDUC was considered. The patient hesitated about intravenous corticosteroids, so he received
    a standardized treatment with pentasa of 3.2 g/d. After 0.5 mo of treatment, the patient’s symp-
    toms achieved complete remission. Follow-up endoscopy and imaging findings showed no
    evidence of recurrence for 26 mo.

   The sample clinical case reported in the box above is about a patient presenting
gastric symptoms, who is finally diagnosed with gastroduodenitis associated with
ulcerative colitis (GDUC). To reach the diagnosis, two medical tests (colonoscopy
and gastroscopy) were performed. Treatment, outcome (complete remission) and
follow-up (no evidence of recurrence) are also present in the text.



2 Corpus Collection and Annotation

The document collection was determined by the available resources for each lan-
guage (e. g., PubMed, scientific journals, medicine leaflets). First, we identified pos-
sible document sources as well as their licenses and re-distribution policies. We se-
lected sources that were either already available under Creative Commons licenses
(i. e., CC-BY or CC-BY-SA), possibly asking for re-distribution permission to the
right holders. In the case of the SPACCC1 and NUBes2 corpora, the texts were ready
to be used by us in terms of licensing and formatting. We automated the text col-
lection as much as possible, for example, in some cases we were able to identify
and extract the section with the clinical case. All English and some French docu-
ments were automatically extracted from PubMed3 , through its API, while medicine
leaflets were automatically crawled and stored in a single file for each language. Jour-
nal articles with clinical cases that could not be extracted automatically were filtered
through the search query “clinical case” in the different languages. In addition to the

1 https://github.com/PlanTL-GOB-ES/SPACCC
2 https://github.com/Vicomtech/NUBes-negation-uncertainty-biomedical-corpus
3 https://pubmed.ncbi.nlm.nih.gov
17 European Clinical Case Corpus                                                               285

extraction of the relevant documents, corresponding metadata was stored to allow
accurate documentation.
   The annotation of temporal information was performed following an adaptation
of the THYME annotation guidelines (Styler et al. 2014).4 Temporal information
refers to the events in a text as well as to chronological references and relations.
To encode temporal information, we defined the following tags and relation types.
Events, time expressions, temporal relations and aspectual relations are widely used
in temporal information tasks, while actor, body part and RML annotations were
added as they convey relevant information of the clinical domain.
    • Events are the events or states relevant to the patient’s clinical timeline.
    • Time expressions refer to points and intervals in time.
    • Temporal relations (TLINK) implement relations that chronologically order
      events and time expressions.
    • Aspectual relations (ALINK) are created between an aspectual event and its
      subordinated non-aspectual event.
    • Actors are the people (or animals) mentioned in the text.
    • Body parts are the parts of the body that are bigger than cells.
    • Results, measurements and lab and test results (RML) are lab test and analytics’
      results, formulaic measurements and measurement values.




Fig. 1 A sentence in a clinical case annotated with both temporal information and clinical entities
(i. e., disorders) with their UMLS codes (marked in red)


  The annotation of clinical entities is mainly based on the guidelines of SEM-
EVAL 2015 Task 14 “Analysis of Clinical Text”5 and on the ASSESS CT guidelines
(Miñarro-Giménez et al. 2018). The annotation of Layer 1 was done fully manually,
while for Layer 2 the automatic annotation was produced with a distant supervision
method that matches clinical entities with disorder concepts in UMLS.



3 Implementation

The E3C corpus is organised in three different layers:
Layer 1: about 25k tokens per language of clinical narratives with full manual or
manually checked annotation of clinical entities, temporal information and factuality,
for benchmarking and linguistic analysis.
4   http://clear.colorado.edu/compsem/documents/THYME_guidelines.pdf
5   http://alt.qcri.org/semeval2015/task14/data/uploads/share_annotation_guidelines.pdf
286                 Bernardo Magnini, Begoña Altuna, Alberto Lavelli, Anne-Lyse Minard et al.

Layer 2: 50-100k tokens per language of clinical narratives with automatic anno-
tation of clinical entities. Distant supervision was used to annotate 8,972 clinical
entities with their corresponding concepts in UMLS.
Layer 3: about 1m tokens per language of non-annotated medical documents (not
necessarily clinical narratives) to be exploited by semi-supervised approaches.
    Table 1 shows the sizes of the layers (document and token numbers). Table 2
shows the numbers of Layer 1 tags to indicate information density in clinical cases.


              English          French               Italian          Spanish     Basque
Layer 1       84 / 25142         81 /25196       86 / 24319    81 / 24681    90 / 22505
Layer 2      171 / 50371      168 / 50490      174 / 49900    162 / 49351 111 / 12541
Layer 3   9779 / 1075709 25740 / 66281501 10213 / 13601915 1876 / 1030907 1232 / 518244
Table 1 Documents/tokens in each language and layer in the E3C corpus.




                   Entity         English French Italian Spanish Basque
                   CLINENTITY        1024    1327      869    1345     1910
                   EVENT             4885    4312     3385    4767     7910
                   ACTOR              682     427      338     319      505
                   BODYPART           968     659      328     814     1410
                   TIMEX3             380     333      298     383      638
                   RML                480     508      383     391     1101
                   ALINK              114      71      109      92      113
                   TLINK             4852    4084     1150    4700     7981
Table 2 Annotations in each language in Layer 1 in the E3C corpus.




4 Evaluation

For temporal information and clinical entity annotation tasks, we performed inter-
annotator agreement (IAA) tests. We measured whether the guidelines had been de-
fined and were understood correctly, and we ensured that the quality of annotations
in the corpus was similar. The IAA phase had been done on the English part of
the corpus. IAA for temporal entities (EVENT, TIMEX3, ACTOR, BODYPART)
was measured using three annotators and six documents. To compute the agreement,
we used the F1-measure metric, which produced the same results as using the Dice
coefficient. The agreement is high for EVENT and ACTOR entities (with an aver-
age of 0.81 and 0.87), but a bit lower for TIMEX3 and BODYPART (with an av-
erage of 0.50 and 0.57). The IAA for temporal relations (TLINK) was split in two
phases: three documents were annotated, the results discussed by the annotators and
17 European Clinical Case Corpus                                                            287

then three new documents were annotated. To measure the agreement, we used the
Tempeval-3 scorer (UzZaman and Allen 2011), implemented for the evaluation of
systems based on the comparison of temporal graphs built from annotations. The
average F1-measure for the first phase was 0.43 and 0.53 for the second.
   The annotation of the clinical entities in Layer 1 was performed by four annota-
tors. Again, the agreement is calculated using F1, whereas for the CUI attribute we
computed the accuracy taking into consideration only the entities identified by two
annotators. The agreement for clinical entity recognition is 0.70 on average (from
0.64 to 0.78). In the entity linking task, the accuracy on entities identified by both
annotators starts at 0.86 (on average 0.89).
   The clinical entities in Layer 2 were annotated automatically using distant super-
vision and UMLS as a controlled vocabulary. A manual assessment of the quality of
these annotated entities would be too demanding in terms of human resources. For
this reason, the quality of Layer 2 has been estimated through an indirect evaluation
that uses the results obtained by distant supervision on Layer 1 (Table 3) as an es-
timation of the quality of the Layer 2 annotations. This approximation is possible
because the documents in Layer 1 and Layer 2 are clinical cases and because they
were extracted from the same kind of publications or from the same existing corpora.


                                 English French Italian Spanish Basque
                      Accuracy 48.33       54.92   58.09    63.64    55.35
Table 3 Estimated accuracy (F1 -measure) of the clinical entities in Layer 2.




5 Conclusions and Results of the Pilot Project

The E3C pilot project aims at fostering advanced information extraction technolo-
gies for medical applications. Results include a large corpus of annotated clinical
cases in five languages. The corpus is available on the ELG platform.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects and from the Basque Government post-doctoral
grant POS_2020_2_0026.
288                  Bernardo Magnini, Begoña Altuna, Alberto Lavelli, Anne-Lyse Minard et al.

References

Dreisbach, Caitlin, Theresa A. Koleck, Philip E. Bourne, and Suzanne Bakken (2019). “A sys-
   tematic review of natural language processing and text mining of symptoms from electronic
   patient-authored text data”. In: Int. Jour. of Medical Informatics 125, pp. 37–46. DOI: 10.1016
   /j.ijmedinf.2019.02.008.
Grabar, Natalia, Cyril Grouin, Thierry Hamon, and Vincent Claveau (2019). “Recherche et ex-
   traction d’information dans des cas cliniques. Présentation de la campagne d’évaluation DEFT
   2019”. In: Actes du Défi Fouille de Textes 2019. Toulouse, France: Actes DEFT 2019, pp. 7–16.
   URL: https://www.irit.fr/pfia2019/wp-content/uploads/2019/07/actes_DEFT_CH_PFIA2019
   .pdf.
Johnson, Alistair E.W., Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad
   Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark (2016).
   “MIMIC-III, a freely accessible critical care database”. In: Scientific Data 3. DOI: 10.1038/sda
   ta.2016.35.
Kelly, Liadh, Hanna Suominen, Lorraine Goeuriot, Mariana Neves, Evangelos Kanoulas, Dan
   Li, Leif Azzopardi, Rene Spijker, Guido Zuccon, Harrisen Scells, and João Palotti (2019).
   “Overview of the CLEF eHealth Evaluation Lab 2019”. In: Experimental IR Meets Multilin-
   guality, Multimodality, and Interaction. Ed. by Fabio Crestani, Martin Braschler, Jacques Savoy,
   Andreas Rauber, Henning Müller, David E. Losada, Gundula Heinatz Bürki, Linda Cappellato,
   and Nicola Ferro. Cham: Springer, pp. 322–339.
Luo, Yuan, William K. Thompson, Timothy M. Herr, Zexian Zeng, Mark A. Berendsen, Siddhartha
   R. Jonnalagadda, Matthew B. Carson, and Justin Starren (2017). “Natural Language Processing
   for EHR-Based Pharmacovigilance: A Structured Review”. In: Drug Safety 40 (11), pp. 1075–
   1089. DOI: 10.1007/s40264-017-0558-6.
Miñarro-Giménez, José Antonio, Catalina Martínez-Costa, Daniel Karlsson, Stefan Schulz, and
   Kirstine Rosenbeck Gøeg (2018). “Qualitative analysis of manual annotations of clinical text
   with SNOMED CT”. In: PLoS ONE 13.12. URL: https://www.ncbi.nlm.nih.gov/pmc/articles
   /PMC6307753/pdf/pone.0209547.pdf.
Schulz, Sarah, Jurica Ševa, Samuel Rodríguez, Malte Ostendorff, and Georg Rehm (2020). “Named
   Entities in Medical Case Reports: Corpus and Experiments”. In: Proceedings of the 12th Lan-
   guage Resources and Evaluation Conference. Marseille, France: ELRA, pp. 4495–4500. URL:
   https://www.aclweb.org/anthology/2020.lrec-1.553.
Styler, William F., Steven Bethard, Sean Finan, Martha Palmer, Sameer Pradhan, Piet C. de Groen,
   Brad Erickson, Timothy Miller, Chen Lin, Guergana Savova, et al. (2014). “Temporal Annota-
   tion in the Clinical Domain”. In: Transactions of the Association for Computational Linguistics
   2. Ed. by Ellen Riloff, pp. 143–154. URL: http://aclweb.org/anthology/Q14-1012.
UzZaman, Naushad and James Allen (2011). “Temporal Evaluation”. In: Proceedings of the 49th
   Annual Meeting of the Association for Computational Linguistics: Human Language Technolo-
   gies. Portland, Oregon, USA: ACL, pp. 351–356. URL: https://aclanthology.org/P11-2061.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 18
Extracting Terminological Concept Systems
from Natural Language Text

Dagmar Gromann, Lennart Wachowiak, Christian Lang, and Barbara Heinisch




Abstract Terminology denotes a language resource that structures domain-specific
knowledge by means of conceptual grouping of terms and their interrelations. Such
structured domain knowledge is vital to various specialised communication settings,
from corporate language to crisis communication. However, manually curating a ter-
minology is both labour- and time-intensive. Approaches to automatically extract ter-
minology have focused on detecting domain-specific single- and multi-word terms
without taking terminological relations into consideration, while knowledge extrac-
tion has specialised on named entities and their relations. We present the Text2TCS
method to extract single- and multi-word terms, group them by synonymy, and in-
terrelate these groupings by means of a pre-specified relation typology to generate
a Terminological Concept System (TCS) from domain-specific text in multiple lan-
guages. To this end, the method relies on pre-trained neural language models.



1 Overview and Objectives

Domain knowledge is paramount to any specialised communication setting. A struc-
tured representation of domain-specific terminology fosters the acquisition of new
domain knowledge, the expansion of existing knowledge, and optimises specialised
discourse by supporting terminological consistency (Budin 1996). Extracting Ter-
minological Concept Systems from Natural Language Text (Text2TCS) is a pilot
project supported by the European Language Grid (ELG) to develop a language
technology that automatically extracts a Terminological Concept System (TCS) from
domain-specific texts in multiple languages. A TCS is a terminological resource that
conceptually structures domain-specific terms and provides hierarchical and non-
hierarchical relations between them. Within the context of terminology science, a
term signifies a domain-specific designation that linguistically represents a domain-
specific concept (ISO1087 2019). A concept groups terms by meaning, which is

Dagmar Gromann · Lennart Wachowiak · Christian Lang · Barbara Heinisch
University of Vienna, Austria, dagmar.gromann@univie.ac.at, lennart.wachowiak@univie.ac.at,
christian.lang@univie.ac.at, barbara.heinisch@univie.ac.at


© The Author(s) 2023                                                                   289
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_18
290                  Dagmar Gromann, Lennart Wachowiak, Christian Lang, and Barbara Heinisch

generally represented as unique characteristics shared by a set of real-world enti-
ties. Once terms have been grouped into concepts based on their synonymous mean-
ing within languages and equivalent meaning across languages, terminology science
foresees interrelations of concepts by terminological relations. Such relations are
categorised into hierarchical, i. e., generic and partitive, and non-hierarchical, e. g.,
causal and spatial, relations. For instance, the sentence COVID causes coughing can
be depicted as a causal relation from the concept that represents the cause COVID
to the effect concept designated by coughing. However, in practice, publicly avail-
able terminologies rarely contain any relations, since manually creating them is time-
and labour-intensive. While Automated Term Extraction (ATE) methods have pro-
liferated (e. g., Astrakhantsev 2018; Lang et al. 2021), additionally structuring ex-
tracted terms by concepts and relations has been neglected. To address this issue,
Text2TCS provides a method and tool to extract terms and interrelations between
domain-specific synonym sets across languages and domains. The Text2TCS imple-
mentation has been integrated and is available on the ELG plattform.1



2 Methodology

The Text2TCS methodology depicted in Figure 1 builds on a pipeline approach
with the following steps: preprocessing, term extraction, relation extraction and post-
processing. The pipeline takes domain-specific natural language sentences or text as
input and outputs a TCS in the TermBase eXchange (TBX) format and as a concept
map. We experimented with several joint term and relation extraction methods, espe-
cially relying on pre-trained Neural Machine Translation and Sequence to Sequence
models such as mT5 (Xue et al. 2021). However, a pipeline approach relying on
fine-tuning XLM-R (Conneau et al. 2020) was finally preferable due to a smaller
model size as well as a substantially higher inference speed and performance relia-
bility. In order to fine-tune pre-trained models, training data needs to be available.
To this end, two terminologists annotated 51 texts spanning distinct domains from
computer science to ecology in English and German with a total of 6,327 terms and
9,460 relations.



2.1 Preprocessing

In a first step, the input text’s language is detected and it is split into individual sen-
tences. The former relies on the Python library PYCLD22 that supports 83 languages.
Language detection is required in order to issue a warning in case the input language
is unsupported and to indicate the language in the final TBX output file. Furthermore,

1   https://live.european-language-grid.eu/catalogue/tool-service/8122
2   https://github.com/aboSamoor/pycld2
18 Extracting Terminological Concept Systems from Natural Language Text                 291


   Preprocessing            Term Extr.           Relation Extr.           Postprocessing

                               Term                Sentence-
     Language                                                                Filter
                             Extraction          Level Relation
     Detection                                                              Relations
                              Model                Extraction

     Sentence
     Boundary                Term Post-            Text-Level
                             processing                                       Merge
     Detection                                      Relation                Synonyms
                                                   Extraction

Fig. 1 Text2TCS extraction pipeline



the detected language is passed on to the sentence boundary detection module that
relies on language-specific rules.
    Sentence boundary detection is achieved using the rule-based Python module
pySBD (Sadvilkar and Neumann 2020), which officially supports 22 languages. This
step is required due to limited input length of current neural language models and
to allow for a sentence-based relation extraction step. Thus, the pipeline can be sure
to support 22 languages (two-digit ISO language codes): am, ar bg, da, de, en, es,
el, fa, fr, hi, hy, it, ja, kk, mr, my, nl, ru, pl, ur, zh. However, the term and relation
extraction models potentially support up to 100 languages.



2.2 Term Extraction

From several distinct experiments with term extraction, which we detail in Lang et
al. (2021), the best performing classifies each token of an input sentence separately,
utilising the same fully connected layer for all tokens after they have been processed
by XLM-R. In term extraction, an established method is (e. g., Hazem et al. 2020)
to first generate all possible term candidates from a sequence/sentence and input the
candidate together with its context for the model to predict whether it is a term or not.
This requires first generating all possible n-grams of a pre-specified length from a
text. Instead, the token classification we propose assigns one of three labels to each
token in a sequence: B-T for beginning of term, T for continuation of term, and n
for not a term (component). For instance, the input sequence “motor vehicle means
any power-driven vehicle.” would be labeled as B-T, T, n, n, B-T, T, n, ex-
tracting the terms “motor vehicle” and “power-driven vehicle”. This approach leads
to a substantial reduction in training and inference time compared to previous meth-
ods. In XLM-R’s own tokeniser, which we utilise, we noticed an issue with trailing
punctuation, e. g. a comma after a term. Thus, we apply an additional cleaning step
in which we remove trailing punctuation from a standard punctuation list, unless the
punctuation appears multiple times in the term, e. g. “U.S.A.”.
292                Dagmar Gromann, Lennart Wachowiak, Christian Lang, and Barbara Heinisch

                                c2
                             ['vehicle']

                                  genericRelation

                                  c1                                     c4
              ['motor vehicle', 'power-driven vehicle']       ['by road', 'on the road']

                                           instrumentalRelation            spatialRelation

                                                       c3
                           ['carrying persons or goods', 'carriage of persons or goods']

Fig. 2 Example TCS from sequence “motor vehicle means any power-driven vehicle, which is
normally used for carrying persons or goods by road or for drawing, on the road, vehicles used for
the carriage of persons or goods”



2.3 Relation Extraction

Related domain-specific mentions in text can either occur within the same sentence
or across sentence boundaries. Thus, two separate models in the pipeline predict
relations: a sentence-level and a text-level model. For sentence-level relation ex-
traction, we input a mention pair followed by a contextualising sentence containing
both mentions to a fine-tuned pre-trained XLM-R model that predicts a relation tak-
ing the relation direction into account (see Wachowiak et al. 2021, for details). We
apply our own relation typology of hierarchical relations, i. e., generic and partitive,
and non-hierarchical relations, i. e., activity, causal, instrumental, origination, spatial,
property, and associative. Generic relations and synonyms frequently occur across
sentence boundaries, which is why we additionally train a text-level relation extrac-
tion model to detect these two, building on our previous model (Wachowiak et al.
2020) fine-tuning XLM-R. This model takes a mention pair as input and classifies
it as a generic relation, synonymy or random, which means no or any other relation.
Since predicting relations for individual term pairs drastically impacts inference time,
we optimize the pipeline to process multiple term pairs and their context sentence
simultaneously.



2.4 Postprocessing

In the last step, synonyms predicted on sentence- and text-level are merged into
concepts. Furthermore, the relations predicted by the two models are filtered to only
include those with high confidence scores and to remove duplicates to provide the
final TCS exemplified in Figure 2.
18 Extracting Terminological Concept Systems from Natural Language Text                       293

3 Evaluation

We evaluated individual steps in the pipeline as well as the overall system on manu-
ally TCS-annotated texts in English, German, Spanish, Portuguese, French, Italian,
Romanian and Russian as well as on standard datasets, where available, for a bet-
ter comparison. The term extraction model outperforms previous neural approaches
(Hazem et al. 2020) from the TermEval challenge by up to 11.6 F1 score and obtained
74% (Precision: 70%, Recall: 78%) on our dataset. The sentence-level relation ex-
traction model obtained a weighted F1 score of up to 53% (Precision: 56%, Recall:
53%) and the text-level relation extraction model of up to 78% (Precision: 78%, Re-
call: 77%) on our manually annotated datasets. The sentence-level extraction is also
compared to a mixed dataset of the SemEval 2007 Task 4 and SemEval 2010 Task 8
relations, on which the model obtains a weighted F1 score of 87% (see Wachowiak
et al. 2021, for details).



4 Conclusions and Results of the Pilot Project

Automatically extracting and structuring domain-specific knowledge from text is
a challenging task. Text2TCS innovatively fine-tunes pre-trained neural language
models in a pipeline approach to first extract terms, second relations on sentence-
and text-level, and finally group synonyms. To this end, this pilot project proposed a
novel typology of terminological relations. A consistent use of relation types across
languages aims to ease the alignment of resulting monolingual TCS across languages.
Integrating such an alignment method is future work. At the moment, the method
takes terms and relations into consideration, however, text frequently contains (parts
of) natural language definitions and their extraction would represent a valuable fu-
ture addition to the method.

Acknowledgements The work described in this article3 has received funding from the EU project
European Language Grid as one of its pilot projects. The computational results presented have been
achieved in part using the Vienna Scientific Cluster (VSC).




References

Astrakhantsev, Nikita (2018). “ATR4S: toolkit with state-of-the-art automatic terms recognition
   methods in Scala”. In: Language Resources and Evaluation 52.3, pp. 853–872.
Budin, Gerhard (1996). Wissensorganisation und Terminologie: Die Komplexität und Dynamik wis-
   senschaftlicher Informations- und Kommunikationsprozesse. Vol. 28. Forum für Fachsprachen-
   Forschung. Gunter Narr Verlag.


3   https://text2tcs.univie.ac.at
294                Dagmar Gromann, Lennart Wachowiak, Christian Lang, and Barbara Heinisch

Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek,
   Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov (2020).
   “Unsupervised Cross-lingual Representation Learning at Scale”. In: Proceedings of the 58th
   Annual Meeting of the Association for Computational Linguistics. Ed. by Dan Jurafsky, Joyce
   Chai, Natalie Schluter, and Joel Tetreault. ACL, pp. 8440–8451. DOI: 10.18653/v1/2020.acl-
   main.747. URL: https://www.aclweb.org/anthology/2020.acl-main.747.
Hazem, Amir, Mérieme Bouhandi, Florian Boudin, and Beatrice Daille (2020). “TermEval 2020:
   TALN-LS2N System for Automatic Term Extraction”. In: Proceedings of the 6th International
   Workshop on Computational Terminology. Ed. by Béatrice Daille, Kyo Kageura, and Ayla
   Rigouts Terryn. Marseille, France: ELRA, pp. 95–100.
ISO1087 (2019). ISO 1087:2019: Terminology work and terminology science – Vocabulary. Stan-
   dard. Geneva, CH: International Organization for Standardization.
Lang, Christian, Lennart Wachowiak, Barbara Heinisch, and Dagmar Gromann (2021). “Transform-
   ing Term Extraction: Transformer-Based Approaches to Multilingual Term Extraction Across
   Domains”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
   ACL, pp. 3607–3620. DOI: 10.18653/v1/2021.findings-acl.316.
Sadvilkar, Nipun and Mark Neumann (2020). “PySBD: Pragmatic Sentence Boundary Disambigua-
   tion”. In: Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). ACL,
   pp. 110–114.
Wachowiak, Lennart, Christian Lang, Barbara Heinisch, and Dagmar Gromann (2020). “CogALex-
   VI Shared Task: Transrelation - A Robust Multilingual Language Model for Multilingual Rela-
   tion Identification”. In: Proceedings of the Workshop on the Cognitive Aspects of the Lexicon.
   Ed. by Rong Xiang, Emmanuele Chersoni, Luca Iacoponi, and Enrico Santus. ACL, pp. 59–64.
Wachowiak, Lennart, Christian Lang, Barbara Heinisch, and Dagmar Gromann (2021). “Towards
   Learning Terminological Concept Systems from Multilingual Natural Language Text”. In: 3rd
   Conference on Language, Data and Knowledge (LDK 2021). Ed. by Dagmar Gromann, Gilles
   Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo,
   and Barbara Heinisch. Vol. 93. Open Access Series in Informatics (OASIcs). Dagstuhl, Ger-
   many: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 22:1–22:18. DOI: 10.4230/OASIc
   s.LDK.2021.22.
Xue, Linting, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
   Barua, and Colin Raffel (2021). “mT5: A Massively Multilingual Pre-trained Text-to-Text
   Transformer”. In: Proceedings of the 2021 Conference of the North American Chapter of the
   Association for Computational Linguistics: Human Language Technologies. ACL, pp. 483–498.
   DOI: 10.18653/v1/2021.naacl-main.41.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 19
Italian EVALITA Benchmark Linguistic
Resources, NLP Services and Tools

Viviana Patti, Valerio Basile, Andrea Bolioli, Alessio Bosca, Cristina Bosco,
Michael Fell, and Rossella Varvara




Abstract Starting from the first edition held in 2007, EVALITA is the initiative for
the evaluation of Natural Language Processing tools for Italian. We describe the
EVALITA4ELG project, whose main aim is to systematically collect the resources
released as benchmarks for this evaluation campaign, and make them easily accessi-
ble through the European Language Grid platform. The collection is moreover inte-
grated with systems and baselines as a pool of web services with a common interface,
deployed on a dedicated hardware infrastructure.



1 Overview and Objectives of the Pilot Project

In Natural Language Processing (NLP), periodic campaigns are a popular means to
set benchmarks for specific tasks, stimulate the development of comparable systems
and ultimately promote research advancement (Nissim et al. 2017). The validation
of NLP models on different datasets strongly depends on the possibility of general-
ising their results on data and languages other than those on which they have been
trained and tested (Magnini et al. 2008). Recent trends are pushing towards propos-
ing benchmarks for multiple tasks (Wang et al. 2018), or for testing the adaptabil-
ity of systems to different textual domains, genres, and languages, including under-
researched and under-resourced ones. The recent specific emphasis on multilingual
assessment is also driven by a growing awareness that language technologies can
help promote multilingualism and linguistic diversity (Joshi et al. 2020). In this con-
text, the EVALITA4ELG project integrates linguistic resources and language tech-
nologies developed under the umbrella of the EVALITA evaluation campaign into
the European Language Grid.



Viviana Patti · Valerio Basile · Cristina Bosco · Michael Fell · Rossella Varvara
University of Turin, Italy, viviana.patti@unito.it, valerio.basile@unito.it, cristina.bosco@unito.it,
michael.fell@unito.it, rosella.varvara@unito.it
Andrea Bolioli · Alessio Bosca
CELI, Italy, andrea.bolioli@h-farm.com, alessio.bosca@h-farm.com


© The Author(s) 2023                                                                            295
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_19
296                Viviana Patti, Valerio Basile, Andrea Bolioli, Alessio Bosca, Cristina Bosco et al.

    EVALITA1 is an initiative of the Italian Association for Computational Linguis-
tics (Associazione Italiana di Linguistica Computazionale, AILC2 ). Since 2007, it
has been providing a shared framework where different systems and approaches can
be evaluated and compared with each other with respect to a large variety of tasks, or-
ganised by the Italian research community. The focus of EVALITA is to support the
advancement of methodologies and techniques for natural language and speech pro-
cessing in an historical perspective, beyond the performance improvement, favour-
ing reproducibility and cross-community engagement.
    The main goal of the EVALITA4ELG project is to leverage more than a decade
of findings of the Italian NLP community, in order to provide easier access to re-
sources and tools for Italian through ELG. We worked towards the achievement of
multiples goals, namely: (i) a survey of the tasks organised in the seven editions
of EVALITA, released as a knowledge graph; (ii) an anonymisation procedure for
improving compliance with current data standard policies; (iii) the integration of
resources and systems developed during EVALITA into the ELG platform; (iv) the
creation of a unified benchmark for evaluating Italian Natural Language Understand-
ing (NLU); (v) the dissemination of a shared protocol and a set of best practices to
describe new resources and tasks in a format that allows a quick integration of meta-
data into the European Language Grid.



2 Methodology

We started by surveying the tasks organised in EVALITA, collecting the resources
and their metadata for upload, and organising this set of information in an ontology.
We anonymised the resources according to the current policies for the protection
of people’s privacy. Finally, we integrated systems and baselines as a pool of web
services with a common interface.



2.1 Surveying the EVALITA Tasks

Starting in 2007, EVALITA has been devoted to the evaluation of NLP tools for
Italian, providing a shared framework in which participating systems are evaluated
on a growing set of different tasks. Rather than being focused on a single task,
EVALITA has always been characterised by a wider variety of tasks: each edition of
the EVALITA campaign, held in 2007 (Magnini et al. 2008), 2009, 2011 (Magnini
et al. 2013), 2014 (Attardi et al. 2015), 2016 (P. Basile et al. 2017), 2018 (Caselli et
al. 2018) and 2020 (V. Basile et al. 2020), has been organised around a set of shared
tasks dealing with both written and spoken language, varying with respect to the

1   http://www.evalita.it
2   https://www.ai-lc.it
19 Italian EVALITA Benchmark Linguistic Resources, NLP Services and Tools           297

challenges tackled and datasets used. The number of tasks has considerably grown,
from five tasks, in the first edition in 2007, to 14 tasks in the latest edition held in
2020. Following the trends of other national and international evaluation campaigns,
like, e. g., SemEval3 , the typology of tasks also evolved, progressively including a
larger variety of exercises oriented to semantics and pragmatics. In particular, the
2016 edition brought a focus on social media data and on the use of shared data
across tasks. Open access to resources and research artifacts is deemed crucial for
the advancement of the state of the art (Caselli et al. 2018) and the availability of
shared evaluation benchmarks is crucial for fostering reproducibility and comparabil-
ity of results. Organisers were encouraged to collaborate, stimulated to the creation
of a shared test set across tasks, and to eventually share all resources with a wider
audience. This has resulted in the creation of GitHub public repositories.4



2.2 The EVALITA Knowledge Graph

Starting from the semi-structured repositories mentioned in the previous section and
from the information collected by surveying seven editions of EVALITA, we built
a knowledge graph (KG) that provides the essential information about the editions
of the EVALITA evaluation campaign. The KG describes EVALITA in terms of or-
ganised tasks, but also of people and institutions that constitute the EVALITA com-
munity throughout the years. The KG is structured around an ontology implemented
in OWL and it is available both on the website of the EVALITA4ELG project5 and
as a service on the ELG platform. The current version of the ontology comprises
148 classes, 37 object properties and nine data properties. The ontology and the KG
are thoroughly described in Patti et al. (2020). As an example, Figure 1 depicts the
structure of the KG around the HaSpeeDe2018 task.
    The knowledge graph can be queried through a SPARQL endpoint, which allows
to inspect the ontology by selecting some variables that occur among the set of triples
(subject, predicate, object) composing the knowledge graph. It is thus possible to
answer relevant questions related to the EVALITA campaign, extracting information
from the KG such as, e. g., “What is the total number of institutions involved as
organisers of tasks in all seven EVALITA campaigns?”:
    SELECT (COUNT(distinct ?institution) AS ?totalInstitutions)
    where {
      ?task e4e:hasInstitution ?institution.
    }
    >>>> result: 55 <<<<




3 https://semeval.github.io
4 https://github.com/evalita2016/data
5 http://evalita4elg.di.unito.it
298                    Viviana Patti, Valerio Basile, Andrea Bolioli, Alessio Bosca, Cristina Bosco et al.

                                           e4e:EvaluationCampaign

                                                  rdf:type




            e4e:Evalita2020                      e4e:Task                    e4e:Evalita2018

                                                                                                    https://github.com/evalita2018/
                                                 rdf:type                                             data/tree/master/HaSpeeDe
                                                                       e4e:isTaskOf
      e4e:isTaskOf                                                                                        Datatype: xsd:anyURI

                                                                                                           e4e:hasWebsite
                                             e4e:isReRunOf
          e4e:HaSpeeDe2020                                                  e4e:HaSpeeDe2018

                                                                                                 e4e:hasPaper

                      e4e:hasInstitution
                                                 e4e:hasOrganizer                          http://ceur-ws.org/Vol-2263/paper010.pdf
                                                                                                      Datatype: xsd:anyURI
         e4e:UniversitàDiTorino                       e4e:ManuelaSanguinetti
              e4e:ILC-CNR                               e4e:MaurizioTesconi
              e4e:IIT-CNR                                e4e:FabioPoletto
              e4e:Acmos                                  e4e:CristinaBosco
                                                       e4e:FeliceDellOrletta
           rdf:type
                                                            rdf:type

             e4e:Institution                                   e4e:Person


Fig. 1 EVALITA knowledge graph; primary classes are colored and their relations illustrated
around the HaSpeeDe2018 task



2.3 Anonymisation of Resources

The EVALITA resources to be made accessible in the ELG platform had to be care-
fully checked and made compliant with the current policies about data releasing and
sharing (e. g., GDPR, Rangel and Rosso 2018), therefore particular attention has
been paid to data anonymisation. The datasets collected for EVALITA4ELG were
anonymised relying on an automatic anonymisation tool developed in the context
of the AnonymAI research project, and then manually reviewed in order to assess
their quality. AnonymAI is a nine months research project co-financed by the H2020
project NGI Trust focusing on providing legally compliant anonymisation profiles
customised to the needs of end users.
   The anonymisation profile applied to the EVALITA4ELG dataset detects and
masks person names, phone numbers, email addresses, mentions/replies/retweets,
and URLs. The most frequent entities that were masked in the anonymisation pro-
cess consist of person names and mentions (e. g., in the SardiStance dataset about
50 person names and 150 mentions).
19 Italian EVALITA Benchmark Linguistic Resources, NLP Services and Tools                    299

2.4 Release of Data and Models through ELG

At the time of this writing, 51 Language Resources and Technologies are linked
to the EVALITA4ELG project in ELG.6 Eight services were fully integrated into
ELG: four of them from the EVALITA 2018 edition, and four of them from the most
recent EVALITA 2020 edition. Of the 2018 systems, three are hate speech detection
systems (HaSpeeDe 2018 task) and one is Gender Detection (GxG). Of the 2020
systems, two are hate speech detectors (HaSpeeDe 2020 task), one is a POS tagger
for spoken language (KIPoS task), and one is a misogyny detection system (AMI
task). All datasets and services are accessible interactively from the ELG website or
programmatically by means of REST API calls or the ELG-provided Python SDK.



3 Conclusions and Results of the Pilot Project

EVALITA4ELG has been a successful effort towards the inclusion of resources for
the Italian language in the European Language Grid. We created a catalogue of re-
sources and models developed during the various editions of the EVALITA cam-
paign, designed in the form of a knowledge graph that can be inspected through
SPARQL queries. We collected the original distribution of the resources used for
EVALITA tasks and we created 44 entries. For 13 resources, together with CELI, we
developed and applied an anonymisation procedure to mask personal and sensitive
data. We integrated eight available systems from different tasks into ELG. Finally,
we organised an event on September 2021 with hybrid participation7 , including an
overview of the project and the results obtained, a tutorial about integrating systems
and resources on ELG, and a round table with 14 invited speakers chosen among the
most active organisers of tasks of EVALITA.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects. The work has also received funding from
the EU’s Horizon 2020 research and innovation programme under grant agreement no. 825618
(AnonymAI, NGI Trust).




References

Attardi, Giuseppe, Valerio Basile, Cristina Bosco, Tommaso Caselli, Felice Dell’Orletta, Simonetta
   Montemagni, Viviana Patti, Maria Simi, and Rachele Sprugnoli (2015). “State of the Art Lan-
   guage Technologies for Italian: The EVALITA 2014 Perspective”. In: Intelligenza Artificiale 9,
   pp. 43–61.


6   https://live.european-language-grid.eu/catalogue/project/1397
7   http://evalita4elg.di.unito.it/conference
300             Viviana Patti, Valerio Basile, Andrea Bolioli, Alessio Bosca, Cristina Bosco et al.

Basile, Pierpaolo, Malvina Nissim, Rachele Sprugnoli, Viviana Patti, and Francesco Cutugno
   (2017). “EVALITA Goes Social: Tasks, Data, and Community at the 2016 Edition”. In: Ital-
   ian Journal of Computational Linguistics 3.1, pp. 93–127.
Basile, Valerio, Danilo Croce, Maria Di Maro, and Lucia C. Passaro (2020). “EVALITA 2020:
   Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools
   for Italian”. In: Proc. of the 7th Evaluation Campaign of Natural Language Processing and
   Speech Tools for Italian. Final Workshop (EVALITA 2020), 17 Dec. 2020. Vol. 2765. CEUR
   Workshop Proceedings.
Caselli, Tommaso, Nicole Novielli, Viviana Patti, and Paolo Rosso (2018). “Evalita 2018: Overview
   on the 6th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian”.
   In: Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech
   Tools for Italian. Final Workshop (EVALITA 2018). Ed. by Tommaso Caselli, Nicole Novielli,
   Viviana Patti, and Paolo Rosso. Torino: CEUR Workshop Proceedings, pp. 3–8.
Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury (2020). “The
   State and Fate of Linguistic Diversity and Inclusion in the NLP World”. In: Proc. of the 58th
   Annual Meeting of the Association for Computational Linguistics. ACL, pp. 6282–6293.
Magnini, Bernardo, Amedeo Cappelli, Fabio Tamburini, Cristina Bosco, Alessandro Mazzei, Vin-
   cenzo Lombardo, Francesca Bertagna, Nicoletta Calzolari, Antonio Toral, Valentina Bartalesi
   Lenzi, Rachele Sprugnoli, and Manuela Speranza (2008). “Evaluation of Natural Language
   Tools for Italian: EVALITA 2007”. In: Proc. of the 6th Int. Conference on Language Resources
   and Evaluation (LREC 2008). Marrakech: ELRA, pp. 2536–2543.
Magnini, Bernardo, Francesco Cutugno, Mauro Falcone, and Emanuele Pianta, eds. (2013). Eval-
   uation of Natural Language and Speech Tools for Italian, International Workshop, EVALITA
   2011, Rome, Italy, January 24-25, 2012, Revised Selected Papers. Vol. 7689. Lecture Notes in
   Computer Science. Springer. URL: https://doi.org/10.1007/978-3-642-35828-9.
Nissim, Malvina, Lasha Abzianidze, Kilian Evang, Rob van der Goot, Hessel Haagsma, Barbara
   Plank, and Martijn Wieling (2017). “Last Words: Sharing Is Caring: The Future of Shared
   Tasks”. In: Computational Linguistics 43.4, pp. 897–904.
Patti, Viviana, Valerio Basile, Cristina Bosco, Rossella Varvara, Michael Fell, Andrea Bolioli, and
   Alessio Bosca (2020). “EVALITA4ELG: Italian Benchmark Linguistic Resources, NLP Ser-
   vices and Tools for the ELG Platform”. In: Italian Journal of Computational Linguistics 6.6-2,
   pp. 105–129. DOI: https://doi.org/10.4000/ijcol.754.
Rangel, Francisco and Paolo Rosso (2018). “On the Implications of the General Data Protection
   Regulation on the Organisation of Evaluation Tasks”. In: Language and Law 5.2, pp. 95–117.
Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman (2018).
   “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”.
   In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neu-
   ral Networks for NLP. Brussels: ACL, pp. 353–355.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 20
Lingsoft Solutions as Distributable Containers

Sebastian Andersson and Michael Stormbom




Abstract Lingsoft is one of the leading language technology and language service
providers in the Nordic countries. In the Lingsoft Solutions as Distributable Contain-
ers (LSDISCO) project, we packaged our language technology tools for distribution
as containerised services via the European Language Grid (ELG). As a result, Ling-
soft’s speech recognition, machine translation, proofing, and morphological analysis
was made available to users of the European Language Grid. The services primar-
ily cover Finnish (general and healthcare domain), Swedish (also Finland Swedish),
Danish, Norwegian bokmål and nynorsk, and English. The distribution as container-
ised services is a straightforward way of making our tools available and updated on
ELG and we intend to continue to update our service offerings on ELG with new
tools and languages as we develop them.



1 Overview and Objectives of the Pilot Project

Lingsoft is one of the leading providers of language technology solutions in the
Nordic countries and one of the 100 largest language service providers in the world.
The tools and models that Lingsoft contributed to ELG via the Lingsoft Solutions
as Distributable Containers (LSDISCO) project already existed and in most cases
they were already actively used in production by Lingsoft or our customers. The
goal of the LSDISCO project was to make those tools and models available as ELG-
compatible services for ELG users (Rehm et al. 2021). This included four types of
services:
  • Speech recognition, with the supported languages being Finnish (general and
    healthcare domain), Swedish and Norwegian bokmål
  • Machine translation, for language pairs involving Finnish, Swedish, and English
    in any combination, as well as both directions of Finnish – German



Sebastian Andersson · Michael Stormbom
Lingsoft, Finland, sebastian.andersson@lingsoft.fi, michael.stormbom@lingsoft.fi


© The Author(s) 2023                                                               301
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_20
302                                           Sebastian Andersson and Michael Stormbom

  • Proofing, entailing spelling and grammar error detection for Finnish, Swedish,
    Danish, Norwegian bokmål, and spelling for Norwegian Nynorsk and English
  • Text analysis, entailing morphological analysis (lemmatization and morphology)
    and named entity recognition (NER) for Finnish, Swedish, Danish, Norwegian
    Bokmål, Nynorsk, and English
   The end result of the project was a set of high quality NLP tools for the Nordic
languages available through ELG, for both commercial and non-commercial use,
allowing companies and public organisations throughout Europe to efficiently in-
corporate Nordic language support in their solutions and services.



2 Methodology

The four types of tools and services in scope for the LSDISCO project – speech
recognition, machine translation, proofing and text analysis – have been originally
developed at Lingsoft in different periods in the company and software development
history and for different primary use cases. The least common denominator was
a need for refactoring the tools and service architecture to comply with the ELG
requirements. Especially the machine translation tools needed conversion from an
internally used tool to enable also external distribution as a service via ELG.
   The LSDISCO project was divided into three phases per requirements in the ELG
call outline: 1. Experiment; 2. Integration; 3. Dissemination. The Experiment phase
consisted of refactoring Lingsoft’s tools and architecture to comply with ELG’s inte-
gration requirements. This phase also included enabling a licensing mechanism for
the services and creation or upgrade of the terms of service documentation. For the
Integration phase, we selected the option to integrate our services to ELG via a proxy
container, as this was the most practical option for us requiring the least amount of
additional maintenance. This means that all calls to the ELG service are forwarded
to and processed by Lingsoft’s back end. Upgrades to the services in Lingsoft’s back
end per our normal release update cycle, e. g., model improvements, are then im-
mediately available also in ELG. The dissemination phase consisted of advertising
Lingsoft’s services and the ELG platform on Lingsoft’s website and in suitable fo-
rums such as conferences and trade fairs.



3 Implementation

Lingsoft’s proofing, text analysis and speech recognition services were already to a
large extent ready for ELG integration. The improvements made for those largely
followed the existing development roadmap. The biggest implementation and refac-
toring effort in the LSDISCO project was for enabling serving Lingsoft’s neural ma-
chine translation (NMT) to external users, in this case ELG. The NMT engine and
20 Lingsoft Solutions as Distributable Containers                                 303

models were migrated from a solution serving “only” Lingsoft’s own translation pro-
duction to the same Software as a Service infrastructure as our speech recognition.
This gave us a scalable back end and the possibility to provide user credentials for
NMT usage, thus making important improvements to commercialising Lingsoft’s
machine translation and serving also external organisations.
   To integrate our services with ELG, we implemented the Lingsoft ELG adapter.
The Lingsoft ELG adapter is an API proxy container, illustrated in Figure 1. It ex-
poses the ELG platform’s internal LT Service API specification compatible end-
points and acts as a proxy to the Lingsoft APIs:
   ASR API Lingsoft Speech Recognition API
   NMT API Lingsoft Machine Translation API
   LMC API Lingsoft Language Management Central API (text analysis)
   In the proxy container, we implemented the conversion between the ELG and the
Lingsoft API specifications. The proxy container also includes the mechanism for
forwarding authentication via ELG for Lingsoft’s back end service.
   The Lingsoft ELG Adapter was packaged into a Docker image and submitted
to DockerHub. Lingsoft then filled in the ELG XML metadata specifications for
Lingsoft’s services on the ELG platform, and the ELG technical team could proceed
with the actual integration. The DockerHub image of the Lingsoft ELG Adapter
was created for ELG, but it can be deployed by other organisations in a Docker
environment and integrated with the organisation’s own solutions. All that another



                                  European Language Grid platform


                  ASR API Proxy                 E Proxy
                                           NMT API                LMC API Proxy

                  pod                      pod                    pod




                        ASR API                  NMT API                LMC API

                  pod                      pod                    pod

                                      Lingsoft Service platform



Fig. 1 API proxy containers relay Lingsoft’s services to ELG
304                                              Sebastian Andersson and Michael Stormbom

organisation would need to deploy the same Docker image into their environment
are credentials from Lingsoft that allows calling the Lingsoft back end services.
   As the ELG technical team preferred one service per functionality and language.
This meant that Lingsoft provided a total of 35 services for ELG integration. The
full set of services is presented in Table 1.


                        Service                     Supported Languages/Domains
      Speech recognition                        Finnish, Finnish Healthcare, Swedish,
                                                Norwegian bokmål
      Machine translation                       Finnish ↔ English, English ↔ Swedish,
                                                Finnish ↔ Swedish, German ↔ Finnish
      Proofing                                  Finnish, Finnish Healthcare, Swedish, Fin-
                                                land Swedish, Danish, Norwegian bokmål
                                                and nynorsk, English
      Morphological analysis (incl. Lemmatization) Finnish, Swedish, Danish, Norwegian
                                                   bokmål and nynorsk, English
      Named Entity Recognition (NER)            Finnish, Finnish Wikidata, Finnish YSO,
                                                Swedish, Danish, Norwegian bokmål and
                                                nynorsk, English
Table 1 Lingsoft services and languages




4 Evaluation

Generally, online guidelines and human integration support from ELG were clear
and sufficiently detailed throughout the course of the project. The integrated services
work per expectation in the “try out” user interface on the ELG platform.
   Lingsoft also provided the ELG project with feedback from a commercial per-
spective regarding the integration process and platform functionality. For example,
the demonstration services available in the “try out” box are quite slow. Lingsoft’s
speech recognition supports near real-time “live” subtitling/dictation, but this is not
yet possible to demonstrate via the ELG platform. The commercial aspects of the
platform are also work-in-progress at the time of writing, with no working solution
for billing an ELG end user for the use of, e. g., Lingsoft’s services. At present, we
provide our solutions through ELG mainly for demonstration purposes, as a market-
ing channel, and for non-commercial use.
20 Lingsoft Solutions as Distributable Containers                                   305

5 Conclusions and Results of the Pilot Project

The ELG project allowed us to upgrade our service infrastructure for easier distribu-
tion via ELG as well as through other channels. We believe that we will continue to
utilise other providers’ ELG resources and services for our benefit, especially open
source tools and resources. From our experience with trying to utilise open source
tools from the academic community, the ELG approach of researchers (and other
developers) providing their open source tools as shareable docker containers with
an exposed API is a great improvement over the current situation.
    For Lingsoft, ELG can be seen as an additional distribution channel for tools and
services we already provide. As an SME from Finland, it is expected that an official
EU platform will increase the findability of our services and raise the credibility
of our solutions outside of Finland, where we are well known. ELG is therefore
expected to facilitate reaching customers outside of Finland and the Nordics.
    We provide our tools both for commercial usage (on a Software as a Service sub-
scription model) by companies and organisations, and for research purposes (free
of charge for non-commercial use). In our internal work processes, e. g., subtitling
and translation, the dockerised tools and API access is ideal, as this facilitates keep-
ing our technology pipeline modular, and the core language technology tools easily
replaceable and/or upgradable.
    A centralised catalogue of European language technology, if widely adopted,
will be beneficial to private providers of language technology, such as Lingsoft, for
reaching new customers with our tools and services offerings. Conversely, we hope
our contribution to the platform with our services benefit ELG in becoming widely
adopted by providing more quality items for the ELG catalogue. Our solutions are
robust and widely used with a proven track record. Our spelling and grammar tools
have been distributed with the Microsoft Office suite and are used by the Finnish
Digital and Population Data Agency, as well as several of the largest newspapers in
Sweden; we have collaborated with the Swedish Post and Telecom Authority and
the public service broadcaster SVT in creating speech-to-text for Swedish and our
Finnish speech-to-text is in use for transcription in a number of Finnish organisations,
including the Finnish parliament.
    As ELG grows, we believe we will get good exposure for our services by having
them on display at ELG. The service adapter ELG integration allows us to continu-
ously improve the content of our ELG services with a minimum of additional mainte-
nance effort. We also intend to continue to release new tools and covered languages
in line with our general development roadmap.
    Lingsoft is proud to have been one of the selected organisations for the ELG
integration projects. We look forward to being part of the continued development of
the ELG platform and hope that a substantial part of the ELG visions are fulfilled in
the near future.
    Lingsoft’s services can be found in the European Language Grid.1


1   https://live.european-language-grid.eu/catalogue/search/Lingsoft
306                                                  Sebastian Andersson and Michael Stormbom

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.




References

Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 21
Motion Capture 3D Sign Language Resources

Zdeněk Krňoul, Pavel Jedlička, Miloš Železný, and Luděk Müller




Abstract The new 3D motion capture data corpus expands the portfolio of exist-
ing language resources by a corpus of 18 hours of Czech sign language. This helps
alleviate the current problem, which is a critical lack of quality data necessary for
research and subsequent deployment of machine learning techniques in this area.
We currently provide the largest collection of annotated sign language recordings
acquired by state-of-the-art 3D human body recording technology for the successful
future deployment of communication technologies, especially machine translation
and sign language synthesis.



1 Overview and Objectives of the Pilot Project

Sign language (SL) is a natural means of communication for deaf people. About 70
million people use SL as their first language and there are more than 100 different di-
alects used around the world. Although significant progress has been made in recent
years in the field of language machine learning techniques, the field of SL processing
struggles with a critical lack of quality data needed for the successful application of
these techniques. SL resources are scarce – they consist of small SL corpora usually
designed for a specific domain such as linguistics or computer science. There are
some motion capture datasets for American Sign Language (ASL) and French Sign
Language (Lu and Huenerfauth 2010; Naert et al. 2017) with a total recorded time
of motion of up to 60 minutes. The situation is even worse for “small” languages.
    The 3D reconstruction of human body motion using images and depth cameras
is a common approach for capturing the movement of the human body (MMPose
Contributors 2020). Current large SL datasets are mostly based on 2D RGB videos
(Vaezi Joze and Koller 2019; Zelinka and Kanis 2020). The main goal of our project
is to deliver a large 3D motion dataset collected using high precision optical marker-
based motion capture and to extend the existing ELG portfolio of language resources

Zdeněk Krňoul · Pavel Jedlička · Miloš Železný · Luděk Müller
University of West Bohemia, Czech Republic, zdkrnoul@ntis.zcu.cz, jedlicka@ntis.zcu.cz,
zelezny@ntis.zcu.cz, muller@kky.zcu.cz


© The Author(s) 2023                                                               307
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_21
308                        Zdeněk Krňoul, Pavel Jedlička, Miloš Železný, and Luděk Müller

by Czech sign language (CSE) data. For comparison SIGNUM, one of the largest
video-based SL datasets, contains approximately 55 hours of SL recordings (Koller
et al. 2015) and one of the largest 3D motion capture datasets contains only 60 min-
utes of SL recordings (Naert et al. 2017).
    Motion capture technology guarantees precise recording of the signer’s move-
ments in 3D space at the cost of a more complex preparation phase compared to
standard video recording. Optical marker-based motion capture has become the in-
dustry standard for capturing movement of the human body. In Jedlička et al. (2020),
we collected the first 3D motion capture dataset for CSE, covering the weather fore-
cast domain. It has a rather limited size and contains recordings of one signer only.
    Our contribution can be summarised as follows:
  • Proof of concept of large-scale motion capture recording of multiple SL speak-
    ers;
  • Provide 3D motion capture data to cover wider domains, grammatical context
    and more signers. We perform proper data post-processing, annotate glosses,
    and develop tools for data extraction from the collected dataset;
  • The largest SL motion capture dataset consisting of recordings of continuous
    SL phrases and a vocabulary of six native SL speakers from carefully selected
    domains, in total more than 18 hours;
  • Tools that allow searching for individual glosses, phrases, or small movement
    sub-units (e. g., given hand shape/action) in the dataset.



2 Methodology and Experiment

A new recording procedure for a large amount of 3D motion capturing of SL was in-
vestigated to ensure sufficient diversity of SL speakers, grammar, and sign contexts.
This makes the new language resource more versatile and useful in many different
research fields such as further linguistic and SL motion analyses. The integral part
of the experiment is data processing.
   In Jedlička et al. (2020), the experimental recording setup with VICON 18 cam-
eras was used as proof of the intended concept. The negative aspect of this setup
was its high complexity; the setup was very time demanding and not suitable for
large-scale data and multiple speakers.
   The new procedure simplifies the process by dividing the setup into two separate
parts: large-scale body movement and small-scale, highly detailed finger movement
are recorded with two separate motion capture camera setups, each of which uses a
reduced number of capture cameras and is adjusted slightly for different speakers.
21 Motion Capture 3D Sign Language Resources                                                 309

2.1 Recording Setup

We used our laboratory equipment, i. e., the VICON motion capture system with
eight cameras. We extended it with a standard color video camera for a reference
video. The frame rate was 100 frames per second (fps) for the motion capture and
25 fps for the reference video. The VICON system records movement using passive
retro-reflexive markers attached to the human body. Movement is modeled as a set
of movements of the rigid parts connected by the skeleton; the marks are placed on
the poles of the rotation axis of the main skeleton joints. Each body part is defined
by at least four markers, except fingertips, see Figure 1.




Fig. 1 Visualisation of SL body marker setup (left) and SL hand-shape marker setup (right)


   The SL body marker setup is based on marker positions defined by the VICON
three-finger standard. It uses a total of 43 markers for tracking upper body, head,
arms, and palms movement. A simple hand pose is provided at the same time and
incorporates tracking of thumb, index, and little fingertips. Moreover, this setup in-
cludes face tracking providing a non-manual component of SL, that is reduced to
seven facial markers. The SL hand-shape marker setup is designed for detailed hand-
shapes recording. Each hand-shape is recorded separately. Data is recorded for the
right hand only. The movement starts from the relaxed hand-shape, then changes
to the given hand-shape and back to the relaxed hand-shape. For both setups, data
capturing was supervised by CSE linguists.



2.2 Data Annotation

An essential step is the annotation of captured SL utterances. We use time-synchron-
ised reference video, the ELAN tool (Figure 2) and SL experts. The annotation of a
sign is done by giving the information of the sign’s meaning (gloss), and the right
and the left hand-shape. If the sign consists of more than one defined hand-shape, the
310                            Zdeněk Krňoul, Pavel Jedlička, Miloš Železný, and Luděk Müller

hand-shapes are annotated as a set of hand-shapes. Both the activities are very labo-
rious and time-consuming. To successfully complete this task, we involved several
trained annotators who worked in parallel.



2.3 Data Post-processing

Post-processing consists of data-cleaning, whole-body motion reconstruction, and
data-solving. Data-cleaning removes noise and fills gaps in the raw 3D data caused
by frequent mutual occlusions of markers during signing, and other noise caused by
the environment. Motion reconstruction and data-solving recalculate marker posi-
tions into the movement of the skeletal model.
    The data of both setups was post-processed. We reconstructed small gaps by the
interpolation standard technique as long as the trajectory was simple enough. Note,
that the recording speed is 100 fps, which is fast enough to contain minimal changes
in trajectory between frames. We used semi-automatic 3D reconstruction of marker
trajectories and labeling, and manual cleaning of swaps and gaps. For the body parts
defined by at least four markers, filling in the trajectories of the marker is well au-
tomatised because at least three points are enough to define the missing position.
    The body marker setup uses only one marker per fingertip and some larger gaps
caused by more complex self-occlusions of body parts can obscure three or more
markers in one rigid segment. Post-processing in those cases is more complicated
and gaps must be filled in manually.
    The full SL body movement is achieved as a composition of the body movement
and corresponding data of the hand-shapes setup. For this purpose, the annotation
of hand-shapes provides us temporal segmentation of the recordings. Thus the fin-
gertip motion segments can provide information about dynamic changes during the
performance of a particular SL hand-shape in a particular data frame.
    The middle part of a given segment is always completed according to the hand-
shape(s) assigned by the annotation. We captured full fingers motion only for the
transition of the given hand-shape from and to the neutral hand-shape. Thus, for the
other frames of the segment, the nearest hand pose with the smallest reconstruction




Fig. 2 Example of annotation
work in ELAN, specifically
designed software for the
analysis of sign languages,
and gestures
21 Motion Capture 3D Sign Language Resources                                       311

error can be used. We consider only those frames that have an alignment error below
a given threshold. The remaining frames will have gaps in the final trajectories.
   We solved the above problem as point-set alignment via Procrustes analysis that
arises especially in tasks like 3D point cloud data registration. The rigid transforma-
tion of two sets of points on top of each other minimises the total distance in 3D
between the corresponding markers (Arun et al. 1987). Since the data is noisy, it
minimises the least-squares error:

                                  ∑
                                  N
                          err =         ||RMfi + t − Mrf
                                                      i
                                                         ||,                       (1)
                                  i=1

where Mf and Mrf are current and reference frame(s) respectively as a set of 3D
points with known correspondences, R is the rotation matrix and t the translation
vector. We define N = 7 as three fingertips (thumb, index, little finger), two wrist
markers, and two knuckles of the index and little fingers. We aligned just the rotation
and translation because the 3D transformation preserves the shape and size (same
hand-shape and SL speaker). For the left hand, we mirrored the reference frame(s).
   The last step is data-solving. It is a process of reconstruction of the 3D motion
of the skeleton from the marker trajectories. For this purpose, we use the VICON
software. The skeleton is well defined to directly control the SL avatar animation or
handle animation retargeting.



2.4 Dataset Parameters

We limited the linguistic domain to two specific fields to reduce the number of unique
signs. Weather forecasts and animal descriptions from the zoological garden domain
were selected by CSE linguists. We were also given a list of all hand-shapes which
occur in these domains. The dataset is collected from six SL speakers, who differ in
their body size, age, and gender.



3 Conclusions and Results of the Pilot Project

SLs are not sufficiently supported through technologies and have only fragmented,
weak, or no support at all. Our ELG pilot project offers a new SL resource designed
for the development of language technologies (LTs) and multilingual services for
Czech. The results contribute to the establishment of the Digital Single Market as
one of ELG’s objectives. In contrast to the all-in-one recording setup, the body move-
ment is recorded separately from the highly detailed recording of hand poses. This
separation reduces the camera setup complexity and the complexity of data during
post-processing, which makes SL recording more flexible and adjustments for new
SL speakers or data easier.
312                             Zdeněk Krňoul, Pavel Jedlička, Miloš Železný, and Luděk Müller

   The project delivered a professionally created SL dataset via state-of-the-art 3D
motion capture technology. The project provides data for the wider research com-
munity through ELG. We have recorded 18 hours of sign language and recorded six
different speakers for two different domains.
   We assume our results will be beneficial for other applications such as next gen-
eration SL synthesis that uses a 3D animated avatar for natural human movement
reproduction or SL analysis or gesture recognition and classification in general.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.




References

Arun, K. S., T. S. Huang, and S. D. Blostein (1987). “Least-Squares Fitting of Two 3-D Point Sets”.
   In: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-9.5, pp. 698–700.
   DOI: 10.1109/TPAMI.1987.4767965.
Jedlička, Pavel, Zdeněk Krňoul, Jakub Kanis, and Miloš Železný (2020). “Sign Language Mo-
   tion Capture Dataset for Data-driven Synthesis”. In: Proceedings of the LREC2020. Marseille,
   France: ELRA, pp. 101–106.
Koller, Oscar, Jens Forster, and Hermann Ney (2015). “Continuous sign language recognition: To-
   wards large vocabulary statistical recognition systems handling multiple signers”. In: Computer
   Vision and Image Understanding 141. Pose & Gesture, pp. 108–125.
Lu, Pengfei and Matt Huenerfauth (2010). “Collecting a motion-capture corpus of American Sign
   Language for data-driven generation research”. In: Proceedings of the NAACL HLT 2010 Work-
   shop on Speech and Language Processing for Assistive Technologies. ACL, pp. 89–97.
MMPose Contributors (2020). OpenMMLab Pose Estimation Toolbox and Benchmark. URL: http
   s://github.com/open-mmlab/mmpose.
Naert, Lucie, Caroline Larboulette, and Sylvie Gibet (2017). “Coarticulation Analysis for Sign
   Language Synthesis”. In: Universal Access in Human – Computer Interaction. Designing Novel
   Interactions. Cham: Springer, pp. 55–75.
Vaezi Joze, Hamid and Oscar Koller (2019). “MS-ASL: A Large-Scale Data Set and Benchmark
   for Understanding American Sign Language”. In: BMVC.
Zelinka, Jan and Jakub Kanis (2020). “Neural Sign Language Synthesis: Words Are Our Glosses”.
   In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3384–3392.
   DOI: 10.1109/WACV45572.2020.9093516.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 22
Multilingual Image Corpus

Svetla Koeva




Abstract The ELG pilot project Multilingual Image Corpus (MIC 21) provides
a large image dataset with annotated objects and multilingual descriptions in 25
languages. Our main contributions are: the provision of a large collection of high-
quality, copyright-free images; the formulation of an ontology of visual objects based
on WordNet noun hierarchies; precise manual correction of automatic image segmen-
tation and annotation of object classes; and association of objects and images with
extended multilingual descriptions. The dataset is designed for image classification,
object detection and semantic segmentation. It can be also used for multilingual ima-
ge caption generation, image-to-text alignment and automatic question answering for
images and videos.



1 Overview and Objectives of the Pilot Project

Significant progress has been achieved in many multimodal tasks, such as image
caption generation, aligning sentences with images in various types of multimodal
documents and visual question answering. The shift of traditional vision methods
challenged by multimodal big data motivates the creation of a new image dataset,
the Multilingual Image Corpus (MIC21).
    The MIC21 dataset is characterised by carefully selected images from themati-
cally related domains and precise manual annotation for segmentation and classifi-
cation of objects in over 20,000 images. The annotation is performed by drawing
of or correcting automatically generated polygons, from which bounding boxes are
automatically constructed. This allows for wide application of the dataset in various
computer vision tasks: image classification, recognition and classification of single
objects in an image or of all object instances in an image (semantic segmentation).
    The annotation classes which are used belong to a specially designed ontology of
visual objects which provides options for extracting relationships between objects
in images; the construction of diverse datasets with different levels of granularity of

Svetla Koeva
Institute for Bulgarian Language, Bulgarian Academy of Sciences, Bulgaria, svetla@dcl.bas.bg


© The Author(s) 2023                                                                      313
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_22
314                                                                      Svetla Koeva

object classes; and the compilation of appropriate sets of images illustrating differ-
ent thematic domains. The ontology classes and their definitions, accompanied by
illustrative examples, have been translated into 25 languages, which can be used for
automatic interpretation of an image, caption generation and alignment of images
with short texts such as questions and answers about the image content.



2 Methodology

We have divided the annotation process into four main stages: 1. definition of an
ontology of visual objects; 2. collection of appropriate images; 3. automatic object
segmentation and classification; and manual correction of object segmentation and
manual classification of objects. The dataset contains four thematic domains (sport,
transport, arts, security), which group highly related dominant classes such as Ten-
nis player, Soccer player, Limousine, Taxi, Singer, Violinist, Fire engine, and Police
boat in 130 subsets of images. We have used the COCO Annotator (Brooks 2019),
which allows for collaborative work within a project, and offers tracking object in-
stances and labelling objects with disconnected visible parts.



2.1 Ontology of Visual Objects

In current practice, WordNet is typically used in generating text queries for the cre-
ation of search-based image collections. For example, ImageNet uses 21841 synsets
for image collection and their labeling (Russakovsky et al. 2015). A Visual Concept
Ontology is proposed which organises concepts (Botorek et al. 2014), containing 14
top-level ontology classes divided into 90 more specific classes. Other datasets use
a hierarchical organisation of object classes and mutually exclusive classes (Caesar
et al. 2018), however, the number of concepts is usually relatively small.
    The ontology of visual objects created for MIC21 embraces concepts that are the-
matically related and can be depicted in images. The four thematic domains (sport,
transport, arts, security) are represented by 137 dominant classes, which show the
main “players” within these domains. The ontology also embraces the hypernyms of
the dominant classes up to the highest hypernym, which denotes a concrete object,
and non-hierarchically related classes (called attributes) (Koeva 2021). The type of
dominant class and the type of attribute class determine the type of the relation be-
tween them: has instrument, wears, uses, has part, etc. For example, the attribute
classes for Billiard player are Pool table, Billiard ball, and Cue, while for Bowler
– Bowling alley, Bowl, Bowling pin, Bowling shoe etc.; the hypernym classes for
Billiard player and Bowler are Player, Contestant and Person.
    Some of the classes and relations are inherited from WordNet (Miller et al. 1990).
Additional classes and relations are included in the ontology in case they are not
present in WordNet, for example Bowler wears Bowling shoes. Using the ontology
22 Multilingual Image Corpus                                                       315

of visual objects ensures the selection of mutually exclusive classes; the interconnec-
tivity of classes by means of formal relations and an easy extension of the ontology
with more concepts corresponding to visual objects.



2.2 Collection of Images and Metadata

The images in the dataset are collected from a range of repositories offering APIs:
Wikimedia (images with Public Domain License or Non-copyright restrictions li-
cense)1 ; Pexels (images with a free Pexels license allowing free use and modifica-
tions)2 ; Flickr (images with Creative Commons Attribution License, Creative Com-
mons Attribution ShareAlike License, no known copyright restrictions, Public Do-
main Dedication, Public Domain Mark)3 ; Pixabay (images with a free Pixabay li-
cense allowing free use, modifications and redistribution)4 . The Creative Commons
Search API is also used for searches on content available under Creative Commons
licenses5 . Over 750,000 images were collected in total and automatically filtered fur-
ther by image dimensions, license types and for duplication. Each image is equipped
with metadata description in JSON format: filepath; source (name of the repository
or service used to obtain the image); sourceURL (URL of this repository or service
); license; author (if available); authorURL (if available); domain (the domain the
image belongs to); width and height (in pixels) etc.



3 Criteria for the Selection of Images

After the collection of images, we performed additional manual selection to ensure
the quality of the dataset, applying the following criteria: i) The image has to con-
tain a clearly presented object described by a given dominant class; ii ) The object
should (preferably) have no occluded parts; iii) The target object should be in its
usual environment and in a position or use that is normal for its activity or purpose;
iv) The instances of the target object in different images should not represent one
and the same person, animal or artefact; v) Images with small objects, unfocused
objects in the background or images with low quality are not selected; vi) Images
which represent collages of photos or are post-processed are not selected.
   The final selection of images is triple-checked independently by different experts:
after the automatic collection, after the automatic generation of segmentation masks
and during manual annotation.

1 https://commons.wikimedia.org/wiki/Commons:Licensing
2 https://www.pexels.com/license/
3 https://www.flickr.com/services/developer/api/

4 https://pixabay.com/service/license/

5 https://api.creativecommons.org/docs/
316                                                                           Svetla Koeva

3.1 Generation and Evaluation of Suggestions

To accelerate the manual annotation, an image processing pipeline for object detec-
tion and segmentation was developed. Two software packages – YOLACT (Bolya
et al. 2019) and DETECTRON2 (Wu et al. 2019), and Fast R-CNN (Girshick 2015)
models trained on the COCO dataset (Lin et al. 2014) were used for the generation
of annotation suggestions. We also performed automatic relabelling for some of the
predicted classes (usually for the dominant class and for some of its attribute classes),
e. g., the COCO category Person within the subset Golf from the thematic domain
Sport is replaced with the class Golf player. The performance of the models was
evaluated over all domain-specific datasets within the domain Sport (see Figure 1).




Fig. 1 Annotation results: human (left), YOLACT (middle) DETECTRON2 (right)


   The results demonstrate similar behaviour with a slight predominance of one of
the models, which was further used to predict the object classes in the datasets from
the other three thematic domains. Altogether 253,980 segmentation masks were au-
tomatically generated, 194,212 of which were manually adjusted.



3.2 Annotation Protocol

The task for annotators was to outline polygons for individual objects in the image
(either by approving or correcting the automatic segmentation or by creating new
polygons) and to classify the objects against the classes from the predefined ontology.
The annotation follows several conventions:
  • An object within an image is annotated if it represents an instance of a concept
    included in the ontology.
  • All objects from the selected dominant class and its attribute classes are anno-
    tated (for example, Gondola and the related objects Gondolier and Oar).
  • If the object can be associated with different classes, this is recorded within the
    metadata (for example, for a female soldier – Soldier and Woman).
    Quality control is provided by a second annotator who validates the implemen-
tation of the conventions and discusses the quality with the annotation group on a
regular basis. If necessary, some of the images are re-annotated.
22 Multilingual Image Corpus                                                        317

4 Multilingual Classes

For the purpose of the multilingual description of the images, all ontology classes
have been translated into 25 languages: English (Princeton WordNet), Albanian, Bul-
garian, Basque, Catalan, Croatian, Danish, Dutch, Galician, German, Greek, Finnish,
French, Icelandic, Italian, Lithuanian, Polish, Portuguese, Romanian, Russian, Ser-
bian, Slovak, Slovene, Spanish, and Swedish.
   Openly available wordnets have been used from the Extended Open Multilingual
WordNet.6 For the ontology classes which are not inherited from WordNet the ap-
propriate WordNet hypernyms are used. Where WordNet translations are not avail-
able, additional sources of translations as BabelNet7 are employed. The multilingual
translations of classes are presented in a separate JSON file which contains informa-
tion about the language and the translation source. The translations of the ontology
classes are accompanied by their synonyms, the concept definition and usage exam-
ples (if available in the sources).



5 Conclusions and Results of the Pilot Project

The Multilingual Image Corpus provides fully annotated objects within images with
segmentation masks, classified according to an ontology of visual objects, thus of-
fering data to train models specialised in object detection, segmentation and classi-
fication (Table 1). The ontology of visual objects allows easy integration of anno-
tated images in different datasets as well as learning the associations between ob-
jects in images. The ontology classes are translated into 25 languages and supplied
with definitions and usage examples. The explicit association of objects and images
with appropriate text fragments is relevant for multilingual image caption generation,
image-to-text alignment and automatic question answering for images and video.


               Domain     Subsets   Number of Images        Number of Annotations
              Sport            40                   6,915                  65,482
              Transport        50                   7,710                  78,172
              Arts             25                   3,854                  24,217
              Security         15                   2,837                  35,916
              MIC21           130                21,316                   203,797
Table 1 Multilingual Image Corpus: basic statistics




6   http://compling.hss.ntu.edu.sg/omw/summx.html
7   https://babelnet.org/guide
318                                                                                  Svetla Koeva

   All annotations and image metadata are available for commercial and non-com-
mercial purposes in accordance with the Creative Commons Attribution-ShareAlike
4.0 International (CC BY-SA 4.0).

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.




References

Bolya, Daniel, Chong Zhou, Fanyi Xiao, and Yong Jae Lee (2019). “YOLACT: Real-time Instance
   Segmentation”. In: pp. 9156–9165. DOI: 10.1109/ICCV.2019.00925.
Botorek, Jan, Petra Budı́ková, and Pavel Zezula (2014). “Visual Concept Ontology for Image An-
   notations”. In: CoRR. URL: http://arxiv.org/abs/1412.6082.
Brooks, Justin (2019). COCO Annotator. URL: https://github.com/jsbroks/coco-annotator/.
Caesar, Holger, Jasper Uijlings, and Vittorio Ferrari (2018). “COCO-Stuff: Thing and Stuff Classes
   in Context”. In: Conference on Computer Vision and Pattern Recognition, pp. 1209–1218.
Girshick, Ross (2015). “Fast R-CNN”. In: pp. 1440–1448. DOI: 10.1109/ICCV.2015.169.
Koeva, Svetla (2021). “Multilingual Image Corpus: Annotation Protocol”. In: Proceedings of the
   International Conference on Recent Advances in Natural Language Processing (RANLP 2021).
   INCOMA, pp. 701–707.
Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays,
   Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár (2014). “Microsoft COCO:
   Common Objects in Context”. In: European Conference on Computer Vision (ECCV). Zürich,
   pp. 740–755.
Miller, George, R. Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller (1990).
   “Introduction to WordNet: An on-line lexical database”. In: International Journal of Lexicogra-
   phy 3, pp. 235–244.
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
   Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-
   Fei (2015). “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal
   of Computer Vision 116, pp. 157–173.
Wu, Yuxin, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick (2019). Detec-
   tron2. URL: https://github.com/facebookresearch/detectron2.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 23
Multilingual Knowledge Systems as
Linguistic Linked Open Data

Alena Vasilevich and Michael Wetzel




Abstract Creation and re-usability of language resources in accordance with Linked
Data principles is a valuable asset in the modern data world. We describe the contri-
butions made to extend the Linguistic Linked Open Data (LLOD) stack with a new
resource, Coreon MKS, bringing together concept-oriented, language-agnostic ter-
minology management and graph-based knowledge organisation. We dwell on our
approach to mirroring of Coreon’s original data structure to RDF and supplying it
with a SPARQL endpoint. We integrate MKS into the existing ELG infrastructure,
using it as a platform for making the published MKS discoverable and retrievable
via a industry-standard interface. While we apply this approach to LLOD-ify Coreon
MKS, it can also provide relevant input for standardisation bodies and interoperabil-
ity communities, acting as a blueprint for similar integration activities.



1 Overview and Objectives of the Pilot Project

In a world depending on knowledge sharing, data-driven businesses and research
communities are concerned with the creation, sharing, and use of language resources
in accordance with Linked Data principles, which ensure better data discoverability,
standardised structure, and cost savings for all parties involved in the creation of
structured data. Robust, coherent, and multilingual information standards are needed
to enable information exchange among public organisations, similar to standards that
have been fostering technical interoperability for decades (Guijarro 2009).
    We extend the Linguistic Linked Open Data (LLOD) stack with a new resource,
Multilingual Knowledge System (MKS). MKS caters for the discovery, access, re-
trieval, and re-usability of terminologies and other interoperability assets organised
in knowledge graphs (KG) in a taxonomic fashion. As a semantic knowledge repos-
itory, its main forte is the ability to exchange information among acting systems,
ensuring that its precise meaning is understood and preserved among all parties, in
any language. Injecting structure into the language data and expanding the result-

Alena Vasilevich · Michael Wetzel
Coreon GmbH, Germany, alena@coreon.com, michael@coreon.com


© The Author(s) 2023                                                              319
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_23
320                                                  Alena Vasilevich and Michael Wetzel

ing KG with multilingual terminologies, Coreon uses the European Language Grid
(ELG) as a platform for making the published resources discoverable and retriev-
able through SPARQL, a protocol widely used for the retrieval of information from
Semantic Web resources. While existing SPARQL tools enable users to query knowl-
edge graphs, they are rarely used for termbases and other terminology resources, i. e.,
core data sources for translation and localisation (Stanković et al. 2014). This step
makes Coreon integration into other systems tool-independent: instead of using the
proprietary API, it relies on LLOD standards.
    The goal of our contribution is to deliver MKS resources to the Semantic Web
community, enabling it to query concept-oriented multilingual structured data with
a well-established industry-standard syntax, and to promote the development of data
multilingualism within the Semantic Web. In the long run, MKS as a LLOD resource
can provide relevant input for standardisation bodies and interoperability communi-
ties: acting as a blueprint for similar integration activities, it can be viewed as a
starting point for an international standard. We share our experience with ISO/TC37
SC31 working groups as a draft for a technical recommendation on how to represent
TermBase eXchange (TBX) dialects as RDF.



2 Making Coreon Data Structure LLOD-compatible

Resource Description Framework (RDF) and Web Ontology Language (OWL) are
standardised formats for representing Semantic Web data. They support data inte-
gration and offer a plethora of tools and methods for data access. SPARQL operates
on RDF/OWL resources allowing users to retrieve structured responses to submit-
ted queries. To express queries, it utilises triple patterns that are to be matched by
RDF/OWL triples and filter conditions, imposing ranges for literals (Almendros-
Jiménez and Becerra-Terón 2021). Despite the emerging interest in publishing ter-
minological resources as linked data, the LLOD stack has not been heavily utilised
for this purpose so far (Buono et al. 2020).
    We implemented a solution for Coreon MKS, making termbases discoverable and
accessible for LLOD systems (Chiarcos et al. 2013). Normally data owners deploy
a technology like a RDF triple store for their terminology tool, often developing or
setting up a tedious data-mirroring process. We go beyond the limits of RDF/knowl-
edge graph editors, which tend to be good at relation modeling but have weaknesses
when it comes to capturing linguistic information.
    At the core of the MKS lies a language-independent KG. Unlike other popular so-
lutions within terminology management, linking is performed not at the term but at
the concept level; therefore, abstracting from terms, we can model structured knowl-
edge for phenomena that reflect the non-deterministic nature of human language,
such as word sense ambiguity, synonymy, and multilingualism. Linking per con-
cept also ensures smooth maintenance of relations without additional data clutter:

1   https://www.iso.org/committee/48136.html
     23 Multilingual Knowledge Systems as Linguistic Linked Open Data                    321

     relation edges are independent from labels, terms and their variants, and other meta-
     data. Besides the mirroring process between the Coreon data model and an RDF
     graph, the RDF vocabulary was established, covering classes, relations, additional
     term-descriptive information, and administrative metadata. It binds elements into
     RDF triples. At this stage it was critical to identify information objects and mapping
     of predicates and literals.


 1   {" created_at ": "2021 -04 -20 T13 :04:59.816 Z",
 2        " terms ":[
 3                  {" lang ": "en",
 4                   "value": "screen" ,
 5                   "id": "607ed17b318e0c181786b549" ,
 6                  " concept_id ": "607 ed17b318e0c181786b545 ",
 7                  " properties ": []} ,
 8                  {" lang": "de",
 9                   "value": "Bildschirm" ,
10                   "id": "607ed195318e0c181786b55e" ,
11                  " concept_id ": "607 ed17b318e0c181786b545 ",
12                  " properties ": []}
13        ],
14         "id": "607ed17b318e0c181786b545" }
     Listing 1 Excerpt of the Coreon data structure.



        Listing 1 shows relevant lines within the original JSON data structure that rep-
     resents the sample concept “screen”, with concept ID and individual term IDs
     and their values highlighted. To transform this data structure into an RDF graph,
     the concept and its two terms are bound together in statements, i. e., RDF triples.
     Each triple comprises a subject, a predicate and an object; in our case, the con-
     cept will act as the subject, the terms become objects and the required predicate
     is named hasTerm. The complete sample set of triples serialised in RDF/Turtle is
     provided in Listing 2, with highlighted lines 9-10 indicating that the resource with
     ID 606336dab4dbcf018ed99308 belongs to the OWL class coreon:Concept and
     contains a term with ID 606336dab4dbcf018ed99307.
        In RDF and LOD, data is stored in an atomic manner, with predicates and uniform
     resource identifiers (URIs) linking elements together. In our case, all instances repre-
     sented as classes receive unique identifiers. Together with unique IDs, the namespace
     coreon: unambiguously identifies any given element, regardless of whether it is a
     concept, term, property or a concept relation. Table 1 lists our RDF vocabulary, de-
     rived from the original MKS data structure. During the Coreon-to-RDF conversion,
     there were obvious candidates for classes, like Concept and Term; yet mirroring
     descriptive information like Definition or TermStatus and mapping taxonomic
     and associative concept relations turned out to be challenging. For the predicates we
     had to specify what information can be used, defining owl:range and owl:domain;
     322                                                        Alena Vasilevich and Michael Wetzel

 1   coreon:607ed17b318e0c181786b547 a coreon:Edge ;
 2     coreon:edgeSource coreon:606336dab4dbcf018ed99308 ;
 3     coreon:edgeTarget coreon:607ed17b318e0c181786b545 ;
 4     coreon:type " SUPERCONCEPT_OF " .
 5

 6   coreon:606336dab4dbcf018ed99307 a coreon:Term ;
 7     coreon:value " peripheral device "@en .
 8

 9   coreon:606336dab4dbcf018ed99308 a coreon:Concept;
10     coreon:hasTerm coreon:606336dab4dbcf018ed99307 .
11

12   coreon:607ed17b318e0c181786b545 a coreon:Concept ;
13     coreon:hasTerm coreon:607ed195318e0c181786b55e ,
14       coreon:607ed17b318e0c181786b549 .
15

16   coreon:607ed17b318e0c181786b549 a coreon:Term ;
17     coreon:value " screen "@en .
18

19   coreon:607ed195318e0c181786b55e a coreon:Term ;
20     coreon:value " Bildschirm "@de .
     Listing 2 Triples serialised in RDF / Turtle



 1   coreon : hasTerm
 2     rdf:type owl: ObjectProperty ;
 3     rdfs: comment " makes a term member of a concept " ;
 4     rdfs: domain coreon : Concept ;
 5     rdfs:label "has term" ;
 6     rdfs:range coreon :Term .
     Listing 3 Specification of a predicate




     e. g., the predicate hasTerm can only accept resources of type coreon:Concept as
     a subject (owl:domain). Listing 3 provides a full specification of this predicate.


                            OWL Type                      Coreon RDF Vocabulary
           Classes   owl:Class                 coreon:Admin, coreon:Edge, coreon:Concept,
                                               coreon:Flagset, coreon:Property, coreon:Term
           Predicates owl:ObjectProperty       coreon:hasAdmin,              coreon:hasFlagset,
                                               coreon:hasProperty, coreon:hasTerm
           Values    owl:AnnotationProperty coreon:edgeSource, coreon:edgeTarget, coreon:id,
                                            coreon:name, coreon:type, coreon:value
     Table 1 Derived Coreon RDF vocabulary
    23 Multilingual Knowledge Systems as Linguistic Linked Open Data                       323

    3 Real-Time Data Access via a SPARQL Endpoint

    With the vocabulary defined, we equipped Coreon’s export engine with a RDF pub-
    lication mechanism, including the export in relevant syntax flavours (Turtle, N3,
    JSON-LD). The Coreon cloud service was supplied with a real-time accessible
    SPARQL endpoint via Apache Jena Fuseki.2 It conforms to all published standards
    and tracks revisions and updates in the under-developed areas of the standard. Run-
    ning as a secondary index in parallel with the repository’s data store, Fuseki catches
    any changes made by data maintainers, updating the state of the repository in real
    time. Listing 4 demonstrates a sample SPARQL query over a MKS that deals with
    wine varieties: here, we want to return all terms, including the values of the Usage
    flag in case the terms have them.


1   SELECT ?t ? termvalue ? usagevalue
2       WHERE { ?t rdf:type coreon :Term .
3                ?t coreon :value ? termvalue .
4                OPTIONAL { ?t coreon : hasProperty ?p .
5                             ?p coreon :key " Usage " .
6                             ?p coreon :value ? usagevalue .
7                }
8       }
    Listing 4 Sample SPARQL query over MKS



       Table 2 shows a subset of the linked data structures returned by this query, i. e., a
    term’s URI, its value, and usage recommendation if available.


                                  [t]                         termvalue       usagevalue
              http://www.coreon.com/coreon-rdf#[…]8b8aa   Riesling
              http://www.coreon.com/coreon-rdf#[…]8b8bb   Cabernet Sauvignon Preferred
              http://www.coreon.com/coreon-rdf#[…]8b8be   CS                 Not allowed
              http://www.coreon.com/coreon-rdf#[…]8b8c2   Merlot
    Table 2 Results of the sample SPARQL query (Listing 4): returned grape varieties




    4 Conclusions and Results of the Pilot Project

    We developed a pipeline to make MKS resources LLOD-compatible, mapping
    Coreon data structure to RDF, conceiving the Coreon-RDF vocabulary and pub-
    lishing MKS resources via ELG. Besides making the SPARQL endpoint available
    2   https://jena.apache.org
324                                                         Alena Vasilevich and Michael Wetzel

through ELG, we implemented a productised piece of software, providing TermBase
eXchange-like terminology resources in the RDF and Semantic Web context; a set
of demo repositories is accessible via the endpoint through ELG. Beyond establish-
ing structural interoperability, the implemented interface bridges Coreon with other
Semantic Web systems, enabling querying of elaborate multilingual terminologies.
Our mirroring approach can act as a blueprint for similar conversion and integra-
tion activities, viewed as a starting point for an international standard. Deployed
through ELG, Coreon’s SPARQL interface enables the Semantic Web community
to query rich heterogeneous MKS data with a familiar, industry-standard syntax,
promoting data accessibility and contributing to the development of multilingual re-
sources within the Semantic Web.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.




References

Almendros-Jiménez, Jesús Manuel and Antonio Becerra-Terón (2021). “Discovery and diagnosis
   of wrong SPARQL queries with ontology and constraint reasoning”. In: Expert Systems with
   Applications 165, p. 113772. DOI: 10.1016/j.eswa.2020.113772.
Buono, Maria Pia Di, Philipp Cimiano, Mohammad Fazleh Elahi, and Frank Grimm (2020).
   “Terme-à-LLOD: Simplifying the Conversion and Hosting of Terminological Resources as
   Linked Data”. In: Proc. of the 7th Workshop on Linked Data in Linguistics, LDL@LREC
   2020, Marseille, France, May 2020. Ed. by Maxim Ionov, John P. McCrae, Christian Chiar-
   cos, Thierry Declerck, Julia Bosque-Gil, and Jorge Gracia. ELRA, pp. 28–35.
Chiarcos, Christian, Philipp Cimiano, Thierry Declerck, and John P. McCrae (2013). “Linguistic
   Linked Open Data. Introduction and Overview”. In: Proceedings of the 2nd Workshop on Linked
   Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other
   language data. Pisa, Italy: ACL, pp. i–xi.
Guijarro, Luis (2009). “Semantic interoperability in eGovernment initiatives”. In: Computer Stan-
   dards & Interfaces 31.1, pp. 174–180. DOI: 10.1016/j.csi.2007.11.011.
Stanković, Ranka, Ivan Obradović, and Miloš Utvić (2014). “Developing Termbases for Expert Ter-
   minology under the TBX Standard”. In: Natural Language Processing for Serbian-Resources
   and Applications, pp. 12–26.


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 24
Open Translation Models, Tools and Services

Jörg Tiedemann, Mikko Aulamo, Sam Hardwick, and Tommi Nieminen




Abstract The ambition of the Open Translation Models, Tools and Services (OPUS-
MT) project is to develop state-of-the art neural machine translation (NMT) models
that can freely be distributed and applied in research as well as professional applica-
tions. The goal is to pre-train translation models on a large scale on openly available
parallel data and to create a catalogue of such resources for streamlined integration
and deployment. For the latter we also implement and improve web services and
computer-assisted translation (CAT) tools that can be used in on-line interfaces and
professional workflows. Furthermore, we want to enable the re-use of models to
avoid repeating costly training procedures from scratch and with this contribute to a
reduction of the carbon footprint in MT research and development. The ELG pilot
project focused on European minority languages and improved translation quality in
low resource settings and the integration of MT services in the ELG infrastructure.



1 Overview and Objectives of the Pilot Project

OPUS-MT (Tiedemann and Thottingal 2020) provides ready-made server solutions
that can be deployed on regular desktop machines to run translations using any NMT
model that has been released through the project.1 The service is powered by Marian-
NMT2 (Junczys-Dowmunt et al. 2018), an efficient open-source framework written
in pure C++ with implementations of state-of-the-art neural machine translation ar-
chitectures. OPUS-MT provides two implementations that can be deployed on regu-
lar Ubuntu servers or through containerised solutions using docker images. Both so-
lutions can easily be configured using JSON and can be deployed with a wide range
of OPUS-MT models. Multiple translation services and nodes can be combined in
one access point through a lightweight API. The coverage is constantly growing and

Jörg Tiedemann · Mikko Aulamo · Sam Hardwick · Tommi Nieminen
University of Helsinki, Finland, jorg.tiedemann@helsinki.fi, mikko.aulamo@helsinki.fi,
sam.hardwick@helsinki.fi, tommi.nieminen@helsinki.fi
1   https://github.com/Helsinki-NLP/Opus-MT
2   https://marian-nmt.github.io


© The Author(s) 2023                                                                     325
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_24
326                       Jörg Tiedemann, Mikko Aulamo, Sam Hardwick, and Tommi Nieminen

improved models are continuously released through our repository as a result of our
on-going model training efforts.
   A dockerised web app is implemented using the Tornado Python framework,
which we adapted for the integration into the European Language Grid environment
providing an interface that can seamlessly be deployed in the ELG infrastructure.
The essential metadata records for the ELG service catalogue are generated from
pre-defined templates using information available from released translation models.
The routines support bilingual as well as multilingual models and can also be used
to set up access points that serve several translation services. Appropriate docker
images are compiled using installation recipes and scripts. We host them on Docker
Hub from where they can be pulled by ELG requests to serve translation requests
directly through the online APIs. Detailed deployment documentation is available
from the repository.3
   At the time of writing, OPUS-MT provides 89 registered MT services within
ELG including a wide variety of bilingual and multilingual models. Registered ser-
vices can be tested online and can also be accessed through the web API and ELG
Python SDK. The translation runs on regular CPUs with minimal resource require-
ments thanks to the efficient decoder implementation in Marian-NMT. Multilingual
models are handled in a special way: multiple source languages can be handled by a
single access point whereas multiple target languages require separate access points.
Metadata records include the relevant information to describe the service provided.
   We also developed plugins for professional translation workflows under the label
of OPUS-CAT4 (Nieminen 2021). Our tools include a local MT engine that can run
on regular desktop machines making MT available without the security and confi-
dentiality risks associated with online services. OPUS-CAT integrates with popular
translation software such as Trados Studio, memoQ, OmegaT and Memsource. It
also provides an integrated fine-tuning procedure for domain adaptation. All OPUS-
MT models can be downloaded and used locally with the MT engine, some of the
plugins can also fetch translations directly from the OPUS-MT services in ELG.



2 Increasing Language Coverage

The general goal of OPUS-MT is to increase language coverage of freely avail-
able machine translation solutions. The project already provides over a thousand
pre-trained translation models covering hundreds of languages in various transla-
tion directions. The ongoing effort is documented by public repositories and regular
updates and we omit further details here as this is a quickly moving target.
   Within our ELG pilot project, we further developed our pipelines and recipes to
systematically train additional NMT models. The effort resulted in the model de-


3   https://github.com/Helsinki-NLP/Opus-MT/tree/master/elg
4   https://helsinki-nlp.github.io/OPUS-CAT/
24 Open Translation Models, Tools and Services                                                 327




Fig. 1 OPUS-MT map: A visualisation of language coverage and model quality according to au-
tomatic evaluation metrics and the Tatoeba MT challenge benchmarks; here: models that translate
from a source language mapped on their glottolog location to English; larger circles indicate bigger
benchmark test sets and the color scale goes from green (high quality) to red (poor quality)



velopment framework OPUS-MT-train5 with support for bilingual and multilingual
models that can be trained on data provided by OPUS6 and the Tatoeba translation
challenge7 (Tiedemann 2020).
    In order to keep track of the development, we heavily rely on the Tatoeba bench-
marks and we implemented an interactive tool to visualize the current state of our
released models. Figure 1 shows an example screenshot.
    The geographic distribution of released models is an appealing way to uncover
blind spots in the NLP landscape. The lack of appropriate data resources is one of the
major bottlenecks that block the development of proper MT solutions for most lan-
guage pairs of the world. Another issue is the narrow focus of research that typically
overemphasises well established tasks for reasons of comparability and measurable
success. OPUS-MT does not have a strict state-of-the-art development focus based
on major benchmarks but rather emphasises language coverage and the focus on
under-researched translation directions. The OPUS-MT map and the Tatoeba MT
challenge try to make this work visible and more attractive.
    The main strategy to tackle issues with limited data resources is to apply transfer
learning and some type of data augmentation. In OPUS-MT we are constantly facing
the problem of limited training data and noise and the ELG pilot project specifically
focused on low-resource scenarios and European minority languages.
    The idea of transfer learning is based on the ability of models to pick up valuable
knowledge from other tasks or languages. In MT, the main type of transfer learning
is based on cross-lingual transfer where multilingual translation models can be used
to push the performance in low-resource settings (Fan et al. 2021). The effect is typi-
cally pronounced with closely related languages where strong linguistic similarities
can lead to big improvements across language boundaries (Tiedemann 2021).

5   https://github.com/Helsinki-NLP/OPUS-MT-train
6   https://opus.nlpl.eu
7   https://github.com/Helsinki-NLP/Tatoeba-Challenge/
328                       Jörg Tiedemann, Mikko Aulamo, Sam Hardwick, and Tommi Nieminen

   In OPUS-MT, we therefore focused on multilingual models of typologically re-
lated languages. In our setup, we rely on language groups and families established
within the ISO 639-5 standard. A dedicated tool for mapping languages to language
groups and connecting them with the hierarchical language tree has been developed
to allow a systematic development of multilingual NMT models based on typolog-
ical relationships.8 The procedures have been integrated in the OPUS-MT training
recipes and can be applied to arbitrary datasets from the Tataobea MT Challenge.
   Table 1 illustrates the effect of cross-lingual transfer with multilingual models
on the example of the Belarusian-English translation benchmark from the Tatoeba
MT Challenge. All models apply the same generic transformer-based architecture
(Vaswani et al. 2017) with identical hyper-parameters and training recipes.


                NMT model            Belarusian −→ English English −→ Belarusian
              Belarusian – English             10.0                      8.2
             East Slavic – English             38.7                     20.8
                  Slavic – English             42.7                     22.9
          Indo-European – English              41.7                     18.1
Table 1 Machine translation between Belarusian and English with different NMT models; scores
refer to BLEU scores measured on the Tatoeba MT Challenge benchmark



   The bilingual baseline model is very poor due to the limited training data that is
available from the Tatoeba dataset (157,524 sentence pairs). Augmenting the training
data with closely related languages such as other (East) Slavic languages leads to
significant improvements, which is not very surprising. The effect can be seen in
both directions. Note that the multi-target models need to be augmented by language
tokens to indicate the output language to be generated. The importance of systematic
benchmarks is also shown in the table where we can see that Indo-European language
model struggles and the effect of positive transfer diminishes due to the capacity
issues of such a complex model setup.
   Finally, we also tested a novel type of data augmentation using a rule-based sys-
tem (RBMT) for back-translation (Sennrich et al. 2016) to produce additional data
for the translation from Finnish to Northern Sámi (Aulamo et al. 2021). Our results
revealed that knowledge from the RBMT system can effectively be injected into a
neural MT model significantly boosting the performance as shown in Table 2.
   We use two benchmarks in our evaluations: the UiT set9 , and the YLE set of 150
sentence pairs from news stories about Sámi culture.10 Preliminary manual evalu-
ation revealed that the NMT-based model was often unable to correctly translate
proper names. Adding copies of monolingual data as suggested by Currey et al.
(2017) helps to alleviate that issue. Furthermore, we also added experiments with
subword regularisation (Kudo 2018) and data tagging (Caswell et al. 2019) to bet-
8 https://github.com/Helsinki-NLP/LanguageCodes
9 2,000 sentence pairs sampled from the Giellatekno Free corpus https://giellatekno.uit.no
10 Collected from https://yle.fi
24 Open Translation Models, Tools and Services                                                329

                                                     Training Data UiT YLE
                  Baseline                                   25,106 18.9 4.3
                  + NMT-bt                                  422,596 34.0 9.8
                  + RBMT-bt                                 378,567 36.3 15.5
                  + NMT-bt + RBMT-bt                        885,301 40.1 10.8
                  + NMT-bt + copy                           845,192 35.7 12.5
                  + RBMT-bt + copy                          757,134 35.7 18.6
                  + NMT-bt + RBMT-bt + SR + TB              885,301 40.0 17.2
Table 2 Training data sizes (sentence pairs) and results (BLEU) for the Finnish-Northern Sámi
translation models using original parallel data (Baseline), augmented data with back-translations
from NMT and RBMT systems (NMT-bt, RBMT-bt), added monolingual data (copy), subword
regularisation (SR) and tagged back-translations (TB) evaluated on the UiT and YLE test sets



ter exploit the distributions in the training data and to distinguish between sources
with different noise levels. Preliminary results are encouraging and deserve further
investigations. In future work, we plan to add pivot-based translation and multilin-
gual models to further improve the performance of the system, to support additional
input languages and to include other Sámi language varieties, too.



3 Conclusions and Results of the Pilot Project

OPUS-MT is an on-going effort to make MT widely available for open research and
development with an extensive language coverage and well established deployment
and integration procedures. Our ELG pilot project made it possible to strengthen the
focus on minority languages and to further exploit transfer and data augmentation
strategies to improve the quality of MT for under-resourced language pairs.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects. We would also like to acknowledge the support
by the FoTran project funded by the European Research Council (no. 771113) and CSC, the Finnish
IT Center for Science, for computational resources.




References

Aulamo, Mikko, Sami Virpioja, Yves Scherrer, and Jörg Tiedemann (2021). “Boosting Neural Ma-
   chine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation”. In: Pro-
   ceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Reykjavik,
   Iceland: Linköping University Electronic Press, pp. 351–356. URL: https://aclanthology.org/2
   021.nodalida-main.37.
Caswell, Isaac, Ciprian Chelba, and David Grangier (2019). “Tagged Back-Translation”. In: Proc.
   of the Fourth Conf. on Machine Translation, pp. 53–63.
330                       Jörg Tiedemann, Mikko Aulamo, Sam Hardwick, and Tommi Nieminen

Currey, Anna, Antonio Valerio Miceli Barone, and Kenneth Heafield (2017). “Copied Monolingual
   Data Improves Low-Resource Neural Machine Translation”. In: Proceedings of the Second Con-
   ference on Machine Translation. Copenhagen, Denmark: ACL, pp. 148–156. DOI: 10.18653/v
   1/W17-4715. URL: https://aclanthology.org/W17-4715.
Fan, Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal,
   Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom
   Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin (2021). “Beyond
   English-Centric Multilingual Machine Translation”. In: Journal of Machine Learning Research
   22.107, pp. 1–48. URL: http://jmlr.org/papers/v22/20-1307.html.
Junczys-Dowmunt, Marcin, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Hea-
   field, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev,
   André F. T. Martins, and Alexandra Birch (2018). “Marian: Fast Neural Machine Translation
   in C++”. In: Proceedings of ACL 2018, System Demonstrations. Melbourne, Australia: ACL,
   pp. 116–121. URL: http://www.aclweb.org/anthology/P18-4020.
Kudo, Taku (2018). “Subword Regularization: Improving Neural Network Translation Models with
   Multiple Subword Candidates”. In: Proc. of the 56th Annual Meeting of the Association for
   Computational Linguistics, pp. 66–75.
Nieminen, Tommi (2021). “OPUS-CAT: Desktop NMT with CAT integration and local fine-
   tuning”. In: Proceedings of the 16th Conference of the European Chapter of the Association
   for Computational Linguistics: System Demonstrations. ACL, pp. 288–294. DOI: 10.18653/v1
   /2021.eacl-demos.34. URL: https://aclanthology.org/2021.eacl-demos.34.
Sennrich, Rico, Barry Haddow, and Alexandra Birch (2016). “Improving Neural Machine Transla-
   tion Models with Monolingual Data”. In: Proceedings of the 54th Annual Meeting of the Asso-
   ciation for Computational Linguistics (Volume 1: Long Papers), pp. 86–96.
Tiedemann, Jörg (2020). “The Tatoeba Translation Challenge – Realistic Data Sets for Low Re-
   source and Multilingual MT”. In: Proceedings of the Fifth Conference on Machine Translation
   (WMT). ACL, pp. 1174–1182. URL: https://aclanthology.org/2020.wmt-1.139.
Tiedemann, Jörg (2021). “The Development of a Comprehensive Data Set for Systematic Studies of
   Machine Translation”. In: Multilingual Facilitation. Ed. by Mika Hämäläinen, Niko Partanen,
   and Khalid Alnajjar. Finland: University of Helsinki, pp. 248–262. DOI: 10.31885/978951515
   0257.
Tiedemann, Jörg and Santhosh Thottingal (2020). “OPUS-MT – Building open translation services
   for the World”. In: Proceedings of the 22nd Annual Conference of the European Association for
   Machine Translation (EAMT). Lisboa, Portugal: European Association for Machine Translation,
   pp. 479–480. URL: https://helda.helsinki.fi/bitstream/handle/10138/327852/2020.eamt_1_499
   .pdf.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
   Łukasz Kaiser, and Illia Polosukhin (2017). “Attention is all you need”. In: Proceedings of the
   31st International Conference on Neural Information Processing Systems, pp. 6000–6010.



Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 25
Sign Language Explanations for Terms in a Text

Helmut Ludwar and Julia Schuster




Abstract The ELG pilot project SignLookUp serves the goal of developing a func-
tion that makes text documents easier to comprehend for deaf people. This is impor-
tant as many of them are functional illiterates.



1 Overview and Objectives of the Pilot Project

The ELG (Rehm et al. 2021) pilot project SignLookUp aims to make texts easier to
comprehend for deaf people. Deaf people have a difficult access to texts (Luckner
et al. 2005). Learning a written language is a challenge with a hearing impairment
(Harris et al. 2017). Therefore, about 75 percent of deaf people are functional illiter-
ates.




Fig. 1 LookApp visualisation


   The ideal form of accessibility for the deaf would be the complete translation
of texts into sign language. However, this is usually not possible due to limited re-
sources and budgets. The LookApp technology is an intermediate solution and serves
the goal of making texts easier to understand for the deaf.
   SignLookUp is a technology that links texts to a sign language encyclopedia. Deaf
people thus have the possibility to click on difficult or unknown terms in a text and

Helmut Ludwar · Julia Schuster
Sign Time GmbH, Austria, helmut.ludwar@signtime.media, julia.schuster@signtime.media


© The Author(s) 2023                                                                   331
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_25
332                                                    Helmut Ludwar and Julia Schuster

immediately receive the explanation or description of the word in their sign language
which is displayed adjacent to the text. Using mouseover or clicking on the term, a
window pops up and a sign language video is played. Often the explanation of a word
or term in sign language is sufficient to make a whole sentence understandable.
    SignLookUp starts with two sign languages, but is developed in such a way that
it can be easily expanded. The product will be licensed for companies and is free for
the end-user (deaf people). This technology thus supports the deaf in accessing and
making sense of text information on the internet and at the same time promotes the
integration of this marginalised group in our society.



2 Methodology

Selecting the terms that are most important for deaf people to better understand the
whole text is a special challenge. On the one hand, it must of course be those that
are of central importance, but on the other hand, consideration must also be given
to how deaf people experience and understand facts. Last but not least, linguistic pe-
culiarities such as idiomatic expressions, onomatopoeic terms and language images
must also be taken into account when finding terms.
   Therefore, for the creation of the sign language explanations of an item within a
text on a website the following method is used:
 1. Determine the target audience or readers for the website, e. g., language com-
    petence, relevant prior knowledge, thematic interest, age, gender, education.
 2. Perform word analysis (Egle 2020):
      a. Does the author paraphrase or avoid certain terms in a noticeable manner
         (euphemisms, taboos)?
      b. Does the text contain words and expressions that must be understood in a
         figurative sense (linguistic images, metaphors, similes)?
      c. What language-layers or language-uses can be identified?
      d. Does the text contain a foreign word or technical expressions?
      e. Are there words and phrases in the text that can be associated or connoted
         with other ideas (e. g., “She’s feeling blue” → “She’s feeling sad”)?
      f. Do buzzwords, empty phrases, or other stereotypes occur (e. g., “low-hang-
         ing fruit”)?
      g. Do certain words acquire a special meaning when the context is taken into
         account (broadening or narrowing of meaning, emotional coloring)?
      h. From what time do the words used originate? Are they already obsolete
         (archaism) or newly formed (neologism)? What is their purpose?
      i. Can certain words be assigned to a specific area (e. g., technology, art,
         sports)? What is the effect?
      j. Are there exaggerations/understatements?
      k. Is only a part of a whole addressed: synecdoche (e. g., pars pro toto)?
      l. Are synonyms (different terms but describing the same in context) used?
25 Sign Language Explanations for Terms in a Text                                      333

    3. Analysis of the text and selection of items: An automatic analysis of the text to
       show the comprehensibility and complexity of the text and individual words are
       used as a starting point, e. g., creation of the readability index (LIX, W. Lenhard
       and A. Lenhard 2011).1
       Thereafter a specialist who is fluent in the languages, e. g., a deaf person or an
       interpreter, checks whether the passages and terms are understandable for deaf
       people and selects the candidates for explanation based on the following criteria:
        a. Which terms are of central importance to the content?
        b. Special meaning, e. g., opposite of what is written (irony)
        c. Special words from item 2
    4. Providing the following (meta) information: Concept (named entity), lemma,
       context, web link, text language, sign language, version.
    5. Term explanation (for each term):
        a. Explanation of the term in simple language using the guidelines (Netzwerk
           Leichte Sprache 2013).
        b. It must not exceed 30 words and must be as brief as possible.
        c. Must be universal and general so that it is suitable for all uses in a text with
           the same context.
        d. Begins with a relationship to a higher-level or more general term.
        e. Includes the typical features of the term, using semes (the smallest unit of
           meaning) for this purpose.
        f. Add examples
       As a reference for the creation of explanations available sources may be used,
       e. .g., medicine DGS2 , medicine ÖGS3 .
    6. Translation into sign language:
        a. If there is a common sign for the item, it must be used at the beginning,
           followed by the signed explanation.
        b. Translation into sign language glosses
        c. Transfer into sign language animations
        d. Producing a sign language explanation video
    7. Quality assurance according to the four-eyes-principle: The draft version of an
       entry including sign language videos must be checked by a hearing sign lan-
       guage interpreter for completeness and correctness of content. In this way, na-
       tive speaker competencies of both languages, written and sign language, are
       included.




1 https://wortliga.de/textanalyse/
2 https://www.sign-lang.uni-hamburg.de/glex/intro/inhalt.html
3 https://www.equalizent.com
334                                                                                                                  Helmut Ludwar and Julia Schuster

3 Implementation

The beta-version of LookApp (preliminary product name) is implemented in Java-
Script on the server where the respective website to be analysed is located. The work-
flow described below is also shown in Figure 2:

                                                                                       1.
      Request                                                                    Webpage Request
       Order                                                                 https://theventury.com/


                                                                                       2.

                                                                                  Page Content                                                           Client Web Server
                                                                                <head> ... </head>
                                                                                       <body>
                              User Browser                       <div class=“lookapp-scope“> CONTENT </div>
                                                                                         ...
                         TextTranslation (User text)          <script> ...simax.media/js/lookapp--1.0.0.js </script>
                       ContentTranslation (CONTENT)           <script> lookApp.init({secret: '123', lang: 'de', addLoa-
             6.                  Feedback                                 dingSpinner: true, ...}); </script>
                            RequestTranslation                                        </body>
                               Up/Down Vote

                                   3.
      Video Injected Content                           Simax LookApp Server             Store Feedback/Request Translation/Votes           Postgres DB
   (User text) is shown in Popup                                                                               4.
 NEW CONTENT replaces CONTENT
                                                                                                  Request Translations
                                                                              (All Client linked word databases with matching languages)




                                                                                                               5.
                                                                             Collection of words/phrases/sentences with the matching video ID



Fig. 2 LookApp concept



 1. End user goes to a website that offers LookApp.
 2. The web server returns the content of the page which includes:
         a. Parts of the content with the LookApp-scope class
         b. The LookApp JavaScript is fetched from the LookApp server or served in
            a static way.
         c. The script is initialised with certain parameters.
 3. The request 3 actually represents multiple calls between browser and client
         a. At first the “custom options” are loaded
         b. The client-specific CSS file is loaded
         c. Any LookApp action
 4. Depending on the action
         a. Store feedback, requested translations, votes in the database → workflow
            ends here
         b. Query a list of translations belonging to the client side and corresponding
            to the passed parameter lang

 5. Collection of words and explanations
25 Sign Language Explanations for Terms in a Text                                       335

     a. The server then replaces found words with an icon
     b. JavaScript will interpret as hover or clickable video translations
 6. The page content is sent back and replaced by JavaScript.



4 Evaluation

In order to verify the usefulness of the application, a preliminary study was con-
ducted. This involved providing a website with LookApp to a small group of deaf
people and then performing a qualitative survey through sign language interpreters.
   The results show consistently positive feedback regarding assistance for under-
standing as well as the avatar used. In order to be able to make reliable statements,
however, a survey with a larger test group that represents the deaf community must
be carried out.



5 Conclusions and Results of the Pilot Project

As part of the pilot project, a beta version of LookApp was created, which is already
being used on early adopter websites, which is why it is evident that the concept
and implementation can be used with a positive benefit. Further development of the
functions (e. g., use of NLP methods) and the creation of high quality explanations
of as many terms as possible are planned next.
   It has already been shown in this phase of development that there are multiple
advantages. Deaf people have better access to information that cannot be fully trans-
lated into sign language due to time or resource constraints. Although our reading
aid does not provide the convenience of a full sign language translation, it supports
text comprehension in a significant way. Customers who provide large amounts of
information or whose content is updated frequently cannot translate all of their con-
tent into sign language due to time and economic constraints. With LookApp, even
such content can be made much more accessible. Existing and future customers can
thus be offered hybrid solutions. In addition to summaries of a website’s content
in sign language videos according to “Accessibility of websites and mobile applica-
tions” (European Parliament, Council of the European Union 2016), LookApp can
be implemented for the entire content of the website. Implementing LookApp in a
specific website requires only a small financial and organisational effort on the side
of the customer but can produce great effects on the side of deaf users.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.
336                                                            Helmut Ludwar and Julia Schuster

References

Egle, Gert (2020). Leitfragen zur sprachlichen Analyse. URL: http://teachsam.de/deutsch/d_schre
   ibf/schr_schule/txtanal/txtanal_6_1a.htm.
European Parliament, Council of the European Union (2016). Directive 2016/2102 of 26 Oct. 2016
   on the Accessibility of the Websites and Mobile Applications of Public Sector Bodies. URL:
   http://data.europa.eu/eli/dir/2016/2102/oj/eng.
Harris, Margaret, Emmanouela Terlektsi, and Fiona E. Kyle (2017). “Literacy Outcomes for Pri-
   mary School Children Who Are Deaf and Hard of Hearing: A Cohort Comparison Study”. In:
   vol. 60. American Speech-Language-Hearing Association. URL: https://pubs.asha.org/doi/pdf
   /10.1044/2016_JSLHR-H-15-0403.
Lenhard, Wolfgang and Alexandra Lenhard (2011). Berechnung des Lesbarkeitsindex LIX nach
   Björnson. URL: http://rgdoi.net/10.13140/RG.2.1.1512.3447.
Luckner, John, Ann Sebald, John Cooney, John Young, and Sheryl Muir (2005). “An Examination
   of the Evidence-Based Literacy Research in Deaf Education”. In: American Annals of the Deaf
   150, p. 443.
Netzwerk Leichte Sprache (2013). Regeln für Leichte Sprache. URL: https://www.leichte-sprache
   .org/wp-content/uploads/2017/11/Regeln_Leichte_Sprache.pdf.
Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 26
Streaming Language Processing in
Manufacturing

Patrick Wiener and Steffen Thoma




Abstract Often underestimated, (semi-)structured textual data sources are an impor-
tant cornerstone in the manufacturing sector for product and process quality tracking.
The ELG pilot project SLAPMAN develops novel methods for industrial text ana-
lytics in the form of scalable, reusable, and potentially stateful microservices, which
can be easily orchestrated by domain experts in order to define quality anomaly pat-
terns, e. g., by analysing machine states and error logs. The results are fully available
as open source and integrated into the IIoT toolbox Apache StreamPipes.



1 Overview and Objectives of the Pilot Project

Continuous process and product quality monitoring is a critical task in the manufac-
turing sector for early detection of anomalies, e. g., gathering insights on potential
machine failures, breakouts or performance degradation. Often underestimated, a
large part of data sources that are able to provide insights to quality deviations are
textual data sources. This includes machine status data and error data, but also pro-
duction plans. Such information is very important for tracking anomalies and an
important source to shop floor workers and other domain experts for identifying po-
tentially critical situations and root causes. While the analysis of real-time measure-
ments is well explored, the automated analysis of textual data is underexplored and
hindered by language barriers and often confusing text codes specific to companies
or domains. The goal of the SLAPMAN project is the development and integration
of streaming language technology (LT) modules from the European Language Grid
(ELG, Rehm et al. 2021) to process, analyse and exploit non-structured or semi-
structured manufacturing process data. These modules have been integrated into the
open-source IIoT toolbox Apache StreamPipes. StreamPipes provides services for
self-service data analytics by pursuing a graphical flow-based modeling approach.
This allows the description of stream processing applications in the form of pro-
cessing pipelines composed of multiple, interconnected pipeline elements. This sig-

Patrick Wiener · Steffen Thoma
FZI Research Center for Information Technology, Germany, wiener@fzi.de, thoma@fzi.de


© The Author(s) 2023                                                                   337
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_26
338                                                     Patrick Wiener and Steffen Thoma

nificantly lowers rather high technological entry barriers towards making streaming
language processing in particular, and LT in general, accessible for non-technical do-
main experts. SLAPMAN developed novel extensions to Apache StreamPipes that
can be easily added to StreamPipes in the form of modular standalone services, e. g.,
streaming adapters to quickly connect textual data sources (e. g., production plans
from MES systems), or pipeline elements for NLP including named entity recogni-
tion (NER), tokenising, word embeddings or translation.



2 Graphical, Flow-based Modeling with Apache StreamPipes

Apache StreamPipes1 is an incubator project of the Apache Software Foundation,
that provides a reusable toolbox to easily connect, analyse and exploit a variety of
IIoT-related data streams without any programming skills. It leverages different tech-
nologies especially from the fields of stream processing, distributed computing, and
the semantic web. Riemer et al. (2014) proposed a methodology for semantics-based
management of event streams based on the dataflow programming paradigm which
is the foundations of StreamPipes. In this regard, StreamPipes allows modelling
stream processing applications in the form of processing pipelines. Pipelines com-
prise a sequence of pipeline elements provided by arbitrary event-driven microser-
vices from an extensible toolbox. Such event-driven microservices are operated in
a distributed environment consisting of multiple, potentially heterogeneous runtime
implementations. In doing so, this facilitates the distributed execution of pipeline el-
ements to account for business or application-specific requirements. Figure 1 gives
a rudimentary overview of a basic named entity recognition pipeline in StreamPipes.
The pipeline consists of three pipeline elements, a textual quality report data source
for a group of flow rate sensors, a named entity recognition processor based on an
ELG service, and a dashboard sink to visualise results.
    The decomposition of complex analytical challenges into smaller function blocks
allows StreamPipes to mitigate the problem of committing to a single stream pro-
cessing technology. On top, it uses semantics to guide non-technical domain ex-
perts throughout the pipeline creation process. In recent years, several profound
extensions to the knowledge base of StreamPipes were implemented to improve
and extend existing capabilities. This includes StreamPipes Connect (Zehnder et
al. 2020), a semantics-based adapter model and edge transformation functions, and
StreamPipes Edge Extensions (Wiener et al. 2020), a methodology for geo-distrib-
uted pipeline deployment and operation. Besides StreamPipes, other solutions for
low-code dataflow programming exist, e. g., Apache Nifi2 , or Node-RED3 .




1 https://streampipes.apache.org
2 https://nifi.apache.org
3 https://nodered.org
26 Streaming Language Processing in Manufacturing                                 339




Fig. 1 Example pipeline in StreamPipes



3 Architecture

From an architectural point of view, SLAPMAN follows the microservice architec-
ture of StreamPipes and provides a seamless integration with LT services offered by
the ELG platform as shown in Figure 2. In general, the ELG platform provides vari-
ous LT services that allow to perform language processing and LT-related operation.
From a technical perspective, LT services are remotely accessible via REST over
HTTP. As such, requests comprising textual data are issued against corresponding
LT services that process the incoming call and in return provide the analysis results.
For instance, using a machine translation service allows to translate quality defect
reports from various plants in different source languages into a common target lan-
guage, e. g., English, in order to globally investigate certain defect patterns.
    In this context, StreamPipes allows to design and develop arbitrary pipeline ele-
ments using an SDK. Therefore, arbitrary LT services available on the ELG platform
can be wrapped as specific pipeline elements providing language processing capabil-
ities to domain experts to be leveraged in a reusable and self-service manner. Once
a user models and deploys a pipeline using one of the LT pipeline elements, textual
data is continuously transferred between participating pipeline elements in an event-
driven manner by means of a topic-based publish/subscribe pattern. As such, output
events from preceding pipeline elements are published to a message broker proto-
340                                                                                 Patrick Wiener and Steffen Thoma

        ELG Platform                    Apache StreamPipes
                                                              User Interface (Graphical Pipeline Modeling)
        ELG Services
                                                                        Core (Pipeline Management)

       ILSP Machine
         Translation                                                   Pipeline Element Microservices

                          External        LT Algorithms (ELG Service Wrapper)              Generic & Specific (IIoT) Algorithms
                         Service Call
                                                                 ...                       Text          Field    ...    Outlier
         NER Bert                           ILSP        NER                Cogito
                                                                                           Filter       Hasher          Detection



         NER Gate                                                             Message Broker

                                                                       Pipeline Element Microservices
              ...




                                                   Text Adapters & Sinks                       Streaming Adapters & Sinks
      Cogito Sentiment
          Analysis                          E-Mail       Telegram        Slack             OPC UA          MQTT           Kafka




Fig. 2 Architecture: ELG platform and StreamPipes integration



col, e. g., Apache Kafka4 . Succeeding pipeline elements subscribe to relevant topics
in order to retrieve the previously published events. The complete life cycle of the
event-driven application is internally managed by the core of StreamPipes which is
responsible for the pipeline management. This includes pipeline element compatibil-
ity based on semantic verification to provide user support and guidance throughout
the pipeline modeling process. In addition, this incorporates message broker pro-
tocol negotiation including system-side topic management of the publish/subscribe
pattern. At run-time, streaming textual events subscribed by LT pipeline elements of
deployed pipelines issue REST calls to remote LT services on the ELG platform to
perform the essential processing tasks. Results are sent back and published again to
the corresponding message broker protocol for further usage. The architectural de-
sign of standalone pipeline element microservices facilitates to extend StreamPipes
with additional LT components.



4 Implementation

The main activities in SLAPMAN were focused on the development of new ex-
tensions for Apache StreamPipes related to language technology. As such, the ex-
tensions were focused on i) wrapping and integrating existing services from the
ELG platform (e. g., NER, rumour veracity, sentiment analysis, machine translation),
ii) developing new data processors and data sinks for Apache StreamPipes related
to LT (e. g., chunker, language detection, part-of-speech-tagger, sentence detection,
tokeniser), iii) developing additional adapters to connect text-focused data sources
(e. g., Telegram, Slack, Manual Input) and iv) developing technical extensions to the

4   https://kafka.apache.org
26 Streaming Language Processing in Manufacturing                                       341

toolbox itself to ease the integration of new NLP models along with general usability
improvements (e. g., file management, word cloud visualization).
    In addition, a new Client API was developed which allows to adapt existing
pipelines and to configure pipeline elements from external applications. This en-
ables users to easily update trained language models using a convenient Java client.
Moreover, from a deployment and orchestration perspective, StreamPipes relies on
Docker as its default installation option. To further alleviate the integration into the
ELG platform based on Kubernetes, a helm chart5 for StreamPipes was developed
which is available for public use. This helm chart paired with the general extensi-
bility of StreamPipes to install new pipeline elements providing LT capabilities at
run-time allows to integrate additional LT algorithms as demands change.



5 Conclusions and Results of the Pilot Project

In the future, we plan on pursuing the following key activities resulting from lessons
learned along the way. In order to better facilitate the integration in existing enter-
prise architectures, StreamPipes is planned to support standard identity and access
management systems such as Keycloak to complement the existing user manage-
ment. This will also be beneficial for a smoother interaction with the ELG platform
itself. In addition, the work on the StreamPipes Python wrapper to simplify the de-
velopment of new pipeline elements and especially the integration of ELG services
is continued. Similarly, the work on the Client API for external pipeline control from
code is planned to be pursued.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.




5   https://helm.sh
342                                                            Patrick Wiener and Steffen Thoma

References

Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.
Riemer, Dominik, Ljiljana Stojanovic, and Nenad Stojanovic (2014). “SEPP: Semantics-Based
   Management of Fast Data Streams”. In: Proceedings of IEEE 7th International Conference
   on Service-Oriented Computing and Applications, SOCA 2014. IEEE, pp. 113–118. DOI: 10.1
   109/SOCA.2014.52. URL: http://ieeexplore.ieee.org/document/6978598/.
Wiener, Patrick, Philipp Zehnder, and Dominik Riemer (2020). “Managing Geo-Distributed Stream
   Processing Pipelines for the IIoT with StreamPipes Edge Extensions”. In: Proceedings of the
   14th ACM International Conference on Distributed and Event-Based Systems. DEBS ’20. Mon-
   treal, Quebec, Canada: ACM, pp. 165–176. DOI: 10.1145/3401025.3401764. URL: https://doi
   .org/10.1145/3401025.3401764.
Zehnder, Philipp, Patrick Wiener, Tim Straub, and Dominik Riemer (2020). “StreamPipes Connect:
   Semantics-Based Edge Adapters for the IIoT”. In: The Semantic Web. Ed. by Andreas Harth,
   Sabrina Kirrane, Axel-Cyrille Ngonga Ngomo, Heiko Paulheim, Anisa Rula, Anna Lisa Gentile,
   Peter Haase, and Michael Cochez. Cham: Springer, pp. 665–680.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 27
Textual Paraphrase Dataset for
Deep Language Modelling

Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Valtteri Skantsi, Jemina Kilpeläinen,
Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, and Otto Tarkka




Abstract The Turku Paraphrase Corpus is a dataset of over 100,000 Finnish para-
phrase pairs. During the corpus creation, we strived to gather challenging paraphrase
pairs, more suitable to test the capabilities of natural language understanding models.
The paraphrases are both selected and classified manually, so as to minimise lexi-
cal overlap, and provide examples that are structurally and lexically different to the
maximum extent. An important distinguishing feature of the corpus is that most of
the paraphrase pairs are extracted and distributed in their native document context,
rather than in isolation. The primary application for the dataset is the development
and evaluation of deep language models, and representation learning in general.



1 Overview and Objectives of the Pilot Project

Natural language processing research focuses increasingly more at a deeper under-
standing of language meaning, which is the enabling factor for the next generation of
language technology applications. Of especially recent interest are neural meaning
representations that are robust to non-trivial re-phrasing of statements with equiv-
alent or near-equivalent meaning. While deep learning methods have effectively
solved many supervised learning tasks where large amounts of task-specific training
data are available, their performance in representation learning tasks is much weaker
(Glockner et al. 2018; Tsuchiya 2018; McCoy et al. 2019). In practical terms, we do
not yet have well-proven general methods that, given arbitrary statements with the
same contextual meaning but very different wording, would reliably produce highly
similar representations for the statements. The fundamental limitation has been the
lack of appropriate training data and learning procedures that are able to infer the
projection from observable surface forms to faithful semantic representations.
   In this ELG pilot project, we set out to address this limitation by building a fully
manually annotated paraphrase corpus for Finnish, the Turku Paraphrase Corpus. In

Jenna Kanerva · Filip Ginter · Li-Hsin Chang · Valtteri Skantsi · Jemina Kilpeläinen · Hanna-Mari
Kupari · Aurora Piirto · Jenna Saarni · Maija Sevón · Otto Tarkka
University of Turku, Finland, jmnybl@utu.fi, lhchan@utu.fi, figint@utu.fi


© The Author(s) 2023                                                                         343
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_27
344       Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Valtteri Skantsi, Jemina Kilpeläinen et al.

addition to building this resource, we also gathered experience and data regarding
how such a resource can be built efficiently and what human resources are needed,
built initial models based on the new resource, and produced baseline results.



2 Methodology

The primary distinguishing feature of our corpus compared to other related efforts is
its fully manual annotation (as opposed to automatic candidate generation), resulting
in paraphrase pairs that are non-trivial and challenging in not being highly lexically
related. In other words, an important objective was to avoid bias due to automatic
candidate selection so as to obtain a more realistic estimate of the performance of
machine learning models on natural language understanding tasks. To this end, we
gather source documents that are potentially rich in paraphrases for fully manual
paraphrase candidate extraction. These documents include alternative translations
of movie subtitles, news headings and articles reporting the same event, discussion
forum messages with identical titles and topics, alternative student translations from
translation course assignments, and student essays answering the same prompts.
    Along with the manual extraction, all paraphrase candidates are manually classi-
fied into categories of paraphrases and non-paraphrases according to the developed
annotation scheme. The design of the annotation scheme strives to capture varying
levels of paraphrasability of candidate paraphrase pairs. We use a scale of four base
labels, 1–4, similar to those used in some other paraphrase corpora (Creutz 2018).
We define the four base labels as label 1 unrelated sentences, label 2 related but
not paraphrases, label 3 paraphrases in the given context but not universally so, and
label 4 universal paraphrases. In addition, label 4 paraphrases can be marked with
optional flags > or < for subsumption, s for style, and i for minor deviations. These
flags mark properties of the paraphrases that do not fulfill the strict universality crite-
ria of the label 4 due to one of several defined reasons. The subsumption flag means
that the paraphrasability is directional; one sentence can be universally substituted
by the other, but not the other way around. The style flag means that the paraphrases
convey the same meaning, but may have differing tones or registers, which make
them not interchangeable in certain circumstances. The minor deviation flag marks
minimal differences in meaning (for example, “this” vs. “that”), or grammatical num-
ber, person, tense, etc. that can be trivially identified automatically. These flags are
independent of each other and thus one label 4 paraphrase pair can have multiple
flags, disregarding the directional subsumption flags. More detailed description of
the labels together with example annotations is given in the annotation guidelines
(Kanerva et al. 2021a).
27 Textual Paraphrase Dataset for Deep Language Modelling                          345

3 Implementation

The annotation work was carried out by six main annotators, each being a native
Finnish speaker with a strong background in language studies by having completed
or ongoing studies in a field related to languages or linguistics. Each annotator
worked 5–9 months either full or part time in a strong collaboration with a broader
project team including supportive roles in the annotation work.
   An annotator starts the process by going through the automatically aligned source
document pair presented side-by-side in a custom annotation tool1 developed for the
paraphrase extraction, and extracts all interesting paraphrase candidates by selecting
the corresponding text passages from both documents. While saving the candidate,
together with the text passage pair the tool also saves the actual position of the text
passage in the original document, therefore supporting studying the paraphrase pairs
in their original document context. To our knowledge, this is the first paraphrase
corpus that includes the document context for the released paraphrase pairs. After
extracting all interesting paraphrase candidates from the source document pair, the
annotator marks the document finished and moves on to the next one.
   The extracted paraphrase candidates are automatically transferred to a separate
annotation tool2 developed specifically for paraphrase labeling. In this tool, each
pair of paraphrase candidates is shown separately, and the annotator can see the
original contexts if necessary. The annotator labels the original paraphrase pair, and
has the option to copy the original text and rewrite the texts into full paraphrases
(label 4 without flags). In cases where the annotator decided to provide a rewritten
pair, two or more pairs of paraphrases are obtained for the corpus: the original pair,
and the rewritten pair(s). The annotators are instructed to rewrite the paraphrase can-
didates in cases where a simple edit, such as word deletion, insertion or synonym
replacement, can be naturally constructed and does not require too much effort.



4 Evaluation

The paraphrase label annotation was guided using a shared annotation manual, daily
meetings, and regularly assigned double annotation batches in order to ensure anno-
tation consistency between the six annotators. The manual paraphrase extraction did
not involve a similarly careful annotator training or consistency monitoring through-
out the project. Instead of ensuring each annotator extracting the same segments if
given the same text, the objective is to collect a diverse set of different paraphrase
candidates, where minor deviations in the personal extraction habits only creates
more diversity to the data. In order to study the extraction behaviour of the annota-
tors, we measure the average number of paraphrase pairs extracted from one docu-


1   https://github.com/TurkuNLP/pick-para-anno
2   https://github.com/TurkuNLP/rew-para-anno
346        Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Valtteri Skantsi, Jemina Kilpeläinen et al.

ment pair, indicating how eager the annotator was to include or exclude borderline
uninteresting, extremely difficult or otherwise debatable pairs from the corpus.
   While the data sources used in the paraphrase extraction step have distinct char-
acteristics in terms of extraction ratios, we use the subset originating from the alter-
native subtitles (approx. 80% of the full corpus) for this study in order to account for
differing source text proportions between the annotators. We measure the average
number of paraphrases extracted from one subtitle document pair (about 15 minutes
worth of the subtitled program’s runtime), while taking into account all document
pairs where the extraction and labeling was carried out by the same annotator, and
the document pair resulted at least one extracted paraphrase. The statistics are shown
in Table 1, the individual extraction rates falling between 13 and 50 pairs indicating
some amount of diversity between the annotators. When measuring the mean lexical
similarity of the extracted paraphrase pairs (together with standard deviation) as well
as annotated paraphrase label distribution for each annotator, we do not notice any
significant difference between annotators oriented towards higher or lower extrac-
tion rates. The label distributions are visualised in Figure 1. Finally, in Table 1 we
measure the proportion of extracted paraphrase pairs each annotator chose to rewrite
during the label annotation (row Rewritten), showing large differences among the an-
notators, between 1.4% and 29.5% of rewritten paraphrase pairs.


                                Ann1      Ann2      Ann3      Ann4      Ann5      Ann6
          Extracted pairs        28,685   18,908      9,553     7,713    6,359     1,897
          Total extracted (%)      39.1     25.8       13.0      10.5       8.7       2.6
          Extracted/doc            23.4     13.2       13.4      22.0     48.9      23.4
          Rewritten (%)             6.8     23.4        1.3      29.5     14.9       1.4
Table 1 Comparison of the six annotators in terms of the average number of paraphrase pairs
extracted from one 15-min subtitle pair (Extracted/doc), as well as the percentage of paraphrase
pairs, where the annotator provided a rewrite (Rewritten); in addition to these two metrics, we also
illustrate the total amount of the paraphrase pairs extracted by the annotator (both raw count and
percentage); note that the number of extracted paraphrases does not sum up to the total corpus size
as the comparison is done on the subtitle subset only (approx. 80% of the full corpus)



    In order to ensure the consistency of the label annotation, approx. 2% of the para-
phrase pairs are double annotated, where two different annotators annotate the labels
independently from one another for the same paraphrase candidates. The two indi-
vidual annotations are merged and conflicting labels resolved together with the anno-
tation team, resulting in a consolidated subset of consensus annotation. The overall
accuracy of the individual annotations against the consensus labels is around 70%,
on the full set of labels permitted in the annotation scheme. The level of agreement is
on par with similar numbers reported in other paraphrase studies (Dolan and Brock-
ett 2005; Creutz 2018). The agreement measures when calculated separately for each
annotator vary between 64% and 76%, the most common disagreements being be-
tween the semantically nearest labels (i. e., labels 3 and 4</>, or labels 4</> and
4), or whether to include or not include the rare additional flags s or i.
27 Textual Paraphrase Dataset for Deep Language Modelling                                        347




                                           100                                      Label 2
                                                                                    Label 3
                                                                                    Label 4</>
               Number of paraphrases (%)   80                                       Label 4


                                           60


                                           40


                                           20


                                            0
                                                 Ann1   Ann2   Ann3   Ann4   Ann5     Ann6


Fig. 1 Label frequencies illustrated separately for the six annotators using the same subtitle subset
of the corpus as in Table 1



5 Conclusions and Results of the Pilot Project

The project resulted in a high quality corpus of Finnish paraphrases including a total
of 104,645 manually classified pairs, 91,604 being naturally occurring pairs directly
extracted from the source documents, while 13,041 are produced through manual
rewriting. The manual extraction method presented in the article both skews the label
distribution towards true paraphrases ensuring efficient use of human resources (98%
being labeled positive) as well as preserves the original document context, making
this the first released corpus of paraphrasing in context. The contextual information
is used in Kanerva et al. (2021b), where we present a novel approach to paraphrase
detection by framing the task as detecting the target paraphrase span from the given
document, a similar setting as used in question answering. In addition to the actual
corpus, the project also provided models trained for paraphrase classification and
fine-tuned sentence representations.
    All resources presented in this article are available through the European Lan-
guage Grid3 and also on the TurkuNLP website4 under the CC-BY-SA license.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects. In addition, this work was supported by the
Academy of Finland and the Digicampus project. Computational resources were provided by CSC
– IT Center for Science.



3   https://live.european-language-grid.eu/catalogue/corpus/7754
4   https://turkunlp.org/paraphrase.html
348        Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Valtteri Skantsi, Jemina Kilpeläinen et al.

References

Creutz, Mathias (2018). “Open Subtitles Paraphrase Corpus for Six Languages”. In: Proceedings
   of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Ed.
   by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti
   Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno,
   Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga. Miyazaki, Japan: ELRA, pp. 1364–1369.
Dolan, William B. and Chris Brockett (2005). “Automatically Constructing a Corpus of Senten-
   tial Paraphrases”. In: Proceedings of the Third International Workshop on Paraphrasing (IWP
   2005), pp. 9–16.
Glockner, Max, Vered Shwartz, and Yoav Goldberg (2018). “Breaking NLI Systems with Sentences
   that Require Simple Lexical Inferences”. In: Proceedings of the 56th Annual Meeting of the
   Association for Computational Linguistics (Volume 2: Short Papers). ACL, pp. 650–655. DOI:
   10.18653/v1/P18-2103. URL: https://aclanthology.org/P18-2103.
Kanerva, Jenna, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen,
   Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, et al. (2021a). “Annotation
   Guidelines for the Turku Paraphrase Corpus”. In: arXiv preprint arXiv:2108.07499.
Kanerva, Jenna, Hanna Kitti, Li-Hsin Chang, Teemu Vahtola, Mathias Creutz, and Filip Gin-
   ter (2021b). “Semantic Search as Extractive Paraphrase Span Detection”. In: arXiv preprint
   arXiv:2112.04886.
McCoy, Tom, Ellie Pavlick, and Tal Linzen (2019). “Right for the Wrong Reasons: Diagnosing
   Syntactic Heuristics in Natural Language Inference”. In: Proceedings of the 57th Annual Meet-
   ing of the Association for Computational Linguistics. ACL, pp. 3428–3448. DOI: 10.18653/v1
   /P19-1334. URL: https://aclanthology.org/P19-1334.
Tsuchiya, Masatoshi (2018). “Performance Impact Caused by Hidden Bias of Training Data for
   Recognizing Textual Entailment”. In: Proceedings of the 11th International Conference on Lan-
   guage Resources and Evaluation (LREC 2018). Miyazaki, Japan: ELRA, pp. 1506–1511. URL:
   https://aclanthology.org/L18-1239.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 28
Universal Semantic Annotator

Roberto Navigli, Riccardo Orlando, Cesare Campagnano, and Simone Conia




Abstract Explicit semantic knowledge has often been considered a necessary in-
gredient to enable the development of intelligent systems. However, current state-
of-the-art tools for the automatic extraction of such knowledge often require expert
understanding of the complex techniques used in lexical and sentence-level seman-
tics and their linguistic theories. To overcome this limitation and lower the barrier
to entry, we present the Universal Semantic Annotator (USeA) ELG pilot project,
which offers a transparent way to automatically provide high-quality semantic anno-
tations in 100 languages through state-of-the-art models, making it easy to exploit
semantic knowledge in real-world applications.



1 Overview and Objectives of the Pilot Project

Natural Language Processing (NLP) is the field of Artificial Intelligence (AI) which
aims at enabling computers to process, understand and generate text in the same
way as we humans do. Although AI systems are nowadays able to process massive
amounts of text, they are still far from achieving true Natural Language Understand-
ing (NLU). Indeed, current systems still struggle in explicitly identifying and ex-
tracting the meaning or semantics conveyed by a text of interest. Nonetheless, the
integration of explicit semantics has already been successfully exploited in a wide
array of downstream tasks that span multiple areas of AI from NLP with informa-
tion retrieval, question answering, text summarisation, and machine translation, to
computer vision with visual semantic role labeling and situation recognition. Un-
fortunately, expert knowledge of lexical semantics, sentence-level semantics and
complex deep learning techniques often becomes a roadblock in the integration of
explicit semantic information into downstream tasks and real-world applications, es-
pecially in multilingual scenarios. To lower the entry point for semantic knowledge
integration into multilingual applications, we present the Universal Semantic Anno-

Roberto Navigli · Riccardo Orlando · Cesare Campagnano · Simone Conia
Sapienza University of Rome, Italy, navigli@diag.uniroma1.it, orlando@diag.uniroma1.it,
campagnano@di.uniroma1.it, conia@di.uniroma1.it


© The Author(s) 2023                                                               349
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_28
350               Roberto Navigli, Riccardo Orlando, Cesare Campagnano, and Simone Conia

tator (USeA) project, the first unified API for three core tasks in NLU: Word Sense
Disambiguation (WSD), Semantic Role Labeling (SRL), and Abstract Meaning Rep-
resentation (AMR) parsing. With USeA, we offer a simple yet efficient way to use
state-of-the-art multilingual models within a single framework accessible via REST
API, browsers, and programmatically. This will ease the integration of NLU models
in NLP pipelines (also for low-resource languages), allowing them to exploit explicit
semantic information to improve their performance.



2 Methodology

USeA is the first unified set of APIs for high-performance multilingual NLU, sup-
porting 100 languages. USeA employs state-of-the-art multilingual neural networks
to provide automatic semantic annotations for WSD, SRL and AMR Parsing.
Word Sense Disambiguation (WSD) is the task of associating a word in context
with its most appropriate sense from a sense inventory (Bevilacqua et al. 2021b).
USeA provides word sense labels using an improved version of the state-of-the-art
WSD model proposed by Conia and Navigli (2021), which, differently from other
ready-to-use tools for WSD based on graph-based heuristics (Moro et al. 2014; Scoz-
zafava et al. 2020) or non-neural models (Papandrea et al. 2017), is built on top of a
Transformer encoder. Crucially, thanks to BabelNet 5 (Navigli et al. 2021), a multi-
lingual encyclopedic dictionary, USeA is able to disambiguate text in 100 languages.
Semantic Role Labeling (SRL) is the task of answering the question “Who did
What, to Whom, Where, When, and How?” (Màrquez et al. 2008), providing a struc-
tured and explicit representation of the underlying semantics of a sentence. Differ-
ently from other available SRL systems, USeA encapsulates an improved version of
the neural model introduced by Conia et al. (2021a), which performs state-of-the-art
cross-lingual SRL with heterogeneous linguistic inventories.
Abstract Meaning Representation (AMR) parsing is the task of capturing the
semantics of a sentence through a rooted directed acyclic graph, with nodes rep-
resenting concepts and edges representing their relations (Banarescu et al. 2013).
USeA offers a multilingual version of SPRING (Bevilacqua et al. 2021a), a recent
state-of-the-art, end-to-end system for Text-to-AMR generation.



3 Implementation

The USeA pipeline is organised in five self-contained modules that are transparent
to the end user, as shown in Figure 1.
Orchestrator Module. The Orchestrator Module is the core of USeA and serves
as an entry point for the semantic API. Being an end-to-end system, the end user
28 Universal Semantic Annotator                                                              351

                                                                   Server

                                                                Preprocessing
                                                                   Module
                                 Client


                              Python SDK



                                                                Orcherstrator
                              RESTful API



                               Browser



                                                   WSD Module   SRL Module      AMR Module




Fig. 1 USeA architecture: a user sends text to the USeA server and receives semantic information;
in the server, the orchestrator processes the input using task-specific modules



is only required to send raw text to our service. The input text is then processed by
the Preprocessing Module and the result sent to the WSD, SRL and AMR Parsing
modules. In particular, since the SRL and AMR Parsing tasks are more demanding,
we offload the WSD module to CPU and run SRL and AMR Parsing requests on
GPU to optimise hardware usage. The responses from the three semantic modules
are then combined and sent back to the end user.
Preprocessing Module. The preprocessing module takes care of producing the pre-
processing information that is usually needed by NLP systems, i. e., language iden-
tification, document splitting, tokenisation, lemmatisation, and part-of-speech tag-
ging. In order to support as many languages as possible while keeping low hardware
requirements, the preprocessing module is built around Trankit (Nguyen et al. 2021)
and supports 100 languages with a single model.
WSD Module. We developed AMuSE-WSD (Orlando et al. 2021) as our WSD
module. Its neural architecture is based on XLM-RoBERTa (Conneau et al. 2020),
a multilingual Transformer model. More specifically, given a word in context, the
WSD module i) builds a contextualised representation of the word using the hidden
states of XLM-RoBERTa, ii) applies a non-linear transformation to obtain a sense-
specific representation, and iii) computes the output score distribution over all the
possible senses of the input word.
SRL Module. InVeRo-XL (Conia et al. 2021b) is the SRL system we developed
for USeA. Similarly to the WSD module, the SRL module is also based on XLM-
RoBERTa. In particular, given an input sentence, the SRL module i) builds a se-
quence of contextualised word representations using the hidden states of XLM-
RoBERTa, ii) identifies and disambiguates each predicate in the sentence, and iii)
for each predicate, produces its arguments and their semantic roles.
AMR Parsing Module. The AMR Parsing Module is heavily based on SPRING
(Blloshmi et al. 2021), which we extended to support multiple languages. SPRING is
a sequence-to-sequence Transformer model that operates as a parser by “translating”
an input sentence into a linearised AMR graph. We extend SPRING to support 100
languages by replacing BART with the multilingual version of T5.
352                    Roberto Navigli, Riccardo Orlando, Cesare Campagnano, and Simone Conia

                                             English datasets                       Multilingual datasets
                               Se2    Se3     Se07 Se13 Se15          All          Se13 Se15        Xl-Wsd
    Moro et al. (2014)         67.0   63.5     51.6    66.4    70.3   65.5         65.6     –        52.9
    Papandrea et al. (2017)    73.8   70.8     64.2    67.2    71.5    –            –       –         –
    Scozzafava et al. (2020)   71.6   72.0     59.3    72.2    75.8   71.7         73.2    66.2      57.7
    USeA (WSD)                 77.8   76.0     72.1    77.7    81.5   77.5         76.8    73.0      66.2
Table 1 English WSD results in F1 scores on Senseval-2 (SE2), Senseval-3 (SE3), SemEval-
2007 (SE07), SemEval-2013 (SE13), SemEval-2015 (SE15), and the concatenation of the datasets
(ALL); we also include results on multilingual WSD in SemEval-2013 (DE, ES, FR, IT), SemEval-
2015 (IT, ES), and XL-WSD (average over 17 languages, English excluded)


                               Catalan         Czech      German        English           Spanish    Chinese
    AllenNLP’s SRL demo           –              –             –            86.5            –           –
    InVeRo                        –              –             –            86.2            –           –
    USeA (SRL)                   83.3           85.9          87.0          86.8           81.8        84.9
Table 2 Comparison between USeA and other recent automatic tools for SRL; F1 scores on argu-
ment labeling with pre-identified predicates on the CoNLL-2012 English test set and the CoNLL-
2009 test sets converted from dependency-based to span-based



4 Evaluation

USeA offers state-of-the-art models for multilingual WSD, SRL and AMR Parsing.
Here, we report its results on standard gold benchmarks for each task.
Results in WSD. We evaluate our WSD Module against other disambiguation tools
on gold standard benchmarks for English and multilingual WSD, covering 17 lan-
guages. The results (Table 1) show that USeA outperforms its competitors by a wide
margin, especially in multilingual WSD (+8.5% in F1 Score on XL-WSD).
Results in SRL. We report the performance of our SRL Module on two gold stan-
dard benchmarks for SRL, CoNLL-20091 and CoNLL-2012, covering six languages.
USeA is the first package to provide annotations in languages other than English
while also outperforming its competitors in English (Table 2).
Results in AMR Parsing. Finally, we examine the performance of our AMR Pars-
ing Module on AMR 3.02 , which is currently the largest AMR-annotated corpus.
Even though USeA supports 100 languages, it is still competitive with other recently
proposed English-only AMR parsing systems (Table 3).




1 The CoNLL-2009 dataset was originally intended for dependency-based SRL. We convert
dependency-based annotations to span-based annotations using the gold syntactic trees.
2 https://catalog.ldc.upenn.edu/LDC2020T02
28 Universal Semantic Annotator                                                            353

                                                            SMATCH
                        Lyu et al. (2021)                      75.8
                        Zhou et al. (2021)                     81.2
                        SPRING (Bevilacqua et al. 2021a)       83.0
                        USeA (AMR-Parsing)                     80.9
Table 3 SMATCH score obtained by USeA compared with recent literature on AMR 3.0 (English)



5 Conclusions and Results of the Pilot Project

We presented the USeA project, providing an overview on its objectives and on how
we worked towards achieving them. We hope that USeA will represent a useful tool
for the integration of explicit semantic knowledge – word meanings, semantic role
labels, and graph-like semantic representations – into real-world applications.

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.




References

Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob,
   Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider (2013). “Abstract Mean-
   ing Representation for Sembanking”. In: Proceedings of the 7th Linguistic Annotation Workshop
   and Interoperability with Discourse, pp. 178–186. URL: https://aclanthology.org/W13-2322.
Bevilacqua, Michele, Rexhina Blloshmi, and Roberto Navigli (2021a). “One SPRING to Rule
   Them Both: Symmetric AMR Semantic Parsing and Generation without a Complex Pipeline”.
   In: Proc. of AAAI 35.14, pp. 12564–12573. URL: https://ojs.aaai.org/index.php/AAAI/article
   /view/17489.
Bevilacqua, Michele, Tommaso Pasini, Alessandro Raganato, and Roberto Navigli (2021b). “Re-
   cent Trends in Word Sense Disambiguation: A Survey”. In: Proc. of IJCAI-21, pp. 4330–4338.
   DOI: 10.24963/ijcai.2021/593.
Blloshmi, Rexhina, Michele Bevilacqua, Edoardo Fabiano, Valentina Caruso, and Roberto Navigli
   (2021). “SPRING Goes Online: End-to-End AMR Parsing and Generation”. In: Proceedings of
   EMNLP, pp. 134–142. URL: https://aclanthology.org/2021.emnlp-demo.16.
Conia, Simone, Andrea Bacciu, and Roberto Navigli (2021a). “Unifying Cross-Lingual Semantic
   Role Labeling with Heterogeneous Linguistic Resources”. In: Proceedings of NAACL, pp. 338–
   351. URL: https://www.aclweb.org/anthology/2021.naacl-main.31.
Conia, Simone and Roberto Navigli (2021). “Framing Word Sense Disambiguation as a Multi-Label
   Problem for Model-Agnostic Knowledge Integration”. In: Proceedings of EACL, pp. 3269–
   3275. URL: https://www.aclweb.org/anthology/2021.eacl-main.286.
Conia, Simone, Riccardo Orlando, Fabrizio Brignone, Francesco Cecconi, and Roberto Navigli
   (2021b). “InVeRo-XL: Making Cross-Lingual Semantic Role Labeling Accessible with Intelli-
   gible Verbs and Roles”. In: Proceedings of EMNLP, pp. 319–328. URL: https://aclanthology.o
   rg/2021.emnlp-demo.36/.
Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek,
   Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov (2020).
   “Unsupervised Cross-lingual Representation Learning at Scale”. In: Proceedings of the 58th
354                  Roberto Navigli, Riccardo Orlando, Cesare Campagnano, and Simone Conia

   Annual Meeting of the Association for Computational Linguistics. Ed. by Dan Jurafsky, Joyce
   Chai, Natalie Schluter, and Joel Tetreault. ACL, pp. 8440–8451. DOI: 10.18653/v1/2020.acl-
   main.747. URL: https://www.aclweb.org/anthology/2020.acl-main.747.
Lyu, Chunchuan, Shay B. Cohen, and Ivan Titov (2021). “A Differentiable Relaxation of Graph
   Segmentation and Alignment for AMR Parsing”. In: Proc. of EMNLP, pp. 9075–9091. URL:
   https://aclanthology.org/2021.emnlp-main.714.
Màrquez, Lluı́s, Xavier Carreras, Kenneth C. Litkowski, and Suzanne Stevenson (2008). “Semantic
   Role Labeling: An Introduction to the Special Issue”. In: Comp. Linguistics 34.2, pp. 145–159.
   URL: https://aclanthology.org/J08-2001.
Moro, Andrea, Alessandro Raganato, and Roberto Navigli (2014). “Entity Linking meets Word
   Sense Disambiguation: A Unified Approach”. In: TACL 2, pp. 231–244. URL: https://aclantho
   logy.org/Q14-1019.
Navigli, Roberto, Michele Bevilacqua, Simone Conia, Dario Montagnini, and Francesco Cecconi
   (2021). “Ten Years of BabelNet: A Survey”. In: Proc. of IJCAI-21, pp. 4559–4567. DOI: 10.2
   4963/ijcai.2021/620.
Nguyen, Minh Van, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen (2021).
   “Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Pro-
   cessing”. In: Proceedings of the 16th Conference of the European Chapter of the Association
   for Computational Linguistics: System Demonstrations. ACL, pp. 80–90. DOI: 10.18653/v1/2
   021.eacl-demos.10. URL: https://aclanthology.org/2021.eacl-demos.10.
Orlando, Riccardo, Simone Conia, Fabrizio Brignone, Francesco Cecconi, and Roberto Navigli
   (2021). “AMuSE-WSD: An All-in-one Multilingual System for Easy Word Sense Disambigua-
   tion”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Pro-
   cessing: System Demonstrations. ACL, pp. 298–307. DOI: 10.18653/v1/2021.emnlp-demo.34.
   URL: https://aclanthology.org/2021.emnlp-demo.34.
Papandrea, Simone, Alessandro Raganato, and Claudio Delli Bovi (2017). “SupWSD: A Flexible
   Toolkit for Supervised Word Sense Disambiguation”. In: Proceedings of the 2017 Conference
   on Empirical Methods in Natural Language Processing: System Demonstrations. Copenhagen,
   Denmark: ACL, pp. 103–108. DOI: 10.18653/v1/D17-2018. URL: https://www.aclweb.org/an
   thology/D17-2018.
Scozzafava, Federico, Marco Maru, Fabrizio Brignone, Giovanni Torrisi, and Roberto Navigli
   (2020). “Personalized PageRank with Syntagmatic Information for Multilingual Word Sense
   Disambiguation”. In: Proceedings of the 58th Annual Meeting of the Association for Computa-
   tional Linguistics: System Demonstrations. ACL, pp. 37–46. DOI: 10.18653/v1/2020.acl-dem
   os.6.
Zhou, Jiawei, Tahira Naseem, Ramón Fernandez Astudillo, and Radu Florian (2021). “AMR Pars-
   ing with Action-Pointer Transformer”. In: Proceedings of the 2021 Conference of the North
   American Chapter of the Association for Computational Linguistics: Human Language Tech-
   nologies. ACL, pp. 5585–5598. DOI: 10.18653/v1/2021.naacl-main.443. URL: https://aclanth
   ology.org/2021.naacl-main.443.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 29
Virtual Personal Assistant Prototype YouTwinDi

Franz Weber and Gregor Jarisch




Abstract YouTwinDi is the next step in a digitised world in which the digital twin
evolves and interacts with other digital twins and makes autonomous decisions in
the interest of its human twin. In this scenario, security and digital ethics assure
ethical decisions and IT specialists concur on improving the digital landscape with
ethical models. This vision also includes overcoming language barriers. A continu-
ous match of supply and demand as well as tailored searches help human twins to
improve their lives in all respects. YouTwinDi uses the most advanced translation
and language analysis technologies, allowing the user and its digital twin to interact
with all European citizens without being blocked by language barriers.



1 Overview and Objectives of the Pilot Project

The goal of this ELG (Rehm et al. 2021) pilot project was to build the prototype of a
personal virtual assistant, which can be installed on a small device or integrated in an
ELG-compatible container. We wanted to demonstrate that this can be accomplished
using ELG language resources and technologies while keeping highest security stan-
dards. We use the open source software EDDI which is running in a docker container
for the natural language interface. This prototype is the basis for the development of
a minimum viable product ready for market launch.
   We believe that conversational AI applications are well suited to support interac-
tions between people that speak different languages due to their real-time nature and
the ability to create personalised customer experiences at scale.
   In line with the broader ELG principle that “with 24 official EU and many more
additional languages, multilingualism in Europe and an inclusive Digital Single Mar-
ket can only be enabled through Language Technologies”, the YouTwinDi1 solution
was developed on top of our existing technology and integrated into the European
Language Grid. We use APIs to translate text input (or speech input, via speech-

Franz Weber · Gregor Jarisch
Labs.ai, Austria, franz@labs.ai, gregor@labs.ai
1   https://www.youtwindi.com


© The Author(s) 2023                                                                355
G. Rehm (ed.), European Language Grid, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-17258-8_29
356                                                      Franz Weber and Gregor Jarisch

to-text technologies) and to recognise intents to query specific data sources and to
provide feedback in the language spoken by the user either in written or spoken
form (via text-to-speech technologies). YouTwinDi uses these features to add trans-
lations of web audio and video streams and to convert the channels into text streams
– two appropriate examples are the automatic translation of the European Commis-
sion’s LinkedIn broadcast events or the automatic translation of local radio stations.
Through the integration of ELG APIs we can also integrate technologies such as sen-
timent analysis into YouTwinDi. Such features are fundamental especially for public
institutions to better support citizens.



2 Methodology

The basis for the Digital Twin prototype is our open source chatbot framework EDDI
(Enhanced Dialogue Driven Intelligence).2 This solution has several features that
simplify the integration of and with the available ELG resources.
   Our software development process is based on the agile software development
approach, in particular on Scrum. All product features are listed and prioritised in a
product backlog, which consists of what needs to be done to successfully deliver a
working software system, including bug fixes and non-functional requirements.
   Cross-functional teams estimate and sign up to deliver potentially shipable in-
crements of software during successive sprints, typically lasting 30 days. Once a
sprint’s backlog is committed, no further functionality can be added to the sprint ex-
cept by the team. Once a sprint has been delivered, the product backlog is analysed
and re-prioritised, if necessary, and the next set of deliverables is selected for the
next sprint. From the lean product development best practices we have adopted the
concept of minimum viable product (MVP) as a strategy to avoid building products
that customers do not need or want, realising often the product with the agreed num-
ber of features and the minimum level of quality that can be easily verified by senior
users. We develop our solution keeping in mind the ability to interface with external
services and resources via APIs and building software development kits. This allows
us to integrate fast and to test the integration with available ELG building blocks.
   Each feature under development was monitored in terms of costs (human re-
sources and hardware as well as software resources) and in terms of delivery. Ac-
ceptance tests were linked to use cases and test criteria. Integration has always been
important for us as an open source solution provider, which is why all our software
features are available at the API level. Modern concepts as Graph API and authenti-
cation and authorisation security are at the core of our software development method-
ology, allowing for easy testing and integration with existing systems.
   Our development strictly follows the Service Oriented Architecture (SOA) con-
cept, removing the bottleneck of dependencies and permitting the usage of indepen-
dent layers to achieve the development goals. We also subscribe to the concept of

2   https://www.eddi.labs.ai
29 Virtual Personal Assistant Prototype YouTwinDi                                  357

microservices (already adopted by ELG), which allows us to easily embed our so-
lution in the ELG ecosystem. Our goal was to develop a portable solution that can
run on a small hardware solution (e. g., Raspberry Pi) and that can also be interfaced
with the ELG platform or directly embedded in ELG as a container.
    We value change management and have documented all steps to integrate our
solution using “how to” documents and guidelines.



2.1 Use Case 1: Automated Translation of local News

The Newbly3 use case relates to the delivery of local news in foreign languages (see
Figure 1). In this use case, the user interacts via text or voice with YouTwinDi.
    • The automated translation translates the topic expressed in the search query into
      the local language (set in the configuration).
    • YouTwinDi initiates a look up for the topic in local news and social media in the
      local language.
    • YouTwinDi checks if the news is categorised as fake, in which case the user is
      alerted and asked if they want to proceed anyway.
  If the news is not categorised as fake, the user is presented with the news and the
news is stored in order to be periodically checked against the fake news database,
which case YouTwinDi will notify the user accordingly.



2.2 Use Case 2: Secure Communication between Virtual Assistants

The second use case revolves around communication between multiple virtual assis-
tants. Imagine a friend has a wish list on an ecommerce platform – you could ask
your friend for access to this list, but that would make your friend anticipate the
present. One solution for this challenge can be personal assistants negotiating for
a piece of information. Your bot could ask your friend’s bot what to gift the friend
based on the online wish list, which, in the case of Amazon, is provided by Alexa. As
your and your friend’s virtual assistants are “friends” themselves (trusted domain),
they are allowed to communicate such information without your friend receiving a
notification.



3 Implementation

The pilot project consisted of five work packages. Work Package 1 was dedicated
to the research of potential suitable hardware to be used for the prototype. In addi-
3   https://newb.ly
358                                                            Franz Weber and Gregor Jarisch




Fig. 1 YouTwinDi use case 1: automated translation of local news



tion we verified if running a containerised version of EDDI would be possible on
the shortlisted hardware. For the prototype we decided to use a standard Android
smartphone. We also specified the use cases (see Sections 2.1 and 2.2).
    Work Package 2 focused upon the integration of EDDI into the ELG platform
and setting up the needed containers. We implemented the two use cases, as defined
in WP1, on the Android phone. The first use case is defined as translating news
from the German language into other languages, such as Romanian or Croatian using
machine translation tools available in ELG. The second use case concentrates on the
communication between two virtual assistants where one wants to obtain a birthday
wish list from the other assistant’s owner.
    Work Package 3 concentrated on preparing the hardware and installing the soft-
ware including the use cases on the selected Android smartphone running in a con-
tainer. In order to accomplish this some modifications had to be applied to the operat-
ing system. Afterwards we could easily install EDDI running in a container, however,
we came to realise that the ELG language technology tools would be too large to run
on the smartphone in a container. From a security point of view our goal was to have
all technologies on the device in order to provide maximum security and privacy to
users. As this was not possible, we decided for the prototype to be able to call remote
services.
29 Virtual Personal Assistant Prototype YouTwinDi                                    359

   Work Package 4 was dedicated to finalising and testing the prototype. In addition,
we created a presentation and documented which compromises we had to engage in
compared to the initial specification in WP 1 and WP 2.
   Work Package 5 took care of all dissemination activities. This was an ongoing
process from the beginning to the end of the pilot project. We set up a project website4
which was updated on a regular basis with updates and news about the pilot project.
We also posted updates on social media, such as LinkedIn and Twitter. The audience
reached with the project website was, on average, 145 unique users per month. In
total, users were reached from Austria, USA, Czech Republic, China, Netherlands,
Canada, Germany, United Arab Emirates, Switzerland and Croatia.



4 Conclusions and Results of the Pilot Project

The main technology achievements of our pilot project can be summarised as follows.
We could successfully demonstrate that Docker containers can run on small devices
such as Android smartphones and that applications such as EDDI and databases such
as MongoDB can run within these containers. We could also show that peer-to-peer
networks for communication between virtual assistants are possible with both a pub-
lic and a private section of the accessible data and a handshake and identity check
mechanisms to verify both users of the virtual assistants with key exchange and
end-to-end encryption in order to achieve highest security standards. Based on the
research work during the pilot project and the implemented prototype, we plan to
develop the software further to a minimum viable product.
    YouTwinDi is the next step in a digitised world in which the digital twin evolves
and interacts with other digital twins and makes autonomous decisions in the interest
of its human twin. In this scenario, security and digital ethics assure ethical decisions
and IT specialists concur on improving the digital landscape with ethical models.
This vision also includes overcoming language barriers. A continuous match of sup-
ply and demand as well as tailored searches help human twins to improve their lives
in all respects. YouTwinDi uses the most advanced translation and language analy-
sis technologies, allowing the user and its digital twin to interact with all European
citizens without being blocked by language barriers.
    We use our existing open source software EDDI which is running in a docker
container for the natural language interface. This prototype is the basis for the devel-
opment of a minimum viable product ready for market launch.
    The ELG pilot project YouTwinDi had two major innovation aspects:
Technical innovation For the first time an AI application runs within a Docker
   container on a small hardware device without any technical limitations.
Creative-economical innovation The creative-economical innovation relates to the
   idea that the digital twin interacts with other digital twins and makes autonomous
   decisions in the interest of its human twin.
4   https://youtwindi.com
360                                                               Franz Weber and Gregor Jarisch

Acknowledgements The work described in this article has received funding from the EU project
European Language Grid as one of its pilot projects.




References

Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiļjevs,
   Gerhard Backfried, José Manuel Gómez Pérez, Ulrich Germann, Rémi Calizzano, Nils Feldhus,
   Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Julian Moreno-Schneider, Dimitris Gala-
   nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris
   Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kačena, Khalid
   Choukri, Valérie Mapelli, Mickaël Rigault, Jūlija Meļņika, Miro Janosik, Katja Prinz, Andres
   Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language
   Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings
   of the 16th Conference of the European Chapter of the Association for Computational Linguis-
   tics: System Demonstrations (EACL 2021). Kyiv, Ukraine: ACL, pp. 221–230. URL: https://w
   ww.aclweb.org/anthology/2021.eacl-demos.26.pdf.




Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
    The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.