SPEECH TECHNOLOGIES FOR DAILY LIFE:
Voice Assistants, ChatBots and Spoken Dialogue Systems

14th-17th November 2023, Jaca, Spain

RED TEMÁTICA EN TECNOLOGÍAS DEL HABLA (RTTH)
Colaboran:
CÁTEDRA RTVE-UNIVERSIDAD DE ZARAGOZA
CÁTEDRA BTS GROUP-UNIVERSIDAD DE ZARAGOZA
CURSOS EXTRAORDINARIOS UNIVERSIDAD DE ZARAGOZA


We are thrilled to announce our upcoming seventh edition in a series of RTTH Schools. This time, in a format of a Fall School, it will be devoted to the topic of "Speech Technologies for Daily Life: Voice Assistants, Chatbots, and Spoken Dialogue Systems."
In today's fast-paced world, speech technologies have become integral to our daily lives, revolutionizing the way we interact with technology and enhancing our overall experiences. This school aims to provide participants with a comprehensive understanding of voice assistants, chatbots, and spoken dialogue systems, covering their underlying technologies, applications, and future trends.
Through expert-led lectures, hands-on workshops, and interactive discussions, attendees will delve into the world of conversational agents, natural language processing, automatic speech recognition, and machine learning algorithms powering these technologies. The program will also include a keynote talk by a renowned expert in the field.
By the end of the school, participants will have gained valuable insights and practical skills to develop and deploy their own speech-enabled applications, enabling them to leverage the power of speech technologies for various domains such as smart homes, healthcare, customer service, and more.
The school is open to all Master&PhD-students, researchers, and professionals interested in learning and refreshing their knowledge on conversational systems. Participants will have the opportunity to present their work in a special Three Minute Thesis and receive feedback from the experts.
The RTTH Fall School 2023 is sponsored by the Spanish Thematic Network on Speech Technology (RTTH), the Cátedra RTVE in the University of Zaragoza, Cátedra BTS GROUP Nuevas Tecnologías de Telecomunicación in the University of Zaragoza and organized by the ViVoLab research group of the Aragon Institute of Engineering Research (I3A) of the University of Zaragoza.
The Fall School 2023 is organized as a 4-day intensive course to be held on November 14th-17th in Jaca (Spain). Join us for an immersive and enlightening experience that will shape the future of human-computer interaction!

Important Dates

  • Early registration by September 30th, 2023
  • Three Minute Thesis Presentations for Master and PhD student by October 31st, 2023
  • Standard registration until October 31st, 2023
  • RTTH Fall School: November 14th-17th, 2023

Speakers

  • M. Inés Torres, Universidad del País Vasco
  • Zoraida Callejas, Universidad de Granada
  • Luis Fernando D'Haro, Universidad Politécnica de Madrid
  • Javier Mikel Olaso, Universidad del País Vasco
  • Hermann Ney, RWTH Aachen University

Program

Tuesday, November 14th

Place: Salón de Actos, Residencia Universitaria

  • 09:00 - 09:30 Opening Ceremony
    Carmen Marta, Eduardo Lleida, Carlos Mártinez
  • 09:30 - 11:00 Introduction to Conversational Systems
    M. Inés Torres Download slides
    The objective of this lecture is to provide a basic understanding of the principal components of a Conversational System. First, we will present each of the modules and technologies involved, such as the speech transcriber (ASR), the language understanding, (NLU) the language generation (NLG) or the text-to-speech (TTS), among others. Then, we will focus on the Dialogue Manager, which is the core decision-making module of the system, and discuss various approaches for its design.
  • 11:00 - 11:30 Coffee Break
  • 11:30 - 13:00 Keynote: Data-driven speech and language technology: from small to large models
    Hermann Ney Download slides
    Today data-driven methods like neural networks and deep learning are widely used for speech and language processing. We will re-visit the evolution of the data-driven methods over the last 40 years and present a unifying view of the underlying principles, which will be based on Bayes decision rule. Specifically the talk will focus on speech recognition and language modelling.

  • 13:00 - 15:00 Lunch Break

  • 15:00 - 17:30 Conversational Systems Development: An Engineering Perspective
    Zoraida Callejas Download slides
    The rapid evolution of conversational systems is driving advancements in the methods, techniques, and tools used for their development. In this class, we will explore conversational systems development from an engineering perspective, encompassing the entire lifecycle of the process. We will cover the prototyping techniques employed in industrial scenarios, emphasizing practical system design considerations for achieving a seamless and fluid conversational user experience across various communication channels. Additionally, we will examine the prevalent architectures used for development and deployment, while focusing on widely-used solutions like Google DialogFlow, Amazon Alexa, or RASA. By the end of the class, you will have gained comprehensive insights into the practical intricacies of developing conversational systems and be well-equipped to start with the practical lab.
  • 18:00 Visit Ciudadela de Jaca
    The Castillo de San Pedro is a pentagonal fortification constructed at the end of the 16th century. Built by Tiburcio Spannocchi, an Italian engineer serving Felipe II, its purpose was to defend the Aragonese border with France. Located outside the walls on land known as Burnao, the castle features a moat, bastions, escarpments, barracks, magazines, tunnels, and an impressive entrance accessed by a drawbridge. The castle represents the new military architecture of its time, influenced by artillery use, with thick walls, slopes, and specific locations for cannons. This style of construction, known as the Italian trace, is evident in the design. Despite housing a military garrison, the castle has seen few war-related events. Notably, during the War of Independence, it was briefly taken by French troops but later recaptured by Spanish soldiers under General Espoz y Mina. Since then, the castle has gradually lost its strategic significance. In 1968, the castle underwent extensive restoration, earning it the prestigious "Europa Nostra" award. Today, the Castillo de San Pedro Consortium, comprising the Ministry of Defence, Huesca Provincial Council, and Jaca City Council, oversees its management, preservation, restoration, and cultural promotion.

Wednesday, November 15th

Place: Palacio de Congresos

  • 09:00 - 10:15 Speech in Spoken Dialogue Systems
    M. Inés Torres Download slides
    In classical approaches for Spoken Dialogue Systems (SDSs), audio signals are initially decoded into word sequences by Automatic Speech Recognition systems (ASR). Subsequently, Natural Language Processing (NLP) techniques are applied to understand the user and finally respond accordingly. However, this approach ignores crucial information embedded in speech signal, such as the speaker’s emotional mood and prosody, or the environmental noise level, which could be essential for enhancing dialogue strategies. In this presentation, we will emphasise the significant role of the information encoded in the audio signal. In this framework, we will additionally introduce some hybrid architectures that include both classical modules and novel end-to-end models.
  • 10:15 - 11:00 Understanding ChatGPT: Technology, Trends and Challenges for Conversational Systems (Part 1)
    Luis Fernando D'Haro Download slides
    Conversational systems such as ChatGPT have become a trending technology due to their ability to provide seamless, personalized, and engaging experiences to users. Thanks to the widespread use of messaging apps, voice assistants, and the mass access to outstanding DNN-based models, these systems are transforming the way we interact with technology, between each other, and how industries also provide very accessible and customizable services to final users.
    In this class, we will first understand the technology at a high level, understanding their strengths and limitations; then moving into the current trends and open challenges, including automatic evaluation, ethical aspects, and some pointers to the current models and trends for open and task-oriented dialogue systems. Finally, some personal experiences and current research will be shown along the topics to describe the efforts we are doing in Spain to also advance and shape the future of conversational systems.
  • 11:00 - 11:30 Coffee Break
  • 11:30 - 13:00 Understanding ChatGPT: Technology, Trends and Challenges for Conversational Systems (Part 2)
    Luis Fernando D'Haro

  • 13:00 - 15:00 Lunch Break

  • 15:00 - 18:00 Automatic Dialogue Evaluation for Conversational Systems
    Luis Fernando D'Haro Download slides Download tutorial files
    For a long time, conversational systems (chatbots) have attracted significant interests from both academia and industry. The widespread usage of ChatGPT last year brought even more attention, from both the public and researchers, especially for generative models and open-domain dialogue systems. However, current metrics are not fully aligned with the training process in such systems; where performance is mainly measured by using extensive human evaluations, which is both time- and cost- intensive. Hence, it is important for researchers and practitioners to know and understand the existing proposed automatic evaluation metrics.
    In this class, we will first describe the existing taxonomy of dialogue evaluation and standard benchmarks. Then, we will present the various common NLG metrics that are used in dialogue evaluation and the problems associated with them. Next, we will see the newly established proposed reference-free, and model-based metrics. After that, we will describe the future research directions. Finally, we will perform some hands-on tasks to be able to understand and practice the different aspects of the pipeline for evaluating dialogue systems.

Thrusday, November 16th

Place: Palacio de Congresos

  • 09:00 - 11:00 hands-on workshop: Conversational Systems Development with RASA, Part 1
    M. Inés Torres, Javier Mikel Olaso, Zoraida Callejas Rasa folder
    RASA is the leading open source platform for conversational systems development. Throughout the two lab sessions, we will embark on a hands-on exploration of the fundamental components that constitute the RASA ecosystem. We will examine RASA's core functionalities, such as intent recognition, entity extraction, and dialogue management using interactive stories, all with an emphasis on best practices. Through practical exercises and challenges, you will gain experience in designing and training conversational models within the RASA framework, resulting in a fully functional system.
  • 11:00 - 11:30 Coffee Break
  • 11:30 - 13:00 hands-on workshop: Conversational Systems Development with RASA, Part 2
    M. Inés Torres, Javier Mikel Olaso, Zoraida Callejas
    RASA is the leading open source platform for conversational systems development. Throughout the two lab sessions, we will embark on a hands-on exploration of the fundamental components that constitute the RASA ecosystem. We will examine RASA's core functionalities, such as intent recognition, entity extraction, and dialogue management using interactive stories, all with an emphasis on best practices. Through practical exercises and challenges, you will gain experience in designing and training conversational models within the RASA framework, resulting in a fully functional system.

  • 13:00 - 15:00 Lunch Break

  • 15:00 - 18:00 Conversational Systems Development with RASA, Part 3
    M. Inés Torres, Javier Mikel Olaso, Zoraida Callejas
    RASA is the leading open source platform for conversational systems development. Throughout the two lab sessions, we will embark on a hands-on exploration of the fundamental components that constitute the RASA ecosystem. We will examine RASA's core functionalities, such as intent recognition, entity extraction, and dialogue management using interactive stories, all with an emphasis on best practices. Through practical exercises and challenges, you will gain experience in designing and training conversational models within the RASA framework, resulting in a fully functional system.

Friday, November 17th

Place: Palacio de Congresos

  • 09:30 - 13:00 Visit to the Canfranc Underground Laboratory (Laboratorio Subterráneo de Canfranc)
    The Laboratorio Subterráneo de Canfranc (LSC) is a facility for Underground Science. It is conceived as a Consortium of the Spanish Ministry of Science and Innovation, the Aragon Regional Government and the University of Zaragoza.
    The experimental halls of the Laboratorio Subterráneo de Canfranc (LSC) have been excavated in the rock 800 m deep under the Mount Tobazo in the Spanish side of the Aragon Pyrenees. The rock filters the cosmic radiation providing the “cosmic silence”, which is necessary to study rarely occurring natural phenomena such as the interactions with an atomic nucleus of neutrinos of cosmic origin or of particles of the invisible “dark matter”. Dark matter provides 85% of the mass of the Universe, but "we do not know what it is made of.”
    The underground experimental activities are supported by an external building located at Canfranc Estación with a mechanical workshop, specialised laboratories, offices for LSC staff and conference, exhibition and meeting rooms for users which is available since January 2011.

  • 13:00 - 15:00 Lunch Break

  • 15:00 - 16:00 This is Alexa
    Antonio Bonafonte
  • 16:00 - 17:30 Three Minute Thesis Presentations
    Miren Mirari San Martín Lacunza Improving Accesibility in Public Web Pages The accessibility of web pages from public institutions is important in ensuring equal access to information and services for all individuals, including those with disabilities. In this project, we aim to improve the accessibility of the web page from the Government of La Rioja. In particular, we are focused on three aspects that involve applying different natural language processing techniques. First, we will automatically caption all the images from the web pages; second, we will provide transcriptions of all the videos from the Government of La Rioja; and, finally, we will improve the readability of the contents of the web page. In summary, this project is a first step towards making web pages from public institutions adaptable to the needs of each particular user that visits them.
    Federico Costa Transcribing Catalan with Gruut In this research project we develop a grapheme to phoneme transcriber for Catalan using Gruut. Gruut is a tokenizer and statistical IPA phonemizer for multiple languages but, right now, Catalan is not supported. To this end, our strategy was to transfer the rule-based Segre transcriber knowledge to Gruut. We will discuss our experience developing this tool and the main obtained results.
    Miguel Angel Pastor Yoldi Detection of extra-linguistic parameters in voice signals The great advances produced last years in the field of machine learning have enabled the generalization of human-machine interaction systems. Specifically, the Automatic Speech Recognition Systems (ASR) are now part of the daily lives of millions of people. This kind of system has reached high precision rates in speech transcription tasks. Nevertheless, there are other kinds of extra-linguistic parameters in the voice signal that they do not detect. For example, the emotional state of the speaker of some pathologies that affect the speaking apparatus.The aim of the thesis is to develop and study systems able to detect extra-linguistic parameters, using machine learning techniques, such as Deep Neural Networks and Self-Supervised Models. Self-supervised models are particularly interesting due to the reduced size of the available datasets.
    Luis Ricardo Garcia Oyervides Video Semantic Search Audiovisual content has become the preferred medium of knowledge sharing, so we must find cost-effective ways to understand and organize this content. This research is about finding the effectiveness and scalability of a tool for indexing and searching videos using semantic data vectors generated by processing the video frames with neural networks such as CLIP. We develop a proof of concept by testing several state-of-the-art methods for the different tasks needed such as generating image embeddings, scene segmentation, and vector indexing. We will test our system with different focus groups of potential users using a subjective measurement. This proof of concept aims to create a baseline useful to future researchers in this area.
    Pablo Ascorbe Fernández PrevenIA: Chatbot for suicide information and prevention Suicide is a health and social issue worldwide; hence, simple access to reliable sources of information is crucial to mitigate this problem for people suffering them and their relatives. The project’s objective is to build a chatbot capable to synthesize and answer questions on a dataset composed of reliable and professional supplied documents. Likewise, it will be integrated into Discord, easing its access, and recording the users' assessment of the answers for later evaluation. Besides a web application where the functionality of the chatbot can be tested.
    David Gimeno Gómez Contributions to Automatic Lipreading for Spanish In the last few decades, there has been an increasing interest in Visual Speech Recognition (VSR), a challenging task that aims to interpret speech solely by reading the speaker's lips. The lack of the auditory sense implies multiple challenges that should be considered, such as visual ambiguities and the difficult modelling of silence. Nontheless, recognizing speech without the need for audio cues can offer a wide range of applications, e.g., silent visual passwords, active speaker detection, visual keyword spotting, or the development of silent speech interfaces that would be able to improve the lives of people who experience difficulties in producing speech.
    Ana-Maria Bucur Computational Approaches for Mental Disorders Detection from Social Media Data In the interdisciplinary context of mental disorders detection, we are searching for novel cues of mental health problems found in social media and developing different methods for detecting them from textual and visual information. While most works covering depression focus on symptoms, we also explore the relationship between depression and manifestations of happiness in social media and show that the life of users diagnosed with depression is not always depressing, which can guide psychological interventions for improving well-being.
    Aitor Etxalar Zugarramurdi Question answering for human-machine interaction in educational context of the BERREKIN project The work consists in the development of a question answering module for instructing students on the usage of a industrial machine. The main goal is to provide an automatic interface to respond to the different possible questions regarding the configuration and different processes of the machine and be able to improve the module by the automatic interaction with domain experts in the case of wrong answers.
    The Master and PhD students participating in the Fall School are invited to demonstrate their ability to communicate complex ideas, breakthroughs, and research findings effectively in a limited timeframe, just like an elevator pitch in the professional world. Each participant will have three minutes to captivate the audience, conveying the essence of their research projects while maintaining clarity and engagement.
    The Three Minute Thesis Presentations will provide numerous benefits, including:
    🌐 Networking opportunities with fellow researchers and academics.
    🧠 Exposure to diverse research perspectives and approaches.
    🤝 An interactive platform for exchanging ideas and feedback.
    💼 Practice in concise and impactful communication.
    🏆 A chance to win the Best Three Minute Thesis Award!

    How to participate?
    Master's or PhD student interested in participating in the Three Minute Thesis Presentations, please submit a title and abstract of your research project by October 31st, 2023 to
  • 17:30 - 18:00 Fall school closing

Venue and Accommodation

The Venue
The Fall School will take place at

The Accommodation
The Fall School accommodation will be at the "Residencia Universitaria de Jaca" in Jaca, Spain.

The cost of the accommodation and meals (breakfast, lunch and dinner) is included in the registration for the members of the Spanish RTTH.
For all others, the accommodation is not included in the registration but if the budget disponibility allows it, the organization will return, total o partially, the accommodation expenses.
Please, it is mandatory for all the participants to contact the "Residencia Universitaria de Jaca" and identify yourself as a fall school student to book your room.

Fares

Room Price per day
Single 26,40 €
Double 40,70 €
triple 59,40 €
Lunch & Dinner per day + 22 €/person

Address

Residencia Universitaria de Jaca
C/ Universidad, 3
22700 Jaca (Huesca)
Teléfono: +34 974 360 196
Mail

Local Information

Jaca is a unique city that impresses all who come to visit because its attractions are many and varied.
Its strategic location at the foot of the Pyrenees, surrounded by high and snowy mountains, its clean and sunny atmosphere, and its important cultural heritage, resulting from its over two thousand years of history, have made Jaca a city with an international vocation and a top tourist destination.
What to do in Jaca
Jaca Agenda

How to arrive

By car

Jaca is connected by the N-330 road (almost completed A-23 motorway) to Huesca and Zaragoza, and by the N-240 road (under construction A-21 motorway) to Pamplona.
The N-330 national road continues towards France, allowing crossing through the Somport pass or the tunnel from Canfranc.

By bus

Regarding regular bus lines from Zaragoza, Huesca, and Pamplona, there are daily direct buses to and from Jaca.
For more information about buses from Zaragoza and Huesca, visit the website
If you are coming from Pamplona, obtain more information here.

By train

Until the Canfranc railway line is reactivated, the closest train station is in Zaragoza. From there, you can take a bus to Jaca.

Registration

Registration is closed Registration is free for any one belonging to the RTTH or student.
If you belong to a RTTH research group or you are a Master/PhD student write the name of the group and University in the "Professional activities you carry out" field and choose Reducida in the "Type of tuition" field of the registration form
The RTTH registration covers breakfasts, coffee breaks, lunches, dinners and accommodation during the school in the Residencia de Jaca.
Travel expenses must be paid by the participants. Later on, the organization will return the expenses to those participants belonging to a research group of the RTTH. For all others, the organization will return, total o partially, their expenses according to the budget disponibility with priority to student non RTTH member.


Important note

Participants must bring their own laptop to follow the practical sessions.
If you do not have a laptop, please contact the organization.

Registration fares

Registration category Reduced fee Standard fee
RTTH member Free Free
student non RTTH member Free Free
non student and non RTTH member 110 € 130 €

Biographies

Hermann Ney

Hermann Ney is director of science at AppTek, McLean, VA and senior professor of computer science at RWTH Aachen University, Germany. His main research interests lie in the area of machine learning, neural networks and applications to speech recognition, machine translation and other tasks in natural language processing.
He and his team contributed to a large number of European (e.g. TC-STAR, QUAERO, TRANSLECTURES, EU-BRIDGE) and American (e.g. GALE, BOLT, BABEL) large-scale joint projects. His work has resulted in more than 700 conference and journal papers with an h index of 113 and 64000 citations (based on Google scholar). More than 50 of his former PhD students work for IT companies like Amazon, Apple, Cerence, Ebay, Google and Nuance.
The results of his research contributed to various operational research prototypes and commercial systems. In 1993 Philips Dictation Systems Vienna introduced a large-vocabulary continuous-speech recognition product for medical applications. In 1997 Philips Dialogue Systems Aachen introduced a spoken dialogue system for traintable information via the telephone. In the German project VERBMOBIL, his team introduced the phrase-based approach to data-driven machine translation, which in 2008 was used by his former PhD students at Google as starting point for the service Google Translate. In the EU project TC-STAR, the first research prototype system for spoken language translation of real-life domains was built.
Awards:
2005 Technical Achievement Award of the IEEE Signal Processing Society;
2013 Award of Honour of the International Association for Machine Translation;
2019 IEEE James L. Flanagan Speech and Audio Processing Award;
2021 ISCA Medal for Scientific Achievements (ISCA: Int. Speech Communication Ass.).

M. Inés Torres

M. Inés Torres received her PhD in Physics from the UPV/EHU in 1990, including an internship at the CNET-Lanion (France). She was a visiting researcher at the Polytechnic University of Valencia (Spain), visiting Faculty in Carnegie Mellon University and visiting Professor at the University of California under the Fulbright program. She is currently a Full Professor of Computer Science at the UPV/EHU. Prof. Torres has a multi-disciplinar academic and industrial background in the fields of Speech and Language Technologies focusing on data-driven approaches.
Her current research interests involve Human-Machine interaction, Speech processing, including Emotional Speech Identification, and Spoken Dialogue Systems. She has successfully coordinated the H2020 EMPATHIC project, led research contracts and national projects, published in outstanding scientific journals and conferences, among others activities. Currently, she is leading the UPV/EHU's participation in the H2020-MSCA-RISE MENHIR action, two research contracts with companies, one industrial project, two national projects, and supervising three PhD students.

Zoraida Callejas

Zoraida Callejas is Associate Professor at the University of Granada, from which she obtained a PhD in 2008. Her research focuses on areas related to dialogue systems, conversational systems, speech and language technologies. She has published more than 150 contributions to scientific journals, books and conferences, and has published 2 books. She has been an invited researcher at Technical University of Liberec (Czech Republic), University of Trento (Italy), Ulster University (UK), Technical University of Berlin (Germany), Ulm University (Germany), and Télécom ParisTech (France), among others.
She has participated in more than 15 projects in European, Spanish and local calls, she currently coordinates the EU H2020 project MENHIR, with a consortium of 8 international partners. The main focus of MENHIR is the use of conversational technologies for mental health. She is also the coordinator of the Chair RTVE-UGR on deep speech synthesis and conversational AI and their applications to news verification.
In 2020 she received the prestigious recognition “Premio del Consejo Social” of the University of Granada for outstanding contributions to technological innovation in healthcare. In 2023 she has received the award "Granada, City of Science and Innovation" in the category "Women in Science".

Luis Fernando D'Haro

Luis Fernando D’Haro is Associate Professor at Universidad Politécnica de Madrid (ETSIT, UPM), Spain, and member of the Speech Technology and Machine Learning Group. His research mainly focuses on spoken dialogue and natural language processing systems; he has written more than 150 international peer-reviewed publications in different areas of speech technologies, but specifically on dialogue systems; he is the editor of 2 books and co-author in 5 research and educational books; he has actively participated in more than 25 research projects at National and International level. He is also a reviewer for 5 different top journals and research projects. And currently he is the PI and Coordinator of the European Project ASTOUND (101071191 – HORIZON-EIC-2021-PATHFINDERCHALLENGES-01).
Prof. D’Haro has also organized many activities to promote the research and access to dialogue technologies and resources including: the co-organization of workshops and challenges like the WoChat, DBDC and DSTC series (currently including Track four at DSTC11 - multilingual and robust dialogue metrics evaluation). He has also been organizer of conferences such as Interspeech in 2014, Human Agent Interaction conference, the International Workshop on Spoken Dialog System Technology (IWSDS) in 2018 and the General Chair for IWSDS 2020. He was Senior Member for the Chanel workshop at the Johns Hopkins Summer school (JSALT2020). Finally, he has been the Faculty Advisor for the Genuine2 (SGC4) and Thaurus (SGC5) Teams for the prestigious Amazon Alexa Prize Grand Challenge

Javier Mikel Olaso

Javier Mikel Olaso is Researcher and Software Developer for the Speech Interactive Research Group (SPIN) at the University of the Basque Country (UPV/EHU). He received his PhD in Physics Engineering, specialty of Computer Languages and Systems from the UPV/EHU in 2017.
He was a visiting researcher at the Centre for Speech Technology Research, School of Informatics at the University of Edinburgh. He was also a visiting researcher at Telecom ParisTech (TPT) Engineering School in Paris. Since 2007, he has been working in research and development of spoken dialogue systems. He has participated in several projects in European, Spanish and local calls.

Photographies