Author(s): Suvrat Arora
Originally published on Towards AI.
Hey Siri, How’s the weather today? — if this statement sounds familiar, you are not foreign to the field of computational linguistics and conversational AI.
Source: Creative Commons
In recent years, we have seen an explosion in the use of voice assistants, chatbots, and other conversational agents that use natural language to communicate with humans. These technologies have revolutionized the way we interact with computers — enabling us to access information, make purchases, and carry out an array of tasks through simple voice commands or text messages. At the heart of these technologies lies the field of computational linguistics, which combines the study of linguistics and computer science to develop computational models and algorithms for processing and understanding human language. In this article, we will dig into the basics of Computational Linguistics and Conversational AI and look at the architecture of a standard Conversational AI pipeline.
Computational linguistics involves the study of linguistics and the use of computer algorithms and models to process and analyze human speech. It includes tasks such as Speech Recognition, Machine Translation, and Sentiment Analysis, which aim to enable computers to understand, generate, and analyze language.
What is Conversational AI?
Conversational AI refers to developing and implementing Artificial Intelligence (AI) systems that can engage in natural language conversations with humans. These AI systems use various technologies, such as Natural Language Processing (NLP), Speech Recognition, and Speech Synthesis, to understand, process, and generate human-like responses. Conversational AI is used in many applications, including chatbots, virtual assistants, and voice-activated devices. These systems can provide customer support, automate tasks, and improve user experiences by allowing users to interact with technology more naturally and intuitively.
Applications of Conversational AI
Applications of Conversational AI are vast. With more and more data being available every minute, the feasibility of building Conversational AI applications continues to escalate. Some of the most common applications are enlisted below:
Virtual Assistants: Virtual assistants, such as Siri and Alexa, are popular applications of conversational AI. These assistants use voice recognition and NLP to answer questions, set reminders, and perform various tasks, such as making calls, sending messages, and playing music, all through natural language interactions with users.
Chatbots: Chatbots are AI-powered software programs that simulate human conversations. They are used to automate customer service interactions, provide information, and handle simple tasks, such as scheduling appointments or ordering products. Chatbots can be integrated into websites, messaging apps, and other platforms to provide instant support and assistance to users. One suitable example of the chatbot can be open AI’s chatGPT.
How are Computational Linguistics and Conversational AI different?
Computational linguistics and conversational AI are related fields, but they have different focuses and goals. Computational linguistics is primarily concerned with the analysis and processing of human language using computational methods, while conversational AI focuses on the development of computer programs that can engage in natural language conversations with humans.
Conversational AI is concerned with the development of computer programs that can engage in human-like conversations with people. It involves using natural language processing, machine learning, and other techniques to create chatbots, voice assistants, and other conversational agents that can understand and respond to human language.
While computational linguistics is a broad field that encompasses many different areas of natural language processing, conversational AI is more focused on creating intelligent agents that can carry out specific conversational tasks, such as answering questions or providing recommendations. In summary, computational linguistics is the foundation for conversational AI, and conversational AI is one of the many applications of computational linguistics.
Linguistic Structure of Speech
Before we dive into the intricacies of conversational AI, it is imperative that we develop an understanding of the linguistic structure of speech. Speech or audio is nothing but a disturbance in the environment that can be represented as an acoustic signal. While the written text comprises categorical units (each word separated by whitespace), speech comprises non-categorical signals and is hence continuous in nature. The mapping from units of speech to units of written text is not one-to-one, and there is no elementary protocol that governs the same for most languages; hence, a separate system defines the structure of spoken language called linguistics.
In the broad field of linguistics, Phonetics refers to the study of the physical properties of sounds used in speech, including their production, transmission, and perception. It focuses on the characteristics of individual sounds and how the vocal tract articulates them. Phonology, on the other hand, is the study of the abstract sound system of a language, including the way sounds are organized and combined to form words and phrases. It examines the patterns of sound in a language, such as how sounds change based on their position in a word and how sounds can be used to distinguish between different words.
Hierarchical Organization of Units of Speech
Speech can be considered as the association/organization of its fundamental units. The elementary units of speech in hierarchical order are described below:
– Phone: A phone is a unit of sound used in the study of phonetics. It is the smallest unit of sound that can be perceived by the human ear.
– Phoneme: A phoneme is the smallest unit of sound that can change the meaning of a word. For example, in English, the sounds /p/ and /b/ are different phonemes because they can change the meaning of a word (e.g., “pat” versus “bat”).
– Syllable: A syllable is a unit of sound that is made up of one or more phonemes and typically contains a vowel sound. It is a basic unit of rhythm in spoken language and can be thought of as a beat or pulse. Syllables usually consist of a syllable nucleus (usually a vowel sound), an optional initial consonant sound called onset, and an ending/final consonant sound called coda. As per the Sonority Sequencing Principle (SSP), in a syllable, onset is the least sonorous sound; the sorority increases as we move toward the nucleus and then gradually decreases as we move toward the coda.
Source: Created by Author
Sonority Sequencing Principle (SSP)
– Word: A word is a unit of language that represents a specific concept, object, action, or idea. It is made up of one or more syllables and is typically used to communicate meaning in speech or writing.
– Utterance: An utterance is a unit of speech produced by a speaker in a single uninterrupted turn, usually with a specific purpose or intention. It can be a word, phrase, or sentence conveying a message or expressing a particular emotion or attitude. Utterances are the building blocks of spoken language, and they can be analyzed in terms of their linguistic features, such as syntax, semantics, and phonetics. In linguistics, the study of utterances is called pragmatics, which is concerned with the use of language in context and the social and cultural factors that shape communication.
Speech in terms of Elementary Units (Source: HAL Open Science)
The architecture of a Conversational AI Pipeline
With the onset of deep learning and data availability, conversational AI models have shown improved accuracy and reduced need for linguistic knowledge in building language services. Now that we are accustomed to what conversational AI is and the linguistic structure of speech, let’s gaze at a typical Conversational AI pipeline.
Conversational AI Pipeline (Source: Created by Author)
A Conversational AI pipeline comprises two components:
– Speech AI: Automatic Speech Recognition (ASR) and Text to Speech (TTS) conversion
– Natural Language Processing: Natural Language Understanding (NLU) and Natural Language Generation (NLG)
Intuitively, conversational AI primarily ought to deal with human speech. However, deriving meaning directly from audio signals is not possible. Hence, conversational AI models convert the speech signal to text (Automatic Speech Recognition), perform the required processing on the text (NLP), and finally, convert the output to speech signals (Text to Speech (TTS) conversion).
In the subsequent sections, we’ll explore the components of a conversational AI pipeline in detail.
Speech AI, at a rudimentary level, involves the mapping of speech to text and vice versa. On this ground, Speech AI has broadly two phases:
Automatic Speech Recognition (ASR): Automatic Speech Recognition systems help transcribe spoken audio to text. It is also called Speech to Text conversion.
Text-to-Speech Conversion (TTS): As the name suggests, Text to Speech (TTS) conversion involves mapping written text to spoken audio.
Let us discuss each of them one by one:
Automatic Speech Recognition (ASR)
In an ASR system, the input is a speech signal, and the output is the most likely sequence of written words.
Speech-to-Text Mapping (Source: Research Gate)
Precisely, an ASR system can be defined in the form of a function,
W = f(X)
X: recorded input speech signal
W: most probable text sequence for X
f: speech-to-text mapping function
The definition of such a function in practice is quite difficult, and thus, the objectives of the ASR are accomplished using a series of consecutive models.
The flow diagram of a typical ASR system is shown below:
ASR System (Source: Research Gate)
As we can see in the diagram above, a raw speech signal is fed as an input to the ASR signal. This speech signal must be pre-processed to reduce any background noise or disturbances. The pre-processed audio undergoes processing by the following models for successful mapping to the text
1. Feature Extraction:
Feature extraction is a critical component of automatic speech recognition (ASR) systems. Since no model can work directly on audio signals, feature extraction involves the process of converting raw audio signals into a series of numerical features that can be analyzed and interpreted by the ASR system. The goal of feature extraction is to capture the most salient information in the audio signal that is relevant for speech recognition while minimizing the effects of noise and other distortions.
There are several techniques used for feature extraction in ASR, but the most widely used method is called Mel-Frequency Cepstral Coefficients (MFCCs). MFCCs are based on the human auditory system’s ability to analyze sounds and have been shown to be effective in representing speech signals.
The process of extracting MFCC features from an audio signal involves several steps:
Pre-Emphasis: The first step is to apply a pre-emphasis filter to the audio signal. This filter amplifies the high-frequency components of the signal, which makes it easier to extract meaningful features.
Frame Blocking: The audio signal is divided into short frames of typically 20–30 milliseconds. The frames are overlapped to ensure that there is continuity between them frames.
Windowing: A window function, such as a Hamming or Hanning window, is applied to each frame to reduce spectral leakage caused by discontinuities at the frame edges.
Fourier Transform: A Fourier transform is applied to each frame to convert the time-domain signal into the frequency domain.
Mel-Scale Filtering: The resulting spectrum is passed through a set of triangular filters that are spaced evenly on a Mel-scale, which is a perceptually-based scale that reflects the way humans hear sound. The filters are used to emphasize the frequencies that are most important for speech recognition.
Logarithmic Transformation: The output of each filter is transformed using a logarithm function, which compresses the dynamic range of the spectrum and makes it easier to represent the signal with a small number of coefficients.
Discrete Cosine Transform: Finally, a Discrete Cosine Transform (DCT) is applied to the logarithmic filter outputs, resulting in a set of Mel-Frequency Cepstral Coefficients (MFCCs) that represent the speech signal.
The resulting set of MFCC features for each frame is then used as input to the acoustic model of the ASR system, which maps the features to phonemes or words.
In summary, feature extraction in ASR is a complex process that involves several steps to convert raw audio signals into a set of numerical features that can be used to recognize speech. The choice of feature extraction technique and the specific parameters used can have a significant impact on the accuracy and robustness of the ASR system.
2. Acoustic Model:
The acoustic model is a fundamental component of Automatic Speech Recognition (ASR) systems. Its main function is to transform the acoustic signal of spoken words into a sequence of phonetic units, which can then be further processed by language and lexical models. The acoustic model’s accuracy directly affects an ASR system’s overall performance. Simply put, an acoustic model can be defined in terms of a function that maps acoustic features to phonetic units.
The acoustic model is based on the principle of Hidden Markov Models (HMM), which are mathematical models that represent the probability distribution of a sequence of observations. In the context of ASR, the observations are the acoustic features extracted from the speech signal, such as Mel-frequency cepstral coefficients (MFCCs), representing the speech signal’s spectral characteristics. The HMM is a probabilistic model that estimates the probability of each observation given a hidden state or phoneme.
The acoustic model is trained using a large corpus of speech data, which is typically transcribed into phonetic units. The training data is used to estimate the parameters of the HMM, which include the mean and variance of the acoustic features for each phoneme. During training, the HMM learns to associate each phoneme with a unique set of acoustic features and to model the transitions between phonemes.
The accuracy of the acoustic model is critical to the performance of an ASR system. Inaccuracies in the model can lead to errors in phoneme recognition, which can significantly degrade the overall performance of the system. As a result, ongoing research in ASR is focused on improving the accuracy of the acoustic model through techniques such as deep neural networks (DNNs) and convolutional neural networks (CNNs), which have shown promising results in recent years.
3. Pronunciation Model (Lexicons):
A pronunciation model is a set of rules and patterns used by automatic speech recognition (ASR) systems to transcribe phonetic units to words. It helps the system recognize the correct sounds of words by providing information about how each phoneme (unit of sound) is pronounced in a particular language. Without a pronunciation model, ASR systems would have a much harder time accurately transcribing spoken words, as there are often multiple ways to pronounce the same word or sound in different dialects and accents. In summary, a pronunciation model can be considered as a function that helps map phonetic units to words.
4. Language Model:
If you’re familiar with the field of NLP, you probably would already know what a language model is. A language model is a statistical model (usually n-gram based) used to predict the probability of a word or a sequence of words in a given context.
For example, consider the following sentence: “I went to the store and bought some apples.”
A language model would analyze the sentence and assign probabilities to each word based on the context in which it appears. For instance, the word “store” would have a higher probability than “apples” since it is more likely to occur after “went to the.” Similarly, the phrase “some apples” would have a higher probability than “some bananas” since it is a more common collocation in English.
The ASR system would use the language model to predict the most likely transcription of a given audio input based on the probabilities assigned to each word generated by the Pronunciation Model. This would help to minimize errors and improve the accuracy of the ASR output.
The decoder takes into consideration the outputs of all the models and yields the most optimal transcription of the audio. It is usually done based on a graph-based search.
Text to Speech (TTS) Conversion
In the Conversational AI pipeline, the TTS component comes into play after the Natural Language Processing (NLP) component has analyzed the user’s text input and generated a response. The TTS component then converts the response into audio, which can be played back to the user. There are a number of different techniques and algorithms used in TTS conversion, including rule-based systems and neural network-based systems.
A typical Text to Speech conversion pipeline would include the following:
– Text Pre-Processing: It involves the normalization of text (conversion to lowercase, omission of special characters, etc.)
– Mel-Spectrogram Generation: A synthesis network generates a spectrogram from the text.
– Audio Wave Generation: A vocoder network generates a waveform from the spectrogram
Natural Language Processing (NLP)
NLP stands for Natural Language Processing, which is a branch of Artificial Intelligence that deals with the interaction between computers and human language in text form. NLP focuses on enabling machines to understand, interpret, and generate human language and is used in a wide range of applications such as chatbots, virtual assistants, sentiment analysis, and machine translation.
NLP can be branched into two subfields:
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Natural Language Understanding focuses on enabling machines to understand and interpret human language. NLU is used to analyze text input and extract meaning from it, enabling machines to recognize entities, understand the relationship between words, and classify text based on its content.
Natural Language Generation focuses on enabling machines to generate human-like language. NLG is used to convert data and information into natural language text, enabling machines to automatically generate reports, summaries, and other types of written content.
Applications of NLP
The applications of NLP are vast and ever-expanding. Some of them are enlisted below:
Sentiment Analysis: NLP can be used to analyze text and determine the sentiment or emotion expressed in the text. Sentiment analysis is used in a wide range of applications, such as social media monitoring, customer feedback analysis, and market research.
Machine Translation: NLP is used to enable machines to translate text from one language to another automatically. Machine translation is used in a variety of applications, such as international business communication and online content localization.
Information Extraction: NLP algorithms are used to extract structured data from unstructured text, such as identifying named entities (people, organizations, etc.), relationships between entities, and other key information. Information extraction is used in a wide range of applications, such as data mining, customer relationship management, and fraud detection.
Text Summarization: NLP can be used to automatically generate summaries of long texts, such as articles, reports, and news stories. Text summarization is used in a variety of applications, such as content curation, news aggregation, and information retrieval.
In the conversational AI pipeline, NLP comes into action after the ASR phase. The converted text is processed to perform the desired task, and the output proceeds to the text-to-speech conversion phase, which outputs it to the end user.
Challenges in Computer Linguistics and Conversational AI
Conversational AI is a rapidly evolving field that has seen significant advancements in recent years, but there are still several challenges that must be addressed in order to create more effective and reliable conversational systems. Some of the key challenges in conversational AI include the following:
Natural Language Understanding: One of the biggest challenges in conversational AI is accurately understanding the meaning and context of user input. NLU algorithms must be able to interpret ambiguous language, recognize sarcasm, and understand the relationship between words and phrases.
Context Awareness: Conversational AI systems must be able to recognize and remember the context of the conversation, including previous exchanges and the user’s history and preferences. This requires sophisticated algorithms that can identify and track context in real-time.
Personalization: Conversational systems must be able to personalize the conversation to the individual user, including their preferences, history, and unique communication style. This requires advanced machine-learning techniques that can adapt to each user over time.
Integration with Other Systems: Conversational systems must be able to integrate with other systems and platforms, such as customer relationship management (CRM) tools, e-commerce platforms, and social media. This requires robust API integration and an understanding of the underlying data structures and workflows of each system.
Ethics and Privacy: Conversational AI systems must be designed with ethics and privacy in mind, ensuring that user data is secure and that the system operates in an ethical and responsible manner. This requires a deep understanding of data privacy laws, ethical frameworks, and best practices for data management and security.
This article endeavored to dig deep into Computer Linguistics and Conversational AI from a fundamental level. Let us have a look at the key takeaways from what we’ve learned.
Computer Linguistics is a broad field that encompasses the computational processing of human speech, while Conversational AI is its subfield that aims at building systems that can perform human-like interactions based on speech commands.
It is imperative to understand the linguistic structure of speech to comprehend Computer Linguistics better. Human speech is an organization of fundamental units called phones that combine to form syllables, which in turn form utterances.
A typical conversational AI pipeline is as follows: Automatic Speech Recognition (ASR) -> Natural Language Processing (NLP) -> Text to Speech (TTS) conversion.
Despite all progress, Computer Linguistics and Conversational AI continue to face challenges in terms of deciphering inputs due to the lack of context awareness, personalization, etc.
That’s all for this article; feel free to leave a comment with any feedback or questions.
Since you’ve read the article up till here, I’m certain our interests do match — so please feel to connect with me on LinkedIn for any queries or potential opportunities.
Published via Towards AI