For the past 4 days [1], I've been going through the notes and video lectures of Stanford's NLP course. I covered all the content but it went by relatively quickly so I'm going to summarize on a high level what I learned.
So what is Natural Language Processing? Well it's processing language naturally. Processing language means taking a body of text, aka a corpus, and extracting the meaning from it so that we can use it in applications like creating chatbots or translating between languages. We aim to do this naturally, meaning we don't hard code rules into our system and instead use "natural" representations of words, which can involve machine learning.
The core problem of NLP is actually just creating these representations of words. We normally represent what a word is by using other words liike we do in dictionaries, but for computers this doesn't work because the computer won't know what the words that define other words mean. The way words are represented in NLP is the same way computers represent everything else: numbers. These numbers aren't ASCII numbers but are instead vectors of an arbitrary dimension to allow for meaning to develop. The simplest way to create these vectors are to use one-hot vectors where are vectors filled with zeros except for at the spot corresponding to the word it represents. However, this takes up a lot of space and doesn't allow meaning to develop in the dimensions. We can do much better yet still retain lots of information if we decrease the dimensions and train an ML model on the word vectors. Intuitively, we can think of dimensions as representing properties of words, for example gender or plurality, though in reality the dimensions are much messier and complicated.
The amazing part of turning words into word vectors is that the words' locations in our vector space will correspond to their meanings, allowing us to do super cool things like this:
Woman - Man = Queen - King
Now that we can turn words into meaningful vectors, can we make chatbots and translators yet? Not quite, because taking words like synonyms and determiners (e.g. "the" and "he") out of context can spell big trouble semantically, so we also have to encode the contexts of words in addition to the words themselves. This is done through attention, which stores a representation of the most relevant places in the input sentence relative to the word we're trying to encode. There is a lot of research being done on how to best find spots to 'attend' to, but it is mostly done by ML models.
Now that we have the tools to turn text into meaningful representations that include context, we should talk about applying our embeddings and making them useful. The two main parts of solving any NLP task consists of encoding our input text into embeddings that implicitly store meaning, and decoding these embeddings into the output we want. The decoding part depends on what task we're trying to solve (e.g. chatbots, translation). The majority of the rest of the course (and probably most of NLP research) is focused on optimizing these encoding and decoding processes.
Let's talk about optimizing the encoding part. The majority of the high-performance techniques leverage machine learning, so we should talk about how machine learning is used in the context of NLP. Neural networks are used due to the nature of "meaning" being complicated and unable to be easily evaluated. More specifically, Recurrent neural networks (RNNs) are used to store "memory" of what was seen previously, which is especially important for incorporating context in our embeddings. For how memory in NNs work, they use GRUs or LSTMs, which have weight matrices for using the previous layers' values and the current input. Basic RNNs do not use memory efficiently as demonstrated by the vanishing gradients problem, where in back-propagation the error gets spread out the furthur it's propagated. Some extensions to RNNs to improve performance and memory include using bidirectional nodes (explained later), deep layers, initializing weights as identify matrices to propagate clearer error, and using ReLU to mitigate vanishing gradients.
Recent research has focused on optimizing performance of these deep learning models, and the major breakthroughs include improvements on RNNs made by ELMo, ULMfit/GPT, transformers, and BERT. ELMo first introduced bidirectional encoding, where the model considers not only previous layers, but also the future ones. Intuitively, this is related to how we read, as we use information later in the text to inform our understanding of previous ideas and subjects. The next breakthrough was with ULMfit and GPT, which introduced the notion of fine-tuning. Instead of training language models for specific tasks, these models trained generalized models to learn word embeddings that could then be optimized on various objectives corresponding to the task at hand (e.g. chatbots, translations). The problem then became speed, which is where transformers come in. Transformers use specialized units consisting of a multi-head attention and a feed-forward neural network (FFNN), resulting in shallower layers that still embed context and allowing for parallelizable computations. The final product of all these innovations resulted in BERT, which used all everything mentioned above and also used deep bidirectionality for improved performance and occassionally masking words in the input to speed up training.
That's it for RNNs, but many NLP tasks do not rely heavily on context but instead focus on narrower feature selection and do not need to include as much context as RNNs provide. Convolutional NNs only allow phrases to be examined and not entire sentences, but they are parallelizable and much faster and work better for localized tasks like named entity recognition and sentiment analysis. Another type of NN used is the Recursive NN that takes advantage of the nested structure of language, making it great for bubbling up sentiments. Recursive NNs require a dependency tree and apply weights and combine lower levels to "bubble up" meaning.
As mentioned before, decoding is dependent on the task at hand so this section will be segmented in terms of different NLP tasks. I'll also include methods of evaluation for each task. In general, the decoding architecture is similar to encoding in that it uses NNs and back-propagation to update weight matrices to optimize an objective. The objectives will differ based on the task, speaking of which...
Machine Translation (MT) is translating between languages using machines. In the past, the common method of MT was using translation models which translates documents (articles, paragraphs, sentences, phrases... basically whatever you want them to be) by sequentially translating smaller phrases using encoders and decoders. The problem with translation models are that they don't easily take into account the context of the phrase being translated and fail to capture long-term dependencies. To resolve this, language models (LMs) were introduced. LMs tell us how likely a given sentence or phrase translation is correct by assigning a probability to each piece of text, and they generate output sequences only after seeing the entire input so it can use attention to capture context and long-term dependencies.
An optimization problem with LMs is trying to decrease the giant search space that LMs need to go through to find translations with the highest probability of being correct. The two main methods are using search algorithms and decreasing vocabulary. The two main search algs are Greedy Search and Beam Search. Greedy Search goes linearly through the data and picks the most probable token at each step. It's efficient and natural, but explore a small section of the search space and a single mistake can blow up and affect the rest of the translation. Beam search is the most commonly used and maintains the best K candidates at each step, then expands the set by adding characters from the next step and choosing the best K candidates again. Choosing a good K is another optimization problem as a low K results in Greedy Search-like properties while a high K results in a high runtime.
The other way to decrease the search space for LMs is to decrease our vocabulary size to decrease the number of potential phrases we have to search through. A simple solution is to add words to our vocabulary until we reach a threshold, then use an
Recent research in MT has focused on back-translation and unsupervised training methods. A big problem with MT is that in the past, translations required human expertise so there is much less data to use in MT than in other fields where NNs have been successful. Back-translation is used to create bilingual data from monolingual text by translating from one language to another, then using that output to translate in the reverse direction. This self-training seems circular but can work if we have two separate models with independent parameters. Another method to use monoligual text in MT is through unsupervised learning, where we do not label any data and the model learns through the data itself. The idea is that if we train word vectors for a new language using text from that language, the embeddings in the vector space should look similar to the vector space for a different language after applying an orthogonal matrix. Unsupervised learning takes advantage of this and learns the orthogonal matrix and uses back-translation to train a MT model. Unsupervised learning is better for large amount of text (>100k) as it grows constant wrt data size while supervised learning grows logarithmically.
Finally, for evaluation, the gold standard is and will probably always be human evaluation even though it's costly and inefficient. Another general method of evaluation is to use the results of our current task to perform another task. For example, we can test our translations by using them in a test for dependency parsing, which is easier to evaluate. The problem with this is that we lose the finer points of the current task as it's usually more complicated than the tasks that have better methods of evaluation. Finally, the machine standard for MT is the Bilingual Evaluation Understudy (BLEU). This algorithm scores translations by comparing them with human referenced n-grams (groups of n words) and multiplies the ratios for the number of common n-grams for multiple n's. Longer translations will naturally have more n-grams in common so a brevity penalty is multiplied to the product to discourage long translations. BLEU works good enough but has many pitfalls such as working well only for large texts as a single zero score for an n will result in a zero in the product, as well as having only a single reference to compare translations to.
The next type of task we'll talk about is parsing, where our goal is to create a parse tree to analyze the syntactic structure of sentences. The two main types of parse trees are created by dependency parsing and constituency parsing. Dependency parsing lists each word as a node and displays links to its dependents along with their relationships. The two subproblems in parse trees are learning (creating a model given sentences and their dependency graphs) and parsing (creating dependency graph given model and sentence). Dependency parsing generally has well-defined rules so the solutions use state machines and models are trained to predict which transitions are taken in the state machine.
For Constituency parsing, the goal is to not find dependencies between individual words, but to display the syntactic structure of sentences using a recursive set of rules called context-free grammar. Sentences are broken down into mini-sentences or phrases, which in turn are recursively broken down to their phrases or words. Due to the recursive nature of context-free grammar, we can utilize Recursive NNs to help build models for constituency parsing.
As mentioned before, parsing is one of the easier tasks to evaluate as there are sets of grammatical rules that sentences follow. However, there are some types of ambiguity, namely structure ambiguity (I ate a fruit covered in stuff: does it refer to me or the fruit?) and coordination ambiguity (I like yellow apples and kiwis: does it refer to apples or kiwis?), that make evaluation a little more complicated but it's still much simpler than other NLP tasks.
Much of this ambiguity falls in the task of coreference resolution, where we aim to identify all mentions that refer to the same real world entity (can include people, objects, or ideas). Coreference resolution results don't seem immediately practical, but they're helpful in more complex tasks such as text understanding, machine translation, and dialogue systems.
There are two steps in coreference resolution: detecting the mentions (easy) and clustering the mentions (hard). For mention detection, there are three kinds of mentions we want to catch: pronouns, named entities, and noun phrases. For pronouns we can use part-of-speech (POS) taggers as pronouns are literally a POS; for named entities we can use named entity recognition (NER) which are pretrained models; and for noun phrases we can use the aforementioned parsers. But it's not that simple, as some phrases can be hard to classify as references such as "it is sunny" and "the best donut in the world." While "it" and "donut" are both nouns, they aren't referring to anything in the real world so they can't be classified as references. To resolve this, we could train a classifier just to deal with bad mentions, but it's generally easier to maintain a list of bad mentions. [2]
Now that we can detect mentions, we need to cluster them so we can see which references refers to what. There are 4 methods of doing this. The first is a rule-based algorithm that uses a fixed set of rules to determine what a mention references. Hobbs' Naive Algorithm works well for most sentences, but doesn't work for Winograd Schemas, sentences where you can replace one word to change the reference like in "she poured water from the pitcher into the cup until it was [full | empty]." If we use 'full,' 'it' will refer to the cup while if we use 'empty,' 'it' will refer to the pitcher. These Winograd Schemas imply that we need real-world knowledge to solve mention referencing.
One way to do this is through mention pairing where we train a binary classifier that assigns each pair of mentions a probability of being coreferent. If the probability for any pair is above a threshold, we add a coreference link between them to create our mention reference graph. While mention pairing uses real-world knowledge through ML, it doesn't work well for when there are many occurences of certain words like "she," as linking present "she"s with past "she"s is difficult due to the fact that we try to find every prior link.
A solution to this is to predict only one antecedent for each mention. Mention ranking assigns each mention to its highest scoring antecedent and uses a dummy NA mention to allow for a null link. Now the only thing left is to figure out how to assign probabilities to pairs of mentions. Non-neural approaches exist such as using a statistical classifier with features like person/number/gender agreement, semantic compatibility, syntactic constraints, recentcy, grammatical role, parallelism, etc. For neural approaches, we can use a simple feed-forward neural network (FFNN) or more advanced models that input word embeddings into a CNN, which feeds into a Bi-LSTM, which then outputs span representations. Spans are subsequences of words and consist of the hidden states for the start and end words, along with an attention-based representation of the other words in the span. To calculate if spans i and j are coreferent, we add (is i a mention?) + (is j a mention) + (do they look coreferent?), and each of these terms is associated with a weight matrix and FFNN. Basically, to calculate probabilities of mentions being coreferent, we create span representations and ask if the spans are coreferent. Note that there will be a quadratic number of spans so lots of pruning is necessary to keep the runtime to a reasonable amount.
The final and most natural method of resolving mentions is through clustering, where each mention starts in its own singleton cluster and we merge a pair of clusters at each step. To determine which pairs to merge, we assign each cluster-cluster pair a score based on mention-cluster pairs, which are in turn based on mention-mention pairs. Each of these are turned into representations and the final cluster-cluster representation turns into a score.
For evaluation, we can use precision and recall where precision is the number of correct mentions in a cluster over the total number of mentions in the cluster and recall is the number of correct mentions in a cluster over the total number of mentions in that category. Averaging over the precision and recall scores, we get the B-cubed score. Finding the right number of clustering is a balancing act as underclustering means better precision (since the clusters are smaller and there's a higher chance of clusters being of a single group) and overclustering means better recall (since the clusters are larger and there's a higher chance of groups being in the same cluster).
In summary, coreference is an important tool used in many other NLP tasks and is a great example of how multiple technologies and techniques can be used to innovate in a specific task.
Speaking of being multifaceted, question answering (QA) systems aim to extract information from different sources to meet a user's needs, often requiring the usage of other NLP tasks like sentiment analysis and named entity recognition to better evaluate the type of question being asked. Since we can separate QA into subtasks, one could think that we could throw a bunch of existing NLP solutions together to create a comprehensive QA system, but the problem with combining NLP models is that most solutions use NNs and this type of transfer-learning is difficult for current NN architectures, implying that we need brand new architecture built for QA.
This new NN is called Dynamic Memory Network (DMN) and is able to incorporate multiple techniques by episodic memory. DMNs consist of 4 modules. The input module takes in any information that would be useful in a QA system, e.g. articles, past questions and answer, etc and creates a sequence of fact representations / word embeddings. The question module reads the questions and outputs an encoded representation of the question. The episodic memory module runs over the input multiple times, each time paying attention to different subsets of facts from the input and using the growing episodic memory representation until a satisfactory answer is achieved. The answer module determines when a good answer has been reached by considering the inputs and the episodic memory representation.
DMNs also work in visual question answering (questions about images or videos) by replacing the input module with a CNN.
Natural language generation (NLG) is perhaps the flashiest, most attention-grabbing task in NLP and refers to any setting where we generate text, e.g. MT, summarization, dialogue, creative wriing, freeform QA, image captioning, etc. What initially made me interesting in learning NLP was when I watched BERT generate an article based off a single headline, and I was blown away by how nuanced and realistic the output was. So how does NLG work? It's the same encoder-decoder pattern we've seen before, but there are some optimizations specifically for NLG that we'll discuss.
Many NLG tasks utilize beam search in generating output, but there's a contradiction when we use beam search in NLG. As mentioned before, a high k value in beam search generally corresponds to more accurate output, but for NLG, high k values encourage safe responses that have high probability to have a decent score for most inputs. But if we're creating a chatbot or summarizing, we don't want to have repetitive text no matter how safe it may be. To increase diversity, we can use sampling-based decoding, where instead of using beam search and taking the highest scoring output, we randomly choose among the probability distribution. A variant of this called top-n sampling is more reliable because it randomly chooses among the top n choices instead of the entire distribution. Another diversity measure is softmax temperature, which is a hyperparameter in the softmax where we divide each value by a constant (temperature). Higher temperature makes the probability distribution more uniform and therefore more diverse while lower temperatures make the distribution spiky and less diverse.
Now we'll talk about specific subtasks in NLG, the first of which is summarization, or generating a shorter body of text than the input that contains the main information presented. The two classes of strategies are extractive summarization where we select parts of the original text to include (easier but restrictive) and abstractive summarization where we generate new text (flexible but harder). Pre-neural solutions were mostly extractive by selecting and reordering content from the input. Graph solutions were used to identify central sentences that had many connections to other sentences. Neural solutions used the standard encoder + decoder + attention combination with additional optimizations like multi-head attention, global content selection, and reinforcement learning. The use of reinforcement learning was actually just for a single purpose: to maximize the evaluation score of summarization on the ROUGE test. ROUGE evaluates summarizations by calculating n-gram overlap, similar to BLEU but without the brevity penalty, so reinforcement learning is able to optimize this well as normal startegies didn't work due to the ROUGE scoring function being non-differentiable.
The next NLG task we'll discuss is dialogue, which can be used in assistive, cooperative, adversarial, casual, or therapudic contexts. Pre-neural techniques involved predefined templates while post-neural techniques were, again, the encoder + decoder combo. The problem with NNs in dialogue are similar to those in summarization where the responses were often dull and unrelated to the input. Some solutions involve buffing the weights for rare words during the beam search and using a hybrid model where we condition the decoder on the input then train a retrieve-and-refine model based on a corpus of predefined responses. Repetitiveness is also a problem and can be improved by blocking repeating n-grams during beam search.
Other NLG tasks include storytelling where a model creates a story based on a text, image, or other types of prompt. People also tried generating poetry and used finite state acceptors to define possible sequences that obey a set of desired rhythm constraints and using this to limit output in beam search.
We briefly covered evaluation for certain NLG tasks, but there is a strategy for general NLG tasks even though evaluation seems highly subjective and arbitrary. Firsty, automatic evaluation for machine translation is pretty bad, but it's worse for summarization (more open-ended), and even worse for dialogue (even more open-ended)... as for poetry and NLG music, non-human evaluation seems like a lost cause. While high-level evaluation for NLG is out of the question for now, we can still define more focused automatic metrics like fluency, correct style, diversity, relevance to input, length and repetition, and also task-specific metrics (e.g. compression rate for summarization or rhythming for poetry). These are small and often superficial measures of quality but can still help in creating "better" output.
Even though human evaluation is the gold standard, it's nowhere near perfect. People are slow, expensive, inconsistent, illogical, lose concentration, misinterpret instructions, and often can't explain why they feel what they feel. Evaluative questions are subjective so respondents will have different expectations and answers. One example of this is in evaluating chat bots, where not even using humans as the chatter scores well.
The final task we'll explore is multitask learning, a meta-task that aims to build general models that can perform multiple tasks. Multitask learning is the bridge to achieving general NLP systems that can understand and store information and perform NLP tasks with minimal training and content.
There are 3 supertask categories for NLP: language modeling, QA, and dialogue. We'll focus on QA as the techniques used should apply to the other supertasks. In QA, there are 10 types of questions: factual, translation, summary, multiple choice, sentiment, feelings of subjects, short response answers, change in dialogue state, translation to new language, and anaphora resolution. When training this model, the researchers use meta-supervised learning where they included what type of question the question was in the question.
The model they used takes the top techniques for each category and uses them together. The structure consists of an initial encoding with n-gram character embeddings as a Bi-LSTM layer, than a coattention layer, and 2 transformer layers, then separates into two branches, one of which is a decoder that generates a word to output while the other uses a question pointer, context pointer, and vocab distribution to generate an output distribution.
To train this model, they used a roun robin style approach, training on mini-batches from alternating subtasks. For content, they utilized anti-curriculum learning where they start with hard tasks and transition to easier ones, preventing the model from getting stuck in local optima. The model ended up performing very poorly on translation, as the words from other languages weren't being used in the other subtasks, so the researchers oversampled translation in the mini-batches to improve it.
All in all, multitask learning is a step towards more general NLP systems that combine tasks and allow for completing tasks that models do not need to be explicitly trained on. This is getting closer to how humans understand language, as we slowly accumulate knowledge to be able to answer a wider range of questions, even those we've never seen before.
NLP is pretty freaking cool. For some meta-thoughts, it would be so cool for a model to generate this blog post from the notes I took. There are a ton of more practical NLP tasks that can be used in industry such as chatbots, customer service, understanding health records and biomedical literature, and speech recognition to name a few. With NLP power comes NLP risk too, as NLP research can be used counter-productively in creating spam, upgrading bots, privacy and security issues, ethical dilemmas, bias in training, and more. Now more than ever, it's important for CS students and researchers to acquiant themselves with basic knowledge of how technology can be used against society and to consider potential consequences when building.
1: As of the time I wrote this first paragraph ↩
2: A linguistics note: coreference is when two mentions refer to the same entity and anaphora is when a term refers to another term. Anaphoras have dependence on something else in the text while coreferences don't need to. We can have anaphoras without coreferences, e.g. "no dancer twists her knee": her refers to the dancer so it's an anaphora but the dancer does not refer to a real-world entity so it is not a coreference. ↩