Since the advent of Transformers in 2017, Large Language Models (LLMs) have completely changed the process of training ML models for language tasks. Earlier, for a given task and a given dataset, we used to play around with various models like RNNs, LSTMs, Decision Trees, etc by training each of them on a subset of the data and testing on the rest. And whichever model gave the best accuracy was chosen as the winner. Of course, a lot of model hyper-parameters also needed to be tuned and experimented with. And for many problems, feature engineering was also necessary. But those days are almost over!
With the advent of transformer-based LLMs, we now have huge models with 100+ Million which do not really require that kind of experimentation. What used be called “training” earlier is now divided into two steps, first being “pre-training” and the second being “fine-tuning”. To give an analogy, the earlier (or should I say ancient!) model training process was like our vocational training or skill development programs, which trained you for a specific task. In contrast, the pre-training process is like holistic college education which gives you a broad set of skills which you can then fine-tune according to your career of choice. There is also in-context learning, especially for LLMs with 1+ Billion parameters, and it is more like those super-smart students who can crack the exams with just a couple of hours of preparation and most importantly, forget all they have learnt soon after the exams are over. Yes, such folks do exist!
Lets take a deeper look at all these three : pre-training, fine-tuning and in-context learning.
The way humans learn language is very different from how a computer learns it. For humans, language is a mode of expression. We have an experience or an observation, which we wish to put into words for the sake of conversation. But for a computer, there is no such thing as an experience or an observation. There could be vision through cameras, but that is not an observation or an experience.
For a computer, it is just about input and output, and that too in bits. A computer is purely algorithmic and statistical in nature. If a computer has been trained to learn a language model, it basically means that given a set of words, it can predict the next word with good accuracy. And this is what the pre-training process is all about! What we call pre-training is basically training an LLM on a large amount of data (few billion words at least) with the primary task of predicting words in a sentence. Now there are two ways in which we can do this.
One way is called MLM (Masked Language Model) used by bi-directional models like BERT, in which a certain percentage of words in the training set are masked, and the task of the model is to predict these missing words. Note that in this task, the model can see the words preceding as well as succeeding the missing word and thats why its called bi-directional. For pre-training BERT, another task called Next Sentence Prediction (NSP) was also used, but researchers have found its utility to be marginal and MLM being good enough for all practical purposes.
There are other kinds of models called auto-regressive (eg. GPT), which are uni-directional and they are trained to predict the next word without seeing the succeeding ones. This is because these auto-regressive models are specifically designed for better language generation, which makes it necessary for the model to be pre-trained in a uni-directional manner.
So in this pre-training process, we are not training the model for specific language tasks but only making it learn how to predict words in a sentence which is what learning a language model is all about. This pre-training process is usually very expensive (few thousand to more than a million dollars) and takes a long time (few days to few months) to complete.
Now when we want to use these pre-trained language models (PLMs) for specific tasks (eg. sentence classification, named entity recognition, question-answering, etc), we need to fine-tune them, which usually requires much less data (~ 100k words) as compared to whats needed for pre-training (few billion words). During the fine-tuning process, we add a task-specific layer to the PLMs and carry out the usual backpropagation method using a suitable loss function. Note the during the fine-tuning process also, all the model parameters are updated through gradient descent and not just the task-specific layer. The reason this takes much less time as compared to pre-training is only because the dataset size required is much smaller.
PLMs also allow us to freeze certain layers and fine-tune the rest, and some of these tricks could be exploited to get better performance. But the general experience has been that freezing too many layers leads to poor performance. The reason PLMs have become so powerful and so popular is that the pre-training process has to be done only once in a task-agnostic manner, and then for each specific task a simple and much cheaper fine-tuning process is enough. And thanks to many AI companies, most of these pre-training language models are available for free download from Hugging Face.
Now although task-specific fine-tuning is a relatively cheap task (few dollars) for models like BERT with a few hundred million parameters, it becomes quite expensive for large GPT-like models which have several billion parameters. And thats simply because during fine-tuning also, all the model parameters have to be updated and even on a smaller dataset, the fine-tuning costs on LLM models like GPT can be prohibitively large! May be should call them Huge Language Models (HLMs) instead.
The saving grace for these models has been whats called in-context or few-shot learning through prompt design. So, instead of fine-tuning the model with a hundred or thousand input texts, the model just takes a few task-specific examples (<10 usually) as input, and then quickly figures out how to perform well on that task. Note that in this process, there is no update of the model weight that happens! No backpropagation and no gradient descent! Yes, thats what magical and mysterious about this process.
We don’t yet fully understand how this works, but researchers have suggested that GPT-like models do some kind of Bayesian inference using the prompts. In simple words, using just a few examples, the model is able to figure out the context of the task, and then it makes the prediction using this information. There is no learning involved in the sense of changing model parameters through gradient descent, and it is more of a search in the space of all contexts previously learnt by the model during pre-training. Hopefully we will get a full grasp on this through further research using open-source LLMs like BLOOM.
So whats next in Natural Language Processing (NLP)? Are we going to have another fancy architecture completely different from attention-based transformers developed by some big lab which will take NLP to the next level? And will there then be a completely new training process? Most probably not! Transformers are here to stay and so is the training process. This is because when looked at from the perspective of Geometric Deep Learning, the transformer is not just an arbitrary architecture that did well on language tasks by chance. It satisfies certain important properties, which will be hard to replicate in a totally different architecture. So yes, get deep into transformer-based LLMs and their training process without being afraid of this knowledge becoming redundant in the near future. Transformers are here to stay!