What is the impact of dataset size on ML models? Are large models better, always? or could you have smaller, better models trained on smaller/smart datasets?

Introduction

In the world of ML, 2019 was the year in which NLP had its imagenet moment. This is the year where applying transfer learning became practical enough to be used to solve actual problems rather than just toy research datasets. 

Note: 

What is transfer learning? 

Transfer learning is a means to extract knowledge from a source setting and apply it to a different target setting.

What is Imagenet?

ImageNet is an image database organized according to the WordNet hierarchy.

Find more about transfer learning and Imagenet here: 

  1. ImageNet
  2. Transfer Learning – Machine Learning’s Next Frontier

Since then transfer learning has been applied to solve virtually every problem within NLP, ranging from form-data parsing to automatic summarization, document classification, NER and much more…

There are more than 300 language models that have come up since BERT. BERT is a type of language model built via transfer learning, which can be used to solve a lot of downstream problems within NLP. However, building these language models is not a simple task: it is quite expensive (the latest 2020 language model GPT-3 takes roughly 6-8 Million USD to train) and challenging/tedious as they require large amounts of text data in order to be useful later on for the downstream tasks. They also have a huge carbon footprint associated with their training, which is not good news.

That leads to the question, is there an alternative? Can we build better or at least similar performing models but with a fraction of the data that might be necessary to train these large, expensive language models?

It turns out we can!

Impact of dataset size on building ML models

The size of our training data matters a lot when used to build a useful model. Finding that optimal dataset size is not straightforward and there is no single answer, it really depends on the task at hand.

In general, having too little training data results in poorly approximated ML models. A model with a huge set of learnable parameters will underfit the small training data and a model with fewer learnable parameters will overfit the training data, both resulting in poor performance. Too little test data will introduce a high variance in the model’s performance.

As mentioned earlier, the latest trend in NLP is to apply transfer learning to solve problems like NER, Classification, summarization etc. In transfer learning we essentially train a language model on vast amounts of unlabelled data without any specific task as an end goal, but rather to learn the lexical/semantic/distributional information.

Once we have a language model, it can be used downstream to build a classifier, for instance, with very few labelled samples. 

Basically pretraining/transfer learning is solving the problem of requiring a large “annotated dataset”. Hence improving the sampling efficiency. 

But having a large language model does not always guarantee you a good task-specific model if the pre-trained model was trained on a vastly different text distribution and you’re trying to adapt it to a very different type of text data/domain. It might not be useful because, when the target domain is very different/dissimilar to the text in the pretraining corpus, the models do not adapt/perform well.

Smaller models with small datasets

Let’s say you are in a situation where you want to build an ML model for NER and you have a small, labelled, but noisy dataset. Could you use transfer learning to build models that work well in your domain?

At Elemendar, we experimented with multiple language models based on transfer learning for NER. None of them was good or practical enough to be actually useful. 

However, we have achieved better performance compared to transfer learning-based models, with much simpler custom models built with a fraction of the data that a BERT uses. 

Observations

Some of the observations from our experiments on building NLP models on cybersecurity domain data so far are: 

  1. Simpler models, but with high-quality data are better as opposed to using a larger pre-trained model.
  2. Simpler models, like for instance BiLSTM-CRF are much easier to later serve in production as compared to something like BERT, which is massive in size.
  3. Since the LM’s are built mostly on general text corpus (like news, Wikipedia, generic blogs etc) that are available, they do not adapt well for the cybersecurity domain.
  4. Smaller models with simpler architecture work well, compared to complex models like BERT/XLNET etc. provided the training data is of very high quality.
  5. Weak supervision is very useful in solving the data annotation problem. We can use weak supervision to programmatically create labelled datasets to address the cold-start/no annotations problem. 

Conclusion

Rather than achieving marginal improvements in accuracy by searching for state-of-the-art models, you can improve your results drastically by improving the quality of your data.

Transfer learning based models like Bert, Xlnet, Albert etc can serve as a good baseline. But they might not be super useful out-of-the-box as in other cases.

Your model performance is only as good as your training data.