Do we even need annotated data anymore?

Until recently, to get started with creating a machine learning model, you needed data specifically prepared for your task. Preparing data for training models is usually a manual and time consuming process, during which domain experts add instructive labels to the data which the model tries to learn. With the advent of Large Language Models (LLM) like ChatGPT and GPT4, this process may not always be necessary to create a first working model.

There are at least three ways that LLMs can help you circumvent the requirement of manually preparing data for your application

Using LLM in a zero shot fashion for your application
Creating synthetic data to train your first model
Label your data using a LLM

Let’s see each of those ways in more detail. We will use ChatGPT and ag_news for demonstration. Ag_news is a dataset consisting of news articles tagged thematically as Business, World, Sports or Sci/Tech.

Zero shot LLM

The simplest way to use a Large Language Model for your application is to ask the model to perform the task without giving it any examples of the correct input / outputs you expect. This is referred to as zero shot in AI terminology.

Assuming you are happy with the performance of the model, you can integrate it into your application. The major downside with this approach is cost. Using a LLM as your machine learning model may be up to 100x more expensive than using a smaller more specialized model.

Synthetic data

Another route is to create synthetic data and use them to train a smaller model to perform the task.

With this approach you need to ensure that the data are diverse, so you might need to guide the model towards certain directions, and then remove examples that are too similar to each other.

Label with LLM

Finally, instead of using domain experts to add labels to your data, you can use a LLM instead. Then you can use that data to train a smaller model. Prompting the models to give you labels is similar to zero shot LLM.