On the fine-tuning discussion we touched among many other things on the question of 'what kind of data do I use to fine-tune my model?'. We know that there are many datasets available on places like huggingface (but always make sure the dataset is actually made for your model, ie using the most appropriate, most legible syntax for your model). What do I do if I cannot find a dataset?
The idea is quite simple and elegant: If I have a small data sample, can I use a teacher LLM to create a new, more comprehensive dataset (= synthetic data) on that basis and use it to teach (fine-tune) a student LLM?
The results seem promising showing enhanced performance of LLMs in the low-data regime, outperforming alternative methods. LLM2LLM seems to reduce the dependence on labor-intensive data curation to pave the path for more scalable and, higher performance LLMs, being particularly beneficial for data-constrained domains and tasks. Its quite neat, they provide the prompts they used in the Annex of the paper.
What everything comes down to is: what is the quality of the data and how well does the synthetic dataset represent learnings from reality?
Unfortunately this seems to be not super straight forward:
- Irrespective of the teacher model, accuracy stays below 30% (GPT-4-Turbo, Llama-70B only 20%)
- Is there inherent bias of the (incomplete) teacher system?
- Are the teacher LLM's representations transferable to the general population of the domain-specific, data-scarce subject area?
- The sample data set's quality would have an outsized effect on the result; any error or 'bad signal' would potentially result in a seriously mis-tuned student LLM
- There is a high risk of overfitting when started with too small a data-set
All of this, of course, gets particularly fraught, when such data-sets are then used to build models in sensitive domains, hiring or performance related reviews, healthcare etc.
Still a cool idea, that can potentially squeeze out a few percentage points of performance.
PS: interesting mention re synthetic datasets by two Anthropic researchers: key point from their perspective would be to capture logic and reason behind arriving at the data. (https://www.youtube.com/watch?v=UTuuTTnjxMQ)