Artificial Intelligence

Merlyn Mind’s education-specific language models

Ashish Jagmohan and Aditya Vempaty
June 21, 2024

Over the past several months, our AI team has been developing an AI platform with a suite of large language models that are built for the unique workflows and safety needs of education.

Today we are releasing three of the models on the open-source community Hugging Face.

Our LLMs enable teachers and students to have a generative AI experience that retrieves content from curriculum chosen by the user, not from the entirety of the internet. The result is an engagement that is curriculum-aligned, hallucination-resistant, and age-appropriate.

The LLMs, and their ability to interact with specific content, are components of a broader generative AI platform for education that we’re building.

Here’s a little about how and why we built these open-source models.

Public models and instruction-tuning  

In contrast to closed models like GPT4, over the last few months, several smaller foundation language models have been publicly released. Models such as LLaMa, Pythia, Falcon, StableLM, MPT, and others have shown impressive capabilities at smaller sizes, from 7B to 65B parameters. 

The AI community has leveraged the democratized access afforded by these public base models to further build fine-tuned models for specific use cases. A popular approach has been instruction- and chat-tuning, where these models are fine-tuned on general instruction and chat data-sets, created via a mixture of human and AI generation. This approach has resulted in models like Vicuna, Koala, Alpaca, Falcon-Instruct, MPT-Instruct, and many others.  

We’ve found that task-specific fine-tuning of small, high-performing base models yields better performance for our use cases, compared to using similarly sized general instruction-tuned models. We’ve fine-tuned three task-specific models (details below) on a 12b parameter Pythia base: 

  1. An appropriateness classification model that detects when a query is inappropriate for in-classroom discussion. 
  2. A corpus-grounded question-answering model that grounds answers to questions, on the ingested corpus. 
  3. A teacher-assistant that makes helpful recommendations based on the ongoing classroom discussion, suggesting research activities and topics for further exploration.  

As a point of comparison to a strong general instruction-tuned model: Our corpus-grounded question-answering model shows a hallucination rate of roughly 1-2% in our internal benchmark evaluations, in contrast to > 10% hallucination rate obtained with the similarly sized best open-source instruction-tuned LLMs such as Vicuna-Evol-Instruct model on the same benchmark. As another point of comparison, our appropriateness-classification model yields a ~99% sensitivity to unsafe topics with a ~10% false-positive rate on our internal safety benchmark dataset. In contrast, the Vicuna-instruction model yields only ~44% sensitivity on the same dataset. On the other hand, larger instruct models such as Falcon-40b-Instruct and the recently released MPT-30b-instruct models give high sensitivity (~90%) but also have a high false-positive rate on the same benchmark (50-60%).

This continuum between task/domain-specificity and general instruction-following, for small models, represents highly relevant tradeoffs between cost and accuracy and is of great interest to us. 

Training our education models 

To train our task-specific education models, we’ve used a mix of open-source and proprietary human and synthetic high-quality data sets. For the appropriateness model we identified 100+ categories of topics deemed inappropriate for in-classroom discussion and created several queries per topic; we further added a similar number of appropriate queries. For corpus-grounded question-answering and grade-appropriate teacher assistance, we built datasets with retrieved information snippets and dialogs with related question-answer pairs, research activities, and follow-up topics. Ensuring high-quality training datasets proved to be key. 

In each case, we used low-rank adaptation with quantization and gradient checkpointing for supervised fine-tuning. Careful design of task-specific input and output sequence formats made a significant difference in performance. Also significant were optimizations like the choice of the number of input and output tokens (which impacted memory and training time), and inference parameter selection (trading-off latency and quality). Hyper-parameter tuning (such as learning rate and epochs) proved crucial; by implication optimizing fine-tuning computation was critical. 

For benchmarking, we used a combination of classification metrics, human evaluation, and AI critique. 

  1. The appropriateness model was evaluated using standard classification metrics on ground-truth labeled held-out datasets. It yields 99% sensitivity (i.e. correctly detecting unsafe content) with a ~10% false-positive rate.  
  2. The corpus-grounded question-answering model was evaluated, via critique, on multiple dimensions. We were especially interested in hallucination i.e. when the model incorrectly introduces information not included in the content used for grounding. The hallucination rate for the model was less than 3%. 
  3. The teacher-assistant model was also evaluated on a multi-dimensional suitability rubric, including checks to see that it produced the requested number of artifacts (e.g. research activities or exploratory topics), and if the produced artifacts were grade-appropriate and relevant to the classroom discussion. On held-out datasets, it produced 95%+ responses deemed “suitable” on all dimensions. 

Try our models out

Our models are available publicly on HuggingFace. Try them out for yourself, and let us know what you think!

Merlyn Mind Appropriateness model

Merlyn Mind Corpus-qa model

Merlyn Mind Teacher-assistant model

Ashish Jagmohan is a Distinguished AI Scientist at Merlyn Mind. Aditya Vempaty is an Engineering Manager and Senior Principal AI Scientist at Merlyn Mind.