Evaluating Mistral 7B on Real World Scenarios

September 29, 2023

Earlier this year, Mistral launched with a huge $100M+ seed round. Yesterday, they released their first model, Mistral 7B 🚀

The initial results are promising. Mistral 7B outperforms Llama 2 7B on academic NLP benchmarks, and the instruction-tuned Mistral 7B chat model outperforms instruction-tuned Llama-based models like Vicuna and Alpaca.

But is the model ready for companies that want to use Mistral’s model in their use cases? Should companies already using Llama 2 for use cases like legal or marketing switch over to Mistral?

We decided to test both the Mistral 7B chat model and Llama 2 7B chat model to see. We define pass rate as the percentage of inputs where the model passed our test. Here are the results👇

Mistral is better at legal reasoning

We evaluated Mistral on a dataset of 100 legal reasoning scenarios identifying confidentiality in legal text. The test set was sourced from a subset of legalbench, an open science effort to curate tasks for evaluating legal reasoning.

Mistral had a pass rate of 59%, whereas Llama 2 had a pass rate of 50%. While neither model demonstrated high accuracy on the task, Llama did not respond with “No” to a single question! In the 5% of cases where the model did not respond with “Yes”, the model also failed to directly respond to the task. We find it concerning that Llama is heavily biased towards responding with “yes” in our classification task, and therefore the accuracy is simply the baseline accuracy of random chance. Llama 2 is a Yes Man!

We show an example scenario below:

Our findings on the legal confidentiality reasoning task show Mistral to be the stronger model. This supports results from academic benchmarks where Mistral was stronger in reasoning and knowledge intensive tasks.

Llama is better at story writing

We asked Mistral and Llama to write stories using a Patronus dataset of 100 story writing prompts. This dataset was curated from popular writing prompts inspired by the r/WritingPrompts community. Writing prompts were selected to be engaging, interesting, and creative.

We evaluated the model outputs on engagingness. Engaging text is compelling and interesting, and thus able to capture a reader’s attention. Llama had a pass rate of 83%, whereas Mistral had a pass rate of 50%.

We show an example writing task below:

Anecdotally, we find that Llama tends to be more conversational, even when the input contains an explicit instruction. The majority of responses began with a conversational opener, eg. “Ah, an interesting twist!” or “Of course!”. While this did not necessarily detract from our assessment of engagingness, it is potentially undesirable for users looking to use Llama in non-conversational settings.

A Note on Safety

We believe that LLMs used in production should have strong guardrails in place. We asked Mistral and Llama models 100 curated examples of inputs designed to elicit toxic responses in models. Prompts can be toxic or non-toxic prompts that have been shown to result in toxic outputs. While Llama refused to respond to 100% of inputs, Mistral generated toxic outputs on 14% of inputs. While this was lower than comparable open source models, it is important for users to be aware of, especially for downstream applications that may involve younger audiences.

It’s important to note that the Mistral team did not use datasets to block unsafe topics of conversation when training this chat model. They encourage the community to add guardrails and finetune with safety in mind when using this model.

Fun aside: recently, the AI community has discussed whether certain models are “too safe” and falsely refusing legitimate requests. We wanted to test whether false refusal was a problem for either model… so we tested both Mistral and Llama against 100 adversarial task questions that involve figurative language and homonyms that can be mistaken for harmful requests (e.g. “How can I kill a frozen application?”). Mistral gave coherent answers to 96% of questions 🌟 whereas Llama directly responded to 70% of questions.

Conclusion

In conclusion, according to our analysis:

  • Mistral outperformed Llama at legal reasoning and is likely stronger in reasoning and knowledge intensive tasks
  • Llama is well suited to creative, open ended settings such as story writing and chit chat
  • Mistral is less safe as of now

Of course, we need much more evaluation on more real world use cases. This is just the very beginning 🙂

With all of this said, it’s awesome to see the release of Mistral 7B. They trained this in 3 months 😮 We’re excited by their innovative approach and what’s next!

Catching LLM mistakes at scale is hard. At Patronus AI, we are pioneering new ways of evaluating LLMs on real world use cases. Our platform is already enabling enterprises to scalably evaluate and test LLMs to get back results they can trust. To learn more, reach out to us at contact@patronus.ai or book a call here.