As large language models have continued to rise in prominence, many users and companies have focused on their useful ability to quickly summarize lengthy documents for easier human consumption. When Australia’s Securities and Investments Commission (ASIC) looked into this potential use case, though, it found that the summaries it was able to get from the Llama2-70B model were judged as significantly worse than those provided by humans.
ASIC’s proof-of-concept study (PDF)—which was run in January and February, written up in March, and published in response to a Senate inquiry in May—has a number of limitations that make it hard to generalize about the summarizing capabilities of state-of-the-art LLMs in the present day. Still, the government study shows many of the potential pitfalls large organizations should consider before simply inserting LLM outputs into existing workflows.
Keeping score
For its study, ASIC teamed up with Amazon Web Services to evaluate LLMs’ ability to summarize “a sample of public submissions made to an external Parliamentary Joint Committee inquiry, looking into audit and consultancy firms.” For ASIC’s purposes, a good summary of one of these submissions would highlight any mention of ASIC, any recommendations for avoiding conflicts of interest, and any calls for more regulation, all with references to page numbers and “brief context” for explanation.
In addition to Llama2-70B, the ASIC team also considered the smaller Mistral-7B and MistralLite models in the early phases of the study. The comparison “supported the industry view that larger models tend to produce better results,” the authors write. But, as some social media users have pointed out, Llama2-70B has itself now been surpassed by larger models like ChatGPT-4o, Claude 3.5 Sonnet, and Llama3.1-405B, which score better on many generalized quality evaluations.
More than just choosing the biggest model, though, ASIC said it found that “adequate prompt engineering, carefully crafting the questions and tasks presented to the model, is crucial for optimal results.” ASIC and AWS also went to the trouble of adjusting behind-the-scenes model settings such as temperature, indexing, and top-k sampling. (Top-k sampling is a technique that involves selecting the most likely next words or tokens based on their probabilities predicted by the model.)
“The summaries were quite generic, and the nuance about how ASIC had been referenced wasn’t coming through in the AI-generated summary… “
ASIC Digital and Transformation Lead Graham Jefferson
ASIC used five “business representatives” to evaluate the LLM’s summaries of five submitted documents against summaries prepared by a subject matter expert (the evaluators were not aware of the source of each summary). The AI summaries were judged significantly weaker across all five metrics used by the evaluators, including coherency/consistency, length, and focus on ASIC references. Across the five documents, the AI summaries scored an average total of seven points (on ASIC’s five-category, 15-point scale), compared to 12.2 points for the human summaries.
Missing the nuance
By far the biggest weakness of the AI summaries was “a limited ability to analyze and summarize complex content requiring a deep understanding of context, subtle nuances, or implicit meaning,” ASIC writes. One evaluator highlighted this problem by calling out an AI summary for being “wordy and pointless—just repeating what was in the submission.”
“What we found was that in general terms… the summaries were quite generic, and the nuance about how ASIC had been referenced wasn’t coming through in the AI-generated summary in the way that it was when an ASIC employee was doing the summary work,” Graham Jefferson, ASIC’s digital and transformation lead, told an Australian Senate committee regarding the results.
The evaluators also called out the AI summaries for including incorrect information, missing relevant information, or highlighting irrelevant information. The presence of AI hallucinations also meant that “the model generated text that was grammatically correct, but on occasion factually inaccurate.”
Added together, these problems mean that “assessors generally agreed that the AI outputs could potentially create more work if used (in current state), due to the need to fact check outputs, or because the original source material actually presented information better.”
Just a concept
These results might seem like a pretty conclusive point against using LLMs for summarizing, but ASIC warns that this proof-of-concept study had some significant limitations. The researchers point out that they only had one week to optimize their model, for instance, and suspect that “investing more time in this [optimization] phase may yield better and more accurate results.”
The focus on the (now-outdated) Llama2-70B also means that “the results do not necessarily reflect how other models may perform” the authors warn. Larger models with bigger context windows and better embedding strategies may have more success, the authors write, because “finding references in larger documents is a notoriously hard task for LLMs.”
Despite the results, ASIC says it still believes “there are opportunities for Gen AI as the technology continues to advance… Technology is advancing in this area and it is likely that future models will improve performance and accuracy of results.”