Microsoft enhances Bing search with new language models, claiming to reduce costs while delivering faster, more accurate results.
- Bing combines large and small language models to enhance search.
- Using NVIDIA technology, Bing reduced operational costs and improved latency
- Bing says the update improves speed without compromising result quality.
Microsoft has announced updates to Bing’s search infrastructure incorporating large language models (LLMs), small language models (SLMs), and new optimization techniques.
This update aims to improve performance and reduce costs in search result delivery.
In an announcement, the company states:
“At Bing, we are always pushing the boundaries of search technology. Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities. While transformer models have served us well, the growing complexity of search queries necessitated more powerful models.”
Performance Gains
Using LLMs in search systems can create problems with speed and cost.
To solve these problems, Bing has trained SLMs, which it claims are 100 times faster than LLMs.
The announcement reads:
“LLMs can be expensive to serve and slow. To improve efficiency, we trained SLM models (~100x throughput improvement over LLM), which process and understand search queries more precisely.”
Bing also uses NVIDIA TensorRT-LLM to improve how well SLMs work.
TensorRT-LLM is a tool that helps reduce the time and cost of running large models on NVIDIA GPUs.
Impact On “Deep Search”
According to a technical report from Microsoft, integrating Nvidia’s TensorRT-LLM technology has enhanced the company’s “Deep Search” feature.
Deep Search leverages SLMs in real time to provide relevant web results.
Before optimization, Bing’s original transformer model had a 95th percentile latency of 4.76 seconds per batch (20 queries) and a throughput of 4.2 queries per second per instance.
With TensorRT-LLM, the latency was reduced to 3.03 seconds per batch, and throughput increased to 6.6 queries per second per instance.
This represents a 36% reduction in latency and a 57% decrease in operational costs.
The company states:
“… our product is built on the foundation of providing the best results, and we will not compromise on quality for speed. This is where TensorRT-LLM comes into play, reducing model inference time and, consequently, the end-to-end experience latency without sacrificing result quality.”
Benefits For Bing Users
This update brings several potential benefits to Bing users:
- Faster search results with optimized inference and quicker response times
- Improved accuracy through enhanced capabilities of SLM models, delivering more contextualized results
- Cost efficiency, allowing Bing to invest in further innovations and improvements
Why Bing’s Move to LLM/SLM Models Matters
Bing’s switch to LLM/SLM models and TensorRT optimization could impact the future of search.
As users ask more complex questions, search engines need to better understand and deliver relevant results quickly. Bing aims to do that using smaller language models and advanced optimization techniques.
While we’ll have to wait and see the full impact, Bing’s move sets the stage for a new chapter in search.
Featured Image: mindea/Shutterstock
SEJ STAFF Matt G. Southern Senior News Writer at Search Engine Journal
Matt G. Southern, Senior News Writer, has been with Search Engine Journal since 2013. With a bachelor’s degree in communications, ...