The rise of AI applications has made the quality of search and retrieval systems increasingly critical. We conducted a detailed evaluation comparing Exa's neural search capabilities against Bing's API, focusing on their ability to provide relevant results for real-world queries that are highly semantic. We used the Patronus AI automated evaluation suite to perform the comparison, generating aggregate metrics and handy visualizations in the process.
We chose a highly semantic query set and tested for whether results semantically match the search query. We describe our methodology below.
We first constructed a representative evaluation dataset. Our dataset consisted of the following attributes:
In order for the comparison to be accurate, we augmented the data with Exa contents for Bing search results, since it only returns URLs. This ensures a fair comparison focused only on the relevance of results.
Our code to query Exa and Bing Search is shown below:
# Example implementation
from exa_py import Exa
exa_client = Exa(api_key="TODO")
exa_results = exa_client.search_and_contents(
query,
type="neural",
use_autoprompt=True,
num_results=5,
text=False,
highlights=True,
summary=True
)
bing_results = bing_client.web.search(
query=query,
count=5,
text_decorations=True,
text_format="HTML"
)
Results were evaluated using an independent judge evaluator on the Patronus platform, assessing both summary quality and result relevance. This evaluator allowed us to obtain reliable evaluation results at scale, ensuring high human-AI alignment in the process. Results were evaluated on a PASS/FAIL basis, based on the following judge definition:
"Given a search query in USER INPUT, a summary of the content from the returned search result in MODEL OUTPUT, and highlights (or snippets) from the returned search results, determine whether the MODEL OUTPUT or RETRIEVED CONTEXT provide useful and relevant information related to the USER INPUT."
We ran the following code to kick off an evaluation with the Patronus experiments framework:
# Example implementation
from patronus import Client
patronus_client = Client(api_key="TODO")
query_result_relevance = patronus_client.remote_evaluator(
evaluator_id_or_alias="judge",
criteria="is-search-query-result-relevant",
)
patronus_client.experiment(
project_name="web-search-comparison",
data=exa_results,
evaluators=[query_result_relevance],
experiment_name="exa",
)
patronus_client.experiment(
project_name="web-search-comparison",
data=bing_results,
evaluators=[query_result_relevance],
experiment_name="bing",
)
We see that Exa outperformed Bing Search in search result relevance. The Comparisons view shows that Exa had a pass rate of 60% whereas Bing had a pass rate of 38%.
Let's dig into some example queries to understand the performance differences!
Query: “best online language learning apps with proven effectiveness for native english speakers learning mandarin chinese”
Exa's result recommended Ninchanese for native English speakers learning Mandarin Chinese. Patronus scored the result as PASS as it is relevant to the user query.
Bing's result provides examples of learning apps for 2024. Patronus scored the result as FAIL. To understand why, we can take a look at the Patronus evaluator's explanation. In this case, the result was marked as FAIL because the results were general in scope and not specific to native English speakers.
1. Semantic Understanding
2. Result Relevance
3. Content Depth
Implications for Developers
The results demonstrate clear advantages for applications requiring:
Our evaluation reveals that Exa's neural search capabilities provide significantly more relevant results for technical and complex queries compared to traditional search APIs. This makes it particularly valuable for applications requiring deep semantic understanding and technical content retrieval.