Exa vs Bing API: A Search Performance Comparison Case Study

Overview

The rise of AI applications has made the quality of search and retrieval systems increasingly critical. We conducted a detailed evaluation comparing Exa's neural search capabilities against Bing's API, focusing on their ability to provide relevant results for real-world queries that are highly semantic. We used the Patronus AI automated evaluation suite to perform the comparison, generating aggregate metrics and handy visualizations in the process.

Search Query Relevance of Exa API vs. Bing API

‍

Methodology

We chose a highly semantic query set and tested for whether results semantically match the search query. We describe our methodology below.

Data Collection

We first constructed a representative evaluation dataset. Our dataset consisted of the following attributes:

150 highly semantic queries
5 results retrieved per query from each API
Full text, highlights, and summaries captured for each result

In order for the comparison to be accurate, we augmented the data with Exa contents for Bing search results, since it only returns URLs. This ensures a fair comparison focused only on the relevance of results.

Our code to query Exa and Bing Search is shown below:

# Example implementation

from exa_py import Exa

exa_client = Exa(api_key="TODO")

exa_results = exa_client.search_and_contents( query, type="neural", use_autoprompt=True, num_results=5, text=False, highlights=True, summary=True )

bing_results = bing_client.web.search( query=query, count=5, text_decorations=True, text_format="HTML" )

‍

Evaluation

Results were evaluated using an independent judge evaluator on the Patronus platform, assessing both summary quality and result relevance. This evaluator allowed us to obtain reliable evaluation results at scale, ensuring high human-AI alignment in the process. Results were evaluated on a PASS/FAIL basis, based on the following judge definition:

"Given a search query in USER INPUT, a summary of the content from the returned search result in MODEL OUTPUT, and highlights (or snippets) from the returned search results, determine whether the MODEL OUTPUT or RETRIEVED CONTEXT provide useful and relevant information related to the USER INPUT."

We ran the following code to kick off an evaluation with the Patronus experiments framework:

# Example implementation

from patronus import Client

patronus_client = Client(api_key="TODO")

query_result_relevance = patronus_client.remote_evaluator( evaluator_id_or_alias="judge", criteria="is-search-query-result-relevant", )

patronus_client.experiment( project_name="web-search-comparison", data=exa_results, evaluators=[query_result_relevance], experiment_name="exa", )

patronus_client.experiment( project_name="web-search-comparison", data=bing_results, evaluators=[query_result_relevance], experiment_name="bing", )

‍

Performance Analysis

We see that Exa outperformed Bing Search in search result relevance. The Comparisons view shows that Exa had a pass rate of 60% whereas Bing had a pass rate of 38%.

Let's dig into some example queries to understand the performance differences!

Example Queries

Query: “best online language learning apps with proven effectiveness for native english speakers learning mandarin chinese”

Example Result: Exa

Exa's result recommended Ninchanese for native English speakers learning Mandarin Chinese. Patronus scored the result as PASS as it is relevant to the user query.

*Patronus Experiments View of one of the relevant recipes suggested by Exa’s API*

‍

Example Result: Bing

Bing's result provides examples of learning apps for 2024. Patronus scored the result as FAIL. To understand why, we can take a look at the Patronus evaluator's explanation. In this case, the result was marked as FAIL because the results were general in scope and not specific to native English speakers.

‍

Key Findings

1. Semantic Understanding

Exa's neural search showed superior performance in understanding complex technical queries
Particularly strong in cases requiring deep domain understanding

2. Result Relevance

Higher precision in technical and specialized searches

3. Content Depth

Exa consistently returned more technically relevant content
Better at finding specific, detailed information rather than general overviews

Implications for Developers

The results demonstrate clear advantages for applications requiring:

Complex query understanding
Accuracy and relevancy of full content within a URL

‍Conclusion

Our evaluation reveals that Exa's neural search capabilities provide significantly more relevant results for technical and complex queries compared to traditional search APIs. This makes it particularly valuable for applications requiring deep semantic understanding and technical content retrieval.