Advancing Search Capabilities: From Lexical to Multi-Modal with Deep Lake¶
Install the main libraries
!pip install deeplake
Load the Data from Deep Lake¶
The following code opens the dataset in read-only mode from Deep Lake at the specified path al://activeloop/restaurant_reviews_complete
. The scraped_data
object now contains the complete restaurant dataset, featuring 160 restaurants and over 24,000 images, ready for data extraction and processing.
import deeplake
scraped_data = deeplake.open_read_only(f"al://activeloop/restaurant_reviews_complete")
print(f"Scraped {len(scraped_data)} reviews")
Scraped 18625 restaurants
1) Create the Dataset and Use an Inverted Index for Filtering¶
In the first stage of this course, we’ll cover Lexical Search, a traditional and foundational approach to information retrieval.
An inverted index is a data structure commonly used in search engines and databases to facilitate fast full-text searches. Unlike a row-wise search, which scans each row of a document or dataset for a search term, an inverted index maps each unique word or term to the locations (such as document IDs or row numbers) where it appears. This setup allows for very efficient retrieval of information, especially in large datasets.
For small datasets with up to 1,000 documents, row-wise search can provide efficient performance without needing an inverted index. For medium-sized datasets (10,000+ documents), inverted indexes become useful, particularly if search queries are frequent. For large datasets of 100,000+ documents, using an inverted index is essential to ensure efficient query processing and meet performance expectations.
import deeplake
from deeplake import types
# Create a dataset
inverted_index_dataset = "local_inverted_index"
ds = deeplake.create(f"file://{inverted_index_dataset}")
We now create two columns in the dataset: restaurant_name
and restaurant_review
. Both columns are text-based and use an inverted index to improve search efficiency.
ds.add_column("restaurant_name", types.Text(index_type=types.Inverted))
ds.add_column("restaurant_review", types.Text(index_type=types.Inverted))
ds.add_column("owner_answer", types.Text(index_type=types.Inverted))
Extract the data¶
This code extracts restaurant details from scraped_data
into separate lists:
Initialize Lists :
restaurant_name
,restaurant_review
andowner_answer
are initialized to store respective data for each restaurant.Populate Lists : For each entry (
el
) inscraped_data
, the code appends:
el['restaurant_name']
torestaurant_name
el['restaurant_review']
torestaurant_review
el['owner_answer']
toowner_answer
After running, each list holds a specific field from all restaurants, ready for further processing.
restaurant_name = []
restaurant_review = []
owner_answer = []
images = []
for el in scraped_data:
restaurant_name.append(el['restaurant_name'])
restaurant_review.append(el['restaurant_review'])
owner_answer.append(el['owner_answer'])
Add the data to the dataset¶
We add the collected restaurant names and reviews to the dataset ds
. Using ds.append()
, we insert two columns: "restaurant_name"
and "restaurant_review"
, populated with the values from our lists restaurant_name
and restaurant_review
. After appending the data, ds.commit()
saves the changes permanently to the dataset, ensuring all new entries are stored and ready for further processing.
ds.append({
"restaurant_name": restaurant_name,
"restaurant_review": restaurant_review,
"owner_answer": owner_answer
})
ds.commit()
ds
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=18625)
Search for the restaurant using a specific word¶
We define a search query to find any entries in the dataset ds
where the word "tapas"
appears in the restaurant_review
column. The command ds.query()
runs a TQL query with SELECT *
, which retrieves all entries that match the condition CONTAINS(restaurant_review, '{word}')
. This search filters the dataset to show only records containing the specified word (tapas
) in their reviews. The results are saved in the variable view
.
Deep Lake offers a high-performance SQL-based query engine for data analysis called TQL
(Tensor Query Language). You can find the official documentation here.
word = 'burritos'
view = ds.query(f"""
SELECT *
WHERE CONTAINS(restaurant_review, '{word}')
LIMIT 4
""")
view
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=4)
Show the results¶
for row in view:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")
Restaurant name: Los Amigos Review: Best Burritos i have ever tried!!!!! Wolderful!!! Restaurant name: Los Amigos Review: Really good breakfast burrito, and just burritos in general Restaurant name: Los Amigos Review: Ordered two of their veggie burritos, nothing crazy just added extra cheese and sour cream. They even repeated the order back to me and everything was fine, then when I picked the burritos up and got home they put zucchini and squash in it.. like what?? Restaurant name: Los Amigos Review: Don't make my mistake and over order. The portions are monstrous. The wet burritos are as big as a football.
AI data retrieval systems today face 3 challenges: limited modalities
, lack of accuracy
, and high costs at scale
. Deep Lake 4.0 fixes this by enabling true multi-modality, enhancing accuracy, and reducing query costs by 2x with index-on-the-lake technology.
Consider a scenario where we store all our data locally on a computer. Initially, this may be adequate, but as the volume of data grows, managing it becomes increasingly challenging. The computer’s storage becomes limited, data access slows, and sharing information with others is less efficient.
To address these challenges, we can transition our data storage to the cloud using Deep Lake. Designed specifically for handling large-scale datasets and AI workloads, Deep Lake enables up to 10 times faster data access. With cloud storage, hardware limitations are no longer a concern: Deep Lake offers ample storage capacity, secure access from any location, and streamlined data sharing.
This approach provides a robust and scalable infrastructure that can grow alongside our projects, minimizing the need for frequent hardware upgrades and ensuring efficient data management.
2) Create the Dataset and use BM25 to Retrieve the Data¶
Our advanced "Index-On-The-Lake"
technology enables sub-second query performance directly from object storage, such as S3
, using minimal compute power and memory resources. Achieve up to 10x greater cost efficiency
compared to in-memory databases and 2x faster performance
than other object storage solutions, all without requiring additional disk-based caching.
With Deep Lake, you benefit from rapid streaming columnar access to train deep learning models directly, while also executing sub-second indexed queries for retrieval-augmented generation.
In this stage, the system uses BM25 for a straightforward lexical search. This approach is efficient for retrieving documents based on exact or partial keyword matches.
We start by importing deeplake and setting up an organization ID org_id
and dataset name dataset_name_bm25
. Next, we create a new dataset with the specified name and location in Deep Lake storage.
We then add two columns to the dataset: restaurant_name
and restaurant_review
. Both columns use a BM25 index, which optimizes them for relevance-based searches, enhancing the ability to rank results based on how well they match search terms.
Finally, we use ds_bm25.commit()
to save these changes to the dataset and ds_bm25.summary()
to display an overview of the dataset's structure and contents.
If you don't have a token yet, you can sign up and then log in on the official Activeloop website, then click the Create API token
button to obtain a new API token. Here, under Select organization
, you can also find your organization ID(s).
import os, getpass
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("Activeloop API token: ")
org_id = "<your_org_id>"
dataset_name_bm25 = "bm25_test"
ds_bm25 = deeplake.create(f"al://{org_id}/{dataset_name_bm25}")
# Add columns to the dataset
ds_bm25.add_column("restaurant_name", types.Text(index_type=types.BM25))
ds_bm25.add_column("restaurant_review", types.Text(index_type=types.BM25))
ds_bm25.add_column("owner_answer", types.Text(index_type=types.BM25))
ds_bm25.commit()
ds_bm25.summary()
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=0)
+-----------------+-----------------+
| column | type |
+-----------------+-----------------+
| restaurant_name |text (bm25 Index)|
+-----------------+-----------------+
|restaurant_review|text (bm25 Index)|
+-----------------+-----------------+
| owner_answer |text (bm25 Index)|
+-----------------+-----------------+
Add data to the dataset¶
We add data to the ds_bm25
dataset by appending the two columns, filled with values from the lists we previously created.
After appending, ds_bm25.commit()
saves the changes, ensuring the new data is permanently stored in the dataset. Finally, ds_bm25.summary()
provides a summary of the dataset's updated structure and contents, allowing us to verify that the data was added successfully.
ds_bm25.append({
"restaurant_name": restaurant_name,
"restaurant_review": restaurant_review,
"owner_answer": owner_answer
})
ds_bm25.commit()
ds_bm25.summary()
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=18625)
+-----------------+-----------------+
| column | type |
+-----------------+-----------------+
| restaurant_name |text (bm25 Index)|
+-----------------+-----------------+
|restaurant_review|text (bm25 Index)|
+-----------------+-----------------+
| owner_answer |text (bm25 Index)|
+-----------------+-----------------+
Search for the restaurant using a specific sentence¶
We define a query, "I want burritos"
, to find relevant restaurant reviews in the dataset. Using ds_bm25.query()
, we search and rank entries in restaurant_review
based on BM25 similarity to the query. The code orders results by how well they match the query (BM25_SIMILARITY
), from highest to lowest relevance, and limits the output to the top 10 results. The final list of results is stored in view_bm25
.
query = "I want burritos"
view_bm25 = ds_bm25.query(f"""
SELECT *
ORDER BY BM25_SIMILARITY(restaurant_review, '{query}') DESC
LIMIT 6
""")
view_bm25
Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=6)
Show the results¶
for row in view_bm25:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")
Restaurant name: Los Amigos Review: Best Burritos i have ever tried!!!!! Wolderful!!! Restaurant name: Los Amigos Review: Fantastic burritos! Restaurant name: Cheztakos!!! Review: Great burritos Restaurant name: La Costeña Review: Awesome burritos! Restaurant name: La Costeña Review: Awesome burritos Restaurant name: La Costeña Review: Bomb burritos
3) Create the Dataset and use Vector Similarity Search¶
If you want to generate text embeddings for similarity search, you can choose a proprietary model like text-embedding-3-large
from OpenAI
, or you can opt for an open-source
model. The MTEB leaderboard on Hugging Face provides a selection of open-source models that have been tested for their effectiveness at converting text into embeddings, which are numerical representations that capture the meaning and nuances of words and sentences. Using these embeddings, you can perform similarity search, grouping similar pieces of text (like sentences or documents) based on their meaning.
Selecting a model from the MTEB leaderboard offers several benefits: these models are ranked based on performance across a variety of tasks and languages, ensuring that you’re choosing a model that’s both accurate and versatile. If you prefer not to use a proprietary model, a high-performing model from this list is an excellent alternative.
We start by installing and importing the openai
library to access OpenAI's API for generating embeddings.Next, we define the function embedding_function
, which takes texts
as input (either a single string or a list of strings) and a model name, defaulting to "text-embedding-3-large"
. Then, for each text, we replace newline characters with spaces to maintain clean, uniform text. Finally, we use openai.embeddings.create()
to generate embeddings for each text and return a list of these embeddings, which can be used for cosine similarity comparisons.
!pip install openai
Sets the OpenAI API key in the environment using getpass
.
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")
import openai
def embedding_function(texts, model="text-embedding-3-large"):
if isinstance(texts, str):
texts = [texts]
texts = [t.replace("\n", " ") for t in texts]
return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]
Create the dataset and add the columns¶
Next, we add three columns to vector_search
:
embedding
: Stores vector embeddings with a dimension size of 3072, which will enable vector-based similarity searches.restaurant_name
: A text column with a BM25 index , optimizing it for relevance-based text search.restaurant_review
: Another text column with a BM25 index , also optimized for efficient and ranked search results.owner_answer
: A text column with an inverted index , allowing fast and efficient filtering based on specific onwner answer.
Finally, we use vector_search.commit()
to save these new columns, ensuring the dataset structure is ready for further data additions and queries.
dataset_name_vs = "vector_indexes"
vector_search = deeplake.create(f"al://{org_id}/{dataset_name_vs}")
# Add columns to the dataset
vector_search.add_column(name="embedding", dtype=types.Embedding(3072))
vector_search.add_column(name="restaurant_name", dtype=types.Text(index_type=types.BM25))
vector_search.add_column(name="restaurant_review", dtype=types.Text(index_type=types.BM25))
vector_search.add_column(name="owner_answer", dtype=types.Text(index_type=types.Inverted))
vector_search.commit()
This function processes each review in restaurant_review
and converts it into a numerical embedding. These embeddings, stored in embeddings_restaurant_review
, represent each review as a vector, enabling us to perform cosine similarity searches and comparisons within the dataset.
Deep Lake will handle the search computations, providing us with the final results.
# Create embeddings
batch_size = 500
embeddings_restaurant_review = []
for i in range(0, len(restaurant_review), batch_size):
embeddings_restaurant_review += embedding_function(restaurant_review[i : i + batch_size])
# Add data to the dataset
vector_search.append({"restaurant_name": restaurant_name, "restaurant_review": restaurant_review, "embedding": embeddings_restaurant_review, "owner_answer": owner_answer})
vector_search.commit()
vector_search.summary()
Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer), length=18625)
+-----------------+---------------------+
| column | type |
+-----------------+---------------------+
| embedding | embedding(3072) |
+-----------------+---------------------+
| restaurant_name | text (bm25 Index) |
+-----------------+---------------------+
|restaurant_review| text (bm25 Index) |
+-----------------+---------------------+
| owner_answer |text (Inverted Index)|
+-----------------+---------------------+
Search for the restaurant using a specific sentence¶
We start by defining a search query, "A restaurant that serves good burritos."
.
- Generate Embedding for Query :
- We call
embedding_function(query)
to generate an embedding for this query. Sinceembedding_function
returns a list, we access the first (and only) item with[0]
, storing the result inembed_query
.
- Convert Embedding to String :
- We convert
embed_query
(a list of numbers) into a single comma-separated string using",".join(str(c) for c in embed_query)
. This step stores the embedding as a formatted string instr_query
, preparing it for further processing or use in queries.
query = "A restaurant that serves good burritos."
embed_query = embedding_function(query)[0]
str_query = ",".join(str(c) for c in embed_query)
- Define Query with Cosine Similarity :
We construct a TQL query (
query_vs
) to search within thevector_search
dataset.The query calculates the cosine similarity between the
embedding
column andstr_query
, which is the embedding of our query,"A restaurant that serves good burritos."
. This similarity scorescore
measures how closely each entry matches our query.
- Order by Score and Limit Results :
- The query orders results by
score
in descending order, showing the most relevant matches first. We limit the results to the top 3 matches to focus on the best results.
- Execute Query :
vector_search.query(query_vs)
runs the query on the dataset, storing the output inview_vs
, which contains the top 3 most similar entries based on cosine similarity. This approach helps us retrieve the most relevant records matching our query invector_search
.
query_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
ORDER BY score DESC
LIMIT 3
"""
view_vs = vector_search.query(query_vs)
view_vs
Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer,row_id,score), length=3)
for row in view_vs:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")
Restaurant name: Cheztakos!!! Review: Great burritos Restaurant name: Los Amigos Review: Nice place real good burritos. Restaurant name: La Costeña Review: Awesome burritos
If we want to filter for a specific owner answer, such as Thank you , we set word = "Thank you"
to define the desired owner answer. Here, we’re using an inverted index on the owner_answer
column to efficiently filter results based on this owner answer.
word = "Thank you"
query_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
WHERE CONTAINS(owner_answer, '{word}')
ORDER BY score DESC
LIMIT 3
"""
view_vs = vector_search.query(query_vs)
view_vs
Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer,row_id,score), length=3)
for row in view_vs:
print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']} \nOwner Answer: {row['owner_answer']}")
Restaurant name: Taqueria La Espuela Review: My favorite place for super burrito and horchata Owner Answer: Thank you for your continued support! Restaurant name: Chaat Bhavan Mountain View Review: Great place with good food Owner Answer: Thank you for your positive feedback! We're thrilled to hear that you had a great experience at our restaurant and enjoyed our delicious food. Your satisfaction is our priority, and we can't wait to welcome you back for another wonderful dining experience. Thanks, Team Chaat Bhavan Restaurant name: Chaat Bhavan Mountain View Review: Good food. Owner Answer: Thank you for your 4-star rating! We're glad to hear that you had a positive experience at our restaurant. Your feedback is valuable to us, and we appreciate your support. If there's anything specific we can improve upon to earn that extra star next time, please let us know. We look forward to serving you again soon. Thanks, Team Chaat Bhavan
4) Explore Results with Hybrid Search¶
In the stage, the system enhances its search capabilities by combining BM25 with Approximate Nearest Neighbors (ANN) for a hybrid search. This approach blends lexical search with semantic search, improving relevance by considering both keywords and semantic meaning. The introduction of a Large Language Model (LLM) allows the system to generate text-based answers, delivering direct responses instead of simply listing relevant documents.
We open the vector_search
dataset to perform a hybrid search. First, we define a query "Let's grab a drink"
and generate its embedding using embedding_function(query)[0]
. We then convert this embedding into a comma-separated string embedding_string
, preparing it for use in combined text and vector-based searches.
vector_search = deeplake.open(f"al://{org_id}/{dataset_name_vs}")
Search for the correct restaurant using a specific sentence¶
query = "I feel like a drink"
embed_query = embedding_function(query)[0]
embedding_string = ",".join(str(c) for c in embed_query)
We create two queries:
Vector Search (
tql_vs
): Calculates cosine similarity withembedding_string
and returns the top 5 matches by score.BM25 Search (
tql_bm25
): Ranksrestaurant_review
by BM25 similarity toquery
, also limited to the top 5.
We then execute both queries, storing vector results in vs_results
and BM25 results in bm25_results
. This allows us to compare results from both search methods.
tql_vs = f"""
SELECT *
FROM (
SELECT *, cosine_similarity(embedding, ARRAY[{embedding_string}]) AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
)
ORDER BY score DESC
LIMIT 5
"""
tql_bm25 = f"""
SELECT *, BM25_SIMILARITY(restaurant_review, '{query}') AS score
FROM (
SELECT *, ROW_NUMBER() AS row_id
)
ORDER BY BM25_SIMILARITY(restaurant_review, '{query}') DESC
LIMIT 5
"""
vs_results = vector_search.query(tql_vs)
bm25_results = vector_search.query(tql_bm25)
vs_results
Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer,row_id,score), length=5)
bm25_results
Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer,row_id,score), length=5)
Show the scores¶
for el_vs in vs_results:
print(f"vector search score: {el_vs['score']}")
for el_bm25 in bm25_results:
print(f"bm25 score: {el_bm25['score']}")
vector search score: 0.5322654247283936 vector search score: 0.46281781792640686 vector search score: 0.4580579102039337 vector search score: 0.45585304498672485 vector search score: 0.4528498649597168 bm25 score: 13.076177597045898 bm25 score: 11.206666946411133 bm25 score: 11.023599624633789 bm25 score: 10.277934074401855 bm25 score: 10.238584518432617
First, we import the required libraries and define a Document class, where each document has an id, a data dictionary, and an optional score for ranking.
- Setup and Classes :
- We import necessary libraries and define a
Document
class usingpydantic.BaseModel
. EachDocument
has anid
, adata
dictionary, and an optionalscore
for ranking.
- Softmax Function :
- The
softmax
function normalizes a list of scores (retrieved_score
) using the softmax formula. Scores are exponentiated, limited bymax_weight
, and then normalized to sum up to 1. This returnsnew_weights
, a list of normalized scores.
Install the required libraries
!pip install numpy pydantic
import math
import numpy as np
from typing import Any, Dict, List, Optional
from pydantic import BaseModel
class Document(BaseModel):
id: str
data: Dict[str, Any]
score: Optional[float] = None
def softmax(retrieved_score: list[float], max_weight: int = 700) -> Dict[str, Document]:
# Compute the exponentials
exp_scores = [math.exp(min(score, max_weight)) for score in retrieved_score]
# Compute the sum of the exponentials
sum_exp_scores = sum(exp_scores)
# Update the scores of the documents using softmax
new_weights = []
for score in exp_scores:
new_weights.append(score / sum_exp_scores)
return new_weights
Normalize the score¶
- Apply Softmax to Scores :
- We extract
score
values fromvs_results
andbm25_results
and applysoftmax
to them, storing the results invss
andbm25s
. This step scales both sets of scores for easy comparison.
- Create Document Dictionaries :
- We create dictionaries
docs_vs
anddocs_bm25
to store documents fromvs_results
andbm25_results
, respectively. For each result, we add therestaurant_name
andrestaurant_review
along with the normalized score. Each document is identified byrow_id
.
This code standardizes scores and organizes results, allowing comparison across both vector and BM25 search methods.
vs_score = vs_results["score"]
bm_score = bm25_results["score"]
vss = softmax(vs_score)
bm25s = softmax(bm_score)
print(vss)
print(bm25s)
[0.21224761685297047, 0.19800771415362647, 0.1970674552539808, 0.19663342673946818, 0.19604378699995426] [0.7132230191866898, 0.10997834807700335, 0.09158030054295993, 0.04344738382536802, 0.04177094836797888]
vs_results
Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer,row_id,score), length=5)
docs_vs = {}
docs_bm25 = {}
for el, score in zip(vs_results, vss):
docs_vs[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)
for el, score in zip(bm25_results, bm25s):
docs_bm25[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)
We define weights for our hybrid search: VECTOR_WEIGHT
and LEXICAL_WEIGHT
are both set to 0.5
, giving equal importance to vector-based and BM25 scores.
- Initialize Results Dictionary :
- We create an empty dictionary,
results
, to store documents with their combined scores from both search methods.
- Combine Scores :
We iterate over the unique document IDs from
docs_vs
anddocs_bm25
.For each document:
We add it to
results
, defaulting to the version available (vector or BM25).We calculate a weighted score:
vs_score
from vector results (if present indocs_vs
) andbm_score
from BM25 results (if present indocs_bm25
).The final
results[k].score
is set by addingvs_score
andbm_score
.
This produces a fused score for each document in results
, ready to rank in the hybrid search.
docs_vs
{'17502': Document(id='17502', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Nice place for a drink'}, score=0.21224761685297047), '17444': Document(id='17444', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks, good food'}, score=0.19800771415362647), '4022': Document(id='4022', data={'restaurant_name': 'Eureka! Mountain View', 'restaurant_review': 'Good drinks and burgers'}, score=0.1970674552539808), '17426': Document(id='17426', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks an easy going bartenders'}, score=0.19663342673946818), '5136': Document(id='5136', data={'restaurant_name': 'Scratch', 'restaurant_review': 'Just had drinks. They were good!'}, score=0.19604378699995426)}
docs_bm25
{'3518': Document(id='3518', data={'restaurant_name': 'Olympus Caffe & Bakery', 'restaurant_review': 'I like the garden to sit down with friends and have a drink.'}, score=0.7132230191866898), '2637': Document(id='2637', data={'restaurant_name': 'Mifen101 花溪米粉王', 'restaurant_review': 'Feel like I’m back in China.'}, score=0.10997834807700335), '11383': Document(id='11383', data={'restaurant_name': 'Ludwigs Biergarten Mountain View', 'restaurant_review': 'Beer is fresh tables are big feel like a proper beer garden'}, score=0.09158030054295993), '2496': Document(id='2496', data={'restaurant_name': 'Seasons Noodles & Dumplings Garden', 'restaurant_review': 'Comfort food, excellent service! Feel like back to home.'}, score=0.04344738382536802), '10788': Document(id='10788', data={'restaurant_name': 'Casa Lupe', 'restaurant_review': 'Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos'}, score=0.04177094836797888)}
Fusion method¶
def fusion(docs_vs: Dict[str, Document], docs_bm25: Dict[str, Document]) -> Dict[str, Document]:
VECTOR_WEIGHT = 0.5
LEXICAL_WEIGHT = 0.5
results: Dict[str, Dict[str, Document]] = {}
for k in set(docs_vs) | set(docs_bm25):
results[k] = docs_vs.get(k, None) or docs_bm25.get(k, None)
vs_score = VECTOR_WEIGHT * docs_vs[k].score if k in docs_vs else 0
bm_score = LEXICAL_WEIGHT * docs_bm25[k].score if k in docs_bm25 else 0
results[k].score = vs_score + bm_score
return results
results = fusion(docs_vs, docs_bm25)
results
{'2637': Document(id='2637', data={'restaurant_name': 'Mifen101 花溪米粉王', 'restaurant_review': 'Feel like I’m back in China.'}, score=0.013747293509625419), '5136': Document(id='5136', data={'restaurant_name': 'Scratch', 'restaurant_review': 'Just had drinks. They were good!'}, score=0.024505473374994282), '17426': Document(id='17426', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks an easy going bartenders'}, score=0.024579178342433523), '17444': Document(id='17444', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks, good food'}, score=0.02475096426920331), '2496': Document(id='2496', data={'restaurant_name': 'Seasons Noodles & Dumplings Garden', 'restaurant_review': 'Comfort food, excellent service! Feel like back to home.'}, score=0.005430922978171003), '4022': Document(id='4022', data={'restaurant_name': 'Eureka! Mountain View', 'restaurant_review': 'Good drinks and burgers'}, score=0.0246334319067476), '3518': Document(id='3518', data={'restaurant_name': 'Olympus Caffe & Bakery', 'restaurant_review': 'I like the garden to sit down with friends and have a drink.'}, score=0.08915287739833623), '17502': Document(id='17502', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Nice place for a drink'}, score=0.02653095210662131), '11383': Document(id='11383', data={'restaurant_name': 'Ludwigs Biergarten Mountain View', 'restaurant_review': 'Beer is fresh tables are big feel like a proper beer garden'}, score=0.011447537567869991), '10788': Document(id='10788', data={'restaurant_name': 'Casa Lupe', 'restaurant_review': 'Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos'}, score=0.00522136854599736)}
We sort the results dictionary by each document's combined score in descending order, ensuring that the highest-ranking documents appear first.
sorted_documents = dict(sorted(results.items(), key=lambda item: item[1].score, reverse=True))
sorted_documents
{'3518': Document(id='3518', data={'restaurant_name': 'Olympus Caffe & Bakery', 'restaurant_review': 'I like the garden to sit down with friends and have a drink.'}, score=0.3566115095933449), '17502': Document(id='17502', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Nice place for a drink'}, score=0.10612380842648524), '17444': Document(id='17444', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks, good food'}, score=0.09900385707681324), '4022': Document(id='4022', data={'restaurant_name': 'Eureka! Mountain View', 'restaurant_review': 'Good drinks and burgers'}, score=0.0985337276269904), '17426': Document(id='17426', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks an easy going bartenders'}, score=0.09831671336973409), '5136': Document(id='5136', data={'restaurant_name': 'Scratch', 'restaurant_review': 'Just had drinks. They were good!'}, score=0.09802189349997713), '2637': Document(id='2637', data={'restaurant_name': 'Mifen101 花溪米粉王', 'restaurant_review': 'Feel like I’m back in China.'}, score=0.054989174038501676), '11383': Document(id='11383', data={'restaurant_name': 'Ludwigs Biergarten Mountain View', 'restaurant_review': 'Beer is fresh tables are big feel like a proper beer garden'}, score=0.045790150271479965), '2496': Document(id='2496', data={'restaurant_name': 'Seasons Noodles & Dumplings Garden', 'restaurant_review': 'Comfort food, excellent service! Feel like back to home.'}, score=0.02172369191268401), '10788': Document(id='10788', data={'restaurant_name': 'Casa Lupe', 'restaurant_review': 'Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos'}, score=0.02088547418398944)}
Show the results¶
We will output a list of restaurants in order of relevance, showing each name and review based on the hybrid search results.
for v in sorted_documents.values():
print(f"Restaurant name: {v.data['restaurant_name']} \nReview: {v.data['restaurant_review']}")
Restaurant name: Olympus Caffe & Bakery Review: I like the garden to sit down with friends and have a drink. Restaurant name: St. Stephen's Green Review: Nice place for a drink Restaurant name: St. Stephen's Green Review: Good drinks, good food Restaurant name: Eureka! Mountain View Review: Good drinks and burgers Restaurant name: St. Stephen's Green Review: Good drinks an easy going bartenders Restaurant name: Scratch Review: Just had drinks. They were good! Restaurant name: Mifen101 花溪米粉王 Review: Feel like I’m back in China. Restaurant name: Ludwigs Biergarten Mountain View Review: Beer is fresh tables are big feel like a proper beer garden Restaurant name: Seasons Noodles & Dumplings Garden Review: Comfort food, excellent service! Feel like back to home. Restaurant name: Casa Lupe Review: Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos
This code completes the RAG (Retrieval-Augmented Generation) approach by generating an LLM-based answer to a user’s question, using results retrieved in the previous step. Here’s how it works:
- Setup and Initialization :
- We import
json
for handling JSON responses and initialize theOpenAI
client to interact with the language model.
- Define
generate_question
Function :
- This function accepts:
question
: The user’s question.information
: A list relevant chunks retrieved previously, providing context.
- System and User Prompts :
The
system_prompt
instructs the model to act as a restaurant assistant, using the provided chunks to answer clearly and without repetition.The model is directed to format its response in JSON.
The
user_prompt
combines the user’s question and the information chunks.
- Generate and Parse the Response :
Using
client.chat.completions.create()
, the system and user prompts are sent to the LLM (specified asgpt-4o-mini
).The response is parsed as JSON, extracting the
answer
field. If parsing fails,False
is returned.
import json
from openai import OpenAI
client = OpenAI()
def generate_question(question:str, information:list):
system_prompt = f"""You are a helpful assistant specialized in providing answers to questions about restaurants. Below is a question from a user, along with the top four relevant information chunks about restaurants from a Deep Lake database. Using these chunks, construct a clear and informative answer that addresses the question, incorporating key details without repeating information.
The output must be in JSON format with the following structure:
{{
"answer": "The answer to the question."
}}
"""
user_prompt = f"Here is a question from a user: {question}\n\nHere are the top relevant information about restaurants {information}"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
response_format={"type": "json_object"},
)
try:
response = response.choices[0].message.content
response = json.loads(response)
questions = response["answer"]
return questions
except:
return False
This function takes a restaurant-related question and retrieves the best response based on the given context. It completes the RAG process by combining relevant information and LLM-generated content into a concise answer.
information = [f'Review: {el["restaurant_review"]}, Restaurant name: {el["restaurant_name"]}' for el in view_vs]
result = generate_question(query, information)
result
"If you're feeling like a drink, consider visiting Taqueria La Espuela, which is known for its refreshing horchata. Alternatively, you might enjoy Chaat Bhavan Mountain View, a great place with good food and a lively atmosphere."
Let's run a search on a multiple dataset¶
In