# Vector Databases Lab

In this lab, we will explore how the vector databases work and how to use them effectively. You can run this lab in a Jupyter notebook environment either locally or on a cloud service such as Google Colab.

In this lab, we will use the [Chroma]() vector database in embedded mode and embeddings of natural language text using the sentence transformers library.
You can install the required libraries using pip:

In [None]:
%pip install chromadb sentence-transformers pandas tqdm notebook

Here is an example on how embeddings can be created and stored using some sentence transformers model. Then, we will use the naive way to query them, by going vector by vector and writing their similairties. Here we use L2 distance as similarity as that is what is used by default in ChromaDB. You can experiment with others such as cosine similarity.

In [None]:
from sentence_transformers import SentenceTransformer

# Load a small embedding model (you can use other ones see https://www.sbert.net/docs/pretrained_models.html)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Turn text into numbers
sentences = ["This is the Large Scale Data Management Course", "We are learning about vector databases", "Chroma is a great open-source vector database"]
embeddings = model.encode(sentences)

print(f"Vector dimension: {len(embeddings[0])}")
#If you want, you can print the first vector by uncommenting the next line
#print(f"First vector: {embeddings[0]}")

# Computing similarity between vectors manually using L2 distance (used by default in Chroma)
import numpy as np
def euclidean_distance(vec1, vec2):
    return np.linalg.norm(vec1 - vec2)

for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = euclidean_distance(embeddings[i], embeddings[j])
        print(f"Euclidean distance between sentence {i} and {j}: {sim}")
# You can already see that the last two sentences are more similar to each other than to the first one.

query_sentence = "I want to learn about vector databases"

for i in range(len(sentences)):
    sim = euclidean_distance(model.encode([query_sentence])[0], embeddings[i])
    print(f"Euclidean distance between query and sentence {i}: {sim}")
# You can see that the second and third sentences are more similar to the query than the first one.


Let us do the same using Chroma vector database. Notice that the distances are slightly different but the results are the same.

In [None]:
import chromadb

# Create a Chroma client
client = chromadb.Client()
# Create a collection
try:
    client.delete_collection("lsdm_course")
except:
    pass

# Add documents and their embeddings to the collection
for i, sentence in enumerate(sentences):
    collection.add(
        documents=[sentence],
        embeddings=[embeddings[i].tolist()],
        ids=[str(i)]
    )
# Now, let's query the collection
query_embedding = model.encode([query_sentence])[0].tolist()
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=2
)

print("Chroma query results:")
for doc_id, distance in zip(results['ids'][0], results['distances'][0]):
    print(f"Document ID: {doc_id}, Distance: {distance}, Sentence: {collection.get(ids=[doc_id])['documents'][0]}")

## (Your turn) Scaling up

We will now use a real dataset to store and query vectors using the Chroma vector database. We use the TMDB movie dataset from [Kaggle](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata). We use metadata such as title, genres, vote average, and release date to enrich our documents and query them later.

* **Task 1**: Load the entire dataset and create embeddings for all the documents.

In [None]:
import pandas as pd
import chromadb
from chromadb.utils import embedding_functions

# --- Step 1: Load the Dataset ---
print("Downloading dataset...")
# The dataset is hosted locally to avoid Kaggle authentication issues
url = "https://cloud.univ-grenoble-alpes.fr/public.php/dav/files/ZAn6WWRqgzGzaaC"
df = pd.read_csv(url)

# Select only the columns we need
df = df[['id', 'title', 'overview', 'genres', 'vote_average', 'release_date']]

# Remove movies that have no plot overview
print(f"Original count: {len(df)}")
df = df.dropna(subset=['overview'])
print(f"Cleaned count: {len(df)}")

# Combine title + overview to have a more complete context
documents = (df['title'] + ": " + df['overview']).tolist()

metadatas = df[['title', 'genres', 'vote_average', 'release_date']].to_dict(orient='records')

ids = [str(x) for x in df['id'].tolist()]

print(df)

# Initialize and ingest data into ChromaDB
chroma_client = chromadb.Client()

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

try:
    chroma_client.delete_collection("tmdb_movies")
except:
    pass

# TODO add ingestion of the data into chromadb here

* **Task 2**: Query some vectors using metadata filters. For example, find movies with a high vote average or specific genres. You can find the documentation for the collection query method [here](https://docs.trychroma.com/collections/query). What do you notice when you try to filter using nested metadata (such as genres)?

In [None]:
#This is an example query, but which does not use any nested metadata (such as the genres)
results = collection.query(
    query_texts=["A movie about treasure"],
    n_results=2,
    where={"vote_average": {"$gt": 7.0}} #TODO add a filter on genres as well
)

for doc_id, distance, metadata in zip(results['ids'][0], results['distances'][0], results['metadatas'][0]):
    print(f"Document ID: {doc_id}, Distance: {distance}, Title: {metadata['title']}, Genres: {metadata['genres']}, Vote Average: {metadata['vote_average']}")

* **Task 3** Benchmark the query time between using the vector database and the naive approach of going vector by vector. Do this for a random sample of 1000 movies and 10 random (or hand chosen) queries

In [None]:
# TODO benchmark naive vs chromadb

* **Task 4** Check the similarity of the search results when using different embedding functions from the sentence transformers library. How do the results differ? Optionally, you can plot the similarity matrix between the results of different embedding functions.

In [None]:
# TODO check how the embeddings change the results