In the age of AI and big data, the ability to find similar items in vast datasets is crucial. Whether you're building a recommendation engine, a semantic search engine, or a fraud detection system, traditional databases often fall short when it comes to handling the complexity of modern data. This is where vector databases come in, and in this post, we'll explore how you can leverage the power of vector search directly within your PostgreSQL database using the pgvector extension.
A vector database is a specialized database designed to store, manage, and search high-dimensional data points called vectors or embeddings. These vectors are numerical representations of unstructured data like text, images, or audio, and they capture the semantic meaning of the data. This allows you to find similar items by comparing their vector representations, a task that is difficult to achieve with traditional databases that rely on exact keyword matching.

Vector databases are used in a wide range of applications, including:
pgvector is an open-source extension for PostgreSQL that adds vector similarity search capabilities to your existing relational database. This means you can store and search your vector embeddings alongside your other application data, leveraging the power of PostgreSQL's robust and feature-rich ecosystem.
With pgvector, you can:
There are two main ways to get started with pgvector: using a pre-built Docker image or installing it manually.
The easiest way to get up and running with pgvector is to use a Docker image. There are several pre-built images available, such as pgvector/pgvector and ankane/pgvector, which come with PostgreSQL and the pgvector extension already installed.
Here is an example docker-compose.yml file that you can use to start a PostgreSQL container with pgvector:
services:
db:
image: pgvector/pgvector:pg16
ports:
- "5432:5432"
environment:
POSTGRES_DB: vectordb
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpwd
volumes:
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
You will also need to create an init.sql file to create the vector extension and the facts_dataset table which we will use to store our facts when the container starts:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE facts_dataset (
id SERIAL PRIMARY KEY,
fact TEXT,
embedding VECTOR(384)
);
With these two files, you can start the container by running docker-compose up -d
For those who prefer a manual installation, you can install PostgreSQL and then build and install the pgvector extension from source.
Install PostgreSQL: If you don't already have PostgreSQL installed, you can download it from the official website or install it using your operating system's package manager. For example, on Ubuntu, you can install PostgreSQL from the PostgreSQL Apt Repository.
Install pgvector: You can install pgvector from your distribution's packages or compile it from source. For example, on Debian/Ubuntu, you can install it with apt install postgresql-XX-pgvector, where XX is your PostgreSQL version. To compile from source, you can clone the pgvector repository from https://github.com/pgvector/pgvector and follow the installation instructions.
Enable the extension: Once pgvector is installed, connect to your PostgreSQL database and run the following command to enable the extension:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE facts_dataset (
id SERIAL PRIMARY KEY,
fact TEXT,
embedding VECTOR(384)
);
To demonstrate how to use pgvector, we will use a local dataset of 50 facts stored in a file named facts.txt. The file should be in a simple list of facts.
Create a file named facts.txt in the same directory as your script with content like this:
The capital of France is Paris
Mars is also colloquially called the Red Planet
William Shakespeare wrote the play "Romeo & Juliet"
... (and 47 more facts)
These instructions assume that uv is already installed. Use the following commands to initialize the project and
install the necessary dependencies:
uv init vectordb
cd vectordb
uv add psycopg2-binary sentence-transformers
To perform a semantic search, we need to convert our questions into vector embeddings. You can use a pre-trained model from a provider like OpenAI, or an open-source model from a library like Hugging Face. The key is that you are converting the text of the question into a numerical representation.
Here is a Python script that reads from your facts.csv file, generates embeddings for the questions, and inserts the data into the facts_dataset table:
import psycopg2
from langchain_huggingface import HuggingFaceEmbeddings
# Connect to your PostgreSQL database
conn = psycopg2.connect("host=localhost port=5432 dbname=vectordb user=testuser password=testpwd")
cur = conn.cursor()
# Read the facts from the CSV file and insert into the database
with open("facts.txt", "r", encoding="utf-8") as f:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
for fact in f:
fact = fact.strip()
embedding = embeddings.embed_query(fact)
cur.execute(
"INSERT INTO facts_dataset (fact, embedding) VALUES (%s, %s)",
(fact, embedding),
)
# Commit the changes and close the connection
conn.commit()
cur.close()
conn.close()
print("\nSuccessfully loaded facts into the database.")
Let's name this script populate.py and run it with:
uv run populate.py
When running the script for the first time, allow a few minutes for the embedding model to download. Subsequent executions will be significantly faster as the model is cached locally.
Now that our facts data is in the database, we can ask a new question, find the most similar stored fact, and return its answer. pgvector provides several operators for calculating the distance between vectors:
<->: L2 distance (Euclidean)<#>: Inner product<=>: Cosine distanceLet's find the answer to a new question: "Who wrote Romeo & Juliet?". We'll generate an embedding for this query and use the cosine distance (<=>) to find the most similar fact in our database.
import psycopg2
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
message = "Who wrote Romeo & Juliet?"
embedding = embeddings.embed_query(message)
conn = psycopg2.connect("host=localhost port=5432 dbname=vectordb user=testuser password=testpwd")
cursor = conn.cursor()
query_data = {
"embedding": str(embedding)
}
cursor.execute(
"SELECT fact, embedding <=> %(embedding)s AS distance FROM facts_dataset ORDER BY distance LIMIT 3",
query_data
)
rows = cursor.fetchall()
for row in rows:
print(row)
cursor.close()
You should get the following output:
('William Shakespeare wrote the play "Romeo & Juliet"', 0.22425718452079202)
('The capital of France is Paris', 0.8331506600037014)
('Mars is also colloquially called the Red Planet', 1.0083558881047898)
As shown, the top result identifies William Shakespeare as the author of Romeo & Juliet. This hit is significantly
closer in distance than the other facts (0.22 vs. 0.83 and 1.0). In a production environment, you should implement
a similarity threshold (e.g., <0.5). This ensures that if no relevant data is found, the application refrains from
providing an incorrect or 'hallucinated' answer.
When you search for the "closest" or "most similar" vectors, you're asking the database to measure the distance between them. But "distance" can mean different things. pgvector offers three primary ways to measure it, and choosing the right one is critical for the success of your search application. In the previous example, we have chosen to use cosine similarity search for reasons which should be obvious below.
Let's break them down with intuitive explanations and clear examples.
<->Use L2 distance when the magnitude of your vector has a clear and important meaning. For example: Imagine you are representing images with a simple vector of their average red, green, and blue values: [R, G, B]. A bright red image might be [250, 10, 20]. A dark red image might be [50, 2, 4]. These vectors point in a similar direction (they are both reddish), but their magnitudes are very different. L2 distance correctly identifies them as being far apart because "bright red" is visually very different from "dark red." The magnitude, representing brightness/intensity, is crucial.
<=>Use cosine distance when the direction of the vector is what encodes the meaning, which is almost always the case for semantic text search with modern language models (like those from OpenAI or Hugging Face). The example described earlier in this post is a good example of when to use cosine similarity. When language models create embeddings for text, they place concepts in a high-dimensional space.
Cosine distance is perfect here because it only cares about the conceptual direction (the meaning), not the arbitrary length of the vector produced by the model. And this is why we can ask pgvector "Who wrote Romeo & Juliet" and our example is able to tell us the answer is William Shakespeare.
A · B = Σ(Aᵢ * Bᵢ).<#>Use inner product when you care about both direction and magnitude, but in a multiplicative way. It is often used in recommendation systems and when working with embeddings that have been specifically trained for inner product search.
Imagine you have user embeddings and item embeddings trained together. The model is trained so that the inner product of a user and an item they are likely to enjoy is high.