Vector Database with Django and pgvector

I have a small Django project I have been playing with for a while. There’s more about it in this series of posts.

The application just keeps a list of items (mostly weblinks etc) and lets me organize them. I’ve been experimenting with using embeddings for finding similar items or suggesting tags etc.

I am using OpenAI text-embedding-ada-002 to compute the embeddings for the objects. Because I was too lazy to actually make or use a vector database, I just built a table of all the item-item distances and used that to query for similar stuff. This was a great hack to see how I wanted to compute and use the embeddings but its obviously not going to scale really well.

After looking around at various options out there like pinecone etc. I decided I wanted to try pgvector since I was already using postgres and I didn’t want to deal with another database. This turned out to work better than I expected as pgvector has good Django support built in. Just

pip install pgvector

I really only had to do two things to make it work that were not obvious from the docs.

First was that I was using a container version of postgres from a standard distribution and I wanted to keep that workflow. pgvector is available as a docker container called ankane/pgvector. My only issue was I had been using an image based on Postgres 16 and so my database was incompatible with the pgvector distribution. Fortunately they provide a Dockerfile and it was easy to build my own.

Next was that PgVector says you can enable the extension via a migration that looks like:

from pgvector.django import VectorExtension

class Migration(migrations.Migration):
    operations = [
        VectorExtension()
    ]

That’s correct but you really want to start with a blank migration that is part of your migration chain. Django will make this for you if you just do:

python manage.py makemigrations <app_name> --name <migration_name> --empty

Just tell it the app name and name the migration then add that operation in.

After that, everything worked fine.

Well not exactly. Now I can’t use the standard postgres service in the Gitlab CI/CD to run my tests… hmm

Running a custom Postgres image in Gitlab CI/CD

Fortunately its pretty easy to fix up the CI/CD config once you try a lot of other things. Normally setting up the postgres service in the .yml looks something like this:

services:
    - postgres:12.2-alpine

But you can use your own docker image with this syntax. Alias is how the other parts of the ci/cd environment will access this container etc.


services:
# use the pgvector image from docker hub for testing
  - name: ankane/pgvector:latest
    alias: postgres

And then everything was ok.

I could also use my custom docker file if I put in a place where the CI/CD could find it.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *