Skip to content

Conversation

@cacrespo
Copy link
Owner

@cacrespo cacrespo commented Sep 1, 2025

This pull request introduces semantic search functionality for the blog, allowing users to search for posts and articles based on the meaning of their content, not just keywords.

  • Integrated pgvector for storing vector embeddings in the PostgreSQL database.
  • Used the sentence-transformers library to generate embeddings for post and article content.
  • Implemented a custom search method that combines semantic search (using cosine distance) with traditional keyword-based search for more comprehensive results.

Not related changes (I was too lazy to create another PR 🙈):

  • Improved the Django Admin to better manage content.
  • Removed the search button from the base template.

I'm not 100% convinced about these changes, but I think it would be a pity not to integrate them. For now, let's move forward with the implementation, and maybe later we can improve it.

What don’t I like?

  • In the end, semantic search turns out to be very expensive in database queries. I'm convinced it can be optimized.
  • We’re adding a lot of libraries and also changing the Postgres image.
  • I added quite a bit of specific logic in the models (normalizing title and content, and giving more weight to the title in searches), but I feel it might be more appropriate to move that logic into the views.
  • The save() methods on Article and Post feel redundant. The same goes for search().

Last but not least: how should we handle migration code? Should migrations be generated in dev and then pushed to the repo, or generated in prod and then pushed? What’s the best approach and why?

@cacrespo cacrespo changed the title Semantic search [WIP] Implement semantic search for posts and articles Sep 6, 2025
@cacrespo cacrespo marked this pull request as ready for review September 6, 2025 22:41
return self.title

def save(self, *args, **kwargs):
title_vec = T.encode(self.title)
Copy link
Collaborator

@eduzen eduzen Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this codes deserves a function, then when you read it, I think it will more readable:

self.embedding = self.prepare_vector_information()
return super().save(*args, **kwargs)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went a bit further here and created a new class to handle embeddings — I think we can do more with them (or delete it forever... whichever happens first).

blog/models.py Outdated
super().save(*args, **kwargs)

@classmethod
def search(cls, q, dmax=0.5):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a class method? I think if search is going to use with the orm, like objest.search, should we use a custom manager: https://docs.djangoproject.com/en/5.2/topics/db/managers/

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood

Copy link
Collaborator

@eduzen eduzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left some comments

@cacrespo
Copy link
Owner Author

Test comment

@cacrespo cacrespo merged commit 017facc into main Sep 28, 2025
1 check passed
@cacrespo cacrespo deleted the semantic_search branch September 28, 2025 15:01
@cacrespo cacrespo mentioned this pull request Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants