Skip to content

Conversation

@abeglova
Copy link
Contributor

@abeglova abeglova commented Oct 30, 2025

What are the relevant tickets?

part of https://github.com/mitodl/hq/issues/9135

Description (What does it do?)

This is a v0 implementation for vector search. This uses the huggingface/sentence-transformers/msmarco-distilbert-base-tas-b model. For now only title and description are vectorized the search performance in terms of returning the most relevant results definitely has room for improvement. The new index is only updated and the new search mode only works if the model into opensearch and for now the plan is to only run it on rc. Also the new combined hybrid index is only updated via the recreate_index command for now and not upsert actions when learning resources are created or updated. Partially this is because this pr is already large and partially i wanted to make sure that vector embedding do not affect the existing search until the hybrid search is closer to being production ready. Additionally ,for now the new hybrid index is not linked to contentfiles and content file content is not used in the search

How can this be tested?

Verify that the search page works normally. Also searching "intro to ai class" returns no results

Login as an admin and select "hybrid" as the search mode from the admin options or go to http://open.odl.local:8062/search?search_mode=hybrid. The search will be empty for now regardless of whether you have a term but the site won't crash

Run docker-compose run web ./manage.py recreate_index --combined_hybrid The task should finish right away and you should see Skipping indexing hybrid index reindexing because no vector model is configured. in the logs

from the shell run

from learning_resources_search.indexing_api import update_local_index_settings_for_hybrid_search, register_model, create_ingest_pipeline

update_local_index_settings_for_hybrid_search()
register_model()
create_ingest_pipeline()

Run

docker-compose run web ./manage.py recreate_index --combined_hybrid

Go to
http://open.odl.local:8062/search

Search should still work normally

Go to
http://open.odl.local:8062/search?search_mode=hybrid

Search should work and facets should work
Searching "intro to ai class" should return results.

@github-actions
Copy link

github-actions bot commented Oct 30, 2025

OpenAPI Changes

Show/hide 10 changes: 0 error, 0 warning, 10 info
10 changes: 0 error, 0 warning, 10 info
info	[request-parameter-property-type-generalized] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/contentfiles/
		for the 'query' request parameter 'resource_id', the type/format of property '/items/' was generalized from 'integer'/'' to 'number'/''

info	[request-parameter-property-type-generalized] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/courses/{learning_resource_id}/contentfiles/
		for the 'query' request parameter 'resource_id', the type/format of property '/items/' was generalized from 'integer'/'' to 'number'/''

info	[request-parameter-property-type-generalized] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources/{learning_resource_id}/contentfiles/
		for the 'query' request parameter 'resource_id', the type/format of property '/items/' was generalized from 'integer'/'' to 'number'/''

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources_search/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources_user_subscription/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources_user_subscription/check/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-property-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new 'hybrid' enum value to the request property 'search_mode/allOf[#/components/schemas/SearchModeEnum]/'

info	[request-property-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new 'hybrid' enum value to the request property 'search_mode/allOf[#/components/schemas/SearchModeEnum]/'

info	[request-property-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new 'hybrid' enum value to the request property 'search_mode/allOf[#/components/schemas/SearchModeEnum]/'


Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

@abeglova abeglova marked this pull request as ready for review November 3, 2025 16:10
@shanbady shanbady self-requested a review November 4, 2025 14:28
Copy link
Contributor

@shanbady shanbady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to update the update_local_index_settings_for_hybrid_search like so to have it work locally:

def update_local_index_settings_for_hybrid_search():
    settings_body = {
        "persistent": {
            "archived.plugins.index_state_management.metadata_migration.status": None,
            "archived.plugins.index_state_management.template_migration.control": None,
        }
    }
    conn = get_conn()
    conn.cluster.put_settings(body=settings_body)
    settings_body = {
        "persistent": {
            "plugins": {
                "ml_commons": {
                    "only_run_on_ml_node": "false",
                    "native_memory_threshold": "99",
                }
            }
        }
    }
    conn = get_conn()
    conn.cluster.put_settings(body=settings_body)

I also was not seeing any results for search_mode=hybrid after I ran everything. the search endpoint was returning 200 with 0 results (regular endpoint still worked)


def get_vector_model_id():
conn = get_conn()
model_name = "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to move this to settings.py

EMBEDDING_FIELDS = {
"title_embedding": {
"type": "knn_vector",
"dimension": 768,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have this either as a constant or in settings.py - if possible it would be good to derive this from the model itself since they are dependent on one another similar to how we use encoder.dim() for qdrant

return MLCommonClient(conn)


def register_model():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docstring (same for create_ingest_pipeline get_ml_client and update_local_index_settings_for_hybrid_search)

conn.indices.refresh(index)


def get_vector_model_id():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docstring

@shanbady
Copy link
Contributor

shanbady commented Nov 4, 2025

I do see there is a combined hybrid index but it has 0 documents and its status is also yellow:

yellow open   micromasters_combined_hybrid_b3db3d11e24841b4917df7cde114ad10 LTD55I9OQZ6ImnDv7KGGGA   2   2          0            0       416b           416b

@abeglova
Copy link
Contributor Author

abeglova commented Nov 4, 2025

I'm not sure if the micromasters_ indexes are still used. In any case, they will be ignored by learn since learn ignores any indexes that don't have prefix settings.OPENSEARCH_INDEX

@shanbady
Copy link
Contributor

shanbady commented Nov 4, 2025

appears to be working after switching branches then switching back for some reason. the "micromaster_" is just a local prefix from my configuration

@shanbady
Copy link
Contributor

shanbady commented Nov 4, 2025

not sure how much of a concern this is but - are we keeping both the regular search indexes updated even with hybrid search?

It seems like we are updating the search index only with the "recreate_index" command however subscription emails etc are percolated off of the regular index. just want to make sure we won't accidentally send emails for items that dont show up in search results. I dont think this is an issue if the search view is only visible to admin users for now

@abeglova
Copy link
Contributor Author

abeglova commented Nov 4, 2025

Yes the plan is to keep both the new indexes and the hybrid search for now since the hybrid search is not ready to show users. The hybrid search view is only going to be visible to admins for the time being. I think subscription emails won't be sent anyway for items that are in the new index but not the old one since the only way that would happen is if something is unpublished but the new index is updated

Once the hybrid search is production ready we can get rid of the old indexes

Copy link
Contributor

@shanbady shanbady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants