Ab/hybrid search #2663

abeglova · 2025-10-30T21:22:29Z

What are the relevant tickets?

part of https://github.com/mitodl/hq/issues/9135

Description (What does it do?)

This is a v0 implementation for vector search. This uses the huggingface/sentence-transformers/msmarco-distilbert-base-tas-b model. For now only title and description are vectorized the search performance in terms of returning the most relevant results definitely has room for improvement. The new index is only updated and the new search mode only works if the model into opensearch and for now the plan is to only run it on rc. Also the new combined hybrid index is only updated via the recreate_index command for now and not upsert actions when learning resources are created or updated. Partially this is because this pr is already large and partially i wanted to make sure that vector embedding do not affect the existing search until the hybrid search is closer to being production ready. Additionally ,for now the new hybrid index is not linked to contentfiles and content file content is not used in the search

How can this be tested?

Verify that the search page works normally. Also searching "intro to ai class" returns no results

Login as an admin and select "hybrid" as the search mode from the admin options or go to http://open.odl.local:8062/search?search_mode=hybrid. The search will be empty for now regardless of whether you have a term but the site won't crash

Run docker-compose run web ./manage.py recreate_index --combined_hybrid The task should finish right away and you should see Skipping indexing hybrid index reindexing because no vector model is configured. in the logs

from the shell run

from learning_resources_search.indexing_api import update_local_index_settings_for_hybrid_search, register_model, create_ingest_pipeline

update_local_index_settings_for_hybrid_search()
register_model()
create_ingest_pipeline()

Run

docker-compose run web ./manage.py recreate_index --combined_hybrid

Go to
http://open.odl.local:8062/search

Search should still work normally

Go to
http://open.odl.local:8062/search?search_mode=hybrid

Search should work and facets should work
Searching "intro to ai class" should return results.

github-actions · 2025-10-30T21:38:57Z

OpenAPI Changes

Show/hide 10 changes: 0 error, 0 warning, 10 info

10 changes: 0 error, 0 warning, 10 info
info	[request-parameter-property-type-generalized] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/contentfiles/
		for the 'query' request parameter 'resource_id', the type/format of property '/items/' was generalized from 'integer'/'' to 'number'/''

info	[request-parameter-property-type-generalized] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/courses/{learning_resource_id}/contentfiles/
		for the 'query' request parameter 'resource_id', the type/format of property '/items/' was generalized from 'integer'/'' to 'number'/''

info	[request-parameter-property-type-generalized] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources/{learning_resource_id}/contentfiles/
		for the 'query' request parameter 'resource_id', the type/format of property '/items/' was generalized from 'integer'/'' to 'number'/''

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources_search/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources_user_subscription/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API GET /api/v1/learning_resources_user_subscription/check/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-parameter-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new enum value 'hybrid' to the 'query' request parameter 'search_mode'

info	[request-property-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new 'hybrid' enum value to the request property 'search_mode/allOf[#/components/schemas/SearchModeEnum]/'

info	[request-property-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new 'hybrid' enum value to the request property 'search_mode/allOf[#/components/schemas/SearchModeEnum]/'

info	[request-property-enum-value-added] at head/openapi/specs/v1.yaml	
	in API POST /api/v1/learning_resources_user_subscription/subscribe/
		added the new 'hybrid' enum value to the request property 'search_mode/allOf[#/components/schemas/SearchModeEnum]/'

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

shanbady

I had to update the update_local_index_settings_for_hybrid_search like so to have it work locally:

def update_local_index_settings_for_hybrid_search():
    settings_body = {
        "persistent": {
            "archived.plugins.index_state_management.metadata_migration.status": None,
            "archived.plugins.index_state_management.template_migration.control": None,
        }
    }
    conn = get_conn()
    conn.cluster.put_settings(body=settings_body)
    settings_body = {
        "persistent": {
            "plugins": {
                "ml_commons": {
                    "only_run_on_ml_node": "false",
                    "native_memory_threshold": "99",
                }
            }
        }
    }
    conn = get_conn()
    conn.cluster.put_settings(body=settings_body)

I also was not seeing any results for search_mode=hybrid after I ran everything. the search endpoint was returning 200 with 0 results (regular endpoint still worked)

shanbady · 2025-11-04T14:33:29Z

learning_resources_search/connection.py

+
+def get_vector_model_id():
+    conn = get_conn()
+    model_name = "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"


would be good to move this to settings.py

shanbady · 2025-11-04T14:38:07Z

learning_resources_search/constants.py

+EMBEDDING_FIELDS = {
+    "title_embedding": {
+        "type": "knn_vector",
+        "dimension": 768,


would be good to have this either as a constant or in settings.py - if possible it would be good to derive this from the model itself since they are dependent on one another similar to how we use encoder.dim() for qdrant

shanbady · 2025-11-04T14:40:06Z

learning_resources_search/indexing_api.py

+    return MLCommonClient(conn)
+
+
+def register_model():


missing docstring (same for create_ingest_pipeline get_ml_client and update_local_index_settings_for_hybrid_search)

shanbady · 2025-11-04T14:41:20Z

learning_resources_search/connection.py

    conn.indices.refresh(index)
+
+
+def get_vector_model_id():


missing docstring

shanbady · 2025-11-04T15:33:08Z

I do see there is a combined hybrid index but it has 0 documents and its status is also yellow:

yellow open   micromasters_combined_hybrid_b3db3d11e24841b4917df7cde114ad10 LTD55I9OQZ6ImnDv7KGGGA   2   2          0            0       416b           416b

abeglova · 2025-11-04T17:02:09Z

I'm not sure if the micromasters_ indexes are still used. In any case, they will be ignored by learn since learn ignores any indexes that don't have prefix settings.OPENSEARCH_INDEX

shanbady · 2025-11-04T17:48:48Z

appears to be working after switching branches then switching back for some reason. the "micromaster_" is just a local prefix from my configuration

shanbady · 2025-11-04T20:05:54Z

not sure how much of a concern this is but - are we keeping both the regular search indexes updated even with hybrid search?

It seems like we are updating the search index only with the "recreate_index" command however subscription emails etc are percolated off of the regular index. just want to make sure we won't accidentally send emails for items that dont show up in search results. I dont think this is an issue if the search view is only visible to admin users for now

abeglova · 2025-11-04T20:50:28Z

Yes the plan is to keep both the new indexes and the hybrid search for now since the hybrid search is not ready to show users. The hybrid search view is only going to be visible to admins for the time being. I think subscription emails won't be sent anyway for items that are in the new index but not the old one since the only way that would happen is if something is unpublished but the new index is updated

Once the hybrid search is production ready we can get rid of the old indexes

shanbady

Looks good. LGTM

abeglova force-pushed the ab/hybrid-search branch from d537e3b to afd8ad8 Compare October 30, 2025 21:38

abeglova marked this pull request as ready for review November 3, 2025 16:10

shanbady self-requested a review November 4, 2025 14:28

shanbady requested changes Nov 4, 2025

View reviewed changes

shanbady assigned abeglova Nov 4, 2025

shanbady added the Waiting on author label Nov 4, 2025

shanbady approved these changes Nov 5, 2025

View reviewed changes

hybrid search v0

783e449

abeglova force-pushed the ab/hybrid-search branch from dd47848 to 783e449 Compare November 12, 2025 15:08

Ab/hybrid search #2663

Are you sure you want to change the base?

Ab/hybrid search #2663

Uh oh!

Conversation

abeglova commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

github-actions bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAPI Changes

Uh oh!

shanbady left a comment

Choose a reason for hiding this comment

Uh oh!

shanbady Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

shanbady Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

shanbady Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

shanbady Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

shanbady commented Nov 4, 2025

Uh oh!

abeglova commented Nov 4, 2025

Uh oh!

shanbady commented Nov 4, 2025

Uh oh!

shanbady commented Nov 4, 2025

Uh oh!

abeglova commented Nov 4, 2025

Uh oh!

shanbady left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abeglova commented Oct 30, 2025 •

edited

Loading

github-actions bot commented Oct 30, 2025 •

edited

Loading