-
Notifications
You must be signed in to change notification settings - Fork 3
Ab/hybrid search #2663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Ab/hybrid search #2663
Conversation
d537e3b to
afd8ad8
Compare
OpenAPI ChangesShow/hide 10 changes: 0 error, 0 warning, 10 infoUnexpected changes? Ensure your branch is up-to-date with |
shanbady
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to update the update_local_index_settings_for_hybrid_search like so to have it work locally:
def update_local_index_settings_for_hybrid_search():
settings_body = {
"persistent": {
"archived.plugins.index_state_management.metadata_migration.status": None,
"archived.plugins.index_state_management.template_migration.control": None,
}
}
conn = get_conn()
conn.cluster.put_settings(body=settings_body)
settings_body = {
"persistent": {
"plugins": {
"ml_commons": {
"only_run_on_ml_node": "false",
"native_memory_threshold": "99",
}
}
}
}
conn = get_conn()
conn.cluster.put_settings(body=settings_body)I also was not seeing any results for search_mode=hybrid after I ran everything. the search endpoint was returning 200 with 0 results (regular endpoint still worked)
|
|
||
| def get_vector_model_id(): | ||
| conn = get_conn() | ||
| model_name = "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to move this to settings.py
| EMBEDDING_FIELDS = { | ||
| "title_embedding": { | ||
| "type": "knn_vector", | ||
| "dimension": 768, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to have this either as a constant or in settings.py - if possible it would be good to derive this from the model itself since they are dependent on one another similar to how we use encoder.dim() for qdrant
| return MLCommonClient(conn) | ||
|
|
||
|
|
||
| def register_model(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing docstring (same for create_ingest_pipeline get_ml_client and update_local_index_settings_for_hybrid_search)
| conn.indices.refresh(index) | ||
|
|
||
|
|
||
| def get_vector_model_id(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing docstring
|
I do see there is a combined hybrid index but it has 0 documents and its status is also yellow: |
|
I'm not sure if the |
|
appears to be working after switching branches then switching back for some reason. the "micromaster_" is just a local prefix from my configuration |
|
not sure how much of a concern this is but - are we keeping both the regular search indexes updated even with hybrid search? It seems like we are updating the search index only with the "recreate_index" command however subscription emails etc are percolated off of the regular index. just want to make sure we won't accidentally send emails for items that dont show up in search results. I dont think this is an issue if the search view is only visible to admin users for now |
|
Yes the plan is to keep both the new indexes and the hybrid search for now since the hybrid search is not ready to show users. The hybrid search view is only going to be visible to admins for the time being. I think subscription emails won't be sent anyway for items that are in the new index but not the old one since the only way that would happen is if something is unpublished but the new index is updated Once the hybrid search is production ready we can get rid of the old indexes |
shanbady
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. LGTM
dd47848 to
783e449
Compare
What are the relevant tickets?
part of https://github.com/mitodl/hq/issues/9135
Description (What does it do?)
This is a v0 implementation for vector search. This uses the huggingface/sentence-transformers/msmarco-distilbert-base-tas-b model. For now only title and description are vectorized the search performance in terms of returning the most relevant results definitely has room for improvement. The new index is only updated and the new search mode only works if the model into opensearch and for now the plan is to only run it on rc. Also the new combined hybrid index is only updated via the recreate_index command for now and not upsert actions when learning resources are created or updated. Partially this is because this pr is already large and partially i wanted to make sure that vector embedding do not affect the existing search until the hybrid search is closer to being production ready. Additionally ,for now the new hybrid index is not linked to contentfiles and content file content is not used in the search
How can this be tested?
Verify that the search page works normally. Also searching "intro to ai class" returns no results
Login as an admin and select "hybrid" as the search mode from the admin options or go to http://open.odl.local:8062/search?search_mode=hybrid. The search will be empty for now regardless of whether you have a term but the site won't crash
Run
docker-compose run web ./manage.py recreate_index --combined_hybridThe task should finish right away and you should seeSkipping indexing hybrid index reindexing because no vector model is configured.in the logsfrom the shell run
Run
docker-compose run web ./manage.py recreate_index --combined_hybridGo to
http://open.odl.local:8062/search
Search should still work normally
Go to
http://open.odl.local:8062/search?search_mode=hybrid
Search should work and facets should work
Searching "intro to ai class" should return results.