[CV2-5117] batch formats #111

skyemeedan · 2024-09-24T22:16:03Z

Description

Proposal for changes and batchifycation of presto request and response structures, standardization of argument handling across classes, and some suggestions on edits to get us there

Reference: CV2-5117

How has this been tested?

Has it been tested locally? Are there automated tests?

Are there any external dependencies?

Are there changes required in sysops terraform for this feature or fix?

Have you considered secure coding practices when writing this code?

Please list any security concerns that may be relevant.

skyemeedan · 2024-09-25T17:16:24Z

lib/model/classycat_classify.py

    def classify(self, task_prompt, items_count, max_tokens_per_item=200):
        pass

+


sorry, I think all changes in this file are just updates from formatter

I link line 292 is the only substantive change: for item in data["input_items"]:

DGaffney · 2024-09-25T19:53:14Z

lib/model/yake_keywords.py

        :param num_of_keywords: int
        :returns: str
        """
+        # TODO: loop over passed in items, or actually trigger processing in batch?


This looks fine as a first pass on the idea imo

@ahmednasserswe adding you here for visibility, let us know if you have any thoughts about this file specifically or the refactor as a whole.

DGaffney · 2024-09-25T19:56:11Z

lib/schemas.py

 class Message(BaseModel):
-    body: GenericItem
+    request_id: Union[str, int, float]
+    items: List[GenericItem]  # to support batch, this is a list (can be only 1 item)


Looks fine to me, not terribly complicated!

DGaffney · 2024-09-25T19:56:47Z

lib/schemas.py

-    id: Union[str, int, float]
+    id: Union[str, int, float]  # id (in calling system) for this content
    content_hash: Optional[str] = None
-    callback_url: Optional[str] = None


Taking this out of here feels bad - why couldn't each item have its own callback?

Classycat is implemented so it is making a single callback with all of the data in it. I suppose we could support both options? Like if the callback is present at the top-level, make a single callback with all the results, if at individual level, respond per item.?(but does the entire batch input get returned with each individual item? Or presto has to parse apart the input list) I guess having neither or both types of callback would be an error.

Thinking about this further, I really like having the callback at the upper level, because it is an instruction to Presto system, not to the model (model doesn't know what to do with it)

That's fine, but it does likely mean that we can't use batch processing for any event on Check API without some somewhat significant revisions - we'd likely be adding callback functionality specifically to deal with batching. Not a terrible outcome, but does make it more complicated elsehwere

batch processing for any event on Check API without some somewhat significant revisions

Can you explain how you see this working so we can ensure the format encompasses that as well?

Sure - right now, if we allowed for the callback urls at the individual item level, we'd basically not have to make any changes to Check-API (or Alegre for that matter). The downside of course is that you'd open yourself up to making a ton of HTTP calls, which, not great (IMO i think the answer here is to allow callbacks at either level), but also, no revisions and potentially singular paths for receiving responses for work regardless of if the provenance was from a batch job or an individual job. If we go the route of only a top-level callback, we'd probably have to do some temporary redis key storage to remember what was sent out, and what needs to be done upon receiving, the results of work. Also, if we're running large batches of vectored work.... are we signing up for potentially gigantic POST bodies? Will they work in SQS even, or do they need to even live in SQS? I think they would?

Anyways to answer your question I think the meat of the issue would be that we'd have to introduce a bunch more state management paths for remembering and dealing with the outcomes of callbacks on Check API and Alegre - right now it's basically implied because the structure of the request contains all we need to know, but potentially just introduces more overhead. Not a terrible thing! Just something to consider.

OK, so it sounds like for the format, you'd be ok with "callback can be at individual item or batch level (but must be at least one and not both)"?

remember what was sent out, and what needs to be done upon receiving, the results of work.

Or, you have to carry enough state along with the object to know what to do when it comes back. Which is what we are trying to support in making sure the ids match back to the original payload. ie. for timpani, the items carry a target_state.

I'm generally in the callback belongs at the model level and not at the item level camp.

I think it's right that we would make changes to Alegre/Check API if they start sending batches of items.

If a caller wants individual callbacks, they should send items one at a time.

If the caller wants one callback for a batch, then they should send a batch of items.

Right now, everything on Alegre/Check is set for sending items one at a time. When we implement the bulk similarity endpoint on Alegre, I'd expect to make changes to the callback handling in Alegre so that it can receive back one batch of items. This makes sense because Alegre can then process that batch efficiently as a batch --- i.e., calling the bulk insertion point on Open/Elasticsearch. Having items comeback one at a time from a bulk submission prevents the caller from handling the response as a batch.

tl;dr fine if we want to support callbacks at an individual level, but I don't see it as a priority because we already have a workaround: If a caller wants individual callbacks from Presto they should send items to Presto one at a time.

Alright I'm bought in for the toplevel callback - thank you both for the discussion, I think it was useful to at least air it out as it'll have ripple effects. Ship it at the top level!

ashkankzme

sorry that it took me a while to get to it. I tried my best to be thorough and read everything in depth and left some comments. looking forward to the discussions!

ashkankzme · 2024-09-30T18:45:11Z