Skip to content

Optimize for "session" scenario#8

Open
pavlo-liapota wants to merge 6 commits intodmcg:masterfrom
pavlo-liapota:optimize-for-session
Open

Optimize for "session" scenario#8
pavlo-liapota wants to merge 6 commits intodmcg:masterfrom
pavlo-liapota:optimize-for-session

Conversation

@pavlo-liapota
Copy link
Contributor

@pavlo-liapota pavlo-liapota commented Feb 12, 2023

I have started to think how can we optimize for the scenario that you were talking: start a session and explore possible anagrams.

As I understand, we don't need to show all (potentially millions) possible anagrams, so we don't need to generate them right away.

In my first commit I have changed that we don't generate all anagrams. I return only one of them for now, but we need to think what output do we want to show to the user? Maybe all possible words from which anagrams can be built? Or maybe top 10 that are commonly used? (we may need to use some external resource to sort words by how often they are used). Important part is that I still have a tree and all anagrams can be generated from it.
It takes 3100ms to generate all anagrams for REFACTORING TO KOTL, but it takes only 1000ms to generate a resulting tree.

In my second commit I have removed a code that makes sure that duplicated results like A CAT and CAT A are not generated. Basically I have removed analogue of this:

remainingCandidateWords = remainingCandidateWords.subList(
      1, remainingCandidateWords.size
)

It is not an issue to have such duplicates in a resulting tree as we don't generate all results anyway. This makes code simpler and improves performance of resulting tree generation from 1000 to 700 for REFACTORING TO KOTL input.

@pavlo-liapota
Copy link
Contributor Author

pavlo-liapota commented Feb 12, 2023

Now I can describe how the cache works.

First, let imagine that we allow duplicated anagrams like A CAT and CAT A and that we don't care about anagram maximum depth.
In this case, function process will always return the same results if called with same inputLetters. We don't care about depth and we don't need to filter candidate words based on words that are already used, so nothing else can influence the result. This means that if during our computations we need to call process function with same input several times, we can compute it once, cache and reuse computed result for all subsequent calls.

And in fact it is quite often that process function is called several times with same input.
Imagine input letters "A HOME CAT". Using just HOME letters we can build following anagrams HOME, HEM O, EH MO, EM HO, HM OE. Each time we will have following remaining letters "ACAT". This means that during our computations we will call process function 5 times with input "ACAT".

If we want to have maximum depth as a parameter, then we need to cache per each inputLetters and depth combination (resulting anagrams will of course be different for the same input letters if different maximum depth is allowed).

And if we don't want to generated duplicated anagrams, then additionally we need to have word index as a parameter, so when we use a word then as a continuation we try to use words with same or higher index only. In this case we need a cache per inputLetter, depth and index combinations.

@pavlo-liapota
Copy link
Contributor Author

pavlo-liapota commented Feb 12, 2023

I can suggest following implementation for your scenario.

User provides input letters. We compute a tree with all results and show all (or top N) possible first words.
Then user selects first word and we show all possible second words.
We repeat until no letters left.

So we just need to compute a tree, then everything can be taken from a tree with a constant time.

In this case we don't need a depth parameter, so I have removed it in my third commit to make code a bit faster and even simpler. But of course we can still keep it if needed.

@pavlo-liapota
Copy link
Contributor Author

pavlo-liapota commented Feb 12, 2023

Now we can compute a tree for REFACTORING TO KOTL input in 450ms.
And REFACTORING TO KOTLIN in 1600ms.

@pavlo-liapota
Copy link
Contributor Author

In suggested solution we don't reuse the cache during a session, we just use it to compute a tree faster and then we use only the tree.
We don't need to reuse a cache between the sessions too, but if we do (for example, to speed up sequential sessions with similar inputs), we may need to limit its size to avoid out of memory exception.

@pavlo-liapota
Copy link
Contributor Author

In the "session" scenario we don't generate all anagrams and some code like permuteInto is not needed anymore. But we may like to keep that code and related tests.
Just extending existing code with new code may make it hard to optimize both scenarios.
So should we just move new code to a new package and write new tests for it?

@pavlo-liapota
Copy link
Contributor Author

I have reverted my changes and copied new implementation for session scenario into another package.
I have also created speed tests for it.

@pavlo-liapota
Copy link
Contributor Author

I have implemented simple console application that allows to explore possible anagrams.
For example, I was able to find anagram RETROFITTING ON CLOAK for input REFACTORING TO KOTLIN :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments