-
Notifications
You must be signed in to change notification settings - Fork 190
Merge Czech stemmer #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Merge Czech stemmer #151
Conversation
|
@ojwb, @jimregan Hi Guys, what is state of this PR? can we help with it somehow? Another implementation of czech snowball stemmer can be found here: https://www.fit.vut.cz/research/product/133/.en (GNU GPL) |
|
@jan-zajic It needs the points above resolving, but I think that's just a case of me finding the time to. I'm trying to clear the backlog of Snowball tickets, so hopefully soon. We couldn't really merge a GNU GPL stemmer as currently Snowball has a BSD-style licence - moving to a mixed licence situation would make things harder to understand and manage for users. From a quick look this other stemmer appears to the usual R1 definition (which is good), but it is quite a lot more complex (which is bad unless it does a better job as a result). Do you know how it compares in effectiveness to the one in this PR? If it's better, do you know if the copyright holders might consider relicensing it for inclusion in Snowball releases? |
fcc1b03 to
3afee58
Compare
|
I had a bit more of a look at the GPL snowball stemmer. I noticed the However fundamentally we can't use this implementation without an agreement to relicense. The source download has So I went back to looking at the Dolamic stemmer. Comparing the snowball implementation with the Java implementations http://members.unine.ch/jacques.savoy/clef/CzechStemmerLight.txt and http://members.unine.ch/jacques.savoy/clef/CzechStemmerAggressive.txt I spotted some inconsistencies (code snippets in same snowball/light/aggressive) order: vs vs Note that the light stemmer comment says There's another inconsistency in vs vs Here the comments I tried changing the first case in the snowball code and the differences look plausible but unfortunately I don't know the Czech language to a useful extent. I didn't try the second case yet. @jimregan @jan-zajic Any thoughts? |
|
Here's a scripted analysis of the effects of the various changes to palatalise I covered above:
|
|
There's one remaining inconsistency I've spotted, this one's in Here the light stemmer removes The older version of the light stemmer listed in the original paper removes all four suffixes. Changing to removing all 4 gives:
|
The Java code removes this ending but it was missing from the Snowball version. Looking at the changes resulting from this, it seems a clear improvement so I've concluded it was an accidental omission. See snowballstem#151
|
Three more notes: Comparing the code I noticed that I also noticed that there's a bug in the Java versions in one group of palatalise rules: Here we check The final thing I noticed is that the Snowball version applies the palatalise step rather differently to the Java versions. E.g. consider This changes Almost every case is handled like this in snowball, except for The That at least makes things more similar, but fundamentally it seems the palatalise step in snowball will be much less effective as the final character will often have already been removed. The code in the paper (which seems pseudo-code for an earlier version of the light stemmer) removes the vowels like the snowball version does, then unconditionally performs (This also may mean that the conclusions in the paper about the light vs aggressive stemmers may not entirely apply to the Java versions we have access to, but in the absence of a comparison of the Java versions going with the light stemmer still seems sensible.) |
|
A further difference is that in the snowball implementation if It looks like this could be a deliberate change, as the snowball code does However, the cursor doesn't get reset before We can fix just the latter with |
|
Any progress on this issue? As we understand there is some kind of analysis comparison between two implementations -- one of which cannot be used anyways because of licensing and there are some tradeoffs on both sides? Maybe the original (simpler?) contributed algorithm (with acceptable license) is good enough? Can we somehow help to move this forward? I reviewed the issues above and at this moment they are too technical for me (not familiar with stemming problem domain), but maybe I could provide a feedback on something as a Czech speaker. |
Progress stalled on needing input from someone who knows Czech reasonably well. I thought I'd found someone who could help (this was probably late 2023/early 2024) but they never got back to me and I failed to chase it up. If you're a Czech speaker and wanting to get this resolved, that would definitely be useful.
There is a GPL implementation of a different algorithm mentioned above, which indeed would need relicensing as Snowball uses a 3-clause BSD licence. That one would also need to be rewritten in Snowball as well as relicensed. However the comparisons are against a Java implementation that's meant to be of the same algorithm (and this Java implementation is 2-clause BSD so compatible, see: http://members.unine.ch/jacques.savoy/clef/).
We don't want to just merge something with unresolved issues because that's likely to need significant changes later, and those are disruptive in typical users of these stemmers (because you need to rebuild your whole search database).
I'll need to review the discussion as it's been 9 months, but I think we should be able to resolve this together. |
|
Ok thanks for clarification. Count me in if you need help. |
|
@hauktoma Great. There are a few points to resolve, so I'll cover one at a time. The first question is really about syllables in Czech. I'll try to give some background to what we're doing and why. If you don't follow please say and I can clarify. (I'm also happy to do this on chat or a video or phone call if you think it would be easy to do it interactively.) We want to avoid the stemming algorithm removing suffixes too aggressively and mapping words to the same stem which aren't actually related (or are somewhat related but really have too different a meaning). Most of the Snowball stemmers make use of simple idea to help this which is to define regions at the end of the word from which a suffix can be removed. For most languages these are defining by counting the number of spans of vowel, then of non-vowel, etc - https://snowballstem.org/texts/r1r2.html shows some examples. As well as R1 and R2 there's also an RV for some languages which that page doesn't mention. This is essentially approximating counting syllables, while the original Czech stemming algorithm this implementation is based on used a cruder character-counting approach instead. In his original Snowball implementation jimoregan essentially retrofitted use of R1 and RV which I think was a good idea. However it seems in Czech that clusters of just consonants can form a syllable, so probably our R1 and RV definitions for Czech ought to take that into account. See my comment above for what led me to this conclusion, but the key point is this quote:
And the actual question is for the purposes of determining these regions, should we consider And if so, should |
|
To be honest I am not entirely sure about the idea handling the I'll try to sum the points up here and then provide examples at the end:
My betting/statistical impression is that implementing this may have more negative effect than positive one. Especially for the @ojwb can you please review my reasoning about this and provide feedback whether it is correct? If you think this may be worth a bit more investigating or that the examples provided below are not good enough to make a decision, I can try to consult some colleagues or dig some more formal materials about this. @ojwb maybe one quick question and clarification: you mentioned R1, which means that by default the stem approximation default algorithm for language (unless specified otherwise by knowing language and implementing it differently) is to remove one suffix? R2 means remove two suffixes? Can the number of suffixes removed be variable under certain conditions? What is the setting/strategy for Czech (R1 or R2) and where it came from? Note: have no problem with discussing this real-time on some call but maybe keep it as an option when we hit wall on something or some complex clarification will be needed. As a total layman in stemming/linguistics I am not sure if I would be able to have a real-time conversation on this topic. But if you get feeling that explaining something would be too much trouble in written/async form, let's do it. Example of more complex word for
|
|
Hi @ojwb, @hauktoma, The current discussion in this thread is beyond my time and expertise, so I decided to try to contact and find experts from I will try to reach people who could help more with this topic and I will let you know how it turned out. I think that if there is support for the Czech language in Snowball, it must be done as best as possible, since the impact will be great on a large number of open source projects and solutions above them. |
|
Thanks. I need to work through this in detail, but a couple of notes:
I think we'd probably just do something like work left to right (or perhaps right to left if that turns out to work better) and if a consonant is determined to be a syllabic consonant then it would not be regarded as a consonant for the letter which follows.
No, they're just different regions, and the region which is appropriate for each suffix is chosen based on considering the language's structure, and also empirically what seems to work better. It's typically better to lean towards being conservative in when to remove since overstemming is more problematic than understemming.
There are often conditions on whether a particular suffix is removed, and there's often an order suffixes are considered in, so removing one suffix may expose another that can then be removed too. I think jimregan came up with the current region setting for Czech, presumably based on the Java implementation's cruder character counts. |
|
I think trying to resolve some of the simpler points above will help us resolve the others, as they're somewhat interconncted (if nothing else it'll be some progress!)
I tried comparing the CzechStemmerLight java stemmer as downloaded and with this fix applied:
|
|
I've compiled a list of things to resolve at the top of the ticket.
Testing strongly shows
Changing the Snowball implementation makes no difference here (probably due to the oddness around when to remove a character vs calling |
|
I noticed another oddity in This version leaves the first character of a removed suffix behind when calling This means Testing changing this to handle these suffixes like others where we call |
|
To check there weren't any further discrepancies between the Java and Snowball versions, I tried adjusting the Snowball version to use the same stem-length checks as the Java code (with the various fixes) instead of R1 and RV: Doing this, I found we can split palatalise to simplify things. The main point of note though is instead of The difference is that the former will remove the longest of the suffixes that is in R1, while the latter will find the longest of the suffixes and only remove it if it is in R1 (e.g. Need to actually test which works better, but the former is what the Java code does. Update: Testing show
|
|
I've been looking at using the palatalise approach from the previous comment with R1 based on vowels. It causes a lot of changes, the vast majority for the better:
Based on the above, it seems clear we should adjust palatalise in this way, but then to take a look at the splits and see if we can eliminate most of them. |
The Java code removes this ending but it was missing from the Snowball version. Looking at the changes resulting from this, it seems a clear improvement so I've concluded it was an accidental omission. See snowballstem#151
In order to try to better understand this I compared the suffixes with those listed at https://en.wikipedia.org/wiki/Czech_declension (which I'd expect to be a reliable source for something like this, but if there's a better one please point me at it). Suffixes we remove but which wikipedia's list doesn't seem to support:
I could perhaps believe There are also two suffixes we don't remove but wikipedia lists:
@hauktoma Can you help resolve any of these? |
Use a definition of R1 more like the usual Snowball one, but take syllabic consonants 'l' and 'r' into account. It seems 'm' and 'n' can also be syllabic consonants but are much rarer so we ignore these for now at least. Testing suggests enforcing a minimum of 3 characters before R1 (like the Danish, Dutch and German stemmers do) helps so we do that here too. See snowballstem#151
We can just handle the first character specially - after that we know the previous character is a consonant because otherwise we'd have already stopped. See snowballstem#151
There seems no benefit from having a separate region we can remove possessive suffixes in. See snowballstem#151
This is a singular masculine animate instrumental form - e.g. koněm (horse). Results in 28 merges on the sample vocabulary which all look good.
Previously -í* was handled like -ě*, but experimentation shows treating it the same as -i works much better.
Use {uo} instead of {u*} for ů; use e.g. {ev} instead of {e^} for ě.
This makes no difference to the output, but reduces work for short words we wouldn't modify anyway, some of which are likely to be very common.
This improves handling of about 30 words.
This improves handling of about 90 words.
This improves handling of about 360 words.
This improves handling of about 23 words.
This improves handling of about 13 words.
This improves handling of about 22 words.
This improves handling of about 20 words.
This only improves handling of a small number of words (3 in the sample vocabulary and I know of at least 2 more), but it's a simple rule which doesn't seem to have false positives.
|
I've come to the conclusion that the current converting of -č to -k after removing a suffix starting with e, i, or í does slightly more harm than good. In case a Czech speaker wants to take a look, here's the comparison for removing these two rules: Based on Firefox's machine translation and some looking up of words in wiktionary, to me it seems the merges are almost all improvements, the splits are mostly worse (but some are actually better), and the more complicated cases are mostly better. It seems the cases these rules help are verb and adjective forms (either getting conflated with other forms of the same verb/adjective, or conflating with a related noun). Perhaps there is scope for more targetted rules which aim to address only these cases. |
Based on Firefox's machine translation and some looking up of words in wiktionary, to me it seems the merges are almost all improvements, the splits are mostly worse (but some are actually better), and the more complicated cases are mostly better. It seems the cases these rules help are verb and adjective forms (either getting conflated with other forms of the same verb/adjective, or conflating with a related noun). Perhaps there is scope for more targetted rules which aim to address only these cases.
Based on Firefox's machine translation and some looking up of words in wiktionary, to me it seems the merges are almost all improvements, the splits are mostly worse (but some are actually better), and the more complicated cases are mostly better. This seems particular true when the suffix removed starts with `e` - for those there are hardly any words removing this rule harms.
|
Same conclusion for -ž to -h (especially after removing a suffix starting with e, where it is almost never helpful to change -ž to -h). Comparison: compare.html |
Based on Firefox's machine translation and some looking up of words in wiktionary, to me it seems the merges are almost all improvements, the splits are mostly worse (but some are actually better), and the more complicated cases are mostly better.
|
Similarly for -z to -h: compare.html The rules for -čt and -št seem almost universally good (I already added 6 exceptions for -št). The rules for -c to -k are harming some cases, but they improve significantly more than they harm. Some exceptions may be helpful there too. |
|
červencem (July) and červenka (robin) still are though. There is an inherent ambiguity here as července is a form of both words, but it's bad that we make this worse. |
|
The merges nearly all seem to be better. One niche exceptions I have noticed: The splits that are definitely better and spotted different meaning of words: Some of the wrong splits seems to be splitting nouns and verbs of the same meaning, which I am not sure if its something desirable. as well as noun vs adjective of the same meaning: I’m not sure how much value I’m adding here, as I’m mainly highlighting points you’ve likely already noticed through the machine translations. However, I was able to get in touch with people at Charles University in Prague, specifically from the Institute of Formal and Applied Linguistics. I will forward this discussion to them, and hopefully they will be able to assist with these nuances and help guide this forward. |
While "wrong", neither word carries much useful meaning in the context of a search query (if you're using a stopword list, both these words would likely be on it). So I'm not too concerned about this one.
It is generally desirable to conflate different parts of speech with the same meaning. However if we have to trade that off against the wrong splits above, I'd probably choose to avoid the wrong splits since they are will tend to be more problematic than the additional merges are helpful (searching for "ancestor" and finding apparently unrelated pages is unhelpful, whereas a document on the subject of about "attack", "sheep" or "monkey" is likely to use multiple parts of speech and so to still be found). If there were very few wrong splits and of very rare words (and/or perhaps of words with a tenuously connected meaning) there might be an argument to be made. Also even with a (not actually achievable) perfect stemming algorithm, we won't conflate with other words with the same or very similar meaning - "attack" vs "onslaught", "sheep" vs "lamb", "monkey" vs "ape". This algorithm also doesn't generally aim to handle verb forms (it just happens to handle some because some verb suffixes are the same as some noun suffixes). There is an "aggressive" variant which does try to handle more parts of speech, but apparently it's "known to overstem" so we've gone for the light approach. Perhaps that's worth a revisit at some point (since e.g. the work-in-progress here has a more nuanced definition of the minimum stem than the original paper), but getting this merged has already dragged out for too long so I'm reluctant to broaden the scope significantly when we're pretty close to the finish line.
If nothing else, it's reassuring to have feedback, and it's definitely appreciated.
Probably the most pertinent question they might be able to help with is coming up with rules about when we should try to change a soft trailing consonant to a hard equivalent after removing certain suffixes. For example, adding a suffix starting with e, i or í to a stem ending k will change that k to a c, so when removing such a suffix we may want to change c back to k. The problem is that some stems already ended in c before adding a suffix and changing those to k can be problematic (we don't really care what the stem is as it's handled as an opaque token, so it's only actually a problem if it results in a stem collision with something else but if the stem ending c is a valid form of the word it'll end up split from other forms). So ideally we want to be able to look at a stem ending -k and know what to do with it (and erring on the side of not changing it). The code currently in git always changes it to a -c, which seems to help more words than it hurts but gets quite a lot of cases wrong. One common approach used in other stemmers is to check the character before (e.g. only change if the k is preceded one of bčhkňřsšťuy is a condition I've tried). And similarly for -z to -h, etc (which I've currently turned off as without conditions they seem more harmful than helpful). I've been looking at whether we can usefully mine this information from wiktionary - there is machine-readable JSONL available from https://kaikki.org/dictionary/rawdata.html (both the English and Czech versions are potentially useful as each wiktionary includes entries for other languages, e.g. https://en.wiktionary.org/wiki/kancl%C3%A9%C5%99#Czech - there are unsurprisingly more Czech entries in the Czech wiktionary, but being too comprehensive here may be unhelpful and "Czech words someone has bothered to write an English definition for" might be a reasonable way to select a subset of less obscure words). |

This has been on the web site since 2012, but never actually got
included in the code distribution.
Points to resolve:
čsuffix in snowball vsčein Java (Snowball seems to have copied-čtypo in Java comment)čtí/štíin Java vsčté/štéin Snowball (again seems to be due to Java comment typo)len- 2instead oflen- 3for Javaště/šti/štícheck. Seems fairly clear improvement.palatalise.palatalisedoesn't otherwise match.do_casedoesn't make a replacement thendo_possessivewon't get called, but in the java code,removePossessivesis always called. Merge Czech stemmer #151 (comment)palataliseexcept for-es/-ém/-ímsetlimit tomark p1 for ([substring])vs[substring] R1-ětemisn't listed by https://en.wikipedia.org/wiki/Czech_declension but seems to be valid from e.g. https://en.wiktionary.org/wiki/hrab%C4%9B https://en.wiktionary.org/wiki/markrab%C4%9B and https://en.wiktionary.org/wiki/ml%C3%A1d%C4%9B-os,-es,-iho,-imuaren't listed by https://en.wikipedia.org/wiki/Czech_declension-ichseems to only be a suffix for two pronouns-ima? Probably not.-ímu(with a diacritic on thei)? Yes.-ěteand-ětiwhile the aggressive stemmer removes-eteand-eti(no caron on the e). The snowball implementation follows the light stemmer. The older version of the light stemmer listed in the original paper removes all four suffixes. Analysis in Merge Czech stemmer #151 (comment) suggests maybe to leave as-is? Probably this was trying to make the stemmer partly ignore diacritics, see next point.{ desce desk deska deskami deskou desková deskové deskový deskových desku desky deskách } + { dešti deštích deště }- seems to be conflating "plate" and "rain"; simple tests suggest this (and numerous other conflations due to palatalise) are fixable my imposing some sort of region check on the palatalise step, but need to experiment to determine what region definition is appropriate (and whether it should the same for all palatalise replacements)červenkbut so does červenka ("robin"); also some other forms of "July" such as červencový stem točervenc. There is an inherent ambiguity here as července is a form of both words, but it's bad that we make this worse.