fire219 · kj7rrv · Jul 11, 2022 · Jul 11, 2022 · Jul 11, 2022 · Jul 12, 2022
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,7 @@
 config.yaml
 knownUsers.json
 spamLog.json
+gptc_data
+reputation_data
 __pycache__
 spamImages
diff --git a/GPL-3.0 b/GPL-3.0
diff --git a/LICENSE b/LICENSE
@@ -1,12 +1,34 @@
-Copyright 2021 Matthew Petry (fireTwoOneNine), Samuel Sloniker (kj7rrv)
+Copyright 2021-2024 Matthew Petry (fireTwoOneNine), Samuel Sloniker (kj7rrv)
 
-Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), 
-to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 
-and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+This program is free software: you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free Software
+Foundation, either version 3 of the License, or (at your option) any later
+version.
 
-The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the GNU General Public License for more details.
+You should have received a copy of the GNU General Public License along with
+this program. If not, see <https://www.gnu.org/licenses/>.
 
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
-DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
-OR OTHER DEALINGS IN THE SOFTWARE.
+Code from commit 9dbe028c8f8fb765e963cd5cd59b8a2b04a30178 and earlier,
+including all code by Matthew Petry, was released under the MIT license, and
+may still be used under its terms:
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-![Cedar Sentinel](/readme_files/logo_sm.png)
+![CedarSentinel](/readme_files/logo_sm.png)
 ## A Discord/IRC bot for automated spam detection
 
 ### About
@@ -7,33 +7,38 @@
 automatically detect spam messages *(or any sort of message you don't like!)*.
 It can alert your moderators instantly, allowing them to take action faster.
 It also has support for
-[**Matterbridge**](https://github.com/42wim/matterbridge/)-type chat bridges,
+[Matterbridge](https://github.com/42wim/matterbridge/)-type chat bridges,
 allowing it to also handle messages from many more chat platforms!
 
+CedarSentinel can be extended using plugins. It comes with three plugins by
+default. `cs_gptc` uses [GPTC](https://git.kj7rrv.com/kj7rrv/gptc) to analyze
+the content of messages. `cs_reputation` tracks user reputation; this can be
+incremented or decremented automatically. Finally, `cs_length` simply indicates
+the number of characters in the message.
+
+Unless you are writing your own plugin, it is unlikely that you will not want
+to use the `cs_gptc` module, as it does most of the work in actually
+determining whether or a message is spam. It is also a good idea to use
+`cs_reputation` to prevent the bot from deleting messages from trusted users.
+GPTC does occasionally flag messages incorrectly, so exempting established
+users from having messages automatically deleted helps to reduce false
+positives. `cs_length` may be helpful if you find that CedarSentinel often
+incorrectly flags short messages as spam; GPTC seems to be less accurate at
+classifying very short messages. You can configure CedarSentinel to not delete
+messages under a certain length.
+
 ### How to install
 
 ```bash
 git clone https://github.com/fire219/CedarSentinel.git
 cd CedarSentinel
-pip install pyyaml discord.py irc git+https://git.kj7rrv.com/kj7rrv/gptc@master
-cp exampleconfig_<irc|discord>.yaml config.yaml
+pip install pyyaml discord.py irc gptc tornado lark
+cp example_config.yaml config.yaml
 # edit the config file with your editor of choice!
 ```
 
-After doing this, Cedar Sentinel can be executed in the same way as any other
-Python script.
-
-#### Configuring for use with OCR
-
-CedarSentinel can be optionally configured to do [Optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (also known as OCR) on all images. This system requires many additional Python modules to be installed:
-
-```bash
-apt install tesseract-ocr # or the equivalent command in your OS of choice
-pip install opencv-python pytesseract numpy # this command is unnecessary if you don't want image OCR functionality
-# change the value of "ocrEnable" in your config.yaml file to "true" to enable OCR
-```
-
-If any of the above modules are missing, CedarSentinel will alert you at startup, but otherwise will work normally (albeit with OCR disabled).
+After doing this, CedarSentinel can be executed in the same way as any other
+Python script; the file to run is `bot.py`.
 
 ### Using CedarSentinel
 
@@ -50,136 +55,92 @@ that *aren't* spam, check out the ***How to train models*** section below.
 ### How to train models
 
 As previously mentioned, CedarSentinel comes with a model trained for the
-Pine64 Chat Network. If this model just isn't working for you, then it is
-time to check out the **modelbuilder.py** tool included with CedarSentinel.
-
-The Model Builder is designed to intake spam logs that CedarSentinel itself
-creates *(assuming you have it enabled in the config!)* and convert them into
-GPTC models that CedarSentinel makes its decisions based on. 
-
-As a starting point, **empty_model.json** is included for use in training from
-scratch. If you wish to do so, configure CedarSentinel to use this model. If
-you want to build on the existing data, leave it with the default model. Now,
-let it sit running in your chat for a while. Assuming you haven't dramatically
-changed `script.txt` (see "CedarScript" below), it should start logging
-messages in your server. *(If you used the empty_model, it will likely log all
-messages!)*. After an indeterminate amount of time (up to you, but more
-messages are better!), it's time to use the Model Builder.
-
-First, run it as `modelbuilder.py import`. This will import the messages from
-your "spam" log into its workspace. It should then show you a message and ask
-you if it is Good, Spam, or Unknown. Answer accordingly, and then hit enter.
-It will then give you the next message, and so on. Take your time, and make
-sure you label messages correctly. CedarSentinel relies on this data to make
-its decisions. If you need to stop at any time, hit Ctrl+C, and it will save
-your progress and exit. Remember to delete `spam_log.json` after
-`modelbuilder.py import`! When its time to start labeling the messages again,
-simply run `modelbuilder.py` without arguments, and it will start back where
-you left off.
-
-Once you have labeled all messages in your log, you now need to compile the
-model. You can do this by simply running `modelbuilder.py compile`. At this
-point, your new model (at **compiled_model.json**) is ready for use in
-CedarSentinel! Go ahead and delete your existing spam log, set CedarSentinel
-to use this model in the config, and restart it! If you've followed these
-instructions properly, CedarSentinel should now be trained to work on your
-server.
-
-As time goes on, you may refine the model by continuing to use the logs
-CedarSentinel creates. You can follow these instructions again from
-`modelbuilder.py import`, and as long as you have not deleted the workspace
-file (**model_work.json**), it will build on your existing model. ***It will
-take time to get CedarSentinel fully acclimated with your server, so don't be
-alarmed if the first few iterations aren't very effective!*** As the model
-improves, so will the detection rate.
-
-If you want to export your model in raw format for use with GPTC outside
-CedarSentinel, run `modelbuilder.py export`. The raw model will be in
-`raw_model_export.py`.
+Pine64 Chat Network. If this model just isn't working for you, then it is time
+to check out the GPTC module's training tool. This is accessed through a Web
+interface, available at [`http://localhost:8888`](https://localhost:8888) when
+CedarSentinel is running. Documentation is included on the page.
+
+For security reasons, this tool can only be accessed from `localhost`. If you
+want to use it over a network, then you can either use a reverse proxy or
+access it through a command-line browser over SSH; it is designed to work well
+in [Lynx](https://lynx.invisible-island.net/current/) as well as graphical
+browser.
+
+In most cases, the default model should be a good starting point, and you can
+build on it instead of starting over. However, if you find that CedarSentinel
+is not working well even after retraining with several days of new data, it may
+be best to delete the model and start from scratch. To do this, simply stop
+CedarSentinel, delete `gptc.db`, and restart the bot. This will delete the
+entire model and all logged messages, allowing for a fresh start.
 
 ### CedarScript
 
-CedarSentinel's responses to messages are defined by `script.txt`. Note that a
-syntax error in `script.txt` will likely result in an error message that
-appears to be caused by a bug in CedarSentinel. (If you do get an error,
-please submit an issue anyway, just in case it is a CedarSentinel bug. If you
-have changed `script.txt`, please include your modified version.)
-
-#### `if ... end`
-
-    if ...
-        ...
-    end
-
-#### `if ... else ... end`
-
-    if ...
-        ...
-    else
-        ...
-    end
-
-#### Comparisons
+CedarSentinel's responses to messages are defined by `script.txt`. This file is
+written in CedarScript, a domain-specific language (DSL) designed for this
+purpose. Please note that CedarScript is not Turing-complete; it is designed to
+express simple if-else logic, not advanced algorithms.
 
-CedarScript's comparison syntax is different from that of most programming
-languages. The following are some simple comparisons:
+#### Conditions
 
-    <0.5 confidence>
-    [20 length]
-    {5 reputation}
-    /5 reputation\
+If statements have the following syntax:
 
-The syntax is an opening comparator, a space-separated list of values, and a
-closing comparator. The following comparators are defined:
-
-| Comparators | Python equivalent |
-|-------------|-------------------|
-| `<...>`     | `<`               |
-| `[...]`     | `<=`              |
-| `{...}`     | `==`              |
-| `/...\`     | `!=`              |
-
-Items in a comparison are compared in order, left to right. For `{...}` and
-`/...\`, order is not significant, but it is for `<...>` and `[...]`. The
-example comparisons listed above translate to the following in Python:
-
-    0.5 < confidence
-    20 <= length
-    5 == reputation
-    5 != reputation
-
-Comparisons are not limited to two values, although having more than three is
-rarely, if ever, useful. `<0.4 confidence 0.6>` translates to `0.4 <
-confidence < 0.6`.
-
-##### Conjunctions, grouping, and order of operations.
+```
+if [condition] {
+    ...
+} else {
+    ...
+}
+```
 
-CedarSentinel's conjunctions (`and`, `or`) and grouping (with parenthesis)
-work the same as Python's. Any difference is a bug that should be reported.
+The `else` block is optional. The `[condition]` is written as a comparison
+between values. The supported operators are `<`, `<=`, `==`, `!=`, `>=`, and
+`>`; they work as would be expected. Multiple comparisons can be combined using
+`and` and `or`; parenthesis can be used for grouping, but are not syntactically
+required.
 
 #### Inputs
 
-Inputs are values provided to the script by CedarSentinel. The following are
-available:
+Inputs are values provided to the script by CedarSentinel plugins. Their names
+consist of a dollar sign (`$`), followed by the name of the plugin (without the
+`cs_` prefix), a period (`.`), and the input name. These inputs are provided by
+the default plugins:
 
-| Name         | Meaning                             |
-|--------------|-------------------------------------|
-| `confidence` | Confidence that the message is spam |
-| `reputation` | The message's author's reputation   |
-| `length`     | The length of the message           |
+| Name                 | Meaning                             |
+|----------------------|-------------------------------------|
+| `$gptc.confidence`   | Confidence that the message is spam |
+| `$reputation.value`  | The message's author's reputation   |
+| `$length.length`     | The length of the message           |
 
 #### Actions
 
-Actions are how the script tells CedarSentinel what to do in respone to
-the message. The following are available:
-
-| Name                 | Meaning                                               |
-|----------------------|-------------------------------------------------------|
-| `flag`               | Flag the message as spam                              |
-| `moderate`           | IRC: same as `flag`; Discord: flag and delete message |
-| `log`                | Log the message for manual classification             |
-| `increasereputation` | Increase the author's reputation                      |
-| `decreasereputation` | Decrease the author's reputation                      |
+Actions are how the script tells CedarSentinel what to do in respone to the
+message. Their names have the same format as input names, except they begin
+with an at sign (`@`) instead of a dollar sign. Also, CedarSentinel itself,
+without relying on plugins, provides some actions; their names do not contain
+periods.
+
+| Name                   | Meaning                                                        |
+|------------------------|----------------------------------------------------------------|
+| `@flag`                | Flag the message as spam                                       |
+| `@delete`              | IRC: same as `flag`; Discord: flag and delete message          |
+| `@gptc.log_good`       | Add the message to the model as known good                     |
+| `@gptc.log_prob_good`  | Log the message for manual review as uncertain but likely good |
+| `@gptc.log_unknown`    | Log the message for manual review as unknown qualiy            |
+| `@gptc.log_prob_spam`  | Log the message for manual review as uncertain but likely spam |
+| `@gptc.log_spam`       | Add the message to the model as known spam                     |
+| `@reputation.increase` | Increase the author's reputation                               |
+| `@reputation.decrease` | Decrease the author's reputation                               |
+
+Please note that `@gptc.log_good` and `@gptc.log_spam` are best avoided.
+CedarSentinel can never be entirely sure that a message is good or bad; these
+actions add messages directly into the model with no human review, so there is
+a significant possibility of messages being inaccurately categorized and never
+detected. This, of course, will make CedarSentinel less effective. This would
+most likely occur accidentally, but a malicious user could potentially
+deliberately insert harmful content into the training database as well. For
+these reasons, it is most likely best to instead use `@gptc.log_prob_good` and
+`@gptc.log_prob_spam` instead, and review the logged messages manually using
+the training tool.
 
 ### Contributors
 
@@ -188,25 +149,11 @@ Samuel Sloniker (kj7rrv). Feel free to fork it and push your improvements
 and/or bugfixes upstream!
 
 ### License
-MIT License
 
-```
-Copyright 2021 Matthew Petry (fireTwoOneNine) and Samuel Sloniker (kj7rrv)
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"),  to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense,  and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
-```
+Copyright 2021-2024 Matthew Petry (fireTwoOneNine) and Samuel Sloniker (kj7rrv)
+
+Code from commit 9dbe028c8f8fb765e963cd5cd59b8a2b04a30178 and earlier,
+including all code by Matthew Petry, was released under the MIT license, and
+may still be used under its terms. Code from after that commit is released
+under the GNU General Public License, version 3 or later. See `LICENSE` for
+more details.