Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
ff573af
Move reputation code to separate file
kj7rrv Jul 11, 2022
477b2c6
Update copyright year in README
kj7rrv Jul 11, 2022
117461e
Move GPTC code to plugin
kj7rrv Jul 11, 2022
8795524
Plugins
kj7rrv Jul 12, 2022
b694f0f
Remove debug mode
kj7rrv Jul 12, 2022
57bbc2b
Format code with black
kj7rrv Jul 12, 2022
9454f4a
Integrate `moderation` into core
kj7rrv Jul 12, 2022
d7f9679
Pass config to plugins in a more logical way
kj7rrv Jul 12, 2022
8390512
Document new CedarScript commands and existence of plugins
kj7rrv Jul 12, 2022
5be987a
Remove unused code
kj7rrv Jul 12, 2022
1c605d3
Clean up code
kj7rrv Jul 12, 2022
2bd3ea1
Simplify plugin API
kj7rrv Jul 12, 2022
260f163
Log confidence
kj7rrv Jul 12, 2022
3b2d551
New plugin API module
kj7rrv Jul 12, 2022
ae0e655
Move OCR importer down
kj7rrv Jul 12, 2022
c9ed2e3
Move message generation out of platform code
kj7rrv Jul 12, 2022
058fd24
Discord fixes
kj7rrv Jul 12, 2022
78d8360
Add blank line to CLI output
kj7rrv Jul 12, 2022
52fa95a
remove spamNotifyPing
kj7rrv Jul 12, 2022
a957d5e
Move chat code to separate files
kj7rrv Jul 12, 2022
06d4696
Improve platform importing
kj7rrv Jul 12, 2022
a04e6dc
Improve config handling
kj7rrv Jul 12, 2022
ce6bb62
Support GPTC ngrams
kj7rrv Jul 13, 2022
11cb311
Format code with black
kj7rrv Jul 13, 2022
f70cb7d
Move plugin data to subdirs
kj7rrv Jul 13, 2022
b88d0a5
Update README
kj7rrv Jul 13, 2022
0d309ec
Add data dirs to gitignore
kj7rrv Jul 13, 2022
00e5194
oops
kj7rrv Jul 13, 2022
2952eee
GPTC DB
kj7rrv Jul 14, 2022
a7d7a95
Add Web trainer
kj7rrv Jul 14, 2022
aaeab5f
Minor fix for lynx compatibility; model updates
kj7rrv Jul 14, 2022
e547848
Format code with black; update db
kj7rrv Jul 14, 2022
e7bec00
Remove old test file; update model
kj7rrv Jul 17, 2022
918a10c
Trainer UI improvements; model changes
kj7rrv Jul 17, 2022
f0accc1
Download GPTC from PyPI
kj7rrv Jul 23, 2022
e6cdbbe
New CedarScript syntax
kj7rrv Jul 30, 2022
c5e50fa
Add more required libraries
kj7rrv Jul 30, 2022
6324c7f
Format code
kj7rrv Jul 30, 2022
d12c904
Add profanity filter
kj7rrv Mar 14, 2023
076f0c6
Fix for discord.py updates
kj7rrv Mar 14, 2023
d746c8d
Remove profanity module
kj7rrv Mar 9, 2024
f030c84
Use GPTC in script, not profanity module
kj7rrv Mar 9, 2024
ce0d3a2
Update copyright year
kj7rrv Mar 9, 2024
7ad64df
Remove profanity module from config
kj7rrv Mar 9, 2024
2f3f8e3
New CedarScript
kj7rrv Mar 9, 2024
91c88a9
Updated database
kj7rrv Mar 9, 2024
a596ced
Add list of plugins to README
kj7rrv Mar 9, 2024
2987cf9
Remove OCR functionality
kj7rrv Mar 9, 2024
5679fc2
Only listen on localhost to avoid security issues
kj7rrv Mar 9, 2024
2fe70bf
Rewrite README
kj7rrv Mar 9, 2024
76e29d3
Improve trainer UI
kj7rrv Mar 10, 2024
b6fa0ea
Improved script
kj7rrv Mar 10, 2024
9dbe028
New database
kj7rrv Mar 10, 2024
4644168
Switch license to GNU GPL
kj7rrv Mar 11, 2024
2dd35cd
Add first version of bridge module
kj7rrv Mar 11, 2024
a5d5ec3
Keep bridge and delete "nothing" platform
kj7rrv Mar 11, 2024
6f23b70
Add ability to run only GPTC server
kj7rrv Mar 17, 2024
b0df437
Manually add training data
kj7rrv Mar 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
config.yaml
knownUsers.json
spamLog.json
gptc_data
reputation_data
__pycache__
spamImages
674 changes: 674 additions & 0 deletions GPL-3.0

Large diffs are not rendered by default.

40 changes: 31 additions & 9 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,12 +1,34 @@
Copyright 2021 Matthew Petry (fireTwoOneNine), Samuel Sloniker (kj7rrv)
Copyright 2021-2024 Matthew Petry (fireTwoOneNine), Samuel Sloniker (kj7rrv)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
OR OTHER DEALINGS IN THE SOFTWARE.
Code from commit 9dbe028c8f8fb765e963cd5cd59b8a2b04a30178 and earlier,
including all code by Matthew Petry, was released under the MIT license, and
may still be used under its terms:

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
257 changes: 102 additions & 155 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
![Cedar Sentinel](/readme_files/logo_sm.png)
![CedarSentinel](/readme_files/logo_sm.png)
## A Discord/IRC bot for automated spam detection

### About
Expand All @@ -7,33 +7,38 @@
automatically detect spam messages *(or any sort of message you don't like!)*.
It can alert your moderators instantly, allowing them to take action faster.
It also has support for
[**Matterbridge**](https://github.com/42wim/matterbridge/)-type chat bridges,
[Matterbridge](https://github.com/42wim/matterbridge/)-type chat bridges,
allowing it to also handle messages from many more chat platforms!

CedarSentinel can be extended using plugins. It comes with three plugins by
default. `cs_gptc` uses [GPTC](https://git.kj7rrv.com/kj7rrv/gptc) to analyze
the content of messages. `cs_reputation` tracks user reputation; this can be
incremented or decremented automatically. Finally, `cs_length` simply indicates
the number of characters in the message.

Unless you are writing your own plugin, it is unlikely that you will not want
to use the `cs_gptc` module, as it does most of the work in actually
determining whether or a message is spam. It is also a good idea to use
`cs_reputation` to prevent the bot from deleting messages from trusted users.
GPTC does occasionally flag messages incorrectly, so exempting established
users from having messages automatically deleted helps to reduce false
positives. `cs_length` may be helpful if you find that CedarSentinel often
incorrectly flags short messages as spam; GPTC seems to be less accurate at
classifying very short messages. You can configure CedarSentinel to not delete
messages under a certain length.

### How to install

```bash
git clone https://github.com/fire219/CedarSentinel.git
cd CedarSentinel
pip install pyyaml discord.py irc git+https://git.kj7rrv.com/kj7rrv/gptc@master
cp exampleconfig_<irc|discord>.yaml config.yaml
pip install pyyaml discord.py irc gptc tornado lark
cp example_config.yaml config.yaml
# edit the config file with your editor of choice!
```

After doing this, Cedar Sentinel can be executed in the same way as any other
Python script.

#### Configuring for use with OCR

CedarSentinel can be optionally configured to do [Optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (also known as OCR) on all images. This system requires many additional Python modules to be installed:

```bash
apt install tesseract-ocr # or the equivalent command in your OS of choice
pip install opencv-python pytesseract numpy # this command is unnecessary if you don't want image OCR functionality
# change the value of "ocrEnable" in your config.yaml file to "true" to enable OCR
```

If any of the above modules are missing, CedarSentinel will alert you at startup, but otherwise will work normally (albeit with OCR disabled).
After doing this, CedarSentinel can be executed in the same way as any other
Python script; the file to run is `bot.py`.

### Using CedarSentinel

Expand All @@ -50,136 +55,92 @@ that *aren't* spam, check out the ***How to train models*** section below.
### How to train models

As previously mentioned, CedarSentinel comes with a model trained for the
Pine64 Chat Network. If this model just isn't working for you, then it is
time to check out the **modelbuilder.py** tool included with CedarSentinel.

The Model Builder is designed to intake spam logs that CedarSentinel itself
creates *(assuming you have it enabled in the config!)* and convert them into
GPTC models that CedarSentinel makes its decisions based on.

As a starting point, **empty_model.json** is included for use in training from
scratch. If you wish to do so, configure CedarSentinel to use this model. If
you want to build on the existing data, leave it with the default model. Now,
let it sit running in your chat for a while. Assuming you haven't dramatically
changed `script.txt` (see "CedarScript" below), it should start logging
messages in your server. *(If you used the empty_model, it will likely log all
messages!)*. After an indeterminate amount of time (up to you, but more
messages are better!), it's time to use the Model Builder.

First, run it as `modelbuilder.py import`. This will import the messages from
your "spam" log into its workspace. It should then show you a message and ask
you if it is Good, Spam, or Unknown. Answer accordingly, and then hit enter.
It will then give you the next message, and so on. Take your time, and make
sure you label messages correctly. CedarSentinel relies on this data to make
its decisions. If you need to stop at any time, hit Ctrl+C, and it will save
your progress and exit. Remember to delete `spam_log.json` after
`modelbuilder.py import`! When its time to start labeling the messages again,
simply run `modelbuilder.py` without arguments, and it will start back where
you left off.

Once you have labeled all messages in your log, you now need to compile the
model. You can do this by simply running `modelbuilder.py compile`. At this
point, your new model (at **compiled_model.json**) is ready for use in
CedarSentinel! Go ahead and delete your existing spam log, set CedarSentinel
to use this model in the config, and restart it! If you've followed these
instructions properly, CedarSentinel should now be trained to work on your
server.

As time goes on, you may refine the model by continuing to use the logs
CedarSentinel creates. You can follow these instructions again from
`modelbuilder.py import`, and as long as you have not deleted the workspace
file (**model_work.json**), it will build on your existing model. ***It will
take time to get CedarSentinel fully acclimated with your server, so don't be
alarmed if the first few iterations aren't very effective!*** As the model
improves, so will the detection rate.

If you want to export your model in raw format for use with GPTC outside
CedarSentinel, run `modelbuilder.py export`. The raw model will be in
`raw_model_export.py`.
Pine64 Chat Network. If this model just isn't working for you, then it is time
to check out the GPTC module's training tool. This is accessed through a Web
interface, available at [`http://localhost:8888`](https://localhost:8888) when
CedarSentinel is running. Documentation is included on the page.

For security reasons, this tool can only be accessed from `localhost`. If you
want to use it over a network, then you can either use a reverse proxy or
access it through a command-line browser over SSH; it is designed to work well
in [Lynx](https://lynx.invisible-island.net/current/) as well as graphical
browser.

In most cases, the default model should be a good starting point, and you can
build on it instead of starting over. However, if you find that CedarSentinel
is not working well even after retraining with several days of new data, it may
be best to delete the model and start from scratch. To do this, simply stop
CedarSentinel, delete `gptc.db`, and restart the bot. This will delete the
entire model and all logged messages, allowing for a fresh start.

### CedarScript

CedarSentinel's responses to messages are defined by `script.txt`. Note that a
syntax error in `script.txt` will likely result in an error message that
appears to be caused by a bug in CedarSentinel. (If you do get an error,
please submit an issue anyway, just in case it is a CedarSentinel bug. If you
have changed `script.txt`, please include your modified version.)

#### `if ... end`

if ...
...
end

#### `if ... else ... end`

if ...
...
else
...
end

#### Comparisons
CedarSentinel's responses to messages are defined by `script.txt`. This file is
written in CedarScript, a domain-specific language (DSL) designed for this
purpose. Please note that CedarScript is not Turing-complete; it is designed to
express simple if-else logic, not advanced algorithms.

CedarScript's comparison syntax is different from that of most programming
languages. The following are some simple comparisons:
#### Conditions

<0.5 confidence>
[20 length]
{5 reputation}
/5 reputation\
If statements have the following syntax:

The syntax is an opening comparator, a space-separated list of values, and a
closing comparator. The following comparators are defined:

| Comparators | Python equivalent |
|-------------|-------------------|
| `<...>` | `<` |
| `[...]` | `<=` |
| `{...}` | `==` |
| `/...\` | `!=` |

Items in a comparison are compared in order, left to right. For `{...}` and
`/...\`, order is not significant, but it is for `<...>` and `[...]`. The
example comparisons listed above translate to the following in Python:

0.5 < confidence
20 <= length
5 == reputation
5 != reputation

Comparisons are not limited to two values, although having more than three is
rarely, if ever, useful. `<0.4 confidence 0.6>` translates to `0.4 <
confidence < 0.6`.

##### Conjunctions, grouping, and order of operations.
```
if [condition] {
...
} else {
...
}
```

CedarSentinel's conjunctions (`and`, `or`) and grouping (with parenthesis)
work the same as Python's. Any difference is a bug that should be reported.
The `else` block is optional. The `[condition]` is written as a comparison
between values. The supported operators are `<`, `<=`, `==`, `!=`, `>=`, and
`>`; they work as would be expected. Multiple comparisons can be combined using
`and` and `or`; parenthesis can be used for grouping, but are not syntactically
required.

#### Inputs

Inputs are values provided to the script by CedarSentinel. The following are
available:
Inputs are values provided to the script by CedarSentinel plugins. Their names
consist of a dollar sign (`$`), followed by the name of the plugin (without the
`cs_` prefix), a period (`.`), and the input name. These inputs are provided by
the default plugins:

| Name | Meaning |
|--------------|-------------------------------------|
| `confidence` | Confidence that the message is spam |
| `reputation` | The message's author's reputation |
| `length` | The length of the message |
| Name | Meaning |
|----------------------|-------------------------------------|
| `$gptc.confidence` | Confidence that the message is spam |
| `$reputation.value` | The message's author's reputation |
| `$length.length` | The length of the message |

#### Actions

Actions are how the script tells CedarSentinel what to do in respone to
the message. The following are available:

| Name | Meaning |
|----------------------|-------------------------------------------------------|
| `flag` | Flag the message as spam |
| `moderate` | IRC: same as `flag`; Discord: flag and delete message |
| `log` | Log the message for manual classification |
| `increasereputation` | Increase the author's reputation |
| `decreasereputation` | Decrease the author's reputation |
Actions are how the script tells CedarSentinel what to do in respone to the
message. Their names have the same format as input names, except they begin
with an at sign (`@`) instead of a dollar sign. Also, CedarSentinel itself,
without relying on plugins, provides some actions; their names do not contain
periods.

| Name | Meaning |
|------------------------|----------------------------------------------------------------|
| `@flag` | Flag the message as spam |
| `@delete` | IRC: same as `flag`; Discord: flag and delete message |
| `@gptc.log_good` | Add the message to the model as known good |
| `@gptc.log_prob_good` | Log the message for manual review as uncertain but likely good |
| `@gptc.log_unknown` | Log the message for manual review as unknown qualiy |
| `@gptc.log_prob_spam` | Log the message for manual review as uncertain but likely spam |
| `@gptc.log_spam` | Add the message to the model as known spam |
| `@reputation.increase` | Increase the author's reputation |
| `@reputation.decrease` | Decrease the author's reputation |

Please note that `@gptc.log_good` and `@gptc.log_spam` are best avoided.
CedarSentinel can never be entirely sure that a message is good or bad; these
actions add messages directly into the model with no human review, so there is
a significant possibility of messages being inaccurately categorized and never
detected. This, of course, will make CedarSentinel less effective. This would
most likely occur accidentally, but a malicious user could potentially
deliberately insert harmful content into the training database as well. For
these reasons, it is most likely best to instead use `@gptc.log_prob_good` and
`@gptc.log_prob_spam` instead, and review the logged messages manually using
the training tool.

### Contributors

Expand All @@ -188,25 +149,11 @@ Samuel Sloniker (kj7rrv). Feel free to fork it and push your improvements
and/or bugfixes upstream!

### License
MIT License

```
Copyright 2021 Matthew Petry (fireTwoOneNine) and Samuel Sloniker (kj7rrv)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```
Copyright 2021-2024 Matthew Petry (fireTwoOneNine) and Samuel Sloniker (kj7rrv)

Code from commit 9dbe028c8f8fb765e963cd5cd59b8a2b04a30178 and earlier,
including all code by Matthew Petry, was released under the MIT license, and
may still be used under its terms. Code from after that commit is released
under the GNU General Public License, version 3 or later. See `LICENSE` for
more details.
Loading