Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 36 additions & 37 deletions docs/src/LM.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Statistical Language Model
# Statistical Language Models

**TextAnalysis** provide following different Language Models
**TextAnalysis** provides the following different language models:

- **MLE** - Base Ngram model.
- **Lidstone** - Base Ngram model with Lidstone smoothing.
- **Laplace** - Base Ngram language model with Laplace smoothing.
- **WittenBellInterpolated** - Interpolated Version of witten-Bell algorithm.
- **KneserNeyInterpolated** - Interpolated version of Kneser -Ney smoothing.
- **MLE** - Base n-gram model using Maximum Likelihood Estimation.
- **Lidstone** - Base n-gram model with Lidstone smoothing.
- **Laplace** - Base n-gram language model with Laplace smoothing.
- **WittenBellInterpolated** - Interpolated version of the Witten-Bell algorithm.
- **KneserNeyInterpolated** - Interpolated version of Kneser-Ney smoothing.

## APIs

To use the API, we first *Instantiate* desired model and then load it with train set
To use the API, first instantiate the desired model and then train it with a training set:

```julia
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
Expand All @@ -25,31 +25,31 @@ KneserNeyInterpolated(word::Vector{T}, discount:: Float64=0.1, unk_cutoff=1, unk

(lm::<Languagemodel>)(text, min::Integer, max::Integer)
```
Arguments:
**Arguments:**

* `word` : Array of strings to store vocabulary.
* `word`: Array of strings to store the vocabulary.

* `unk_cutoff`: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

* `unk_label`: token for unknown labels
* `unk_label`: Token for unknown labels.

* `gamma`: smoothing argument gamma
* `gamma`: Smoothing parameter gamma.

* `discount`: discounting factor for `KneserNeyInterpolated`
* `discount`: Discounting factor for `KneserNeyInterpolated`.

for more information see docstrings of vocabulary
For more information, see the docstrings of the vocabulary functions.

```julia
julia> voc = ["my","name","is","salman","khan","and","he","is","shahrukh","Khan"]

julia> train = ["khan","is","my","good", "friend","and","He","is","my","brother"]
# voc and train are used to train vocabulary and model respectively
# voc and train are used to train the vocabulary and model respectively

julia> model = MLE(voc)
MLE(Vocabulary(Dict("khan"=>1,"name"=>1,"<unk>"=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1…), 1, "<unk>", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", "<unk>"]))

julia> print(voc)
11-element Array{String,1}:
11-element Vector{String}:
"my"
"name"
"is"
Expand All @@ -62,42 +62,41 @@ julia> print(voc)
"Khan"
"<unk>"

# you can see "<unk>" token is added to voc
julia> fit = model(train,2,2) #considering only bigrams
# You can see the "<unk>" token is added to voc
julia> fit = model(train,2,2) # considering only bigrams

julia> unmaskedscore = score(model, fit, "is" ,"<unk>") #score output P(word | context) without replacing context word with "<unk>"
julia> unmaskedscore = score(model, fit, "is" ,"<unk>") # score output P(word | context) without replacing context word with "<unk>"
0.3333333333333333

julia> masked_score = maskedscore(model,fit,"is","alien")
0.3333333333333333
#as expected maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"
# As expected, maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"

```

!!! note

When you call `MLE(voc)` for the first time, It will update your vocabulary set as well.
When you call `MLE(voc)` for the first time, it will update your vocabulary set as well.

## Evaluation Method
## Evaluation Methods

### `score`

used to evaluate the probability of word given context (*P(word | context)*)
Used to evaluate the probability of a word given its context (*P(word | context)*):

```@docs
score
```

Arguments:
**Arguments:**

1. `m` : Instance of `Langmodel` struct.
2. `temp_lm`: output of function call of instance of `Langmodel`.
3. `word`: string of word
4. `context`: context of given word
1. `m`: Instance of `Langmodel` struct.
2. `temp_lm`: Output of function call of instance of `Langmodel`.
3. `word`: String of the word.
4. `context`: Context of the given word.

- In case of `Lidstone` and `Laplace` it apply smoothing and,

- In Interpolated language model, provide `Kneserney` and `WittenBell` smoothing
- For `Lidstone` and `Laplace` models, smoothing is applied.
- For interpolated language models, `KneserNey` and `WittenBell` smoothing are provided.

### `maskedscore`
```@docs
Expand All @@ -121,28 +120,28 @@ entropy
perplexity
```

## Preprocessing
## Preprocessing

For Preprocessing following functions:
The following functions are available for preprocessing:
```@docs
everygram
padding_ngram
```

## Vocabulary

Struct to store Language models vocabulary
A struct to store language model vocabulary.

checking membership and filters items by comparing their counts to a cutoff value
It checks membership and filters items by comparing their counts to a cutoff value.

It also Adds a special "unknown" tokens which unseen words are mapped to
It also adds a special "unknown" token which unseen words are mapped to:

```@repl
using TextAnalysis
words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
vocabulary = Vocabulary(words, 2)

# lookup a sequence or words in the vocabulary
# Look up a sequence of words in the vocabulary

word = ["a", "-", "d", "c", "a"]

Expand Down
12 changes: 6 additions & 6 deletions docs/src/classify.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# Classifier

Text Analysis currently offers a Naive Bayes Classifier for text classification.
TextAnalysis currently offers a Naive Bayes Classifier for text classification.

To load the Naive Bayes Classifier, use the following command -
To load the Naive Bayes Classifier, use the following command:

using TextAnalysis: NaiveBayesClassifier, fit!, predict

## Basic Usage

Its usage can be done in the following 3 steps.
It can be used in the following 3 steps:

1- Create an instance of the Naive Bayes Classifier model -
1. Create an instance of the Naive Bayes Classifier model:
```@docs
NaiveBayesClassifier
```

2- Fitting the model weights on input -
2. Fit the model weights on training data:
```@docs
fit!
```
3- Predicting for the input case -
3. Make predictions on new data:
```@docs
predict
```
Expand Down
Loading
Loading