Picture a teacher asking an unprepared student who isn't particularly studious but is highly inventive to answer. He gives fantastically creative but typically wrong answers, entertaining the whole class. The neural networks that I developed in the scope of this project LSTMsky (LSTM-based NLP, currrent branch), and BERTsky, BARTsky (Transformer-based LLMs, see branch "transformers") are designed to behave in precisely this way, generating whimsical and fanciful responses. In the world of language models, this tendency to confidently produce convincing but untrue information is called “hallucination.” While this is usually seen as a flaw, in this case it’s precisely what I want: perfect hallucinations—answers that sound convincing and are constructed from genuine facts, yet, when combined, never represent the actual truth. Unless you are deeply familiar with the subject, you would never notice.
Although the neural network is trained on a corpus of Wikipedia articles, it must generate connected, logically coherent, reasonable text WITHOUT DIRECT CITING (!!!) pieces from these articles. Consider the network successfully generates the text if a layman in the subject would believe it, and it fails if it outputs nonsense or a quote from a Wikipedia article! This kind of language model output is known as halucination and typicaly unwanted in the most of cases, but is highly demanded in this Storyteller project.
A special parameter “smoothing” (do not confuse with the temperature) regulates the intellectual content of the generated text. The network trained with small “smoothing” values returns “academic” texts, while the large values make it like a “schoolboy” answer.
While transformer-based NLP models are powerful and I do have them available (see the "transformers" branch), there are compelling reasons to choose LSTM networks in certain scenarios. Transformers generally outperform LSTMs across most tasks, except for one key aspect: their computational cost. Training transformers requires enormous datasets and thousands of GPU-hours, making them prohibitively expensive for many users.
In contrast, LSTM-based NLP models are much more accessible. They can learn language effectively from a relatively small corpus—sometimes as little as 300 Wikipedia articles—and can be trained quickly, even on a CPU. If you're interested in studying the inner workings of NLP, LSTMs are a great starting point: they're fast, cheap, and capable of generating reasonable texts up to 250 tokens in length. However, for longer texts, LSTMs may lose track of the topic and begin to drift.
In summary: start with LSTM if you want an approachable introduction to NLP model training. They’re efficient and practical for small-scale experiments, whereas transformers are best suited for large-scale, resource-intensive projects.
Find the examples of the generated text with specified smoothing values below. Note that Smoothing = 0 should correspond to the highly intelligent text, Smoothing = 0.1 should correspond to the schoolboy text. Seeded text in bold. The training is ongoing, and the current model output may have some roughness in the texts. However, it gives you a flavor of the network's capabilities.
- In the early twentieth century, it was suggested that
to develop a consistent understanding of the fundamental concepts of mathematics, it was sufficient to study observation. For example, a single electron in an unexcited atom is classically depicted as a particle moving in a circular path around the atomic nucleus... - Charles_darwin in his book
The Road to Serfdom (1944), Friedrich Hayek (1899–1992) asserted that the free-market understanding of economic freedom as present in capitalism is a requisite of political freedom. This philosophy is really sent to think that are said to be true of the evil trait that is very possible for it. Although many slaves have escaped or have been freed since 2007, as of 2012, only one slave owner had been sentenced to serve time in prison. - The idea of philosophy is
a source of academic discussion. - The story begins with
a chapter on the island of Thrinacia, with the crew overriding odysseus's wishes to remain away from the island. - Mathematics is one of
is one of the most important forms of philosophical knowledge.
- In the early twentieth century, it was suggested that
the chinese crossbow was transmitted to the roman world on such occasions, although the greek gastraphetes provides an alternative origin.(My comment: gastraphetes is an acient greek crossbow) - Charles_darwin in his book
The Road to Serfdom (1944), friedrich hayek (1899–1992) asserted that the free-market understanding of economic freedom as present in capitalism is a requisite of political freedom. This philosophy is really not 206 and stated that it is good for the consequences of actions. - The idea of philosophy is
a myth. - The story begins with
a chapter on the Islands of Weathertop, and is known as Five Years. - Mathematics is one of
the most important aspects of the argues of the mathematicians.
Since I am limited in computational resourses, this storyteller must be trained on a single old GPU in reasonable time (days).
This neural network, the LSTMsky, is an old-fasioned NLP model with LSTM layer in its core (see also my trasformer branch with transformer-based BERTsky and BARTsky). It predicts the next words in the beginning of a sentence (the prompt), enabling it to generate text that continues an input seed text. The model is trained on a text corpus, tokenized words using original tokenizer, and converted into numerical sequences for learning. The architecture uses embeddings, LSTMs, and feed-forward layers.
This neural network (NN) predicts the following words in a text sequence (incomplete sentence). It accepts a phrase and continues it as long as needed, setting appropriate punctuation. The purpose of this NN is:
- Test whether a NN can instantly fool the software aimed to detect AI-generated texts.
- A demonstrative and simple example of natural language processing NN
- Entertain. The NN produces funny stories, making you think if this is real.
The model is trained on a text corpus (generated by Extract_wiki_text_content.py), tokenized, and converted into numerical sequences for learning. The architecture uses embeddings, LSTMs, and feed-forward layers. Note that NN uses its own tokenization instead of the nltk package, allowing its potential users to inspect its machinery.
Training does not have a sheduler that reduces learning rate, adds the second cost function, etc. . This is still done manualy, but it will be fixed in the nearest future.
- Corpus Loading: The dataset created by
Extract_wiki_text_content.pyis loaded using Python'spicklemodule. - Tokenizer:
- A custom tokenizer preprocesses text by adding spaces around punctuation and mapping words to unique indices.
- The
Tokenizerclass includes methods to preprocess text, fit the tokenizer on a corpus, and convert text to sequences of indices.
-
TextDataset:
- Converts the tokenized corpus into input-output pairs for training. For each sequence,
n-gramsequences are created where a portion of the sequence is input, and the subsequent tokens are the target for prediction. - The dataset supports multi-word prediction through a
predict_stepsparameter.
- Converts the tokenized corpus into input-output pairs for training. For each sequence,
-
DataLoader:
- Handles batching, shuffling, and padding sequences to ensure that batches can be efficiently processed by the model. A custom
collate_fnfunction is used for padding.
- Handles batching, shuffling, and padding sequences to ensure that batches can be efficiently processed by the model. A custom
The NextWordPredictor model is designed to handle multi-word predictions and consists of the following components:
- Embedding Layer:
- Converts input tokens into dense vector representations of size
embed_size.
- Converts input tokens into dense vector representations of size
- LSTM:
- A two-layer LSTM processes the input embeddings, capturing temporal dependencies in the sequence.
- Layer Normalization:
- Normalizes the output of the LSTM's final hidden state for improved stability.
- Feed-Forward Layers:
- A series of fully connected layers (optionally with BatchNorm) process the hidden state to generate predictions.
- Final Linear Layer:
- Outputs a tensor of shape
(batch_size, predict_steps, vocab_size)containing predictions for multiple words.
- Outputs a tensor of shape
- Custom Weight Initialization:
- Xavier initialization is used for weights, and biases are initialized to zero for better convergence.
Modern natural language processing models increasingly rely on attention mechanisms, or combine attention layers with LSTM architectures, rather than using pure LSTM. Attention-based models generally achieve superior performance on long and complex texts, but they often require more computational resources and longer training times.
In contrast, LSTM networks are faster to train and more resource-efficient. For short text generation tasks (up to 500 words), pure LSTM architectures can actually outperform their attention-based counterparts, offering a practical and effective solution.
If you’re interested in a hands-on comparison between attention-based and LSTM-based models, check out the transformer branch of the Storyteller project, which features models built with attention mechanisms.
The model uses two loss functions. The first loss is 1A custom loss function (multi_word_loss). It computes the average cross-entropy across the predicted steps. It is quite conventional for natural language processing neural networks.
The second loss LabelSmoothingLoss is also a cross-entropy loss multiplied by a smoothing parameter that damps the target word probability and increases the probability of other words from the corpus. It helps avoid overconfidence. In other words, this cost mimics your hesitation about the correct answer to the question. The second cost must switch when the training after the consequent learning rate reduction reaches the plateau. It helps to continue further training. The switch is done maually so far and will be automated in the future.
Good cost values. The model start producing meaningful text when the multi_word_loss returns values smaller than 0.35 .
- The model generates text by recursively predicting the next tokens for a given seed text.
- Predictions are translated back to words using the tokenizer's
index_worddictionary. - Since the neural network surves to amuse a user, the user does the inference: the model can be considered as well trained as soon as the user finds most of the answers amasing and logically structed.
- The user must put the seeding text in the
seeders.pyfor inference.
- The training loop uses Adam optimizer in the beginning of the training and then it alternates
AdambetweenAdamW, once it reachesw paltau.
Copy zip file with the code from this repository and unzip it in your home folder or run in your terminal:
git clone https://github.com/Vlasenko2006/Storyteller.gitOnce you get the code you would need to install the required packages:
- Python 3.8+
- PyTorch
- tqdm
- scikit-learn
- numpy
- Anaconda (recommended for managing the environment)
I recomend you install the the dependences with anaconda (anaconda) using a .yaml file (see below).
Find the environment.yml in your folder. If you don't have it, copy and save the code below as environment.yml, and run conda env create -f environment.yml to create the environment.
name: story_gen
channels:
- defaults
- conda-forge
dependencies:
- python=3.8
- pytorch=1.10
- torchvision
- torchaudio
- pytorch-cuda=11.3
- tqdm
- scikit-learn
- numpy
- pyyaml
- pip
- pip:
- wikipedia-apiOnce you created your environment, activate it running the code below in your terminal:
conda activate story_genOnce the environment is activated, you can run the script:
python story_telling_nn.pyRun the provided script to:
- Load the dataset.
- Train the model on the dataset.
- Periodically save checkpoints and generate text predictions.
After training, you can generate new stories by providing a seed text to the predict_sequence function.
- Handles multi-word predictions.
- Customizable architecture:
- Embedding size, LSTM size, feed-forward layers, and more can be adjusted.
- Flexible tokenizer with preprocessed text.
- Trains efficiently using
DataLoaderwith padding support.