This repo contains an approach I implemented for the Disaster Tweets competition on Kaggle.
Each sample in the train and test set has the following information:
- The text of a tweet
- A keyword from that tweet (although this may be blank!)
- The location the tweet was sent from (may also be blank)
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)
- Pandas
- Numpy
- Regular expression (re)
- NLTK
- matplotlib
- tensorflow
- GloVe
- Clean text data present in both test and train dataset
- tokenizing the text
- removing the stopwords (on the basis of nltk.corpus.stopwords)
- lemmatizing the text entries
- converting list of strings into joint strings
- Importing the GloVe embeddings
Next, I have used TextVectorization layer. And I also create an embedding matrix. In this case, the embedding matrix is N-by-300 matrix, where N is the total count of distinct words in the input dataframe. If the word is not found in the embeddings_index, then the corresponding column is all-zeroes.