How to: SPAM detection with TensorFlow

TensorFlow is an open-source machine learning library with a specific focus on deep learning and neural network. There are endless uses of TensorFlow, but let’s see how we can use it for SPAM detection by utilizing its natural language processing capabilities.

I’ll be using a dataset containing 5,574 text messages, out of which 747 messages are categorized as SPAM, and 4,827 messages are categorized as HAM. The dataset is downloaded from here.

Loading Data

Let’s start by loading data into Pandas dataframe.

df_sms = pd.read_csv('data/SMSSpamCollection.csv', sep = '\t', header = None)

Setting column name

df_sms.columns=['category', 'message']
df_sms.head()

Let’s preview our data

Visualization

Let’s see what we have in our dataframe

df_sms.category.value_counts()

Let’s plot those numbers

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.countplot(df_sms.category)
plt.show()

Let’s see a word cloud for both HAM and SPAM messages. This will give us an idea of which words have more weightage in each category.

df_spam  = df_sms[df_sms.category == 'spam'].copy()
df_ham = df_sms[df_sms.category == 'ham'].copy()

def generate_wordcloud(data_frame, category):
    text = ' '.join(data_frame['message'].astype(str).tolist())
    stopwords = set(wordcloud.STOPWORDS)
    
    fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='lightgrey',
                    colormap='viridis', width=800, height=600).generate(text)
    
    plt.figure(figsize=(10,7), frameon=True)
    plt.imshow(fig_wordcloud)  
    plt.axis('off')
    plt.title(category, fontsize=20 )
    plt.show()

Using the above function, let’s generate a word cloud for SPAM messages

generate_wordcloud(df_spam, 'SPAM')

Pretty obvious I guess. Let’s look at a word cloud for HAM messages

generate_wordcloud(df_ham, 'HAM')

No comments here. Let’s move forward.

Prepare data for training

Let’s prepare our data for model building. For this purpose, we will be converting our category label to numeric data. This means SPAM will be marked as 1, and HAM messages will be marked as 0.

df_sms['category'] = df_sms['category'].map( 
    {'spam': 1, 'ham': 0} )

df_sms.head()

Secondly, we will now split our dataset into training and testing datasets; this is essential for training our model and then validating it against our testing dataset.

# importing libraries to split data into train and test set

from sklearn.model_selection import train_test_split

X = df_sms['message'].values
y = df_sms['category'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

With the above code, we have split our dataset into an 80/20 split, where 80% of messages will be used to train the model, and the other 20% will be used for testing the model.

Now, we will use Keras’ built-in text functions to preprocess the data. This will be done using Tokenizer functions fit_on_texts and texts_to_sequences.

# importing libraries for text preprocessing

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

The Tokenizer itself will remove the punctuations, and then fit_on_texts will create an index of vocabulary in the form of an index. For example, a message “Free entry” will become [“Free”] = 1; [“Entry”] = 2, etc.

token = Tokenizer()
token.fit_on_texts(X_train)

texts_to_sequence will replace the words with their relevant index number from the above-created dictionary.

# applying sequences from the dictionary on both training and test dataset

encoded_train = token.texts_to_sequences(X_train)
encoded_test = token.texts_to_sequences(X_test)
print(encoded_train[0:2])

The output will be somthing like this

The next step will be to apply the padding, as deep learning models expect that all data passed in should have the same shape and form.

# applying 8 words padding to training and test dataset

max_length = 8

padded_train = pad_sequences(encoded_train, maxlen = max_length, padding = 'post')
padded_test = pad_sequences(encoded_test, maxlen = max_length, padding = 'post')

print(padded_train)

The output after the applying the padding will look like this

Now that our training and test datasets are ready let’s move towards building our model.

Model building

Let’s start by importing the required libraries

# importing libraries for model building

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.callbacks import EarlyStopping

For model building, we will use Keras’ Embedding layer, which is used on text data for neural networks. The basic requirement for this is to have integer encoded inputs, meaning a number should represent every word; that we have already done above using Tokenizer.

vocab_size = len(token.word_index) + 1

# creating a deep learning model using Embedding from Keras

deep_learning = Sequential()
deep_learning.add(Embedding(vocab_size, 24, input_length = max_length))
deep_learning.add(Flatten())
deep_learning.add(Dense(500, activation='relu'))
deep_learning.add(Dense(200, activation='relu'))
deep_learning.add(Dropout(0.5))
deep_learning.add(Dense(100, activation='relu'))
deep_learning.add(Dense(1, activation='sigmoid'))

# compile the model
deep_learning.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy'])

# summarize the model
print(deep_learning.summary())

We will get below summary output from the above code

Now that our model is ready let’s call the fit method to train our model. We will use EarlyStopping to stop the training once the model accuracy is reached.

early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

# fit the model
deep_learning.fit(x=padded_train,
         y=y_train,
         epochs=50,
         v

The model learning will typically complete in 10 to 12 epochs. Below should be the output.

Let’s check our accuracy score after model training

from sklearn.metrics import confusion_matrix, accuracy_score

preds = (deep_learning.predict(padded_test) > 0.5).astype("int32")
accuracy_score(y_test, preds)

The accuracy score came out to be 0.9829596412556054

Let’s save our deep learning model for performing testing and then later utilize it in our applications.

deep_learning.save("spam_detection_model")

We will also save our tokenizor using Pickle

import pickle

with open('spam_detection_model/tokenizer.pkl', 'wb') as output:
   pickle.dump(token, output, pickle.HIGHEST_PROTOCOL)

Let’s Predict

Let’s load our saved deep learning model

import tensorflow as tf

s_model = tf.keras.models.load_model("spam_model")
with open('spam_detection_model/tokenizer.pkl', 'rb') as input:
    tokenizer = pickle.load(input)

Now, let’s pass our sample SMS message and perform preprocessing

sms = ["Hi, I'll be late, lets check this tomorrow "]
sms_proc = token.texts_to_sequences(sms)
sms_proc = pad_sequences(sms_proc, maxlen=max_length, padding='post')

Let’s see the prediction

pred = (model.predict(sms_proc) > 0.5).astype("int32").item()
print(pred)

The output of this should be 0 or 1, where 0 represents the HAM message, and 1 represents SPAM. For our above test, it correctly identified it as a HAM message, which means not spam.

Now that our model is saved, we can utilize it in any way we want. For example, by building a background connector when receiving the message at the gateway, you can screen it and categorize it. This is not limited to SMS only. It can be used to classify emails as well.

Let me know what you think. All the working can be found at my GitHub repository here.

How to: SPAM detection with TensorFlow

Loading Data

Visualization

Prepare data for training

Model building

Let’s Predict

Leave a Reply Cancel reply

Recent Posts

How to: SPAM detection with TensorFlow

Loading Data

Visualization

Prepare data for training

Model building

Let’s Predict

Leave a Reply Cancel reply

Recent Posts

Links

Tags