
TensorFlow is an open-source machine learning library with a specific focus on deep learning and neural network. There are endless uses of TensorFlow, but let’s see how we can use it for SPAM detection by utilizing its natural language processing capabilities.
I’ll be using a dataset containing 5,574 text messages, out of which 747 messages are categorized as SPAM, and 4,827 messages are categorized as HAM. The dataset is downloaded fromĀ here.
Loading Data
Let’s start by loading data into Pandas dataframe.
df_sms = pd.read_csv('data/SMSSpamCollection.csv', sep = '\t', header = None)
Setting column name
df_sms.columns=['category', 'message']
df_sms.head()
Let’s preview our data

Visualization
Let’s see what we have in our dataframe
df_sms.category.value_counts()

Let’s plot those numbers
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.countplot(df_sms.category)
plt.show()

Let’s see a word cloud for both HAM and SPAM messages. This will give us an idea of which words have more weightage in each category.
df_spam = df_sms[df_sms.category == 'spam'].copy()
df_ham = df_sms[df_sms.category == 'ham'].copy()
def generate_wordcloud(data_frame, category):
text = ' '.join(data_frame['message'].astype(str).tolist())
stopwords = set(wordcloud.STOPWORDS)
fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='lightgrey',
colormap='viridis', width=800, height=600).generate(text)
plt.figure(figsize=(10,7), frameon=True)
plt.imshow(fig_wordcloud)
plt.axis('off')
plt.title(category, fontsize=20 )
plt.show()
Using the above function, let’s generate a word cloud for SPAM messages
generate_wordcloud(df_spam, 'SPAM')

Pretty obvious I guess. Let’s look at a word cloud for HAM messages
generate_wordcloud(df_ham, 'HAM')

No comments here. Let’s move forward.
Prepare data for training
Let’s prepare our data for model building. For this purpose, we will be converting our category label to numeric data. This means SPAM will be marked as 1, and HAM messages will be marked as 0.
df_sms['category'] = df_sms['category'].map(
{'spam': 1, 'ham': 0} )
df_sms.head()

Secondly, we will now split our dataset into training and testing datasets; this is essential for training our model and then validating it against our testing dataset.
# importing libraries to split data into train and test set
from sklearn.model_selection import train_test_split
X = df_sms['message'].values
y = df_sms['category'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
With the above code, we have split our dataset into an 80/20 split, where 80% of messages will be used to train the model, and the other 20% will be used for testing the model.
Now, we will use Keras’ built-in text functions to preprocess the data. This will be done using Tokenizer functions fit_on_texts and texts_to_sequences.
# importing libraries for text preprocessing
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
The Tokenizer itself will remove the punctuations, and then fit_on_texts will create an index of vocabulary in the form of an index. For example, a message “Free entry” will become [“Free”] = 1; [“Entry”] = 2, etc.
token = Tokenizer()
token.fit_on_texts(X_train)
texts_to_sequence will replace the words with their relevant index number from the above-created dictionary.
# applying sequences from the dictionary on both training and test dataset
encoded_train = token.texts_to_sequences(X_train)
encoded_test = token.texts_to_sequences(X_test)
print(encoded_train[0:2])
The output will be somthing like this

The next step will be to apply the padding, as deep learning models expect that all data passed in should have the same shape and form.
# applying 8 words padding to training and test dataset
max_length = 8
padded_train = pad_sequences(encoded_train, maxlen = max_length, padding = 'post')
padded_test = pad_sequences(encoded_test, maxlen = max_length, padding = 'post')
print(padded_train)
The output after the applying the padding will look like this

Now that our training and test datasets are ready let’s move towards building our model.
Model building
Let’s start by importing the required libraries
# importing libraries for model building
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.callbacks import EarlyStopping
For model building, we will use Keras’ Embedding layer, which is used on text data for neural networks. The basic requirement for this is to have integer encoded inputs, meaning a number should represent every word; that we have already done above using Tokenizer.
vocab_size = len(token.word_index) + 1
# creating a deep learning model using Embedding from Keras
deep_learning = Sequential()
deep_learning.add(Embedding(vocab_size, 24, input_length = max_length))
deep_learning.add(Flatten())
deep_learning.add(Dense(500, activation='relu'))
deep_learning.add(Dense(200, activation='relu'))
deep_learning.add(Dropout(0.5))
deep_learning.add(Dense(100, activation='relu'))
deep_learning.add(Dense(1, activation='sigmoid'))
# compile the model
deep_learning.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy'])
# summarize the model
print(deep_learning.summary())
We will get below summary output from the above code

Now that our model is ready let’s call the fit method to train our model. We will use EarlyStopping to stop the training once the model accuracy is reached.
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)
# fit the model
deep_learning.fit(x=padded_train,
y=y_train,
epochs=50,
v
The model learning will typically complete in 10 to 12 epochs. Below should be the output.

Let’s check our accuracy score after model training
from sklearn.metrics import confusion_matrix, accuracy_score
preds = (deep_learning.predict(padded_test) > 0.5).astype("int32")
accuracy_score(y_test, preds)
The accuracy score came out to be 0.9829596412556054
Let’s save our deep learning model for performing testing and then later utilize it in our applications.
deep_learning.save("spam_detection_model")
We will also save our tokenizor using Pickle
import pickle
with open('spam_detection_model/tokenizer.pkl', 'wb') as output:
pickle.dump(token, output, pickle.HIGHEST_PROTOCOL)
Let’s Predict
Let’s load our saved deep learning model
import tensorflow as tf
s_model = tf.keras.models.load_model("spam_model")
with open('spam_detection_model/tokenizer.pkl', 'rb') as input:
tokenizer = pickle.load(input)
Now, let’s pass our sample SMS message and perform preprocessing
sms = ["Hi, I'll be late, lets check this tomorrow "]
sms_proc = token.texts_to_sequences(sms)
sms_proc = pad_sequences(sms_proc, maxlen=max_length, padding='post')
Let’s see the prediction
pred = (model.predict(sms_proc) > 0.5).astype("int32").item()
print(pred)
The output of this should be 0 or 1, where 0 represents the HAM message, and 1 represents SPAM. For our above test, it correctly identified it as a HAM message, which means not spam.
Now that our model is saved, we can utilize it in any way we want. For example, by building a background connector when receiving the message at the gateway, you can screen it and categorize it. This is not limited to SMS only. It can be used to classify emails as well.
Let me know what you think. All the working can be found at my GitHub repository here.
Leave a Reply