Spam detection in text messages using logistic regression based on gradient descent
DOI:
https://doi.org/10.17721/ISTS.2025.9.74-80Keywords:
machine learning, natural language processing, gradient optimization, email filtering, text preprocessing, classificationAbstract
Background. With the increasing volume of email communication, the problem of spam filtering is becoming more and more relevant. According to statistical research, spam constitutes a significant portion of global email traffic, creating risks both for security and for the efficiency of electronic communication. In this context, natural language processing (NLP) and machine learning methods are gaining particular importance. The aim of this study is to develop a model for classifying email messages into spam and non-spam using logistic regression implemented via gradient descent, in combination with text data processing methods.
Methods. The model was trained on a dataset containing over 5,000 email messages labeled as spam or non-spam. The data were preprocessed by removing noise components such as punctuation, numbers, stop words, and short tokens, followed by lemmatization. The cleaned texts were converted into numerical format using TF-IDF vectorization with L2 normalization. To address the class imbalance, the SMOTE method was applied. The model was trained using a classical gradient descent scheme with a sigmoid activation function and a logarithmic loss function.
Results. The resulting model achieved high performance on the test set: overall accuracy was 98%, with an F1-score of 0.92 for the spam class and 0.99 for the non-spam class. The recall for spam reached 0.90, indicating the model's ability to detect most unwanted messages without excessive false positives. The balance between precision and recall is also reflected in macro and weighted average F1-scores, both exceeding 0.96.
Conclusions. The findings demonstrate the effectiveness of combining logistic regression, gradient descent, and text preprocessing for the spam classification task, even in the presence of imbalanced data. The proposed approach is both efficient and interpretable, making it suitable for practical implementation in email filtering systems.
Downloads
References
Jyothiikaa Moorthy. (2025). 23 Email Spam Statistics to Know in 2025. https://www.mailmodo.com/guides/email-spam-statistics/
Kaggle, (2025). The Enron Email Dataset. https://www.kaggle.com/datasets/mohinurabdurahimova/maildataset
Khanday А., Shahbaz P., & Suraiya Р. (2021). Logistic Regression Based Classification of Spam and Non-Spam Emails. https://doi.org/10.4108/eai.27-2-2020.2303291.
Mohammed, N., Mouhajir, M., & Yassine S.. (2023). High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison. https://doi.org/10.48550/arXiv.2308.10037
Papageorgiou, G., Economou, P., & Bersimis, S. (2024). A method for optimizing text preprocessing and text classification using multiple cycles of learning with an application on shipbrokers emails. Journal of Applied Statistics, 51(13), 2592–2626. https://doi.org/10.1080/02664763.2024.2307535.
Rakhmanov, O. (2020). A Comparative Study on Vectorization and Classification Techniques in Sentiment Analysis to Classify Student-Lecturer Comments. Procedia Computer Science, 178, 194–204. https://doi.org/10.1016/j.procs.2020.11.021
Spam Statistics 2025 (2025). New Data on Junk Email, AI Scams & Phishing. https://www.emailtooltester.com/en/blog/spam-statistics/
Zhang, L., Ray, H., Priestley, J., & Tan, S. (2019). A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data. Journal of Applied Statistics, 47(3), 568–581. https://doi.org/10.1080/02664763.2019.1643829
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Information systems and technologies security

This work is licensed under a Creative Commons Attribution 4.0 International License.
