Data Mining Techniques in Direct Marketing on Imbalanced Data using Tomek Link Combined with Random Under-sampling

Yllmaz Ü., Gezer C., Aydln Z., GÜNGÖR V. Ç.

5th International Conference on Information System and Data Mining, ICISDM 2021, Virtual, Online, Amerika Birleşik Devletleri, 27 - 29 Mayıs 2021, ss.67-73

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1145/3471287.3471299
Basıldığı Şehir: Virtual, Online
Basıldığı Ülke: Amerika Birleşik Devletleri
Sayfa Sayıları: ss.67-73
Anahtar Kelimeler: Direct Marketing, Data Mining, Tomek Link, Machine Learning, Imbalanced Data
Abdullah Gül Üniversitesi Adresli: Evet

Özet

© 2021 ACM.Determining the potential customers is very important in direct marketing. Data mining techniques are one of the most important methods for companies to determine potential customers. However, since the number of potential customers is very low compared to the number of non-potential customers, there is a class imbalance problem that significantly affects the performance of data mining techniques. In this paper, different combinations of basic and advanced resampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Tomek Link, RUS, and ROS were evaluated to improve the performance of customer classification. Different feature selection techniques are used in order the decrease the number of non-informative features from the data such as Information Gain, Gain Ratio, Chi-squared, and Relief. Classification performance was compared and utilized using several data mining techniques, such as LightGBM, XGBoost, Gradient Boost, Random Forest, AdaBoost, ANN, Logistic Regression, Decision Trees, SVC, Bagging Classifier based on ROC AUC and sensitivity metrics. A combination of Tomek Link and Random Under-Sampling as a resampling technique and Chi-squared method as feature selection algorithm showed superior performance among the other combinations. Detailed performance evaluations demonstrated that with the proposed approach, LightGBM, which is a gradient boosting algorithm based on decision tree, gave the best results among the other classifiers with 0.947 sensitivity and 0.896 ROC AUC value.