A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles

Year
2017
Volume 18
Issue 4
Pages
391-437
Authors
Yirey Suh, Jaemyung Yu, Jonghoon Mo, Leegu Song, Cheongtag Kim
Abstract
Machine learning has progressed to match human performance, including the eld of text classification. However, when training data are imbalanced, classifiers do not perform well. Oversampling is one way to overcome the problem of imbalanced data and there are many oversampling methods that can be conveniently implemented. While comparative researches of oversampling methods on non-text data have been conducted, studies comparing oversampling methods under a unifying framework on text data are scarce. This study finds that while oversampling methods generally improve the performance of classifiers, similarity is an important factor that influences the performance of classifiers on imbalanced and resampled data.
Keywords: Imbalanced data, oversampling methods, SMOTE, topic classification