ML: Imbalanced Data Set

Research on imbalanced classes often defines imbalanced data set as a miniority class of 10 to 20 percent of the total numbers of samples. Real-world imbalanced data sets include: 1) about 2 percent of credit card fraudulence; 2) about 0.4 percent of HIV prevalence; 3) about 1 percent of disk drive failures; 4) about 1e-3 to 1e-6 percent of online conversion rates of online ads; 6) about 0.1 percent of factory production defect rates.

Commonly used approaches to deal with imbalanced data sets in maching learning:

  1. oversampling the minority class.
  2. undersampling the majority class (Wallance et al., 2011).
  3. symthesizing new minority classes (Chawla et al., 2004).
  4. adjusting class weights by adjusting cross-entropy loss (misclassification costs).
  5. modifying an existing algorithm to be more sensitive to rare classes.
  6. throw awaying minority examples and switching to an anomaly detection problem (Goh and Rudin 2014).

Different from oversampling (Approach 1) and undersampling (Approach 2) which select examples randomly to adjust their proportions, neighbor-based approaches examine the instance space carefully and decide on next steps based on their neighborhoods. For instance, Tomek links identify pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together. Tomek algorithm looks for such pairs and removes the majority instance of the pair. The idea is to determine the border between the minority and majority classes, making the minority region more distinct.

For synthesizing new examples (Approach 3), SMOTE (Synthetic minority oversampling Technique) proposed by Clawla et al., 2004 derives new minority examples by interpolating between existing samples. This algorithm can only operate within the body of available examples, never outside. It suggests that SMOTE can only fill in the convex hull of existing minority examples, but not create new exterior regions of minority examples.

For adjusting class weights (Approach 4), it is worthwhile to note that adjusting class importance usually only affects the cost of classification errors (False Negatives, assuming the minority class is positive). This approach will adjust a separating surface to decrease these False Negative errors accordingly.

About performance metrics, accuracy is generally not a good indicator. Instead, use ROC (Receiver Operating Characteristic) curve, precision-recall curve (F1 score), lift curve, or a profit curve.

ROC equals to the probability that a random positive example will be ranked above a random negative example. F1 score is the harmonic mean of precision and recall. It is commonly used in text processing when an aggregate measure is preferred. Cohen Kappa is an evaluation statistic that takes into account how much agreement would be expected by chance.

References:

Nice blog: https://www.svds.com/learning-imbalanced-classes/

Wallace, Small, Brodley and Trikalinos, Class Imbalance, Redux. IEEE Conf on Data Mining. 2011.

Krawczyk 2016: Learning from imblanced data: open challenges and future directions.

Torgo, Ribeiro et al 2013: SMOTE for Regression.

N.V. Chawla, N. Japkowicz, A. Kolcz, “Editorial: Special Issue on Learning from Imbalanced Data Sets”, ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 1-6, 2004.

Goh Siong Thye and Rudin Cynthia, Box drawings for Learning with Imbalanced Data, 2014, arXiv, 1403.3378