A QUARTERLY PUBLICATION OF ACCS
Class Imbalance Learning


Sudarsun Santhiappan, PhD scholar at IIT Madras since 2013
Prof. Ravindran, associate professor in Computer Science at IIT Madras.

Abstract

 
Data classification task assigns labels to data points using a model that is learned from a collection of pre-labeled data points. The Class Imbalance Learning (CIL) problem is concerned with the performance of classification algorithms in the presence of under-represented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced datasets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data effciently into information and knowledge representation. It is important to study CIL because it is rare to find a classification problem in real world scenarios that follows balanced class distributions. In this article, we have presented how machine learning has become the integral part of modern lifestyle and how some of the real world problems are modeled as CIL problems. We have also provided a detailed survey on the fundamentals and solutions to class imbalance learning. We conclude the survey by presenting some of the challenges and opportunities with class imbalance learning.

Introduction

 
Machine learning is a core sub-area of artificial intelligence that enables computers to learn without being explicitly programmed. When exposed to new data, computer programs are enabled to learn, grow, change, and develop by themselves. Machine learning is a method of data analysis that automates analytical model building to find hidden insights without being explicitly programmed where to look. The iterative aspect of machine learning allows models to independently adapt to new data exposure. The algorithms learn from previous computations to produce reliable, repeatable decisions and results. A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.[1]

Machine Learning has now become an avenue for absorbing and modeling what one knows and doesn’t know, to enable more targeted information to help individuals learn. Platform companies—like Google, LinkedIn, and Amazon are building algorithms that represent us and our needs for data driven decisions. Learning lifelong from the data generated by individuals will allow machines to know enough about individuals, and make recommendations that adapt to our changing contexts. At the core of this opportunity is data, tonnes of it, and algorithms that convert data into a matrix of comparisons for decision making.

“Customers who bought this item also bought” is a popular phrase in Amazon’s online retail, which has been using data driven approaches to recommend products. With LinkedIn’s Economic Graph[2], insights such as changing nature of work, skills that comprise specific jobs, work-related behaviors that pre-signal a change of job and recognizing skills-gap have become readily available for informed decision making. On medical imaging front, machine learning models[3] are used to detect melanoma from the images of pigmented skin lesions. Android powered mobile phones allow Google to have access to data generated by a very large population of individuals and provide assistance on recommendation and decision making in several day-to-day activities such as travel, food, shopping etc. Autonomous cars use digital imaging and image classification models to identify the type of obstacle for being humans, animals or objects to negotiate appropriately. In agriculture, models for crops and soil conditions are used to decide which crop to grow given a particular soil and climatic conditions. In speech recognition space, machine learning models are used to identify speaker based on the unique characteristics learned about every speaker during the model training phase.

Data Classification

 A typical machine learning classification task comprises of a training set, an evaluation set, a testing set of data points and a performance evaluation metric[4]. The datasets can be either unlabeled or labeled. A labeled dataset is the one where for every input data point there is an associated output label (categorical value) assignment. The labeled datasets are expensive to construct as it requires human labor to verify the truth of the output labels.

Mathematically, the set of labels is represented as Y={y1, y2, …, ym}, the ith input data point is represented as Xi and its associated output label is represented as yi, where yi ∈ Y . A typical dataset is represented as D=⟨X,y⟩, where X is typically a matrix of N input row vectors of P-dimensions and y is a column vector of length N. The objective of a machine learning classifier is to learn a function f : X → y, that minimizes the misclassification error where ˆyi and yi are the predicted and true label for a data point Xi respectively.

The training set is the dataset population from which one or more statistical machine learning models are learned. The evaluation dataset is used to select the best machine learning algorithm or strategy by measuring the performance of the learned models on the evaluation dataset using the evaluation metric(s). Some of the popular pointbased performance metrics are accuracy, precision-recall, F1-score and curve-based performance metrics are ROC, PRC, AUC, cost curves. Once the statistical model is chosen and its parameters are tuned using the evaluation process, the performance on test dataset is measured using the same evaluation metric and reported.

Supervised classification algorithms learn from labeled examples, where the training data comprises of an input data representation along with the desired output label. For example, a quality test of a machinery could have data points labeled either “F” (fail) or “P” (pass). The learning algorithm receives a set of inputs represented predominantly as a vector of feature values, along with the corresponding correct output labels, and the algorithm learns by comparing its predicted output with correct outputs to find the misclassification error. It then modifies the model accordingly so as to minimize the error. Supervised learning is commonly used in applications where historical data is used to learn a model and predict the likely future events. For example, it can predict which insurance customer is likely to file a claim for, or whether a user would buy a product or not, or if the cricket team would win the next match and so forth.

Semi-supervised classification algorithms are used for the same applications as supervised learning. But it uses a combination of both labeled and unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data as unlabeled data is easy to acquire and less expensive. Semisupervised learning is useful when the cost associated with labeling is too high to allow for a fully labeled training process.


[1]Tom Mitchell, Carnegie Mellon University

[2]https://www.linkedin.com/economic-graph

[3]https://skinvision.com/

[4]https://ils.unc.edu/courses/2013_spring/inls509_001/lectures/10-EvaluationMetrics.pdf


Class Imbalance Learning

The Class Imbalance Learning (CIL) problem is concerned with the performance of classification algorithms in the presence of under-represented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced datasets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. It is necessary to study CIL because it is apparently rare to find classification problems in real-world scenarios that follow balanced class distributions. We present some real-world use cases, where the data distribution is naturally imbalanced and the data distribution of interest is typically the minority class data points.

Computer Assisted Coding: A computer assisted coding system (CACS) is a software system that analyzes healthcare documents (medical charts) and produces appropriate medical codes for specific phrases and terms within the document. ICD10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD)[5], a medical classification list by the World Health Organization (WHO). It contains codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases. ICD-10 Clinical Modification has about 69,833 codes, and ICD-10 Procedure Coding System has 71,918 codes, which make up for a perfect challenge for class imbalance learning, where the coding system is both multi-label classification, with thousands of classes and severely skewed in terms of class distribution. Companies like BuddiHealth[6] provide deep learning and knowledge based hybrid solutions to the extreme classification problem, while companies like 3M, Optum, Nuance, Atiego, ezCAC, Opera provide statistical NLP and rule based solutions[7].

Suicide Prevention Tools[8]: Facebook has introduced a suicide-prevention feature that uses AI to identify posts indicating suicidal or harmful thoughts. The AI scans the posts and their associated comments, compares them to others that merited intervention, and, in some cases, passes them along to its community team for review. The company plans to pro-actively reach out to users it believes are at risk, showing them a screen with suicide-prevention resources including options to contact a helpline or reach out to a friend. Algorithms, trained on report data from the network’s close to two billion users, are constantly on the lookout for warning signs in content that users post, as well as replies that are received. When a red flag is raised, a team of human reviewers is alerted, and the user can be contacted and offered help.

Fraud Management: is an example of big data with class imbalance characteristics, where the suspicious behaviors are“fortunately”rare events. Big data analytics link heterogeneous information from transaction data, which enables the service provider to pick up these behaviors automatically. For example, a series of cash-in transactions to the same account, from different locations, might be an attempt to avoid paying for domestic transfers, or several cash-ins immediately followed by a cash-out could indicate money laundering. Best practice states that no actions are purely automated; the fraud analyst always has the final say. As a population’s behavior evolves over time, the parameters of the fraud detection models must adapt to remain optimal. In response to this, machine learning algorithms predict a natural evolution of behavior based on historical data as well as previous actions taken by fraud managers in decision making, and also propose modifications to the detection model for future anomaly detection.

Churn Prediction: One of the most important business metrics for executives is churn rate—the rate at which your customers stop doing business with you. Today, data driven companies use data science to effectively predict which customers are likely to churn. Such programs allow companies to pro-actively protect revenue by incentivizing the potential churners to continue staying with them. Networking and communication companies typically have business level data, usage logs, and call center/support tickets, among other data assets. This data is generated from their consumer and business customer interactions, and these datasets vary in terms of volume and user behavior. The machine learning task of interest is to predict if a customer would become a churner or not, which is relatively a very small subset of the entire customer population.

Buyer Prediction: Millions of consumers visit e-commerce sites, but only few thousands visitors buy the products, which makes the imbalance ratio in the order of 1000:1 or more. A typical e-retailer wants to improve the customer experience and would like to improve the conversion rate. The objective is to identify the potential buyers based on their demographics, historical transaction pattern, clicks pattern, browsing pattern in different pages, etc. Deep data analysis reveals the buyers’ buying behaviors which are highly dependent on their activities like number of clicks, session duration, previous session, purchase session, clicksrate per session etc. By applying machine learning and predictive analytics methods, the propensity score of each visitor can be estimated. This leads to multiple benefits for the eretailer to offer right and targeted product for the customers at the right time, increase conversion rate, and improve customer satisfaction.

Foundations of Imbalanced Learning

Any dataset that exhibits an unequal distribution among its classes can be considered imbalanced. However, the common understanding in the community is that imbalanced data correspond to datasets exhibiting significant, and in some cases extreme, imbalances. Specifically, this form of imbalance is referred to as a between class imbalance; not uncommon are between-class imbalances in the order of 100:1, 1000:1, and 10000:1, where in each case, one class severely out represents another. In order to highlight the implications of the imbalanced learning problem in the real world, let’s consider a real life problem of classifying a visitor to be a buyer or a non-buyer on an online retail portal, which typically has a ratio of 1:1000 or more. In reality, we find that classifiers tend to provide a severely imbalanced degree of accuracy [1], with the majority class having close to 100 percent accuracy and the minority class having accuracies of 0-5 percent. We require a classifier that will provide high accuracy for the minority class without severely jeopardizing the accuracy of the majority class.

Intrinsic imbalance is a direct result of the nature of the data space. However, imbalanced data are not solely restricted to the intrinsic variety. Variable factors such as time and storage also give rise to datasets that are imbalanced. Imbalances of this type are considered extrinsic, i.e., the imbalance is not directly related to the nature of the data space. Extrinsic imbalances are equally as interesting as their intrinsic counterparts since it may very well occur that the data space from which an extrinsic imbalanced dataset is attained may not be imbalanced at all. For instance, suppose a dataset is procured from a continuous data stream of balanced data over a specific interval of time, and if during this interval, the transmission has sporadic interruptions where data are not transmitted, then it is possible that the acquired dataset can be imbalanced in which case the dataset would be an extrinsic imbalanced dataset attained from a balanced data space. In addition to intrinsic and extrinsic imbalance [2] [3], it is important to understand the difference between relative imbalance and imbalance due to rare instances (or “absolute rarity”).

Data complexity is a broad term that comprises of issues such as overlapping, lack of representative data, small disjuncts, and others. In a simple example, consider the depicted distributions in Figure 1, where the stars and circles represent the minority and majority classes, respectively. By inspection, we see that both the distributions in Figures 1a and 1b exhibit relative imbalances. However, notice how Figure 1a has no overlapping examples between its classes and has only one concept pertaining to each class, whereas Figure 1b has both multiple concepts and severe overlapping. Also of interest is subconcept C in the distribution of Figure 1b. This concept might go unlearned by some inducers due to its lack of representative data; this issue embodies imbalances due to rare instances, which we proceed to explore. Imbalance due to rare instances is representative of domains where minority class examples are very limited, i.e., where the target concept is rare. In such a situation, the lack of representative data will make learning difficult regardless of the between-class imbalance. Furthermore, the minority concept may additionally contain a subconcept with limited instances, amounting to diverging degrees of classification difficulty. This, in fact, is the result of another form of imbalance called within-class imbalance [4] [5] [6], which concerns itself with the distribution of representative data for sub-concepts within a class. These ideas are again highlighted in our simplified example in Figure 1. In Figure 1b, cluster B represents the dominant minority class concept and cluster C represents a subconcept of the minority class. Cluster D represents two sub-concepts of the majority class and cluster A (anything not enclosed) represents the dominant majority class concept. For both classes, the number of examples in the dominant clusters significantly outnumber the examples in their respective subconcept clusters, so that this data space exhibits both within-class and between-class imbalances. Moreover, if we completely remove the examples in cluster B, the data space would then have a homogeneous minority class concept that is easily identified (cluster C), but can go unlearned due to its severe underrepresentation.

The existence of within-class imbalances is closely inter-twined with the problem of small disjuncts, which has been shown to greatly degrade the classification performance [4] [5] [6]. Briefly, the problem of small disjuncts can be understood as follows: A classifier will attempt to learn a concept by creating multiple disjunct rules that describe the main concept [7]. In the case of homogeneous concepts, the classifier will generally create large disjuncts, i.e., rules that cover a large portion (cluster) of examples pertaining to the main concept. However, in the case of heterogeneous concepts, small disjuncts, i.e., rules that cover a small cluster of examples pertaining to the main concept, arise as a direct result of under-represented subconcepts [7]. Moreover, since classifiers attempt to learn both majority and minority concepts, the problem of small disjuncts is not only restricted to the minority concept. On the contrary, small disjuncts of the majority class can arise from noisy misclassified minority class examples or under-represented sub-concepts. However, because of the vast representation of majority class data, this occurrence is infrequent. A more common scenario is that noise may influence disjuncts in the minority class. In this case, the validity of the clusters corresponding to the small disjuncts becomes an important issue, i.e., whether these examples represent an actual subconcept or are merely attributed to noise. For example, in Figure 1b, suppose a classifier generates disjuncts for each of the two noisy minority samples in cluster A, then these would be illegitimate disjuncts attributed to noise compared to cluster C, for example, which is a legitimate cluster formed from a severely under represented subconcept.

The last issue to consider is the combination of imbalanced data and the small sample size problem. In many of today’s data analysis and knowledge discovery applications, it is often unavoidable to have data with high dimensionality and small sample size; some specific examples include face recognition and gene expression data analysis, among others. Traditionally, the small sample size problem has been studied extensively in the pattern recognition community. Dimensionality reduction methods have been widely adopted to handle this issue, e.g., principal component analysis (PCA) and various extension methods [8]. However, when the representative datasets’ concepts exhibit imbalances of the forms described earlier, the combination of imbalanced data and small sample size presents a new challenge to the community. In this situation, there are two critical issues that arise simultaneously. First, since the sample size is small, all of the issues related to absolute rarity and withinclass imbalances are applicable. Second and more importantly, learning algorithms often fail to generalize inductive rules over the sample space when presented with this form of imbalance. In this case, the combination of small sample size and high dimensionality hinders learning because of difficulty involved in forming conjunctions over the high degree of features with limited samples. If the sample space is sufficiently large enough, a set of general (albeit complex) inductive rules can be defined for the data space. However, when samples are limited, the rules formed can become too specific, leading to overfitting.

Furthermore, this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. Therefore, more informative assessment metrics, such as the receiver operating characteristics curves, precision recall curves, and cost curves, are necessary for conclusive evaluations of performance in the presence of imbalanced data [9].

Solutions to the problem of Class Imbalance Learning

When standard learning algorithms are applied to imbalanced data, the induction rules that describe the minority concepts are often fewer and weaker than those of majority concepts, since the minority class is often both outnumbered and underrepresented. To provide a concrete understanding of the direct effects of the imbalanced learning problem on standard learning algorithms, let’s consider the popular decision tree learning algorithm, where imbalanced datasets exploit inadequacies in the splitting criterion at each node.

Decision trees use a recursive, top-down greedy search algorithm that uses a feature selection scheme (e.g., information gain) to select the best feature as the split criterion at each node of the tree; a successor (leaf) is then created for each of the possible values corresponding to the split feature. As a result, the training set is successively partitioned into smaller subsets that are ultimately used to form disjoint rules pertaining to class concepts. These rules are finally combined so that the final hypothesis minimizes the total error rate across each class. The problem with this procedure in the presence of imbalanced data is two-fold. First, successive partitioning of the data space results in fewer and fewer observations of minority class examples, resulting in fewer leaves describing minority concepts and successively weaker confidences estimates. Second, concepts that have dependencies on different feature space conjunctions can go unlearned by the sparseness introduced through partitioning. Here, the first issue correlates with the problems of relative and absolute imbalances, while the second issue best correlates with the between-class imbalance and the problem of high dimensionality. In both cases, the effects of imbalanced data on decision tree classification performance are detrimental. In the following sections, we evaluate the solutions proposed to overcome the effects of imbalanced data. A battery of methods to address the class imbalance condition is available in the literature [10] [11]. Typically, the methods are categorized into sampling methods, cost-sensitive methods, kernel methods and active learning methods.

Sampling methods The mechanics of random oversampling follow naturally from its description by adding a set sampled from the minority class: for a set of randomly selected minority examples, augment the original set by replicating the selected examples and adding them to the original set. In this way, the number of total examples is increased and the class distribution balance is adjusted accordingly. This provides a mechanism for varying the degree of class distribution balance to any desired level. While oversampling appends data to the original dataset, random undersampling removes data from the original dataset. At first glance, the oversampling and undersampling methods appear to be functionally equivalent since they both alter the size of the original dataset and can actually provide the same proportion of balance. However, this commonality is only superficial; each method introduces its own set of problematic consequences that can potentially hinder learning [12] [13]. In the case of under-sampling, the problem is relatively obvious: removing examples from the majority class may cause the classifier to miss important concepts pertaining to the majority class. In regards to oversampling, the problem is a little more opaque: since oversampling simply appends replicated data to the original dataset, multiple instances of certain examples become “tied,” leading to overfitting [12]. In particular, overfitting in oversampling occurs when classifiers produce multiple clauses in a rule for multiple copies of the same example which causes the rule to become too specific; although the training accuracy will be high in this scenario, the classification performance on the unseen testing data is generally far worse [14].

Informed Undersampling based on EasyEnsemble and BalanceCascade [15] is proposed to overcome the deficiency of information loss introduced in the traditional random undersampling method. Another example of informed undersampling uses the K-nearest neighbor (KNN) classifier to achieve undersampling. Based on the characteristics of the given data distribution, four KNN undersampling methods were proposed in [16], namely, NearMiss-1, NearMiss2, NearMiss-3, and the“most distant” method. Another method, the one sided selection (OSS) method [17] selects a representative subset of the majority class and combines it with the set of all minority examples to form a preliminary set, which is further refined by using data cleaning techniques.

Synthetic sampling is a powerful method that has shown a great deal of success in various applications [18]. The SMOTE algorithm creates artificial data based on the feature-space similarities between existing minority examples. To create a synthetic sample around a minority point in vector space, randomly select one of the K-nearest neighbors, then multiply the corresponding feature vector difference with a random number between [0, 1], and finally, add this vector to the minority point vector. Though it has many promising benefits, the SMOTE algorithm also has its drawbacks, including over generalization and variance [19], which is largely attributed to the way in which it creates synthetic samples. Specifically, SMOTE generates the same number of synthetic data samples for each original minority example and does so without consideration to neighboring examples, which increases the occurrence of overlapping between classes. To overcome these issues, only selected sub-samples of the minority class are subjected to synthetic sample generation. Borderline-SMOTE [20] uses only the minority samples near the decision boundary to generate new synthetic samples. MWMOTE [21] identifies the hard-tolearn informative minority class samples and assigns them weights according to their euclidean distance from the nearest majority class samples. It then generates the synthetic samples from the weighted informative minority class samples using a clustering approach. SCUT [22] over samples minority class examples through the generation of synthetic examples and employs cluster analysis in order to under sample majority classes. In addition, it handles both within-class and between class imbalance. To this end, various adaptive sampling methods have been proposed to overcome this limitation; some representative work includes Adaptive Synthetic Sampling (ADASYN) [23], ProWSyn [24] and R-SMOTE [25].

Data cleaning techniques, such as Tomek links [26], have been effectively applied to remove the overlapping that is introduced from sampling methods. If two instances form a Tomek link then either one of these instances is noise or both are near a border. Therefore, one can use Tomek links to“cleanup”unwanted overlapping between classes after synthetic sampling where all Tomek links are removed until all minimally distanced nearest neighbor pairs are of the same class. By removing overlapping examples, one can establish well-defined class clusters in the training set, which can, in turn, lead to well-defined classification rules for improved classification performance. Some representative work in this area includes the OSS method [17], the condensed nearest neighbor rule and Tomek Links CNN+Tomek Links integration method [27], the neighborhood cleaning rule (NCL) [28], based on the edited nearest neighbor (ENN) rule—which removes examples that differ from two of its three nearest neighbors, and the integrations of SMOTE with ENN (SMOTE+ENN) and SMOTE with Tomek links (SMOTE+Tomek) [28].

Cluster-based sampling algorithms are particularly interesting because they provide an added element of flexibility that is not available in most simple and synthetic sampling algorithms, and accordingly can be tailored to target very specific problems. Cluster-based oversampling (CBO) algorithm [5] is proposed to effectively deal with the within-class imbalance problem in tandem with the between-class imbalance problem. The CBO algorithm makes use of the K-means clustering technique. Empirical results of CBO are very suggestive into the nature of the imbalanced learning problem; namely, that targeting within-class imbalance in tandem with the between-class imbalance is an effective strategy for imbalanced datasets.

The integration of sampling strategies with ensemble learning techniques has also been studied in the community. For instance, the SMOTEBoost [29] algorithm is based on the idea of integrating SMOTE with Adaboost.M2. Specifically, SMOTEBoost introduces synthetic sampling at each boosting iteration. In this way, each successive classifier ensemble focuses more on the minority class. Since each classifier ensemble is built on a different sampling of data, the final voted classifier is expected to have a broadened and well-defined decision region for the minority class. Another integrated approach, the DataBoost-IM [30] method, combines the data generation techniques introduced in [31] with AdaBoost.M1 to achieve high predictive accuracy for the minority class without sacrificing accuracy on the majority class. Briefly, DataBoost-IM generates synthetic samples according to the ratio of difficult-to-learn samples between classes. RAMOBoost [32] adaptively ranks minority class instances at each learning iteration according to a sampling probability distribution that is based on the underlying data distribution, and can adaptively shift the decision boundary toward difficult-to-learn minority and majority class instances by using a hypothesis assessment procedure.

RUSBoost [33] is a modification to AdaBoost.M1 for solving between class imbalance problems by undersampling from the majority class. It is documented to perform better[33] than SMOTEBoost[29] that solves class imbalance by oversampling minority class. RUSBoost algorithm performs random undersampling from the majority class at every AdaBoost iteration to match the population size of the minority class, prescribed by the data sample distribution computed based on misclassification error and exponential loss estimates.

Cost-Free Learning (CFL) [34] is a type of learning that does not require the cost terms associated with the misclassifications and/or rejects as the inputs. The goal of this type of learning is to get optimal classification results without using any cost information. Based on information theory, CFL seeks to maximize normalized mutual information of the targets and the decision outputs of classifiers, which could be binary/multi-class classifications with/without abstaining. Another advantage of the method is its ability to derive optimal rejection thresholds for abstaining classifications and the “equivalent” costs in binary classifications, which can be used as a reference for cost adjustments in costsensitive learning.

An adaptive sampling with optimal cost [35] for class imbalance learning is proposed to adaptively oversample the minority positive examples and undersample the majority negative examples, forming different sub-classifiers by different subsets of training data with the best cost ratio adaptively chosen, and combining these sub-classifiers according to their accuracy to create a strong classifier. The sample weights are computed based on the prediction probability of every sample, by a pair of induced SVM classifiers built on two equal sized partitions of the training instances.

Weighted Extreme Learning Machines (ELM) [36] [37] is proposed as a generalized cost-sensitive learning method to deal with imbalanced data distributions, where weights are assigned to every training instance based on users’needs. Although persample weights are possible, the authors proposed to use class proportion as the common weight of every sample from a class. They also proposed an alternate weighting scheme that uses golden ratio[9] in computing the common weights for the majority classes. An adaptive semiunsupervised weighted oversampling (A-SUWO) method [38] is proposed for imbalanced datasets, which clusters the minority instances using a semi-unsupervised hierarchical clustering approach and adaptively determines the size to oversample each sub-cluster using its classification complexity and cross validation. The minority instances are weighted based on their Euclidean distance to the majority class based on which they are oversampled.

An inverse random undersampling [39] method is proposed for class imbalance learning, where several distinct training sets are constructed by severely undersampling the majority class to sizes smaller than the minority class, to bias the learned decision boundaries towards the minority class.

Cost-Sensitive-methods While sampling methods attempt to balance distributions by considering the representative proportions of class examples in the distribution, costsensitive learning methods consider the costs associated with misclassifying examples [40]. Instead of creating balanced data distributions through different sampling strategies, cost-sensitive learning targets the imbalanced learning problem by using different cost matrices that describe the costs for misclassifying any particular data example. Recent research indicates that there is a strong connection between costsensitive learning and learning from imbalanced data; therefore, the theoretical foundations and algorithms of cost-sensitive methods can be naturally applied to imbalanced learning problems [41] [15].

There are many different ways of implementing cost-sensitive learning, but, in general, the majority of techniques fall under three categories. The first class of techniques apply misclassification costs to the dataset as a form of data space weighting; these techniques are essentially cost-sensitive bootstrap sampling approaches where misclassification costs are used to select the best training distribution for induction. The second class applies cost-minimizing techniques to the combination schemes of ensemble methods; this class consists of various Meta techniques where standard learning algorithms are integrated with ensemble methods to develop cost-sensitive classifiers. Both of these classes have rich theoretical foundations that justify their approaches, with cost-sensitive data space weighting methods building on the translation theorem [42], and cost-sensitive Meta techniques building on the Metacost framework [43]. In fact, many of the existing research works often integrate the Metacost framework with data space weighting and adaptive boosting to achieve stronger classification results. To this end, we consider both of these classes of algorithms as one in the following section. The last class of techniques incorporates cost-sensitive functions or features directly into classification paradigms to essentially “fit” the cost-sensitive framework into these classifiers. Because many of these techniques are specific to a particular paradigm, there is no unifying framework for this class of cost-sensitive learning, but in many cases, solutions that work for one paradigm can often be abstracted to work for others.

Motivated by the pioneering work of the AdaBoost algorithms [44], several cost-sensitive boosting methods for imbalanced learning have been proposed. Three cost-sensitive boosting methods, AdaC1, AdaC2, and AdaC3 were proposed in [45], which introduces cost items into the weight updating strategy of AdaBoost. In essence, these algorithms increase the probability of sampling a costly example at each iteration, giving the classifier more instances of costly examples for a more targeted approach of induction. In general, it was observed that the inclusion of cost factors into the weighting scheme of Adaboost imposes a bias towards the minority concepts and also increases the use of more relevant data samples in each hypothesis, providing for a more robust form of classification. Another cost-sensitive boosting algorithm that follows a similar methodology is AdaCost [46]. AdaCost, like AdaC1, introduces cost sensitivity inside the exponent of the weights-update formula of Adaboost. However, instead of applying the cost items directly, AdaCost uses a cost-adjustment function that aggressively increases the weights of costly misclassifications and conservatively decreases the weights of high-cost examples that are correctly classified.

Though these cost-sensitive algorithms can significantly improve classification performance, they take for granted the availability of a cost matrix and its associated cost items. In many situations, an explicit description of misclassification costs is unknown, i.e., only an informal assertion is known, such as misclassifications on the positive class are more expensive than the negative class [47].

With respect to costsensitive decision trees, cost-sensitive fitting can take three forms: first, cost-sensitive adjustments can be applied to the decision threshold; second, cost-sensitive considerations can be given to the split criteria at each node; and lastly, cost-sensitive pruning schemes can be applied to the tree. A decision tree threshold moving scheme for imbalanced data with unknown misclassification costs was observed in [48]. When considering cost sensitivity in the split criterion, the task at hand is to fit an impurity function that is insensitive to unequal costs. For instance, traditionally, accuracy is used as the impurity function for decision trees, which chooses the split with minimal error at each node. However, this metric is sensitive to changes in sample distributions, and thus, inherently sensitive to unequal costs. In [49], three specific impurity functions, Gini, Entropy, and DKM, were shown to have improved cost insensitivity compared with the accuracy/error rate baseline.

Cost sensitivity can be introduced to neural networks in four ways [50]: first, cost-sensitive modifications can be applied to the probabilistic estimate; second, the neural network outputs can be made costsensitive; third, cost-sensitive modifications can be applied to the learning rate η and fourth, the error minimization function can be adapted to account for expected costs. A ROC based method for determining the cost is proposed in [51], which allows to select the most efficient cost factor for a given dataset. A neural network based cost-sensitive is proposed in [36] uses Weighted Extreme Learning Machines to solve class imbalance problems. The authors have also showed that assigning different weights for each training sample, WELM could be generalized to a cost-sensitive learning method.

Kernel-methods SMOTE with Different Costs (SDCs) method [52] and the ensembles of over/undersampled SVMs [53][54][55][56] combine Kernel methods and Sampling together to solve the imbalance problem. SDC algorithm uses different error costs [52] for different classes to bias the SVM in order to shift the decision boundary away from positive instances and make positive instances more densely distributed in an attempt to guarantee a more well-defined boundary. Meanwhile, the methods proposed in [54][55] develop ensemble systems by modifying the data distributions without modifying the underlying SVM classifier.

The Granular Support Vector Machines—Repetitive Undersampling algorithm (GSVM-RU) was proposed in [57] to integrate SVM learning with undersampling methods. The major characteristics of GSVMs are two-fold. First, GSVMs can effectively analyze the inherent data distribution by observing the trade-offs between the local significance of a subset of data and its global correlation. Second, GSVMs improve the computational efficiency of SVMs through use of parallel computation. In the context of imbalanced learning, the GSVMRU method takes advantage of the GSVM by using an iterative learning procedure that uses the SVM itself for undersampling.

A kernel-boundary-alignment algorithm is proposed in [58], which considers training-data imbalance as prior information to augment SVMs to improve class-prediction accuracy. Using a simple example, we first show that SVMs can suffer from high incidences of false negatives when the training instances of the target class are heavily outnumbered by the training instances of a non-target class. The remedy the authors propose is to adjust the class boundary by modifying the kernel matrix, according to the imbalanced data distribution.

One example of kernel modification is the kernel classifier construction algorithm proposed in [59] based on orthogonal forward selection (OFS) and the regularized orthogonal weighted least squares (ROWLSs) estimator. This algorithm optimizes generalization in the kernel-based learning model by introducing two major components that deal with imbalanced data distributions for two-class datasets. The first component integrates the concepts of leave-one-out (LOO) cross validation and the area under curve (AUC) evaluation metric to develop an LOOAUC objective function as a selection mechanism of the most optimal kernel model. The second component takes advantage of the cost sensitivity of the parameter estimation cost function in the ROWLS algorithm to assign greater weights to erroneous data examples in the minority class than those in the majority class.


SSL-methods A semi-supervised classification method for prognosis of ACLF was proposed in 2015 [60], where the authors constructed an im-balanced prediction model based on small sphere and large margin approach (SSLM) [61], which classifies two classes (improved patients, death patients) of samples by maximizing their margin. SSLM was shown to perform better than OneClass SVM and Support Vector Data Description (SVDD) methods. The authors also experimented with semisupervised Twin SVM [62] by adding unlabeled patients into the dataset.

Transductive graph-based semisupervised learning methods usually build an undirected graph utilizing both labeled and unlabeled samples as vertices. Those methods propagate the label information of labeled samples to neighbors through their edges in order to get the predicted labels of unlabeled samples. Most popular semi-supervised learning approaches are sensitive to initial label distribution happened in imbalanced labeled datasets. The class boundary will be severely skewed by the majority classes in an imbalanced classification. In [63], the authors propose a simple and effective approach to alleviate the unfavorable influence of imbalance problem by iteratively selecting a few unlabeled samples and adding them into the minority classes to form a balanced labeled dataset for the learning methods afterwards. The experiments on UCI datasets [64] and MNIST handwritten digits dataset [65] showed that the proposed approach outperforms other existing state of the art methods.

Three cost sensitive boosting methods, AdaC1. AdoC2 and AdaC3 were proposed, which introducess cost items into the weight updating strategy of AdaBoost.

A SSL method in [66] uses a tranductive learning approach to build upon a graph-based phase field model [67] that handles imbalanced class distributions. This method is able to encourage or penalize the memberships of data to different classes according to an explicit a priori model that avoids biased classifications. Experiments, conducted on real-world benchmarks, support the better performance of the model compared to several state of the art semi-supervised learning algorithms.

The problem of predicting splice sites in a genome using semisupervised learning approach [68] is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of nonsplice sites. To address this challenge, the authors propose to use ensembles of semi-supervised classifiers, specifically self-training and cotraining classifiers. The experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, it was found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines.

A semi-supervised learning task from both labeled and unlabeled instances and in particular, selftraining with decision tree learners as base learners is proposed in [69]. The authors show that standard decision tree algorithm as the base learner cannot be effective in a self-training algorithm to semi-supervised learning. The main reason is that the basic decision tree learner does not produce reliable probability estimation to its predictions. Therefore, it cannot be a proper selection criterion in self-training. They considered the effect of several modifications to the basic decision tree learner that produce better probability estimation than using the distributions at the leaves of the tree. They show that these modifications do not produce better performance when used only the labeled data, but they do benefit more from the unlabeled data in self-training. The modifications that they considered are Naive Bayes Tree, a combination of No-pruning and Laplace correction, grafting, and using a distance-based measure. Then they extended this improvement to algorithms for ensembles of decision trees and the authors show that the ensemble learner gives an extra improvement over the adapted decision tree learners.

In [70], the authors describe the stochastic semi-supervised learning approach that was used in their submission to all six tasks in 20092010 Active Learning Challenge. The method was designed to tackle the binary classification problem under the condition that the number of labeled data points is extremely small and the two classes are highly imbalanced. It starts with only one positive seed given by the contest organizer. They randomly picked additional unlabeled data points and treated them as “negative” seeds based on the fact that the positive label is rare across all datasets. A classifier was trained using the “labeled” data points and then was used to predict the unlabeled dataset. They took the final result to be the average of “n” stochastic iterations. Supervised learning was used as a large number of labels were purchased. Their approach was shown to work well in 5 out of 6 datasets, which ranked them 3rd in the contest.

A framework to address the imbalanced data problem using semisupervised learning is proposed in [71]. Specifically, from a supervised problem, they created a semisupervised problem and then use a semi-supervised learning method to identify the most relevant instances to establish a well-defined training set. They presented extensive experimental results, which demonstrate that the proposed framework significantly outperforms all other sampling algorithms in 67% of the cases across three different classifiers and ranks second best for the remaining 33% of the cases.

A combined co-training and random subspace generation technique is proposed in [72] for sentiment classification problems. There are two main advantages of this dynamic strategy over the static strategy in generating Majority of the imbalanced learning research works focus on improving the performance of specific datasets, with only a very limited theoretical understanding on the principles of the problem space and the consequences of various assumptions made random subspaces. First, the dynamic strategy makes the involved subspace classifiers quite different from each other even when the training data becomes similar after some iterations. Second, considering that the most helpful features (e.g., sentimental words) for sentiment classification usually account for a small portion, it is possible that one random subspace might contain few useful features.

When this happens in the static strategy, the corresponding subspace classifier will perform badly in selecting correct samples from the unlabeled data. This makes semi-supervised learning fail. In comparison, the dynamic strategy can avoid this phenomenon.

Challenges in CIL

The availability of humongous supply of raw data in many of today’s realworld applications opens up several opportunities of learning from imbalanced data to play an important role across different application domains. However, there are several new challenges [9] at the same time. Here, we briefly present several aspects for the future research directions in this critical research domain.

Understanding the Fundamental Problems

Majority of the imbalanced learning research works focus on improving the performance of specific algorithms paired with specific datasets, with only a very limited theoretical understanding on the principles of the problem space and the consequences of various assumptions made. For instances, several algorithms for imbalance learning published in the literature claim to have taken the performance metric better by a margin as compared to the previous solutions on specific case by case basis in terms of datasets attempted. But there exists situations where learning from the original datasets may provide better performance. This leads us to an important question: to what extent do imbalanced learning methods help with learning? This question could be further refined as:

1. When a method outperformed other methods, what are the underlying effects that led to the better performance?
2. Does the solution provide clarity on the fundamental understanding of the problem at hand?
3. Can the solution scale to various other types of data?

We believe that these fundamental questions should be studied with greater interest both theoretically and empirically in order to thoroughly understand the essence of imbalanced learning problems and solutions. Furthermore, we should also find answers to the following specific questions that would allow us to gauge the solutions better:

  • What are the assumptions to make on imbalanced learning algorithms to work better compared to learning from the original distributions?
  • To what degree should one artificially balance [73] [74] the original dataset by adjusting sample distributions?
  • How do imbalanced data distributions affect the computational complexity of learning algorithms?
  • What is the general error bound, given an imbalanced data distribution?
  • Is there a general theoretical methodology that can alleviate the impediment of learning from imbalanced datasets for specific algorithms and application domains?

Estabrooks et al. [73] suggested that a combination of different expressions of resampling methods may be an effective solution to the tuning problem. Weiss and Provost [74] have analyzed, for a fixed training set size, the relationship between the class distribution of training data (expressed as the percentage of minority class examples) and classifier performance in terms of accuracy and AUC. Based on a thorough analysis of 26 datasets, it was suggested that if accuracy is selected as the performance criterion, the best class distribution tends to be near the naturally occurring class distribution.

However, if the AUC is selected as the assessment metric, then the best class distribution tends to be near the balanced class distribution. Based on these observations, a“budget-sensitive”progressive sampling strategy was proposed to efficiently sample the minority and majority class examples such that the resulting training class distribution can provide the best performance.

Uniform Benchmark Platform

Class Imbalance learning researchers typically use standard multi-class datasets in one-versus-rest configuration to emulate binary class imbalance problems to report their results. Although there are currently many publicly available benchmarks for assessing the performance of classification algorithms, such as the UCI Repository [64] and the LIBSVM datasets[10], there are a very limited Class Imbalance learning researchers typically use standard multi-class datasets in one-versus-rest configuration to emulate binary class imbalance problems to report their results. number of benchmarks that are solely dedicated to imbalanced learning problems. None of the existing classification data repositories identify or mention imbalance ratio as a dataset characteristics. This limitation can severely affect the longterm development of research in class imbalance learning in the following pointers:

1. Lack of a uniform benchmark for standardized performance assessments.
2. Lack of data sharing and data interoperability across different disciplinary domains.
3. Increased procurement costs.

Standardized Evaluation

Accuracy is a measure of trueness, given by (TP+TN)/N (computed using confusion matrix11). Consider a classifier that returns the label of the majority class for any input. When 1001 random test points (with 1:1000 imbalance ratio) are tried, the estimated accuracy is (1000+0)/1001=99%, which in turn provides a wrong interpretation that the classifier performance is high.

Precision is the measure of exactness, given by TP/(TP+FP) and recall is a measure of completeness, given by TP/(TP+FN). It is apparent from the formulas that precision is sensitive to data distributions, while recall is not. An assertion based solely on recall is ambiguous, since recall provides no insight to how many examples are incorrectly labeled as positive (minority). Similarly, precision cannot assert how many positive examples are labeled incorrectly. F -score combines precision and recall effectively to evaluate classification performance in imbalanced learning scenarios.

The traditional use of a point based evaluation metric such as ac-curacy, precision, and recall are not sufficient, while handling class imbalance learning problems as they show sensitivity to data distribution. It becomes very difficult to provide any concrete relative evaluations between different algorithms over varying data distributions without an accompanied curve based analysis.

Therefore, it is necessary for the community to establish-as a standard-the practice of using the curve-based evaluation techniques such as ROC, Precision- Recall curve, and Cost curve. This is not only because each technique provides its own set of answers to different fundamental questions.

But also because an analysis in the evaluation space of one technique can be correlated to the evaluation space of another. This standard would lead to increased transitivity and a broader understanding of the functional abilities of existing and future works.

SSL from Imbalanced Data

The key idea of semi-supervised learning is to exploit the unlabeled examples by using the labeled examples to modify, refine, or reprioritize the hypothesis obtained from the labeled data alone [75]. Some pertinent questions include:

1. How can we identify whether an unlabeled data example came from a balanced or imbalanced underlying distribution?
2. Given an imbalanced training data with labels, what are the effective and efficient methods for recovering the unlabeled data examples?
3. What kind of biases may be introduced in the recovery process given imbalanced labeled data?

Summary

We have motivated the need for Class Imbalance Learning, a special usecase of machine learning, for its wide applicability in real world classification tasks. We did so by introducing the fundamentals of class imbalance learning and the battery of solutions available in the literature to combat it.

We have also presented some real world problem scenarios with imbalance characteristics. We concluded by enlisting the available opportunities and challenges in the field of class imbalance learning research.

Acknowledgements

This research work was partly supported by a funding grant from IIT Madras under project CSE/1415/831/RFTP/BRAV.


References

[1] M. Kubat, R. C. Holte, and S. Matwin, “Machine Learning for the Detection of Oil Spills in Satellite Radar Images.,” Machine Learning, vol. 30, no. 2-3, pp. 195–215, 1998.

[2] G. M. Weiss,“Mining with Rare Cases.,”in The Data Mining and Knowledge Discovery Handbook (O. Maimon and L. Rokach, eds.), pp. 765–776, Springer, 2005.

[3] G. M. Weiss,“Mining with rarity: a unifying framework.,”SIGKDD Explorations, vol. 6, no. 1, pp. 7–19, 2004.

[4] R. Prati, G. Batista, and M. Monard, “Class imbalances versus class overlapping: an analysis of a learning system behavior,” MICAI 2004: Advances in Artificial Intelligence, pp. 312–321, 2004.

[5] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts.,” SIGKDD Explorations, vol. 6, no. 1, pp. 40–49, 2004.

[6] R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard, “Learning with Class Skews and Small Disjuncts.,” in SBIA
(A. L. C. Bazzan and S. Labidi, eds.), vol. 3171 of Lecture Notes in Computer Science, pp. 296–306, Springer, 2004.

[7] J. R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.

[8] W.-H. Yang, D.-Q. Dai, and H. Yan, “Feature Extraction and Uncorrelated Discriminant Analysis for HighDimensional Data.,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 5, pp. 601–614, 2008.

[9] H. He and E. Garcia,“Learning from Imbalanced Data,”Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 1263–1284, Sept 2009.

[10] H. He and Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, 1st ed., 2013.

[11] P. Branco, L. Torgo, and R. P. Ribeiro, “A Survey of Predictive Modeling on Imbalanced Domains.,” ACM Comput. Surv., vol. 49, no. 2, pp. 31:1–31:50, 2016.

[12] D. Mease, A. Wyner, and a. Buja,“Boosted classification trees and class probability/quantile estimation,”The Journal of Machine Learning Research, vol. 8, pp. 409–439, 2007.

[13] C. Drummond and R. Holte, “C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling,” Workshop on Learning from Imbalanced Datasets II, pp. 1–8, 2003.

[14] R. C. Holte, L. Acker, and B. W. Porter, “Concept Learning and the Problem of Small Disjuncts.,” in IJCAI (N. S. Sridharan, ed.), pp. 813–818, Morgan Kaufmann, 1989.

[15] X.-Y. Liu and Z.-H. Zhou,“The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study.,” in ICDM, pp. 970–974, IEEE Computer Society, 2006.

[16] J. Zhang and I. Mani,“KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction,” in Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets, 2003.

[17] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” in In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186, Morgan Kaufmann, 1997.

[18] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer,“SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.

[19] B. X. Wang and N. Japkowicz, “Imbalanced Data Set Learning with Synthetic Samples,” 2004.

[20] H. Han, W. Wang, and B. Mao,“Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning.,” in ICIC (1) (D.-S. Huang, X.-P. Zhang, and G.-B. Huang, eds.), vol. 3644 of Lecture Notes in Computer Science, pp. 878–887, Springer, 2005.

[21] S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning.,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, 2014.

[22] A. Agrawal, H. L. Viktor, and E. Paquet, “SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling.,” in KDIR (A. L. N. Fred, J. L. G. Dietz, D. Aveiro, K. Liu, and J. Filipe, eds.), pp. 226–234, SciTePress, 2015.

[23] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning.,” in IJCNN, pp. 1322–1328, IEEE, 2008.

[24] S. Barua, M. M. Islam, and K. Murase, “ProWSyn: Proximity Weighted Synthetic Oversampling Technique for Imbalanced Data Set Learning.,”in PAKDD (2) (J. Pei, V. S. Tseng, L. Cao, H. Motoda, and G. Xu, eds.), vol. 7819 of Lecture Notes in Computer Science, pp. 317–328, Springer, 2013.

[25] Y. Dong and X. Wang, “A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced Data Sets.,” in KSEM (H. Xiong and W. B. Lee, eds.), vol. 7091 of Lecture Notes in Computer Science, pp. 343–352, Springer, 2011.

[26] I. Tomek, “Two Modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 7(2), pp. 679–772, 1976.

[27] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,”ACM SIGKDD Explorations Newsletter – Special issue on learning from imbalanced datasets, vol. 6, no. 1, pp. 20–29, 2004.

[28] J. Laurikkala, “Improving Identification of Difficult Small Classes by Balancing Class Distribution.,” in AIME (S. Quaglini, P. Barahona, and S. Andreassen, eds.), vol. 2101 of Lecture Notes in Computer Science, pp. 63–66, Springer, 2001.

[29] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: Improving Prediction of the Minority Class in Boosting.,” in PKDD (N. Lavrac, D. Gamberger, H. Blockeel, and L. Todorovski, eds.), vol. 2838 of Lecture Notes in Computer Science, pp. 107–119, Springer, 2003.

[30] H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach.,” SIGKDD Explorations, vol. 6, no. 1, pp. 30–39, 2004.

[31] H. Guo and H. L. Viktor, “Boosting with Data Generation: Improving the Classification of Hard to Learn Examples.,” in IEA/AIE (R. Orchard, C. Yang, and M. Ali, eds.), vol. 3029 of Lecture Notes in Computer Science, pp. 1082–1091, Springer, 2004.

[32] S. Chen, H. He, and E. A. Garcia, “RAMOBoost: Ranked Minority Oversampling in Boosting.,” IEEE Trans Neural Netw, vol. 21, pp. 1624–42, Oct. 2010.

[33] C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano,“RUSBoost: A Hybrid Approach to Alleviating Class Imbalance.,” IEEE Trans. Systems, Man, and Cybernetics, Part A, vol. 40, no. 1, pp. 185–197, 2010.

[34] X. Zhang and B.-G. Hu, “A New Strategy of Cost-Free Learning in the Class Imbalance Problem.,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 12, pp. 2872–2885, 2014.

[35] Y. Peng, “Adaptive Sampling with Optimal Cost for Class-Imbalance Learning.,” in AAAI (B. Bonet and S. Koenig, eds.), pp. 2921–2927, AAAI Press, 2015.

[36] W. Zong, G.-B. Huang, and Y. Chen, “Weighted extreme learning machine for imbalance learning.,” Neurocomputing, vol. 101, pp. 229–242, 2013.

[37] X. Gao, Z. Chen, S. Tang, Y. Zhang, and J. Li, “Adaptive weighted imbalance learning with application to abnormal activity recognition.,” Neurocomputing, vol. 173, pp. 1927–1935, 2016.

[38] I. Nekooeimehr and S. K. Lai-Yuen, “Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets.,” Expert Syst. Appl., vol. 46, pp. 405–416, 2016.

[39] M. A. Tahir, J. Kittler, and F. Yan, “Inverse random under sampling for class imbalance problem and its application to multi-label classification.,” Pattern Recognition, vol. 45, no. 10, pp. 3738–3750, 2012.

[40] C. Elkan, “The Foundations of Cost-Sensitive Learning,” in IJCAI, pp. 973–978, 2001.

[41] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Editorial: special issue on learning from imbalanced data sets.,” SIGKDD Explorations, vol. 6, no. 1, pp. 1–6, 2004.

[42] B. Zadrozny, J. Langford, and N. Abe, “Cost-Sensitive Learning by Cost-Proportionate Example Weighting.,” in ICDM, pp. 435–, IEEE Computer Society, 2003.

[43] P. M. Domingos,“MetaCost: A General Method for Making Classifiers Cost-Sensitive.,”in KDD (U. M. Fayyad, S. Chaudhuri, and D. Madigan, eds.), pp. 155–164, ACM, 1999.

[44] Y. Freund and R. E. Schapire,“Experiments with a New Boosting Algorithm,”in International Conference on Machine Learning, pp. 148–156, 1996.

[45] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. W. 0007,“Cost-sensitive boosting for classification of imbalanced data.,” Pattern Recognition, vol. 40, no. 12, pp. 3358–3378, 2007.

[46] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan,“AdaCost: Misclassification Cost-Sensitive Boosting.,”in ICML (I. Bratko and S. Dzeroski, eds.), pp. 97–105, Morgan Kaufmann, 1999.

[47] M. a. Maloof,“Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown,”Analysis, vol. 21, pp. 1263–1284, 2003.

[48] K. M. Ting,“An Instance-Weighting Method to Induce Cost-Sensitive Trees.,”IEEE Trans. Knowl. Data Eng., vol. 14, no. 3, pp. 659–665, 2002.

[49] C. Drummond and R. C. Holte, “Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria.,” in ICML (P. Langley, ed.), pp. 239–246, Morgan Kaufmann, 2000.

[50] M. Kukar and I. Kononenko, “Cost-Sensitive Learning with Neural Networks.,” in ECAI, pp. 445–449, 1998.

[51] B. Krawczyk and M. Wozniak,“Cost-Sensitive Neural Network with ROC-Based Moving Threshold for Imbalanced Classification.,” in IDEAL (K. Jackowski, R. Burduk, K. Walkowiak, M. Wozniak, and H. Yin, eds.), vol. 9375 of Lecture Notes in Computer Science, pp. 45–52, Springer, 2015.

[52] R. Akbani, S. Kwek, and N. Japkowicz, “Applying Support Vector Machines to Imbalanced Datasets.,” in ECML (J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi, eds.), vol. 3201 of Lecture Notes in Computer Science, pp. 39–50, Springer, 2004.

[53] F. Vilarin˜o, P. Spyridonos, J. Vitri`a, and P. Radeva,“Experiments with SVM and Stratified Sampling with an Imbalanced Problem: Detection of Intestinal Contractions.,” in ICAPR (2) (S. Singh, M. Singh, C. Apt´e, and P. Perner, eds.), vol. 3687 of Lecture Notes in Computer Science, pp. 783–791, Springer, 2005.

[54] P. Kang and S. Cho, “EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems.,” in ICONIP (1) (I. King, J. Wang, L. Chan, and D. L. Wang, eds.), vol. 4232 of Lecture Notes in Computer Science, pp. 837–846, Springer, 2006.

[55] Y. Liu, A. An, and X. Huang,“Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles.,” in PAKDD (W. K. Ng, M. Kitsuregawa, J. Li, and K. Chang, eds.), vol. 3918 of Lecture Notes in Computer Science, pp. 107–118, Springer, 2006.

[56] B. X. Wang and N. Japkowicz, “Boosting support vector machines for imbalanced data sets.,” Knowl. Inf. Syst., vol. 25, no. 1, pp. 1–20, 2010.

[57] Y. Tang and Y.-Q. Zhang, “Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction.,” in GrC, pp. 457–460, IEEE, 2006.

[58] G. Wu and E. Y. Chang,“Aligning Boundary in Kernel Space for Learning Imbalanced Dataset,”in Proceedings of the Fourth IEEE International Conference on Data Mining, ICDM ’04, (Washington, DC, USA), pp. 265– 272, IEEE Computer Society, 2004.

[59] X. Hong, S. C. 0001, and C. J. Harris,“A Kernel-Based Two-Class Classifier for Imbalanced Data Sets.,”IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 28–41, 2007.

[60] Y. Xu, Y. Zhang, Z. Yang, X. Pan, and G. Li, “Imbalanced and semi-supervised classification for prognosis of ACLF.,” Journal of Intelligent and Fuzzy Systems, vol. 28, no. 2, pp. 737–745, 2015.

[61] M. Wu and J. Ye, “A Small Sphere and Large Margin Approach for Novelty Detection Using Training Data with Outliers.,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 2088–2092, 2009.

[62] Jayadeva, R. Khemchandani, and S. Chandra, “Twin Support Vector Machines for Pattern Classification.,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 5, pp. 905–910, 2007.

[63] F. Li, C. Yu, N. Yang, F. Xia, G. Li, and F. Kaveh-Yazdy, “Iterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data,” The Scientific World Journal, Volume 2013, Article ID 875450, 2013, Dec. 2013.

[64] M. Lichman, “UCI Machine Learning Repository,” 2013.

[65] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.

[66] A. E. Ghoul and H. Sahbi, “Semi-supervised learning using a graph-based phase field model for imbalanced data set classification.,” in ICASSP, pp. 2942–2946, IEEE, 2014.

[67] M. Rochery, I. Jermyn, and J. Zerubia, “Phase Field Models and Higher-Order Active Contours.,” in ICCV, pp. 970–976, IEEE Computer Society, 2005.

[68] A. Stanescu and D. Caragea, “Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.,” in BIBM (H. J. Zheng, W. Dubitzky, X. Hu, J.-K. Hao, D. P. Berrar, K.-H. Cho, Y. Wang, and D. R. Gilbert, eds.), pp. 432–437, IEEE Computer Society, 2014.

[69] J. Tanha, M. van Someren, and H. Afsarmanesh, “Semi-supervised self-training for decision tree classifiers,” International Journal of Machine Learning and Cybernetics, 2015.

[70] J. Xie and T. Xiong, “Stochastic Semi-supervised Learning.,” in Active Learning and Experimental Design @ AISTATS (I. Guyon, G. C. Cawley, G. Dror, V. Lemaire, and A. R. Statnikov, eds.), vol. 16 of JMLR Proceedings, pp. 85–98, JMLR.org, 2011.

[71] B. A. Almogahed and I. A. Kakadiaris, “Empowering Imbalanced Data in Supervised Learning: A Semisupervised Learning Approach.,” in ICANN (S. Wermter, C. Weber, W. Duch, T. Honkela, P. D. KoprinkovaHristova, S. Magg, G. Palm, and A. E. P. Villa, eds.), vol. 8681 of Lecture Notes in Computer Science, pp. 523–530, Springer, 2014.

[72] S. Li, Z. Wang, G. Zhou, and S. Y. M. Lee, “Semi-Supervised Learning for Imbalanced Sentiment Classification.,” in IJCAI (T. Walsh, ed.), pp. 1826–1831, IJCAI/AAAI, 2011.

[73] A. Estabrooks, T. Jo, and N. Japkowicz,“A Multiple Resampling Method for Learning from Imbalanced Data Sets.,” Computational Intelligence, vol. 20, no. 1, pp. 18–36, 2004.

[74] F. J. Provost and G. M. Weiss,“Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction,” CoRR, vol. abs/1106.4557, pp. 315–354, 2011.

[75] X. Zhu, “Semi–Supervised Learning in Literature Survey,” Tech. Rep. 1530, Computer Sciences, University of Wisconsin-Madison, 2005.