Home > Issue 1 > Class Imbalance Learning

Class Imbalance Learning



Data classification task assigns labels to data points using a model that is learned from a collection of pre-labeled data points. The Class Imbalance Learning (CIL) problem is concerned with the performance of classification algorithms in the presence of under-represented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced datasets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data effciently into information and knowledge representation. It is important to study CIL because it is rare to find a classification problem in real world scenarios that follows balanced class distributions. In this article, we have presented how machine learning has become the integral part of modern lifestyle and how some of the real world problems are modeled as CIL problems. We have also provided a detailed survey on the fundamentals and solutions to class imbalance learning. We conclude the survey by presenting some of the challenges and opportunities with class imbalance learning.


Machine learning is a core sub-area of artificial intelligence that enables computers to learn without being explicitly programmed. When exposed to new data, computer programs are enabled to learn, grow, change, and develop by themselves. Machine learning is a method of data analysis that automates analytical model building to find hidden insights without being explicitly programmed where to look. The iterative aspect of machine learning allows models to independently adapt to new data exposure. The algorithms learn from previous computations to produce reliable, repeatable decisions and results.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.[1]

Machine Learning has now become an avenue for absorbing and modeling what one knows and doesn’t know, to enable more targeted information to help individuals learn. Platform companies—like Google, LinkedIn, and Amazon are building algorithms that represent us and our needs for data driven decisions. Learning lifelong from the data generated by individuals will allow machines to know enough about individuals, and make recommendations that adapt to our changing contexts. At the core of this opportunity is data, tonnes of it, and algorithms that convert data into a matrix of comparisons for decision making.

“Customers who bought this item also bought” is a popular phrase in Amazon’s online retail, which has been using data driven approaches to recommend products. With LinkedIn’s Economic Graph[2], insights such as changing nature of work, skills that comprise specific jobs, work-related behaviors that pre-signal a change of job and recognizing skills-gap have become readily available for informed decision making. On medical imaging front, machine learning models[3] are used to detect melanoma from the images of pigmented skin lesions. Android powered mobile phones allow Google to have access to data generated by a very large population of individuals and provide assistance on recommendation and decision making in several day-to-day activities such as travel, food, shopping etc. Autonomous cars use digital imaging and image classification models to identify the type of obstacle for being humans, animals or objects to negotiate appropriately. In agriculture, models for crops and soil conditions are used to decide which crop to grow given a particular soil and climatic conditions. In speech recognition space, machine learning models are used to identify speaker based on the unique characteristics learned about every speaker during the model training phase.

Data Classification

A typical machine learning classification task comprises of a training set, an evaluation set, a testing set of data points and a performance evaluation metric[4]. The datasets can be either unlabeled or labeled. A labeled dataset is the one where for every input data point there is an associated output label (categorical value) assignment. The labeled datasets are expensive to construct as it requires human labor to verify the truth of the output labels.

Mathematically, the set of labels is represented as Y={y1, y2, …, ym}, the ith input data point is represented as Xi and its associated output label is represented as yi, where yi ∈ Y . A typical dataset is represented as D=〈X,y〉, where X is typically a matrix of N input row vectors of P-dimensions and y is a column vector of length N. The objective of a machine learning classifier is to learn a function f : X → y, that minimizes the misclassification errorwhere ˆyi and yi are the predicted and true label for a data point Xi respectively.

The training set is the dataset population from which one or more statistical machine learning models are learned. The evaluation dataset is used to select the best machine learning algorithm or strategy by measuring the performance of the learned models on the evaluation dataset using the evaluation metric(s). Some of the popular pointbased performance metrics are accuracy, precision-recall, F1-score and curve-based performance metrics are ROC, PRC, AUC, cost curves. Once the statistical model is chosen and its parameters are tuned using the evaluation process, the performance on test dataset is measured using the same evaluation metric and reported.

Supervised classification algorithms learn from labeled examples, where the training data comprises of an input data representation along with the desired output label. For example, a quality test of a machinery could have data points labeled either “F” (fail) or “P” (pass). The learning algorithm receives a set of inputs represented predominantly as a vector of feature values, along with the corresponding correct output labels, and the algorithm learns by comparing its predicted output with correct outputs to find the misclassification error. It then modifies the model accordingly so as to minimize the error. Supervised learning is commonly used in applications where historical data is used to learn a model and predict the likely future events. For example, it can predict which insurance customer is likely to file a claim for, or whether a user would buy a product or not, or if the cricket team would win the next match and so forth.

Semi-supervised classification algorithms are used for the same applications as supervised learning. But it uses a combination of both labeled and unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data as unlabeled data is easy to acquire and less expensive. Semisupervised learning is useful when the cost associated with labeling is too high to allow for a fully labeled training process.

[1]Tom Mitchell, Carnegie Mellon University





Pages ( 1 of 8 ): 1 23 ... 8Next »

One thought on “Class Imbalance Learning

  1. I had the good fortune of reading your article. It was well-written sir and contained sound, practical advice. You pointed out several things that I will remember for years to come. Thank you Sir. As a laymen outside from the ML industry I can understand those practical examples, CIL, Labelled datasets,Semi-supervised classification algorithms, Supervised classification algorithms,training data, unlabeled or labeled data sets etc. Thank you inspiration…appreciates it. Vazhutthukkal 🙂

Leave a Comment:

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.