Standard Machine Learning Datasets for Imbalanced Classification

An imbalanced classification drawback is an issue that entails predicting a category label the place the distribution of sophistication labels within the coaching dataset is skewed.

Many real-world classification issues have an imbalanced class distribution, due to this fact it will be significant for machine studying practitioners to get accustomed to working with all these issues.

On this tutorial, you’ll uncover a collection of ordinary machine studying datasets for imbalanced classification.

After finishing this tutorial, you’ll know:

Commonplace machine studying datasets with an imbalance of two lessons.
Commonplace datasets for multiclass classification with a skewed class distribution.
In style imbalanced classification datasets used for machine studying competitions.

Let’s get began.

Commonplace Machine Studying Datasets for Imbalanced Classification
Photograph by Graeme Churchard, some rights reserved.

Tutorial Overview

This tutorial is split into three components; they’re:

Binary Classification Datasets
Multiclass Classification Datasets
Competitors and Different Datasets

Binary Classification Datasets

Binary classification predictive modeling issues are these with two lessons.

Usually, imbalanced binary classification issues describe a standard state (class Zero) and an irregular state (class 1), resembling fraud, a analysis, or a fault.

On this part, we are going to take a better have a look at three customary binary classification machine studying datasets with a category imbalance. These are datasets which can be sufficiently small to slot in reminiscence and have been properly studied, offering the premise of investigation in lots of analysis papers.

The names of those datasets are as follows:

Pima Indians Diabetes (Pima)
Haberman Breast Most cancers (Haberman)
German Credit score (German)

Every dataset will likely be loaded and the character of the category imbalance will likely be summarized.

Pima Indians Diabetes (Pima)

Every file describes the medical particulars of a feminine, and the prediction is the onset of diabetes inside the subsequent 5 years.

Under gives a pattern of the primary 5 rows of the dataset.

6,148,72,35,Zero,33.6,Zero.627,50,1 1,85,66,29,Zero,26.6,Zero.351,31,Zero eight,183,64,Zero,Zero,23.three,Zero.672,32,1 1,89,66,23,94,28.1,Zero.167,21,Zero Zero,137,40,35,168,43.1,2.288,33,1 …

6,148,72,35,Zero,33.6,Zero.627,50,1

1,85,66,29,Zero,26.6,Zero.351,31,Zero

eight,183,64,Zero,Zero,23.three,Zero.672,32,1

1,89,66,23,94,28.1,Zero.167,21,Zero

Zero,137,40,35,168,43.1,2.288,33,1

…

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the Pima Indians Diabetes dataset from numpy import distinctive from pandas import read_csv # load the dataset url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/pima-indians-diabetes.csv’ dataframe = read_csv(url, header=None) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the Pima Indians Diabetes dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/pima-indians-diabetes.csv’

dataframe = read_csv(url, header=None)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 768 N Inputs: eight N Lessons: 2 Lessons: [0. 1.] Class Breakdown: - Class Zero.Zero: 500 (65.10417%) - Class 1.Zero: 268 (34.89583%)

N Examples: 768

N Inputs: eight

N Lessons: 2

Lessons: [0. 1.]

Class Breakdown:

- Class Zero.Zero: 500 (65.10417%)

- Class 1.Zero: 268 (34.89583%)

Haberman Breast Most cancers (Haberman)

Every file describes the medical particulars of a affected person and the prediction is whether or not the affected person survived after 5 years or not.

Under gives a pattern of the primary 5 rows of the dataset.

30,64,1,1 30,62,three,1 30,65,Zero,1 31,59,2,1 31,65,four,1 …

30,64,1,1

30,62,three,1

30,65,Zero,1

31,59,2,1

31,65,four,1

…

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the Haberman Breast Most cancers dataset from numpy import distinctive from pandas import read_csv # load the dataset url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/haberman.csv’ dataframe = read_csv(url, header=None) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the Haberman Breast Most cancers dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/haberman.csv’

dataframe = read_csv(url, header=None)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 306 N Inputs: three N Lessons: 2 Lessons: [1 2] Class Breakdown: - Class 1: 225 (73.52941%) - Class 2: 81 (26.47059%)

N Examples: 306

N Inputs: three

N Lessons: 2

Lessons: [1 2]

Class Breakdown:

- Class 1: 225 (73.52941%)

- Class 2: 81 (26.47059%)

German Credit score (German)

Every file describes the monetary particulars of an individual and the prediction is whether or not the particular person is an efficient credit score threat.

Under gives a pattern of the primary 5 rows of the dataset.

A11,6,A34,A43,1169,A65,A75,four,A93,A101,four,A121,67,A143,A152,2,A173,1,A192,A201,1 A12,48,A32,A43,5951,A61,A73,2,A92,A101,2,A121,22,A143,A152,1,A173,1,A191,A201,2 A14,12,A34,A46,2096,A61,A74,2,A93,A101,three,A121,49,A143,A152,1,A172,2,A191,A201,1 A11,42,A32,A42,7882,A61,A74,2,A93,A103,four,A122,45,A143,A153,1,A173,2,A191,A201,1 A11,24,A33,A40,4870,A61,A73,three,A93,A101,four,A124,53,A143,A153,2,A173,2,A191,A201,2 …

A11,6,A34,A43,1169,A65,A75,four,A93,A101,four,A121,67,A143,A152,2,A173,1,A192,A201,1

A12,48,A32,A43,5951,A61,A73,2,A92,A101,2,A121,22,A143,A152,1,A173,1,A191,A201,2

A14,12,A34,A46,2096,A61,A74,2,A93,A101,three,A121,49,A143,A152,1,A172,2,A191,A201,1

A11,42,A32,A42,7882,A61,A74,2,A93,A103,four,A122,45,A143,A153,1,A173,2,A191,A201,1

A11,24,A33,A40,4870,A61,A73,three,A93,A101,four,A124,53,A143,A153,2,A173,2,A191,A201,2

…

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the German Credit score dataset from numpy import distinctive from pandas import read_csv # load the dataset url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/german.csv’ dataframe = read_csv(url, header=None) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the German Credit score dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/german.csv’

dataframe = read_csv(url, header=None)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 1000 N Inputs: 20 N Lessons: 2 Lessons: [1 2] Class Breakdown: - Class 1: 700 (70.0000Zero%) - Class 2: 300 (30.0000Zero%)

N Examples: 1000

N Inputs: 20

N Lessons: 2

Lessons: [1 2]

Class Breakdown:

- Class 1: 700 (70.0000Zero%)

- Class 2: 300 (30.0000Zero%)

Multiclass Classification Datasets

Multiclass classification predictive modeling issues are these with greater than two lessons.

Usually, imbalanced multiclass classification issues describe a number of completely different occasions, some considerably extra widespread than others.

On this part, we are going to take a better have a look at three customary multiclass classification machine studying datasets with a category imbalance. These are datasets which can be sufficiently small to slot in reminiscence and have been properly studied, offering the premise of investigation in lots of analysis papers.

The names of those datasets are as follows:

Glass Identification (Glass)
E-coli (Ecoli)
Thyroid Gland (Thyroid)

Observe: it’s common in analysis papers to remodel imbalanced multiclass classification issues into imbalanced binary classification issues by grouping the entire majority lessons into one class and leaving the smallest minority class.

Every dataset will likely be loaded and the character of the category imbalance will likely be summarized.

Glass Identification (Glass)

Every file describes the chemical content material of glass and prediction entails the kind of glass.

Under gives a pattern of the primary 5 rows of the dataset.

1.52101,13.64,four.49,1.10,71.78,Zero.06,eight.75,Zero.00,Zero.00,1 1.51761,13.89,three.60,1.36,72.73,Zero.48,7.83,Zero.00,Zero.00,1 1.51618,13.53,three.55,1.54,72.99,Zero.39,7.78,Zero.00,Zero.00,1 1.51766,13.21,three.69,1.29,72.61,Zero.57,eight.22,Zero.00,Zero.00,1 1.51742,13.27,three.62,1.24,73.08,Zero.55,eight.07,Zero.00,Zero.00,1 …

1.52101,13.64,four.49,1.10,71.78,Zero.06,eight.75,Zero.00,Zero.00,1

1.51761,13.89,three.60,1.36,72.73,Zero.48,7.83,Zero.00,Zero.00,1

1.51618,13.53,three.55,1.54,72.99,Zero.39,7.78,Zero.00,Zero.00,1

1.51766,13.21,three.69,1.29,72.61,Zero.57,eight.22,Zero.00,Zero.00,1

1.51742,13.27,three.62,1.24,73.08,Zero.55,eight.07,Zero.00,Zero.00,1

…

The primary column represents a row identifier and may be eliminated.

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the Glass Identification dataset from numpy import distinctive from pandas import read_csv # load the dataset url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/glass.csv’ dataframe = read_csv(url, header=None) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the Glass Identification dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/glass.csv’

dataframe = read_csv(url, header=None)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 214 N Inputs: 9 N Lessons: 6 Lessons: [1. 2. 3. 5. 6. 7.] Class Breakdown: - Class 1.Zero: 70 (32.71028%) - Class 2.Zero: 76 (35.51402%) - Class three.Zero: 17 (7.94393%) - Class 5.Zero: 13 (6.07477%) - Class 6.Zero: 9 (four.20561%) - Class 7.Zero: 29 (13.55140%)

N Examples: 214

N Inputs: 9

N Lessons: 6

Lessons: [1. 2. 3. 5. 6. 7.]

Class Breakdown:

- Class 1.Zero: 70 (32.71028%)

- Class 2.Zero: 76 (35.51402%)

- Class three.Zero: 17 (7.94393%)

- Class 5.Zero: 13 (6.07477%)

- Class 6.Zero: 9 (four.20561%)

- Class 7.Zero: 29 (13.55140%)

E-coli (Ecoli)

Every file describes the results of completely different checks and prediction entails the protein localization web site identify.

Under gives a pattern of the primary 5 rows of the dataset.

Zero.49,Zero.29,Zero.48,Zero.50,Zero.56,Zero.24,Zero.35,cp Zero.07,Zero.40,Zero.48,Zero.50,Zero.54,Zero.35,Zero.44,cp Zero.56,Zero.40,Zero.48,Zero.50,Zero.49,Zero.37,Zero.46,cp Zero.59,Zero.49,Zero.48,Zero.50,Zero.52,Zero.45,Zero.36,cp Zero.23,Zero.32,Zero.48,Zero.50,Zero.55,Zero.25,Zero.35,cp …

Zero.49,Zero.29,Zero.48,Zero.50,Zero.56,Zero.24,Zero.35,cp

Zero.07,Zero.40,Zero.48,Zero.50,Zero.54,Zero.35,Zero.44,cp

Zero.56,Zero.40,Zero.48,Zero.50,Zero.49,Zero.37,Zero.46,cp

Zero.59,Zero.49,Zero.48,Zero.50,Zero.52,Zero.45,Zero.36,cp

Zero.23,Zero.32,Zero.48,Zero.50,Zero.55,Zero.25,Zero.35,cp

…

The primary column represents a row identifier or identify and may be eliminated.

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the Ecoli dataset from numpy import distinctive from pandas import read_csv # load the dataset url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/ecoli.csv’ dataframe = read_csv(url, header=None) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the Ecoli dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/ecoli.csv’

dataframe = read_csv(url, header=None)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 336 N Inputs: 7 N Lessons: eight Lessons: [‘cp’ ‘im’ ‘imL’ ‘imS’ ‘imU’ ‘om’ ‘omL’ ‘pp’] Class Breakdown: - Class cp: 143 (42.55952%) - Class im: 77 (22.91667%) - Class imL: 2 (Zero.59524%) - Class imS: 2 (Zero.59524%) - Class imU: 35 (10.41667%) - Class om: 20 (5.95238%) - Class omL: 5 (1.48810%) - Class pp: 52 (15.47619%)

N Examples: 336

N Inputs: 7

N Lessons: eight

Lessons: [‘cp’ ‘im’ ‘imL’ ‘imS’ ‘imU’ ‘om’ ‘omL’ ‘pp’]

Class Breakdown:

- Class cp: 143 (42.55952%)

- Class im: 77 (22.91667%)

- Class imL: 2 (Zero.59524%)

- Class imS: 2 (Zero.59524%)

- Class imU: 35 (10.41667%)

- Class om: 20 (5.95238%)

- Class omL: 5 (1.48810%)

- Class pp: 52 (15.47619%)

Thyroid Gland (Thyroid)

Every file describes the results of completely different checks on a thyroid and prediction entails the medical analysis of the thyroid.

Under gives a pattern of the primary 5 rows of the dataset.

107,10.1,2.2,Zero.9,2.7,1 113,9.9,three.1,2.Zero,5.9,1 127,12.9,2.four,1.four,Zero.6,1 109,5.three,1.6,1.four,1.5,1 105,7.three,1.5,1.5,-Zero.1,1 …

107,10.1,2.2,Zero.9,2.7,1

113,9.9,three.1,2.Zero,5.9,1

127,12.9,2.four,1.four,Zero.6,1

109,5.three,1.6,1.four,1.5,1

105,7.three,1.5,1.5,-Zero.1,1

…

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the Thyroid Gland dataset from numpy import distinctive from pandas import read_csv # load the dataset url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/new-thyroid.csv’ dataframe = read_csv(url, header=None) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the Thyroid Gland dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

url = ‘https://uncooked.githubusercontent.com/jbrownlee/Datasets/grasp/new-thyroid.csv’

dataframe = read_csv(url, header=None)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 215 N Inputs: 5 N Lessons: three Lessons: [1. 2. 3.] Class Breakdown: - Class 1.Zero: 150 (69.76744%) - Class 2.Zero: 35 (16.27907%) - Class three.Zero: 30 (13.95349%)

N Examples: 215

N Inputs: 5

N Lessons: three

Lessons: [1. 2. 3.]

Class Breakdown:

- Class 1.Zero: 150 (69.76744%)

- Class 2.Zero: 35 (16.27907%)

- Class three.Zero: 30 (13.95349%)

Competitors and Different Datasets

This part lists extra datasets utilized in analysis papers which can be much less used, bigger, or datasets used as the premise of machine studying competitions.

The names of those datasets are as follows:

Credit score Card Fraud (Credit score)
Porto Seguro Auto Insurance coverage Declare (Porto Seguro)

Every dataset will likely be loaded and the character of the category imbalance will likely be summarized.

Credit score Card Fraud (Credit score)

Every file describes a bank card translation and it’s categorized as fraud.

This information is about 144 megabytes uncompressed or 66 megabytes compressed.

Obtain the dataset and unzip it into your present working listing.

Under gives a pattern of the primary 5 rows of the dataset.

“Time”,”V1″,”V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″,”V10″,”V11″,”V12″,”V13″,”V14″,”V15″,”V16″,”V17″,”V18″,”V19″,”V20″,”V21″,”V22″,”V23″,”V24″,”V25″,”V26″,”V27″,”V28″,”Quantity”,”Class” Zero,-1.3598071336738,-Zero.0727811733098497,2.53634673796914,1.37815522427443,-Zero.338320769942518,Zero.462387777762292,Zero.239598554061257,Zero.0986979012610507,Zero.363786969611213,Zero.0907941719789316,-Zero.551599533260813,-Zero.617800855762348,-Zero.991389847235408,-Zero.311169353699879,1.46817697209427,-Zero.470400525259478,Zero.207971241929242,Zero.0257905801985591,Zero.403992960255733,Zero.251412098239705,-Zero.018306777944153,Zero.277837575558899,-Zero.110473910188767,Zero.0669280749146731,Zero.128539358273528,-Zero.189114843888824,Zero.133558376740387,-Zero.0210530534538215,149.62,”Zero” Zero,1.19185711131486,Zero.26615071205963,Zero.16648011335321,Zero.448154078460911,Zero.0600176492822243,-Zero.0823608088155687,-Zero.0788029833323113,Zero.0851016549148104,-Zero.255425128109186,-Zero.166974414004614,1.61272666105479,1.06523531137287,Zero.48909501589608,-Zero.143772296441519,Zero.635558093258208,Zero.463917041022171,-Zero.114804663102346,-Zero.183361270123994,-Zero.145783041325259,-Zero.0690831352230203,-Zero.225775248033138,-Zero.638671952771851,Zero.101288021253234,-Zero.339846475529127,Zero.167170404418143,Zero.125894532368176,-Zero.00898309914322813,Zero.0147241691924927,2.69,”Zero” 1,-1.35835406159823,-1.34016307473609,1.77320934263119,Zero.379779593034328,-Zero.503198133318193,1.80049938079263,Zero.791460956450422,Zero.247675786588991,-1.51465432260583,Zero.207642865216696,Zero.624501459424895,Zero.066083685268831,Zero.717292731410831,-Zero.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-Zero.121359313195888,-2.26185709530414,Zero.524979725224404,Zero.247998153469754,Zero.771679401917229,Zero.909412262347719,-Zero.689280956490685,-Zero.327641833735251,-Zero.139096571514147,-Zero.0553527940384261,-Zero.0597518405929204,378.66,”Zero” 1,-Zero.966271711572087,-Zero.185226008082898,1.79299333957872,-Zero.863291275036453,-Zero.0103088796030823,1.24720316752486,Zero.23760893977178,Zero.377435874652262,-1.38702406270197,-Zero.0549519224713749,-Zero.226487263835401,Zero.178228225877303,Zero.507756869957169,-Zero.28792374549456,-Zero.631418117709045,-1.0596472454325,-Zero.684092786345479,1.96577500349538,-1.2326219700892,-Zero.208037781160366,-Zero.108300452035545,Zero.00527359678253453,-Zero.190320518742841,-1.17557533186321,Zero.647376034602038,-Zero.221928844458407,Zero.0627228487293033,Zero.0614576285006353,123.5,”Zero” …

“Time”,”V1″,”V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″,”V10″,”V11″,”V12″,”V13″,”V14″,”V15″,”V16″,”V17″,”V18″,”V19″,”V20″,”V21″,”V22″,”V23″,”V24″,”V25″,”V26″,”V27″,”V28″,”Quantity”,”Class”

Zero,-1.3598071336738,-Zero.0727811733098497,2.53634673796914,1.37815522427443,-Zero.338320769942518,Zero.462387777762292,Zero.239598554061257,Zero.0986979012610507,Zero.363786969611213,Zero.0907941719789316,-Zero.551599533260813,-Zero.617800855762348,-Zero.991389847235408,-Zero.311169353699879,1.46817697209427,-Zero.470400525259478,Zero.207971241929242,Zero.0257905801985591,Zero.403992960255733,Zero.251412098239705,-Zero.018306777944153,Zero.277837575558899,-Zero.110473910188767,Zero.0669280749146731,Zero.128539358273528,-Zero.189114843888824,Zero.133558376740387,-Zero.0210530534538215,149.62,”Zero”

Zero,1.19185711131486,Zero.26615071205963,Zero.16648011335321,Zero.448154078460911,Zero.0600176492822243,-Zero.0823608088155687,-Zero.0788029833323113,Zero.0851016549148104,-Zero.255425128109186,-Zero.166974414004614,1.61272666105479,1.06523531137287,Zero.48909501589608,-Zero.143772296441519,Zero.635558093258208,Zero.463917041022171,-Zero.114804663102346,-Zero.183361270123994,-Zero.145783041325259,-Zero.0690831352230203,-Zero.225775248033138,-Zero.638671952771851,Zero.101288021253234,-Zero.339846475529127,Zero.167170404418143,Zero.125894532368176,-Zero.00898309914322813,Zero.0147241691924927,2.69,”Zero”

1,-1.35835406159823,-1.34016307473609,1.77320934263119,Zero.379779593034328,-Zero.503198133318193,1.80049938079263,Zero.791460956450422,Zero.247675786588991,-1.51465432260583,Zero.207642865216696,Zero.624501459424895,Zero.066083685268831,Zero.717292731410831,-Zero.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-Zero.121359313195888,-2.26185709530414,Zero.524979725224404,Zero.247998153469754,Zero.771679401917229,Zero.909412262347719,-Zero.689280956490685,-Zero.327641833735251,-Zero.139096571514147,-Zero.0553527940384261,-Zero.0597518405929204,378.66,”Zero”

1,-Zero.966271711572087,-Zero.185226008082898,1.79299333957872,-Zero.863291275036453,-Zero.0103088796030823,1.24720316752486,Zero.23760893977178,Zero.377435874652262,-1.38702406270197,-Zero.0549519224713749,-Zero.226487263835401,Zero.178228225877303,Zero.507756869957169,-Zero.28792374549456,-Zero.631418117709045,-1.0596472454325,-Zero.684092786345479,1.96577500349538,-1.2326219700892,-Zero.208037781160366,-Zero.108300452035545,Zero.00527359678253453,-Zero.190320518742841,-1.17557533186321,Zero.647376034602038,-Zero.221928844458407,Zero.0627228487293033,Zero.0614576285006353,123.5,”Zero”

…

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the Credit score Card Fraud dataset from numpy import distinctive from pandas import read_csv # load the dataset dataframe = read_csv(‘creditcard.csv’) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the Credit score Card Fraud dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

dataframe = read_csv(‘creditcard.csv’)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 284807 N Inputs: 30 N Lessons: 2 Lessons: [0. 1.] Class Breakdown: - Class Zero.Zero: 284315 (99.82725%) - Class 1.Zero: 492 (Zero.17275%)

N Examples: 284807

N Inputs: 30

N Lessons: 2

Lessons: [0. 1.]

Class Breakdown:

- Class Zero.Zero: 284315 (99.82725%)

- Class 1.Zero: 492 (Zero.17275%)

Porto Seguro Auto Insurance coverage Declare (Porto Seguro)

Every file describes individuals’s automotive insurance coverage particulars and prediction entails whether or not or not the particular person will make an insurance coverage declare.

This information is about 42 megabytes compressed.

Obtain the dataset and unzip it into your present working listing.

Under gives a pattern of the primary 5 rows of the dataset.

id,goal,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,ps_ind_11_bin,ps_ind_12_bin,ps_ind_13_bin,ps_ind_14,ps_ind_15,ps_ind_16_bin,ps_ind_17_bin,ps_ind_18_bin,ps_reg_01,ps_reg_02,ps_reg_03,ps_car_01_cat,ps_car_02_cat,ps_car_03_cat,ps_car_04_cat,ps_car_05_cat,ps_car_06_cat,ps_car_07_cat,ps_car_08_cat,ps_car_09_cat,ps_car_10_cat,ps_car_11_cat,ps_car_11,ps_car_12,ps_car_13,ps_car_14,ps_car_15,ps_calc_01,ps_calc_02,ps_calc_03,ps_calc_04,ps_calc_05,ps_calc_06,ps_calc_07,ps_calc_08,ps_calc_09,ps_calc_10,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin 7,Zero,2,2,5,1,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,Zero,11,Zero,1,Zero,Zero.7,Zero.2,Zero.7180703307999999,10,1,-1,Zero,1,four,1,Zero,Zero,1,12,2,Zero.four,Zero.8836789178,Zero.3708099244,three.6055512755000003,Zero.6,Zero.5,Zero.2,three,1,10,1,10,1,5,9,1,5,eight,Zero,1,1,Zero,Zero,1 9,Zero,1,1,7,Zero,Zero,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,three,Zero,Zero,1,Zero.eight,Zero.four,Zero.7660776723,11,1,-1,Zero,-1,11,1,1,2,1,19,three,Zero.316227766,Zero.6188165191,Zero.3887158345,2.4494897428,Zero.three,Zero.1,Zero.three,2,1,9,5,eight,1,7,three,1,1,9,Zero,1,1,Zero,1,Zero 13,Zero,5,four,9,1,Zero,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,12,1,Zero,Zero,Zero.Zero,Zero.Zero,-1.Zero,7,1,-1,Zero,-1,14,1,1,2,1,60,1,Zero.316227766,Zero.6415857163,Zero.34727510710000004,three.3166247904,Zero.5,Zero.7,Zero.1,2,2,9,1,eight,2,7,four,2,7,7,Zero,1,1,Zero,1,Zero 16,Zero,Zero,1,2,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,eight,1,Zero,Zero,Zero.9,Zero.2,Zero.5809475019,7,1,Zero,Zero,1,11,1,1,three,1,104,1,Zero.3741657387,Zero.5429487899000001,Zero.2949576241,2.Zero,Zero.6,Zero.9,Zero.1,2,four,7,1,eight,four,2,2,2,four,9,Zero,Zero,Zero,Zero,Zero,Zero …

id,goal,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,ps_ind_11_bin,ps_ind_12_bin,ps_ind_13_bin,ps_ind_14,ps_ind_15,ps_ind_16_bin,ps_ind_17_bin,ps_ind_18_bin,ps_reg_01,ps_reg_02,ps_reg_03,ps_car_01_cat,ps_car_02_cat,ps_car_03_cat,ps_car_04_cat,ps_car_05_cat,ps_car_06_cat,ps_car_07_cat,ps_car_08_cat,ps_car_09_cat,ps_car_10_cat,ps_car_11_cat,ps_car_11,ps_car_12,ps_car_13,ps_car_14,ps_car_15,ps_calc_01,ps_calc_02,ps_calc_03,ps_calc_04,ps_calc_05,ps_calc_06,ps_calc_07,ps_calc_08,ps_calc_09,ps_calc_10,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin

7,Zero,2,2,5,1,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,Zero,11,Zero,1,Zero,Zero.7,Zero.2,Zero.7180703307999999,10,1,-1,Zero,1,four,1,Zero,Zero,1,12,2,Zero.four,Zero.8836789178,Zero.3708099244,three.6055512755000003,Zero.6,Zero.5,Zero.2,three,1,10,1,10,1,5,9,1,5,eight,Zero,1,1,Zero,Zero,1

9,Zero,1,1,7,Zero,Zero,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,three,Zero,Zero,1,Zero.eight,Zero.four,Zero.7660776723,11,1,-1,Zero,-1,11,1,1,2,1,19,three,Zero.316227766,Zero.6188165191,Zero.3887158345,2.4494897428,Zero.three,Zero.1,Zero.three,2,1,9,5,eight,1,7,three,1,1,9,Zero,1,1,Zero,1,Zero

13,Zero,5,four,9,1,Zero,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,12,1,Zero,Zero,Zero.Zero,Zero.Zero,-1.Zero,7,1,-1,Zero,-1,14,1,1,2,1,60,1,Zero.316227766,Zero.6415857163,Zero.34727510710000004,three.3166247904,Zero.5,Zero.7,Zero.1,2,2,9,1,eight,2,7,four,2,7,7,Zero,1,1,Zero,1,Zero

16,Zero,Zero,1,2,Zero,Zero,1,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,eight,1,Zero,Zero,Zero.9,Zero.2,Zero.5809475019,7,1,Zero,Zero,1,11,1,1,three,1,104,1,Zero.3741657387,Zero.5429487899000001,Zero.2949576241,2.Zero,Zero.6,Zero.9,Zero.1,2,four,7,1,eight,four,2,2,2,four,9,Zero,Zero,Zero,Zero,Zero,Zero

…

The instance beneath masses and summarizes the category breakdown of the dataset.

# Summarize the Porto Seguro’s Protected Driver Prediction dataset from numpy import distinctive from pandas import read_csv # load the dataset dataframe = read_csv(‘prepare.csv’) # get the values values = dataframe.values X, y = values[:, :-1], values[:, -1] # collect particulars n_rows = X.form[0] n_cols = X.form[1] lessons = distinctive(y) n_classes = len(lessons) # summarize print(‘N Examples: %d’ % n_rows) print(‘N Inputs: %d’ % n_cols) print(‘N Lessons: %d’ % n_classes) print(‘Lessons: %s’ % lessons) print(‘Class Breakdown:’) # class breakdown breakdown = ” for c in lessons: whole = len(y[y == c]) ratio = (whole / float(len(y))) * 100 print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

three

four

eight

# Summarize the Porto Seguro’s Protected Driver Prediction dataset

from numpy import distinctive

from pandas import read_csv

# load the dataset

dataframe = read_csv(‘prepare.csv’)

# get the values

values = dataframe.values

X, y = values[:, :-1], values[:, -1]

# collect particulars

n_rows = X.form[Zero]

n_cols = X.form[1]

lessons = distinctive(y)

n_classes = len(lessons)

# summarize

print(‘N Examples: %d’ % n_rows)

print(‘N Inputs: %d’ % n_cols)

print(‘N Lessons: %d’ % n_classes)

print(‘Lessons: %s’ % lessons)

print(‘Class Breakdown:’)

# class breakdown

breakdown = ”

for c in lessons:

whole = len(y[y == c])

ratio = (whole / float(len(y))) * 100

print(‘ - Class %s: %d (%.5f%%)’ % (str(c), whole, ratio))

Working the instance gives the next output.

N Examples: 595212 N Inputs: 58 N Lessons: 2 Lessons: [0. 1.] Class Breakdown: - Class Zero.Zero: 503955 (84.66815%) - Class 1.Zero: 91257 (15.33185%)

N Examples: 595212

N Inputs: 58

N Lessons: 2

Lessons: [0. 1.]

Class Breakdown:

- Class Zero.Zero: 503955 (84.66815%)

- Class 1.Zero: 91257 (15.33185%)

Additional Studying

This part gives extra sources on the subject if you’re seeking to go deeper.

Papers

Articles

Abstract

On this tutorial, you found a collection of ordinary machine studying datasets for imbalanced classification.

Particularly, you discovered:

Commonplace machine studying datasets with an imbalance of two lessons.
Commonplace datasets for multiclass classification with a skewed class distribution.
In style imbalanced classification datasets used for machine studying competitions.

Do you could have any questions?
Ask your questions within the feedback beneath and I’ll do my greatest to reply.

Source link