Robin Sir | Computer Science Teacher

What is Machine Learning?

A machine is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Basic Machine Learning Process

Data Input: This is when we give the computer examples to learn from. The data can be pictures, numbers, text, or sounds. Example: showing many pictures of cats and dogs to a computer.
Abstraction: This is when the computer looks for patterns in the data. It tries to figure out what makes a cat different from a dog (like ears, shape, size, etc.). Basically, it learns the important features.
Generalization: This is when the computer uses what it learned on new data it has never seen before. If it sees a new picture, it can decide whether it’s a cat or a dog based on the patterns it learned.

Three Broad Categories of Machine Learning

Supervised Learning: Here, the computer learns from labelled examples (input data with correct output). The computer learns the relationship between input and output and then predicts answers for new data it hasn’t seen before. It is like learning with a teacher.
Unsupervised Learning: Here, the computer learns without being given the correct output. The computer tries to find patterns or groups on its own. It is like learning without a teacher.
Reinforcement Learning: Here, the computer learns by trying things and getting rewards or penalties. The computer (agent) takes an action, gets feedback, and learns to take better decisions over time. It is like learning by trial and error.

Supervised Learning

The computer learns from past information (experience). To make decisions, a machine needs the basic information to be provided to it. This basic information or experience is given in the form of training data. It is the past information on a specific task. If the training data is of poor quality, the prediction will also be far from being accurate. Later, the computer works with test data and tries to place it in the right category.

When we are trying to predict a categorical variable, the problem is known as a classification problem. For example, predicting whether a tumor is malignant or benign.

When we are trying to predict a real-valued variable, the problem is known as a regression problem. For example, predicting the price of a house according to its size.

Regression

Here, the predictor variable and the target variable are continuous in nature. In case of linear regression, a straight-line relationship is fitted between the predictor and the target variables. In case of simple linear regression, there is only one predictor variable, whereas in multiple linear regression, multiple predictor variables are included.

A typical linear regression model can be represented in the form: y = α + βx. Here, x is the predictor variable, and y is the target variable.

Unsupervised Learning

In this, there is no labelled training data to learn from, and no prediction to be made. The objective is to try to find natural groupings or patterns within the data. For example, customer segmentation.

Clustering is the main type of unsupervised learning. It groups similar objects together. One of the most commonly adopted similarity measure is distance. Two data items are considered as a part of the same cluster if the distance between them is less.

Another variant of unsupervised learning is association analysis. Here, the association between data elements is identified. For example, the market basket analysis.

Reinforcement Learning

In this, machines learn to do tasks autonomously. The computer tries to improve its performance on a task. When a sub-task is accomplished successfully, a reward is given. But when a sub-task is not executed correctly, no reward is given. This continues till the machine is able to complete execution of the whole task. For example, self-driving cars.

Python uses a machine learning library named scikit-learn, which has various classification, regression, and clustering algorithms embedded in it.

Basic Types of Data in Machine Learning

A data set is a collection of related information. Each row of a data set is called a record. An attribute gives information on specific characteristic. Attributes are also known as features, variables, dimensions, fields.

Data can be broadly categorized into two types:
1. Qualitative data: Also known as categorical data, it provides information which cannot be measured. For example, remarks on student’s performance as “good” or “average” or “poor”, or name, roll number, etc.
Qualitative data is further divided into two types:
a) Nominal data: It is a value which isn’t numeric. So, mathematical operations can’t be performed on this data. But a basic count (mode) is possible. For example, blood group, nationality, gender, etc.
b) Ordinal data: They can be arranged in ascending/descending order. For example, grades (A, B, C, D, E), customer satisfaction (Very Happy, Happy, Unhappy). Basic counting (mode), median, etc. are possible.
2. Quantitative data: Also known as numeric data, it gives information about the quantity of an object. So, it can be measured. It is further divided into two types:
a) Interval data: Here, not only the order, but the exact difference between two values is known. For example, temperature, date, time, etc. Only addition and subtraction apply to this data.
b) Ratio data: Here, exact value can be measured. These values can be added, subtracted, multiplied or divided. For example, height, weight, age, salary, etc.

The attributes can either be discrete or continuous. For example, roll number is a discrete attribute, but height is a continuous attribute.

Variance is the measure of how much a set of numbers is spread out from their average value. It is the mathematical way to measure consistency. The formula to calculate variance for a population is

Here, x_i denotes the individual values in the data set.
μ denotes the arithmetic mean.
N denotes the total number of values.

Standard Deviation
Standard Deviation Formula

Larger value of variation or standard deviation indicates more dispersion in the data.

Imputation is a method to assign a value to the data elements having missing values. Mean/mode/median is most frequently assigned value.

The structured representation of raw input data to the meaningful pattern is called a model. It might be a mathematical equation, a graph or tree structure.

The process of assigning a model and fitting a specific model to a data set is called model training.

In some cases, the input data is divided into three parts – a training and a test data, and a third validation data. The validation data is used in iterations for measuring the model performance. The test data is used only for once.

In k-fold cross-validation, the data set is divided into k-completely distinct or non-overlapping random partitions called folds. The value of k can be set to any number, but 10-fold cross-validation is a popular approach in which, for each of the 10-folds, one of the folds is used as the test data for validating model performance trained based on the remaining 9 folds. This is repeated 10 times, once for each of the 10 folds being used as the test data and the remaining folds as the training data. The average performance across all folds is reported.

Underfitting occurs if the target function is kept too simple and it is thus unable to capture the essential variation in the data. For example, trying to represent a non-linear data with a linear model. Underfitting can also occur due to unavailability of sufficient training data. Underfitting leads to poor performance and poor generalization. It can be avoided by using more training data, and reducing features by effective feature selection.

Overfitting is a situation where the model is designed in such a way that it emulates the training data too closely. In such a case, any specific deviation in the training data, like noise, gets embedded in the model.