# Introduction to Machine Learning

Till now we have covered how to visualize data, plot the graphs, work on data sets. Now it is time to build a machine learning model. We would be learning how to build a machine learning model.

Introduction to machine learning:

Basically, we train data to predict future unknown values using various methods.

Supervised learning: Here we train algorithms using labelled examples ( like the data set we will train should have labels ). Steps involved in supervised machine learning

**Data acquisition :**This process is to collect the data and according to scikit learn ( one of the machine learning tools, will be discussed in upcoming tutorials ) , there should be at least 50 data counts.**Data cleaning :**This process involves cleaning of data such as removing or filling the missing values. In this process of ML, the Pandas library plays a major role.**Training Model :**Once the data is cleaned, basically divided into two parts, training data and testing data. Train data is used to train the data and test data to check the accuracy of the trained data. A major part of the data is divided to train dataset.**Model testing :**Once the model is trained, we have to test the data.**Adjusting model parameters:**Sometimes testing the model may give us very false results, so we have to adjust the parameters of training data set.**Model deployment :**Once we confirm that our model has great working, we are ready to deploy our model into the market.

Issue with training and testing the data

Is it valid to use a single split of data to evaluate the total model ?

Yes, we can do in this way, we can keep on updating the model as the data gets added on.

There is also another alternative for this

We have to divide the data into three parts instead of two parts

- Train data
- Validation data
- Test data

Validation data will be used for initial testing and any changes in the model will be made, so that model gives a better performance in the test dataset.

**Confusion matrix:**

True prediction | False prediction | |

Actually True condition | True positive | False Negative |

Actually False condition | False positive | True negative |

**Few key classification matrices:**

- Accuracy
- Recall
- Precision
- F1 score

**Accuracy:**

This value calculates the number of correct predictions made by the model to the total number of predictions.

**Accuracy** = Correct predictions / total predictions

This term is not a good choice for unbalanced data.

**Recall:**

Ability of a model to find all the relevant cases within a dataset.

Recall = True positives/ ( True positives + False negatives )

**Precision:**

This shows the ability of a classification model to identify only the relevant datas points.

**Precision** = number of true positives / ( true positives + false positives )

**F1 score:**

This metrix combine the power of Recall and precision. This is the harmonic mean of both the parameters.

**F1 score** = 2* ( precision * recall ) / ( precision + recall )

**Regression:**

Regression is something when a model attempts to predict continuous values. This is unlike categorical and classification model. E.g. predicting the cost of land based on surrounding and other parameters using a model.

Evaluation matrix for regression;

- Mean absolute error.
- Mean squared error.
- Root mean square error.

**Mean error:**

We get a predicted value based on some algorithm ( our predicted value is x ) and our other true parameters are xi ( where i=1 and goes upto n )

Now let us calculate mean error.

( 1/n ) ∑ | xi -x |

**Issue with matrix:**

When a true parameter is very far from the predicted mean then it doesn’t punish the more extreme values.

**Mean squared error:**

This evaluation matrix resolves the issue that was occurring with mean absolute error.

( 1/n ) ∑ | xi -x |2

Issue with mean squared error

When the difference of values are squared, then the extreme values get in huge consideration but the units also get squares, consider we were creating to model to find land cost then the unit of this becomes sq.cost which doesn’t have a meaning.

**Root mean squared error:**

This evaluation matrix solves both the issues are solved

√[ ( 1/n ) ∑ | xi -x |2 ]

**MACHINE LEARNING WITH PYTHON**

Introduction to scikit learn:

It is one of the most powerful and popular ML package tool as it has a lot of machine learning algorithms builtin.

It does’nt comes installed in jupyter notebook. To install

conda install scikit-learn #type in cell and run, wait for sometime

This is the end of this tutorial on Introduction to Machine Learning