algorithms for machine learning

The word “Big data” prevailed in 2019, and it’s going to keep prevailing in the following years. In our previous post, I have introduced some concepts about big data, machine learning, and data mining (see post: Understanding Big Data, Data Mining, and Machine Learning in 5 Minutes). Now let’s dig deeper into Machine Learning with a brief walk-through of some most commonly used ML algorithms, no codes, no abstract theories, just pictures and some examples of how they are used.

The list of algorithms covered in this article includes:

Decision tree
Random forest
Logistic regression
Support vector machine
Naive Bayes
k-NearestNeighbor
k-means
Adaboost
Neural network
Markov

1. Decision Tree

Classify a set of data into different groups using certain attributes, execute a test at each node, through branch judgment, further split the data into two distinct groups, and so on and so forth. Tests are done based on existing data, and when new data are added they can be classified into the corresponding group

Classify data according to some features, whenever the process goes to the next step, there is a judging branch, and the judgment divides the data into two, and the process goes on. When tests are done with existing data, new data can be These questions are learned by the existing data, when there is new data coming in, the computer can categorize data into the right leaves.

2. Random Forest

Select randomly from the original data, and form into different subsets.

Matrix S is the original data, and it contains 1-N data rows, while A, B, and C are the features, and the last C stands for categories.

Create random subsets from S, let’s say we got M sets of subsets.

And we get M sets of decision trees from these subsets:

Throw new data into these trees, we can get M sets of results, and we count to see which results are the most in all M sets, we can consider that as the final result.

3. Logistic Regression

When the probability of the predicting target is larger than 0 and less than or equal to 1, it cannot be fulfilled by a simple linear model. Because when the domain of definition is not within a certain level, the range would exceed the specified interval.

We better go with a model of this kind.

So how can we get this model?

This model needs to fulfill two conditions, “Larger than or equal to 0”, and “Less than or equal to 1”

When we transform the formula, we can get the logistic regressions model:

By calculating the original data, we can get corresponding coefficients.
And we get the logistic model plot.

4. Support Vector Machine

To separate the two classes from the hyperplane, the best choice will be the hyperplane that leaves the maximum margin for both classes. Because Z2>Z1, the green one is better.

Use a linear equation to express the hyperplane, class above the line is larger than or equal to 1, the other class is less than or equal to -1.

Calculate the distance between the point to the surface by using the equation in the graph:

So we get the expression of total margin as below, the aim is to maximize the margin, which we need to do is to minimize the denominator.

For example, we use 3 points to find the optimal hyperplane, define weight vector=(2, 3) – (1, 1)

And get weight vector (a, 2a), substitute these two points into the equation

When a is confirmed, the result using (a, 2a) is the support vector, and the Equation substituting in a and w0 is the support vector machine.

5. Naive Bayes

Here’s an example of NLP:

Giving out a piece of text, examine whether the text’s attitude is positive or negative.

To solve the problem, we can only look at some of the words:

And these words will represent only some of the words and their counts.

And the original question is: Give you a sentence, which category does it belong to? By using Bayes Rules, it is going to be an easy question.

The question becomes, in this class, what’s the probability of occurrence of this sentence? And remember not to forget the other two probabilities in the equation.

Example: the probability of occurrence of the word “love” is 0.1 in the positive class, and 0.001 in the negative class.

6. k-NearestNeighbor

When comes a new datum, which category has the most points nearest to it, it belongs to which category.

For example: To distinguish “dog” and “cat”, we judge from two features, “claws” and “sound”. Circles and triangles are the known categories, what about “star”:

When K=3, these three lines connect the nearest 3 points, and circles are more, so “star” belongs to “cat”.

7. k-means

Separate the data into 3 classes, the pink part is the biggest, while the yellow is the smallest.

Pick 3, 2, 1 as default, and calculate the distance between the rest data and the defaults, and classify it into the class that has the shortest distance.

After classification, calculate the means of each class, and set it as the new center.

After some rounds, we can stop when the class no longer changes.

8. Adaboost

Adaboost is one measure of boosting.

Boosting is to gather up the classifiers that didn’t have satisfactory results, and generate a classifier that may have a better effect.

As the below shows, tree 1 and tree 2 don’t have good effects individually, but if we input the same data, and sum up the results, the final result will be more convincing.

An example for adaboost, in handwriting recognition, the panel can extract many features, such as the beginning direction, the distance between the beginning point and ending point, etc.

When training the machine, it will get the weight of each feature, like 2 and 3, the beginnings of writing them are very similar, so this feature does little to classification, so its weight is little.

But this alpha angle has great recognizability, so the weight of this feature will be great. The final outcome will be a result of considering all of these features.

9. Neural Network

In NN, an input may end up in at least two classes. A neural network is formed of neurons and connections of neurons. The first layer is the input layer, and the last layer is the output layer. The hidden layers and output layers, both have their own classifiers.

When input comes into the network, and is activated, the calculated score will be passed down to the next layer. Scores shown in the output layer are the scores for each class. The example below gets the result of class 1;

same input being passed to different knots generates different scores, which is because each knot has different weights and biases, and this is propagation.

10. Markov

Markov Chain consists of states and transitions. For example, get a Markov Chain based on “the quick brown fox jumps over the lazy dog”. First, we need to set every word under a state, and we need to calculate the probability of state transitions.

These are the probabilities calculated by one single sentence. When you use massive data of texts to train the computer, you will get a bigger state transition matrix, such as words that can follow “the”, and their corresponding probabilities.