“今天的文章算是一篇扫盲文,较为系统地分门别类罗列了机器学习常见的算法以及算法的概要,主要是有监督算法、无监督算法、半监督算法、强化学习算法,再而细分成Regression、Classification、Ensembling、Clustering、Association来展开介绍。
One of the most well-known and essential sub-fields of data science is machine learning. The term machine learning was first used in 1959 by IBM researcher Arthur Samuel. From there, the field of machine learning gained much interest from others, especially for its use in classifications.
When you start your journey into learning and mastering the different aspects of data science, perhaps the first sub-field you come across is machine learning. Machine learning is the name used to describe a collection of computer algorithms that can learn and improve by gathering information while they are running.
Any machine learning algorithm is built upon some data. Initially, the algorithm uses some “training data” to build an intuition of solving a specific problem. Once the algorithm passes the learning phase, it can then use the knowledge it gained to solve similar problems based on different datasets.
In general, we categorize machine learning algorithms into 4 categories:
Each of these categories is designed for a purpose; for example, supervised learning is designed to scale the training data's scope and make predictions of future or new data based on that. On the other hand, unsupervised algorithms are used to organize and filter data to make sense of it.
Under each of those categories lay various specific algorithms that are designed to perform certain tasks. This article will cover 5 basic algorithms every data scientist must know to cover machine learning basics.
Regression algorithms are supervised algorithms used to find possible relationships among different variables to understand how much the independent variables affect the dependent one.
You can think of regression analysis as an equation, For example, if I have the equation y = 2x + z
, y is my dependant variable, and x,z are the independent ones. Regression analysis finds how much do x and z affect the value of y.
The same logic applies to more advanced and complex problems. To adapt to the various problems, there are many types of regression algorithms; perhaps the top 5 are:
Classification in machine learning is the process of grouping items into categories based on a pre-categorized training dataset. Classification is considered a supervised learning algorithm.
These algorithms use the training data's categorization to calculate the likelihood that a new item will fall into one of the defined categories. A well-known example of classification algorithms is filtering incoming emails into spam or not-spam.
There are different types of classification algorithms; the top 4 ones are:
Ensembling algorithms are supervised algorithms made of combining the prediction of two or more other machine learning algorithms to produce more accurate results. Combining the results can either be done by voting or averaging the results. Voting is often used during classification and averaging during regression.
Ensembling algorithms have 3 basic types: Bagging, Boosting, and Stacking.
Clustering algorithms are a group of unsupervised algorithms used to group data points. Points within the same cluster are more similar to each other than to points in different clusters.
There are 4 types of clustering algorithms:
Association algorithms are unsupervised algorithms used to discover the probability of some items to occur together in a specific dataset. It is mostly used in the market-basket analysis.
The most used association algorithm is Apriori.
The Apriori algorithm is a mining algorithm used commonly used in transactional databases. Apriori is used to mine frequent itemsets and generate some association rules from those item sets.
For example, if a person buys milk and bread, then they are likely to also get some eggs. These insights are built upon previous purchases from various clients. Association rules are then formed according to a specific threshold for confidence set by the algorithm based on how frequently these items are brought together.
Machine learning is one of the most famous, well-researched sub-field of data science. New machine learning algorithms are always under development to reach better accuracy and faster execution.
Regardless of the algorithm, it can generally be categorized as one of four categories: supervised, unsupervised, semi-supervised, and reinforced algorithms. Each one of these categories holds many algorithms that are used for different purposes.
In this article, I have gone through 5 types of supervised/ unsupervised algorithms that every machine learning beginner should be familiar with. These algorithms are well-studied and widely-used that you only need to understand how to use it rather than how to implement it.
Most famous Python machine learning modules — such as Scikit Learn — contain a pre-defined version of most — if not all — of these algorithms.
So,
My advice is, understand the mechanic, and master the usage and start building.