Classification and regression techniques: decision tree and knn

原创

403 Forbidden

发布于 2021-05-19 14:25:20

4710

文章被收录于专栏：hsdoifh biuwedsyhsdoifh biuwedsy

Lectures 12 and 13: Classification and regression techniques: decision tree and k-nearest neighbor

-understand what is meant by the task of classification and be able to identify scenarios where it is useful. Understand how it relates to data wrangling.

Classification
- Definition: given a collection of records(training set), each record contains a set of attributes, one of the attributes is the class. Find a predictive model for class attributes as a function of the values of other attributes. then what is the difference between this and prediction? Is this only implement on known data? same thing! classification is to predict a category.
- Useful scenarios:
  - animal classification
  - bank classifying borrower
  - detecting tax fraud/tax cheats
- Given a collection of records (training set)
  - Each record contains a set of attributes + one class label
- Find a predictive model for class label as a function of the values of other attributes
- Goal: previously unseen records should be assignment a class as accurately as possible
- Test set = used to determine the accuracy of the model
  - The given data set is divided into training & test sets
  - Training set used to build the mode
  - Test set used to validate the model

-understand what is meant by the task of regression and be able to identify scenarios where it is useful. Understand how it relates to data wrangling.

Regression
- given a collection of records each record contains a set of attributes, one of the attributes is the target variable. Regression is use other attributes to fit the target variable.
- useful scenarios:
  - predict a number/discrete data.
  - predicting ice-creams consumption from temperature
  - predict activity level of a target gene
- Similar to classification, but for continuous variables
- Given a collection of records (training set)
  - Each record contain a set of attributes, one of target variable
- Learn predictive model from data

-understand the use of accuracy as a metric for measuring the performance of a classification method. Understand how TP,TN,FP and FN are used in the accuracy calculation. The formula for accuracy will be provided on the exam

Appreciate the benefits and limitations of accuracy as a performance metric for classification
- suppose there are 2 classes 0 and 1, with 9990 and 10 instances each
- can be misleading if all test cases are predicted as class 0
Metrics for performance evaluation
- TP: should say yes, and we said yes.
- FN: should say yes, but we said no.
- FP: should say no, but we said yes.
- TN: should say no, and we said no.
- First letter represent the result we get is right(T) or wrong(F), the second letter represent how right or wrong we are.
- Accuracy formula is on the formula sheet
- The accuracy may misleading at here, it may not detect any other value in a dataset but still get very high accuracy. Therefore, we need other concept here, for example, recall rate might be useful.
Divide training data into
- Training set
- Test set
Learn decision tree using the training set
Evaluate performance of decision tree on the test set
Can be summarized in a Confusion Matrix (contingency table)
- Actual class
- Predicted class

Example:

For an accurate decision tree classifier -> minimise:
- False positives
- False negatives
Limitations of accuracy
- E.g. a 2-class problem:
  - Number of class 0 = 9990
  - Number of class 1 = 10
- If model predicts everything to be class 0
  - Accuracy = 9990/10000 = 99.9%
  - Accuracy is misleading as model does not detect any class 1 example

-understand the operation and rationale of the k nearest neighbor algorithm for classification

Requires
- Set of stored records
- Distance Metric to compute distance between records
- Value of k -> number of nearest neighbours to retrieve
To classify an unknown record:
1. Compute distance to other training records
  - Euclidean distance
  - Pearson coefficient
2. Identify k nearest neighbours
  - data points that have the k smallest distance
3. Use class labels of nearest neighbours to determine the class label of unknown record
  - Take the majority vote of class labels among k-nearst neighbours
  - OR weight the vote according to distance
    - Weight factor w =
4. Choosing the value of k
  - Too small -> sensitive to noise points
  - Too large -> neighbourhood may include points from other classes
K-nearest neighbors(K-NN / KNN)
- k-nearest neighbors of a record x are data points that have the k smallest distance to x
- process
  - compute the distance to other training records
  - identify k nearest neighbors
  - use class labels of nearest neighbors to determine the class label of unknown record
- Advantage
  - easier to calculate than decision tree
- can handle datasets with complex structure
- Disadvantages
  - hard to determine the size of k
    - if k is too small, sensitive to noise points
    - if k is too large, neighborhood may include points from other classes
- Classification may be slow for large datasets
- Need to store training data in order to make the prediction

-understand the operation and rationale of the decision tree algorithm for classification

A flow-chart-like tree structure
- Internal node = denotes a test on an attribute
- Branch = an outcome of the test
- Leaf nodes = class labels / class distribution
Advantage:
- Easy to interpret
- Relatively efficient to construct
- Fast for making a decision about a test instance
Disadvantage:
- Sometimes too simple for data with complex structure
- For complex datasets, the tree might grow very big & not easy to understand
- May behave strangely for some types of features (e.g. student ID)
Decision Tree
- How to specify test condition
  - Depends on attribute types
    - nominal (e.g. family, sports, luxury for car)
    - ordinal (e.g.small, large, medium for size)
    - continuous (e.g. ranged data)
  - Depends on number of ways to split
    - 2-way split
    - Multi-way split
- Splitting Based on nominal attributes
  - Multi-way split: use many partitions as distinct values
  - Binary split: divides values into two subsets. Need to find optimal partitioning.
  - how good is a splits
    - the impurity of parent node(before splitting) - the sum of the impurity of its children nodes, notice that each child node need to scale by a parameter, which is (number of data points in this child node) / (number of data points in parent node).
    - the results called gain, the large the gain the better split this is.
Create the decision tree
- Step 1: Calculate information gain [Left Child], [Right Child] for each feature
Step 2: Choose feature + split with the highest information gain & use this as the root node and its split
Step 3: Do recursively, terminating when a node consists of only Cheat = No / Cheat = Yes

-Understand how a decision tree may be used to make predictions about the class of a test instance

Hunt’s algorithm
- Dt = set of training records that reach a node t
  - If Dt contains records that belong the same class yt, then t is a leaf node labelled as yt
  - If Dt is an empty set, then t is a leaf node labelled by the default class yd
  - If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

-Understand the key steps in building a decision tree. How to split the instances, how to specify the attribute test condition, how to determine the best split and how to decide when to stop splitting

Specify test condition
- Depends on attribute types
  - Nominal
  - Ordinal
  - Continuous
- Depends on number of ways to split
  - 2-way split
  - Multi-way split

	Nominal	Ordinal	Continuous
2-way split	Divides values into two subsets Need to find optimal partitioning	Divides values into two subsets Need to find optimal partitioning	Binary decision: (A <v) or (A>= v) Consider all possible splits & finds the best cut Can be more compute intensive
Multi-way split	Use as many partitions as distinct values	Use as many partitions as distinct values	Discretization to form an ordinal categorical attribute Static: discretise once at the beginning Dynamic: ranges can be found by equal interval / equal frequency / clustering

Binary decision:

(A <v) or (A>= v)

Consider all possible splits & finds the best cut
Can be more compute intensive

Multi-way split

Use as many partitions as distinct values
Use as many partitions as distinct values

Discretization to form an ordinal categorical attribute

Static: discretise once at the beginning
Dynamic: ranges can be found by equal interval / equal frequency / clustering

Determine best split
- Nodes with homogeneous class distribution are preferred
Entropy: measure of node impurity
- Low entropy = low uncertainty & high purity
- High entropy = high uncertainty & low purity
- Measures homogeneity of a node
  - Maximum -> records are equally distributed among all classes
  - Minimum -> all records belong to one class
How good is a split?
- Compare the entropy of parent node (before splitting)
- Compare the entropy of children nodes (after splitting)
- Information gain =
H(vj) = impurity measure of node vj
j = children node index
N(vj) = number of data points in child node vj
N = number of data points in parent node
The information gain = mutual information between the class feature & feature being split on
Splitting using the information gain = to choose the feature with highest information shared with the class variable
Compute the gain of all splits & choose the largest gain

Example:

When to stop
- If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
- If Dt is an empty set, then t is a leaf node labeled by the default class, yd

-understand the advantages and disadvantages of using k nearest neighbor or decision tree for classification and the reasons for these advantages and disadvantages

KNN-Classification
- Advantage:
  - The cost of the learning process is zero
  - No assumptions about the characteristics of the concepts to learn have to be done
  - Complex concepts can be learned by local approximation using simple procedures
- Disadvantages:
  - Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it simply chose the neighbors based on distance criteria.
  - Missing Value treatment: K-NN inherently has no capability of dealing with missing value problem.
  - KNN slow algorithm: K-NN might be very easy to implement but as dataset grows efficiency or speed of algorithm declines very fast.
  - Curse of Dimensionality: KNN works well with small number of input variables but as the numbers of variables grow K-NN algorithm struggles to predict the output of new data point.
  - K-NN needs homogeneous features: If you decide to build k-NN using a common distance, like Euclidean or Manhattan distances, it is completely necessary that features have the same scale, since absolute differences in features weight the same, i.e., a given distance in feature 1 must means the same for feature 2.
  - Optimal number of neighbors: One of the biggest issues with K-NN is to choose the optimal number of neighbors to be consider while classifying the new data entry.
  - Imbalanced data causes problems: k-NN doesn’t perform well on imbalanced data. If we consider two classes, A and B, and the majority of the training data is labeled as A, then the model will ultimately give a lot of preference to A. This might result in getting the less common class B wrongly classified.
Decision Tree Classification
- Advantage:
  1. Decision Trees are easy to explain. It results in a set of rules.
  2. It follows the same approach as humans generally follow while making decisions.
  3. Interpretation of a complex Decision Tree model can be simplified by its visualizations. Even a naive person can understand logic.
  4. The Number of hyper-parameters to be tuned is almost null.
- Disadvantages:
  1. There is a high probability of overfitting in Decision Tree.
  2. Generally, it gives low prediction accuracy for a dataset as compared to other machine learning algorithms.
  3. Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories.
  4. Calculations can become complex when there are many class labels.

-understand the use of entropy as a node impurity measure for decision tree node splitting. Understand the benefits of entropy for this task and why it is effective for assessing the goodness of a split

Entropy
- We have seen entropy in the feature correlation section, where it was used to measure the amount of uncertainty in an outcome
- Entropy can also be viewed as an impurity measure
- The set {A,B,C,A,A,A,A,A} has low entropy: low uncertainty and high purity
- The set {A,B,C,D,B,E,A,F} has high entropy: high uncertainty and low purity

-appreciate why it is necessary to separate the dataset into training and testing in order to fairly evaluate the performance of a classifier

Why we need to split the dataset into training set and test set
- Learn decision tree using training set and evaluate performance on the test set. This gives unbiased evaluation.

-appreciate that when distribution of testing data is quite different from distribition of training data, accuracy of classifier may be degraded

-appreciate the benefits and limitations of accuracy as a performance metric for classification

Limitations of accuracy
- E.g. a 2-class problem:
  - Number of class 0 = 9990
  - Number of class 1 = 10
- If model predicts everything to be class 0
  - Accuracy = 9990/10000 = 99.9%
  - Accuracy is misleading as model does not detect any class 1 example

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

javascript