Created
Nov 2, 2021 06:10 PM
Topics
+ Random Forest
Decision TreesMotivation and intuition behind treesCompositionFunction ClassWhat Makes a Good Decision TreeThe Greedy Learning AlgorithmChoosing the Best Variable to SplitTheorizing Choosing the Best VariableEntropy Conditional EntropyInformation GainKnow When to StopCase When Information Gain Choosing a Best Decision TreeHeuristics Toward Obtaining Simper TreesContinuous Variable Decision TreeDecision StumpsChoosing FormulasInterpretation of Decision TreesEnsemble Learning
Decision Trees
Motivation and intuition behind trees
Composition
- Node: tests a single attribute
- Branch: one for each value of the attribute
- Leaf: assign a class to the sample
Function Class
A decision tree can represent any function of the input
What Makes a Good Decision Tree
Principle of Occam’s Razor: the simpler the better
- Learning the simplest & smallest decision tree ⇒ NP-complete problem (exponential computational complexity, hard to solve)
The Greedy Learning Algorithm
- Start with an empty tree
- Select the best possible attribute to split on
- Generate child nodes: one for each value
- Partition samples according to their values & assign to appropriate child node
- Recurse: repeat for each child node; return if all samples at a node are in the same class
Choosing the Best Variable to Split
More certainty about the classification
- Good fit: all samples of different classes goes to different nodes
- Bad fit: equal number of samples fall in both classes
Theorizing Choosing the Best Variable
Entropy
The expected number of bits needed to encode a randomly drawn value of under the most efficient code
High: uniform-like distribution, less predictable
Low: has peeks, more predictable
Conditional Entropy
Conditioned on
Information Gain
Decrease in entropy after splitting at feature
Choose a variable that maximizes information gain
Know When to Stop
- Biased toward finding a simpler tree
- If stop too late: all samples will have its own class
Case When Information Gain
Involve an over-split hyperparameter
- Over-split: randomly split & explore a few more levels
- Then prune unused nodes
Choosing a Best Decision Tree
Decision trees will overfit!
- Standard D-Tree: Bias = 0
- High variance: slight change → huge difference
Heuristics Toward Obtaining Simper Trees
- Lower bounding # of samples per leaf node
- Upper bounding depth
- Ensemble Learning (Random Forests)
Continuous Variable Decision Tree
Decision Stumps
Determine a threshold (Decision Stumps) to split for variable
- Branch 1:
- Branch 2:
Can allow repeated splits on the same variable w/ different thresholds
Choosing
❌ search through all possible values
✔️ Sort data
- Consider splits at
- Only split at t with most IG
Formulas
Interpretation of Decision Trees
- May be readable but may not use human logic
- Need to use heuristics to avoid overfitting (depth limit, leaf bin size threshold, etc.)
- Very high variance: a small change in data will result in a completely different tree
Ensemble Learning
Reduce variance in models by
- Train multiple models trained on the same data set
- Perform an average over predictions