Reducing Overfitting and Complexity of Decision Trees by Limiting Max-Depth and Pruning

By: Edward Krueger, Sheetal Bongale and Douglas Franklin.

Photo by Ales Krivec on Unsplash

In another article, we discussed basic concepts around decision trees or CART algorithms and the advantages and limitations of using a decision tree in Regression or Classification problems.

Read more on that here:

For a video introduction on Decision Trees, check out this 8-minute lesson:

In this article, we are going to focus on:

  • Overfitting in decision trees
  • How limiting maximum depth can prevent overfitting decision trees
  • How cost-complexity-pruning can prevent overfitting decision trees
  • Implementing a full tree, a limited max-depth tree and a pruned tree in Python
  • The advantages and…


Boost the performance of recursive functions with memoization

By: Edward Krueger and Douglas Franklin.

Photo by Florian Krumm on Unsplash

What is Memoization?

Memoization is a type of caching that stores the result of a deterministic function. More specifically, memoization is an optimization technique used to accelerate programs by storing the results of function calls and returning the cached result when redundant inputs arise.

In other words, memoization prevents a program from running the same calculation twice.

Let’s see this behavior with an artificially slow Python function.

slow_func.py

When we run slow_func.py we get the following output:


An explanation of decision trees with an example in Python

By: Edward Krueger, Sheetal Bongale and Douglas Franklin.

Photo by Danka & Peter on Unsplash

What is a Decision Tree?

A decision tree is a supervised machine learning algorithm that can be used for regression and classification problems. A decision tree follows a set of nested if-else conditions to make predictions.

Since decision trees can be used for classification and regression the algorithm used to grow them is often called CART (Classification and Regression Trees). There is no single decision tree algorithm. Multiple algorithms have been proposed to build decision trees, but we will focus on the CART algorithm used in scikit-learn.

Decision trees are binary trees where each node represents a…


Great point!

It is a common mistake to diagnose a model as "overfitting the data" simply by comparing a metric in training versus the same metric in testing. Some models are just designed in a way such that they tend to have high train accuracy.

You make the point for Random Forest very clearly in your graphs.


This is one case where understanding how a model works would lead to one fewer rabbit hole. Depending on your specific implementation, a decision tree evaluates all possible splits or a subset of them. However, for any given split, the feature's scale doesn't make any difference in computing the gain in Gini (or entropy or misclassfication rate). So the encoders should produce equivalent results.

Of course, since random forests are random, you get a little fluctuation.


Simple vs. Complex models for predicting fish mass

By: Edward Krueger and Douglas Franklin.

Photo by Mike Swigunski on Unsplash

This article will discuss a data science competition we did with one of our classes. We will discuss the five best-scoring models and their complexity.

Introduction

The challenge is to create a machine learning model that predicts fish weight. The student whose model has the lowest mean-squared error (MSE) will be declared the winner!

The Challenge

Hello! Welcome to the famous Tsukiji fish market of Tokyo, Japan! We came here to collect data on the fish they have, but we didn’t wake up at 5 am for the tuna auction. By the time we showed up, there…


How to write flexible, reusable decorators in Python

By: Edward Krueger and Douglas Franklin.

Photo by Macau Photo Agency on Unsplash

Introduction

We begin with some background on functional programming concepts and a discussion of timing and tracing.

Next, we illustrate the decorator pattern and its syntax with two examples, tracefunc and timefunc. To do this, we use the Python libraries functools and time.

Then we move to a deeper discussion of functools.wraps and how it preserves the metadata of a decorated function. Lastly, we show this preservation with some examples.

Why One Might Trace and Time a Function

Tracing is recording the inputs and outputs of functions as the program runs. Experienced programmers use tracing to troubleshoot programs, often as a substitute for…


How these powerful coding tools allow function flexibility

By: Edward Krueger and Douglas Franklin.

Photo by Michael Dziedzic on Unsplash

Introduction

We’ve all heard of arguments and keyword arguments (args and kwargs) when discussing Python functions. Arguments usually consist of numerical values, while keyword arguments, as the name suggests, are semantic. When writing functions, *args and **kwargs are often passed directly into a function definition.

This function can handle any number of args and kwargs because of the asterisk(s) used in the function definition. These asterisks are packing and unpacking operators.

When this file is run, the following output is generated.


Debug your code with pure functions and a tracer

By: Edward Krueger and Douglas Franklin.

Photo by Fotis Fotopoulos on Unsplash

Introduction

Our goal is to have a codebase of pure functions that we can decorate with a tracer. By applying this decorator to pure functions, we can debug code without using a cumbersome debugger. This reduces developer pain as debuggers are often tedious and difficult to work with.

What are Pure Functions?

Strictly speaking, in a functional programming paradigm, all functions are pure functions. What are they, how can we code them and why are they useful?

Before we get into that, let's review mathematical functions.

Mathematical functions

Mathematical functions, such as cos(x), return a single value. Give it an x


Use a decorator to trace your functions

By: Edward Krueger and Douglas Franklin

Photo by Maksym Kaharlytskyi on Unsplash

introduction

Our goal is to create a reusable way to trace functions in Python. We do this by coding a decorator with Python's functools library. This decorator will then be applied to functions whose runtime we are interested in.

Tracing Decorator: @tracefunc

The code below represents a common decorator pattern that has a reusable and flexible structure. Notice the placement of functool.wraps. It is a decorator for our closure. This decorator preserves func’s metadata as it is passed to the closure.

tracer.py

If we did not use functools.wraps to decorate our closure on line 7, the wrong name…

Edward Krueger

Data Scientist, Software Developer and Educator

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store