In supervised machine learning, computers put data through a model to make predictions. Today, you'll see how decision trees can identify cancer cells.

Understanding the data

Physicians diagnose cancer by analyzing suspect cells after a biopsy. Researchers at the University of Wisconson quantified biopsy images, so computers could too.

What humans see

Under a microscope, cancer looks "primitive and aggressive," a chaotic agglomeration of cells with irregularly shaped, sized, and patterned nuclei.

What computers "see"

Researchers quantified ten characteristics of cell nuclei in breast-cancer biopsy images.

Radius

Perimeter

Area

Texture

Smoothness

Compactness

Concavity

Concave Points

Symmetry

Fractal Dimensions

Identifying Features

For each biopsy, the researchers calculated every attribute's average, standard error, and highest values. So, the data has 30 features (a.k.a. predictors, variables).

90250

Diagnosis

Benign

Assessing their value add

Computers prioritize features that contribute the most information to the model. Decision trees do this by analyzing the distribution of classes of observations.

An example

Note for later: Make the line showing up more obvious, e.g. the radius or something.

Of the 569 biopsies in the data set, the largest radius ranges from approximately 8 to 32 μm.

The cells in one class of data, benign, to be smaller...

…whereas those in the other class, malignant, tend to be larger.

Note: Show where our malignant sample falls in the histogram.

These two classes of data have different distributions, which means the largest radius could be useful for cancer diagnosis.

Make this transition smoother. show where our sample shows up in the histograms. Or big histogram becomes little histogram.

Seeing all predictors

A computer will analyze all 30 features in this way.

Features with less overlap provide the model with more information.

Finding forks

While building a decision tree, computers divide the data points into homogenous groups.

Picking the splits

The computer must find forks ("if-then" statements) that split the data into branches.

A majority vote in each branch determines a biopsy's classification.

Finding the best split point requires making trade offs.

A split point that captures every malignant sample has many false positives.

Total Error
39.7%

However, a split point that avoids all false positives has many false negatives.

Total Error
20.2%

At the best split, both branches are as homogeneous as possible. Computers find this using math (like the Gini Ratio).

Total Error
7.7%

Combining forks

Adding additional forks can improve a tree's prediction accuracy. A tree with one fork is called a stump. One with many is called bushy tree.

Tree Depth	Total Error
1	7.7%

Tree Depth	Total Error
2	5.2%

Tree Depth	Total Error
3	2.4%

Tree Depth	Total Error
4	1.1%

Tree Depth	Total Error
5	0.6%

Tree Depth	Total Error
6	0%

Too perfect?

A 0% error rate is indeed too good to be true. In our next installment, you'll learn about training & test errors, the trouble with trees, and great alternatives.

Want to get updates?

Check out the data here

R2D3

R2D3 is an experiment in expressing statistical thinking with interactive design. Find us at @r2d3us.

Questions? Check out the FAQs.

Stephanie interprets R2

Stephanie is currently at Netflix (& hiring !!!). In the past, she's been at Stitch Fix, Cardiogram, Sift Science, Google, Bain & Company, and Vector Capital. She's got a MS in Statistics from Stanford.

Find Stephanie: LinkedIn Twitter Email

Tony visualizes with D3

Tony is a product designer at Tecton.ai, where he works on UX for Data Scientists and Data Engineers. Prior to Tecton, Tony worked at Facebook AI Noodle Analytics, H2O and at Sift Science. He holds an MFA in Interaction Design at the School of Visual Arts in New York City, where he tried to change congress with a fancy infographic.

Find Tony: Portfolio Twitter Blog LinkedIn Email