Machine Learning With Spark And Python 2nd Edition Michael Bowles

Machine Learning With Spark And Python 2nd
Edition Michael Bowles download
https://ebookbell.com/product/machine-learning-with-spark-and-
python-2nd-edition-michael-bowles-57017502
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Scaling Machine Learning With Spark Distributed Ml With Mllib
Tensorflow And Pytorch 1st Adi Polak
https://ebookbell.com/product/scaling-machine-learning-with-spark-
distributed-ml-with-mllib-tensorflow-and-pytorch-1st-adi-
polak-50112408
Nextgeneration Machine Learning With Spark Covers Xgboost Lightgbm
Spark Nlp Distributed Deep Learning With Keras And More 1st Edition
Butch Quinto
https://ebookbell.com/product/nextgeneration-machine-learning-with-
spark-covers-xgboost-lightgbm-spark-nlp-distributed-deep-learning-
with-keras-and-more-1st-edition-butch-quinto-11117508
Nextgeneration Machine Learning With Spark Covers Xgboost Lightgbm
Spark Nlp Distributed Deep Learning With Keras And More Butch Quinto
https://ebookbell.com/product/nextgeneration-machine-learning-with-
spark-covers-xgboost-lightgbm-spark-nlp-distributed-deep-learning-
with-keras-and-more-butch-quinto-11063586
Apache Spark 2 Data Processing And Realtime Analytics Master Complex
Big Data Processing Stream Analytics And Machine Learning With Apache
Spark Romeo Kienzler
https://ebookbell.com/product/apache-spark-2-data-processing-and-
realtime-analytics-master-complex-big-data-processing-stream-
analytics-and-machine-learning-with-apache-spark-romeo-
kienzler-11391018

Beginning Apache Spark 3 With Dataframe Spark Sql Structured Streaming
And Spark Machine Learning Library 2nd Ed Hien Luu
https://ebookbell.com/product/beginning-apache-spark-3-with-dataframe-
spark-sql-structured-streaming-and-spark-machine-learning-library-2nd-
ed-hien-luu-35191130
Beginning Apache Spark 2 With Resilient Distributed Datasets Spark Sql
Structured Streaming And Spark Machine Learning Library Hien Luu
https://ebookbell.com/product/beginning-apache-spark-2-with-resilient-
distributed-datasets-spark-sql-structured-streaming-and-spark-machine-
learning-library-hien-luu-7213898
Beginning Apache Spark 2 With Resilient Distributed Datasets Spark Sql
Structured Streaming And Spark Machine Learning Library Hien Luu
https://ebookbell.com/product/beginning-apache-spark-2-with-resilient-
distributed-datasets-spark-sql-structured-streaming-and-spark-machine-
learning-library-hien-luu-11068224
Mastering Apache Spark 2x Scale Your Machine Learning And Deep
Learning Systems With Sparkml Deeplearning4j And H2o 2nd Edition Romeo
Kienzler
https://ebookbell.com/product/mastering-apache-spark-2x-scale-your-
machine-learning-and-deep-learning-systems-with-sparkml-
deeplearning4j-and-h2o-2nd-edition-romeo-kienzler-7213900
Machine Learning With Spark Nick Pentreath
https://ebookbell.com/product/machine-learning-with-spark-nick-
pentreath-5108498

Machine Learning with
Spark™ and Python®

Machine Learning with
Spark™ and Python®
Essential Techniques for
Predictive Analytics
Second Edition
Michael Bowles

Machine Learning with Spark™ and Python®
: Essential Techniques for Predictive Analytics, Second Edition
Published by
John Wiley & Sons, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2020 by John Wiley & Sons, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978‐1‐119‐56193‐4
ISBN: 978‐1‐119‐56201‐6 (ebk)
ISBN: 978‐1‐119‐56195‐8 (ebk)
Manufactured in the United States of America
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted
under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission
of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clear-
ance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 646‐8600. Requests to the
Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111
River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/
go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or war-
ranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all
warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be
created or extended by sales or promotional materials. The advice and strategies contained herein may not
be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in
rendering legal, accounting, or other professional services. If professional assistance is required, the services
of a competent professional person should be sought. Neither the publisher nor the author shall be liable for
damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation
and/or a potential source of further information does not mean that the author or the publisher endorses
the information the organization or website may provide or recommendations it may make. Further, readers
should be aware that Internet websites listed in this work may have changed or disappeared between when
this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department
within the United States at (877) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley publishes in a variety of print and electronic formats and by print‐on‐demand. Some material included
with standard print versions of this book may not be included in e‐books or in print‐on‐demand. If this book
refers to media such as a CD or DVD that is not included in the version you purchased, you may down-
load this material at http://booksupport.wiley.com. For more information about Wiley products, visit
www.wiley.com.
Library of Congress Control Number: 2019940771
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc.
and/or its affiliates, in the United States and other countries, and may not be used without written permis-
sion. Spark is a trademark of the Apache Software Foundation, Inc. Python is a registered trademark of the
Python Software Foundation. All other trademarks are the property of their respective owners. John Wiley &
Sons, Inc. is not associated with any product or vendor mentioned in this book.

I dedicate this book to my expanding family of children and grandchildren, Scott,
Seth, Cayley, Rees, and Lia. Being included in their lives is a constant source of joy for
me. I hope it makes them smile to see their names in print. I also dedicate it to my
close friend Dave, whose friendship remains steadfast in spite of my best efforts.
I hope this makes him smile too.
Mike Bowles, Silicon Valley 2019

vii
About the Author
Dr. Michael Bowles (Mike) holds bachelor’s and master’s degrees in mechanical
engineering, an ScD in instrumentation, and an MBA. He has worked in aca-
demia, technology, and business. Mike currently works with companies where
artificial intelligence or machine learning are integral to success. He serves var-
iously as part of the management team, a consultant, or advisor. He also teaches
machine learning courses at UC Berkeley and Hacker Dojo, a co-working space
and startup incubator in Mountain View, CA.
Mike was born in Oklahoma and took his bachelor’s and master’s degrees
there, then after a stint in Southeast Asia went to Cambridge for ScD and C.
Stark Draper Chair at MIT after graduation. Mike left Boston to work on com-
munications satellites at Hughes Aircraft Company in Southern California, and
then after completing an MBA at UCLA moved to the San Francisco Bay Area
to take roles as founder and CEO of two successful venture-backed startups.
Mike remains actively involved in technical and startup-related work. Recent
projects include the use of machine learning in industrial inspection and auto-
mation, financial prediction, predicting biological outcomes on the basis of
molecular graph structures, and financial risk estimation. He has participated
in due diligence work on companies in the artificial intelligence and machine
learning arenas. Mike can be reached through mbowles.com.

ix
About the Technical Editor
James York-Winegar is an Infrastructure Principal with Accenture Enkitec
Group. James helps companies of all sizes from startups to enterprises with their
data lifecycle by helping them bridge the gap between systems management
and data science. He started his career in physics, where he did large-scale
quantum chemistry simulations on supercomputers, and went into technology.
He holds a master’s in Data Science from Berkeley.

xi
Acknowledgments
I’d like to acknowledge the splendid support that people at Wiley have offered
during the course of writing this book and making the revisions for this second
edition. It began with Robert Elliot, the acquisitions editor who first contacted me
about writing a book—very easy to work with. Tom Dinse has done a splendid
job editing this second edition. He’s been responsive, thorough, flexible, and
completely professional, as I’ve come to expect from Wiley. I thank you.
I’d also like to acknowledge the enormous comfort that comes from having
such a quick, capable computer scientist as James Winegar doing the technical
editing on the book. James has brought a more consistent style and has made
a number of improvements that will make the code that comes along with the
book easier to use and understand. Thank you for that.
The example problems used in the book come from the University of California
at Irvine’s data repository. UCI does the machine learning community a great
service by gathering these data sets, curating them, and making them freely
available. The reference for this material is:
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository (http://
archive.ics.uci.edu/ml). Irvine, CA: University of California, School of
Information and Computer Science.

xiii
Contents at a Glance
Introductionxxi
Chapter 1 The Two Essential Algorithms for Making Predictions 1
Chapter 2 Understand the Problem by Understanding the Data 23
Chapter 3 Predictive Model Building: Balancing Performance,
Complexity, and Big Data 77
Chapter 4 Penalized Linear Regression 129
Chapter 5 Building Predictive Models Using Penalized Linear Methods 169
Chapter 6 Ensemble Methods 221
Chapter 7 Building Ensemble Models with Python 265
Index 329

xv
Contents
Introductionxxi
Chapter 1 The Two Essential Algorithms for Making Predictions 1
Why Are These Two Algorithms So Useful? 2
What Are Penalized Regression Methods? 7
What Are Ensemble Methods? 9
How to Decide Which Algorithm to Use 11
The Process Steps for Building a Predictive Model 13
Framing a Machine Learning Problem 15
Feature Extraction and Feature Engineering 17
Determining Performance of a Trained Model 18
Chapter Contents and Dependencies 18
Summary20
Chapter 2 Understand the Problem by Understanding the Data 23
The Anatomy of a New Problem 24
Different Types of Attributes and Labels Drive
Modeling Choices 26
Things to Notice about Your New Data Set 27
Classification Problems: Detecting Unexploded
Mines Using Sonar 28
Physical Characteristics of the Rocks Versus Mines Data Set 29
Statistical Summaries of the Rocks Versus Mines Data Set 32
Visualization of Outliers Using a Quantile-Quantile Plot 34
Statistical Characterization of Categorical Attributes 35
How to Use Python Pandas to Summarize the Rocks
Versus Mines Data Set 36
Visualizing Properties of the Rocks Versus Mines Data Set 39
Visualizing with Parallel Coordinates Plots 39
Visualizing Interrelationships between Attributes and Labels 41

xvi Contents
Visualizing Attribute and Label Correlations Using a Heat Map 48
Summarizing the Process for Understanding the Rocks Versus
Mines Data Set 50
Real-Valued Predictions with Factor Variables: How Old Is
Your Abalone? 50
Parallel Coordinates for Regression Problems—Visualize
Variable Relationships for the Abalone Problem 55
How to Use a Correlation Heat Map for Regression—
Visualize Pair-Wise Correlations for the Abalone Problem 59
Real-Valued Predictions Using Real-Valued Attributes:
Calculate How Your Wine Tastes 61
Multiclass Classification Problem: What Type of Glass Is That? 67
Using PySpark to Understand Large Data Sets 72
Summary75
Chapter 3 Predictive Model Building: Balancing Performance,
Complexity, and Big Data 77
The Basic Problem: Understanding Function Approximation 78
Working with Training Data 79
Assessing Performance of Predictive Models 81
Factors Driving Algorithm Choices and
Performance—Complexity and Data 82
Contrast between a Simple Problem and a Complex Problem 82
Contrast between a Simple Model and a Complex Model 85
Factors Driving Predictive Algorithm Performance 89
Choosing an Algorithm: Linear or Nonlinear? 90
Measuring the Performance of Predictive Models 91
Performance Measures for Different Types of Problems 91
Simulating Performance of Deployed Models 105
Achieving Harmony between Model and Data 107
Choosing a Model to Balance Problem Complexity, Model
Complexity, and Data Set Size 107
Using Forward Stepwise Regression to Control Overfitting 109
Evaluating and Understanding Your Predictive Model 114
Control Overfitting by Penalizing Regression
Coefficients—Ridge Regression 116
Using PySpark for Training Penalized Regression Models on
Extremely Large Data Sets 124
Summary127
Chapter 4 Penalized Linear Regression 129
Why Penalized Linear Regression Methods Are So Useful 130
Extremely Fast Coefficient Estimation 130
Variable Importance Information 131
Extremely Fast Evaluation When Deployed 131
Reliable Performance 131
Sparse Solutions 132

Contents xvii
Problem May Require Linear Model 132
When to Use Ensemble Methods 132
Penalized Linear Regression: Regulating Linear Regression
for Optimum Performance 132
Training Linear Models: Minimizing Errors and More 135
Adding a Coefficient Penalty to the OLS Formulation 136
Other Useful Coefficient Penalties—Manhattan and
ElasticNet137
Why Lasso Penalty Leads to Sparse Coefficient Vectors 138
ElasticNet Penalty Includes Both Lasso and Ridge 140
Solving the Penalized Linear Regression Problem 141
Understanding Least Angle Regression and Its
Relationship to Forward Stepwise Regression 141
How LARS Generates Hundreds of Models of Varying
Complexity145
Choosing the Best Model from the Hundreds
LARS Generates 147
Using Glmnet: Very Fast and Very General 152
Comparison of the Mechanics of Glmnet and
LARS Algorithms 153
Initializing and Iterating the Glmnet Algorithm 153
Extension of Linear Regression to Classification Problems 157
Solving Classification Problems with Penalized Regression 157
Working with Classification Problems Having More
Than Two Outcomes 161
Understanding Basis Expansion: Using Linear
Methods on Nonlinear Problems 161
Incorporating Non-Numeric Attributes into Linear Methods 163
Summary 166
Chapter 5 Building Predictive Models Using Penalized Linear Methods 169
Python Packages for Penalized Linear Regression 170
Multivariable Regression: Predicting Wine Taste 171
Building and Testing a Model to Predict Wine Taste 172
Training on the Whole Data Set before Deployment 175
Basis Expansion: Improving Performance by
Creating New Variables from Old Ones 179
Binary Classification: Using Penalized Linear
Regression to Detect Unexploded Mines 182
Build a Rocks Versus Mines Classifier for Deployment 191
Multiclass Classification: Classifying Crime Scene
Glass Samples 200
Linear Regression and Classification Using PySpark 203
Using PySpark to Predict Wine Taste 204
Logistic Regression with PySpark: Rocks Versus Mines 208
Incorporating Categorical Variables in a
PySpark Model: Predicting Abalone Rings 213

xviii Contents
Multiclass Logistic Regression with Meta
Parameter Optimization 217
Summary219
Chapter 6 Ensemble Methods 221
Binary Decision Trees 222
How a Binary Decision Tree Generates Predictions 224
How to Train a Binary Decision Tree 225
Tree Training Equals Split Point Selection 227
How Split Point Selection Affects Predictions 228
Algorithm for Selecting Split Points 229
Multivariable Tree Training—Which Attribute to Split? 229
Recursive Splitting for More Tree Depth 230
Overfitting Binary Trees 231
Measuring Overfit with Binary Trees 231
Balancing Binary Tree Complexity for Best Performance 232
Modifications for Classification and Categorical Features 235
Bootstrap Aggregation: “Bagging” 235
How Does the Bagging Algorithm Work? 236
Bagging Performance—Bias Versus Variance 239
How Bagging Behaves on Multivariable Problem 241
Bagging Needs Tree Depth for Performance 245
Summary of Bagging 246
Gradient Boosting 246
Basic Principle of Gradient Boosting Algorithm 246
Parameter Settings for Gradient Boosting 249
How Gradient Boosting Iterates toward a Predictive Model 249
Getting the Best Performance from Gradient Boosting 250
Gradient Boosting on a Multivariable Problem 253
Summary for Gradient Boosting 256
Random Forests 256
Random Forests: Bagging Plus Random Attribute Subsets 259
Random Forests Performance Drivers 260
Random Forests Summary 261
Summary262
Chapter 7 Building Ensemble Models with Python 265
Solving Regression Problems with Python Ensemble
Packages265
Using Gradient Boosting to Predict Wine Taste 266
Using the Class Constructor for
GradientBoostingRegressor266
Using GradientBoostingRegressor to Implement a
Regression Model 268
Assessing the Performance of a Gradient Boosting Model 271
Building a Random Forest Model to Predict Wine Taste 272
Constructing a RandomForestRegressor Object 273

Contents xix
Modeling Wine Taste with RandomForestRegressor 275
Visualizing the Performance of a Random Forest
Regression Model 279
Incorporating Non-Numeric Attributes in Python
Ensemble Models 279
Coding the Sex of Abalone for Gradient Boosting
Regression in Python 280
Assessing Performance and the Importance of Coded
Variables with Gradient Boosting 282
Coding the Sex of Abalone for Input to Random Forest
Regression in Python 284
Assessing Performance and the Importance of
Coded Variables 287
Solving Binary Classification Problems with Python
Ensemble Methods 288
Detecting Unexploded Mines with Python
Gradient Boosting 288
Determining the Performance of a Gradient
Boosting Classifier 291
Detecting Unexploded Mines with Python Random Forest 292
Constructing a Random Forest Model to Detect
Unexploded Mines 294
Determining the Performance of a Random Forest Classifier 298
Solving Multiclass Classification Problems with
Python Ensemble Methods 300
Dealing with Class Imbalances 301
Classifying Glass Using Gradient Boosting 301
Determining the Performance of the Gradient Boosting
Model on Glass Classification 306
Classifying Glass with Random Forests 307
Determining the Performance of the Random Forest
Model on Glass Classification 310
Solving Regression Problems with PySpark Ensemble
Packages311
Predicting Wine Taste with PySpark Ensemble Methods 312
Predicting Abalone Age with PySpark Ensemble Methods 317
Distinguishing Mines from Rocks with PySpark
Ensemble Methods 321
Identifying Glass Types with PySpark Ensemble Methods 325
Summary327
Index329

xxi
Introduction
Extracting actionable information from data is changing the fabric of modern
business in ways that directly affect programmers. One way is the demand
for new programming skills. Market analysts predict demand for people with
advanced statistics and machine learning skills will exceed supply by 140,000
to 190,000 by 2018. That means good salaries and a wide choice of interesting
projects for those who have the requisite skills. Another development that affects
programmers is progress in developing core tools for statistics and machine
learning. This relieves programmers of the need to program intricate algorithms
for themselves each time they want to try a new one. Among general-purpose
programming languages, Python developers have been in the forefront, building
state-of-the-art machine learning tools, but there is a gap between having the
tools and being able to use them efficiently.
Programmers can gain general knowledge about machine learning in a
number of ways: online courses, a number of well-written books, and so on. Many
of these give excellent surveys of machine learning algorithms and examples of
their use, but because of the availability of so many different algorithms, it’s
difficult to cover the details of their usage in a survey.
This leaves a gap for the practitioner. The number of algorithms available
requires making choices that a programmer new to machine learning might not
be equipped to make until trying several, and it leaves the programmer to fill
in the details of the usage of these algorithms in the context of overall problem
formulation and solution.
This book attempts to close that gap. The approach taken is to restrict the algo-
rithms covered to two families of algorithms that have proven to give optimum
performance for a wide variety of problems. This assertion is supported by
their dominant usage in machine learning competitions, their early inclusion in

xxii Introduction
newly developed packages of machine learning tools, and their performance in
comparative studies (as discussed in Chapter 1, “The Two Essential Algorithms
for Making Predictions”). Restricting attention to two algorithm families makes
it possible to provide good coverage of the principles of operation and to run
through the details of a number of examples showing how these algorithms
apply to problems with different structures.
The book largely relies on code examples to illustrate the principles of oper-
ation for the algorithms discussed. I’ve discovered in the classes I have taught
at University of California, Berkeley, Galvanize, University of New Haven, and
Hacker Dojo, that programmers generally grasp principles more readily by
seeing simple code illustrations than by looking at math.
This book focuses on Python because it offers a good blend of functionality
and specialized packages containing machine learning algorithms. Python is an
often-used language that is well known for producing compact, readable code.
That fact has led a number of leading companies to adopt Python for prototyp-
ing and deployment. Python developers are supported by a large community
of fellow developers, development tools, extensions, and so forth. Python is
widely used in industrial applications and in scientific programming, as well.
It has a number of packages that support computationally intensive applica-
tions like machine learning, and it is a good collection of the leading machine
learning algorithms (so you don’t have to code them yourself). Python is a better
general-purpose programming language than specialized statistical languages
such as R or SAS (Statistical Analysis System). Its collection of machine learning
algorithms incorporates a number of top-flight algorithms and continues to
expand.
Who This Book Is For
This book is intended for Python programmers who want to add machine
learning to their repertoire, either for a specific project or as part of keeping
their toolkit relevant. Perhaps a new problem has come up at work that requires
machine learning. With machine learning being covered so much in the news
these days, it’s a useful skill to claim on a resume.
This book provides the following for Python programmers:
■
■ A description of the basic problems that machine learning attacks
■
■ Several state-of-the-art algorithms
■
■ The principles of operation for these algorithms
■
■ Process steps for specifying, designing, and qualifying a machine learning
system

Introduction xxiii
■
■ Examples of the processes and algorithms
■
■ Hackable code
To get through this book easily, your primary background requirements include
an understanding of programming or computer science and the ability to read
and write code. The code examples, libraries, and packages are all Python, so the
book will prove most useful to Python programmers. In some cases, the book
runs through code for the core of an algorithm to demonstrate the operating
principles, but then uses a Python package incorporating the algorithm to apply
the algorithm to problems. Seeing code often gives programmers an intuitive
grasp of an algorithm in the way that seeing the math does for others. Once
the understanding is in place, examples will use developed Python packages
with the bells and whistles that are important for efficient use (error checking,
handling input and output, developed data structures for the models, defined
predictor methods incorporating the trained model, and so on).
In addition to having a programming background, some knowledge of math
and statistics will help get you through the material easily. Math requirements
include some undergraduate-level differential calculus (knowing how to take a
derivative and a little bit of linear algebra), matrix notation, matrix multiplication,
and matrix inverse. The main use of these will be to follow the derivations of
some of the algorithms covered. Many times, that will be as simple as taking a
derivative of a simple function or doing some basic matrix manipulations. Being
able to follow the calculations at a conceptual level may aid your understanding
of the algorithm. Understanding the steps in the derivation can help you to under-
stand the strengths and weaknesses of an algorithm and can help you to decide
which algorithm is likely to be the best choice for a particular problem.
This book also uses some general probability and statistics. The requirements
for these include some familiarity with undergraduate-level probability and con-
cepts such as the mean value of a list of real numbers, variance, and correlation.
You can always look through the code if some of the concepts are rusty for you.
This book covers two broad classes of machine learning algorithms: penal-
ized linear regression (for example, Ridge and Lasso) and ensemble methods
(for example, Random Forest and Gradient Boosting). Each of these families
contains variants that will solve regression and classification problems. (You
learn the distinction between classification and regression early in the book.)
Readers who are already familiar with machine learning and are only inter-
ested in picking up one or the other of these can skip to the two chapters cov-
ering that family. Each method gets two chapters—one covering principles of
operation and the other running through usage on different types of problems.
Penalized linear regression is covered in Chapter 4, “Penalized Linear Regres-
sion,” and Chapter 5, “Building Predictive Models Using Penalized Linear

xxiv Introduction
Methods.” Ensemble methods are covered in Chapter 6, “Ensemble Methods,”
and Chapter 7, “Building Ensemble Models with Python.” To familiarize yourself
with the problems addressed in the chapters on usage of the algorithms, you
might find it helpful to skim Chapter 2, “Understand the Problem by Under-
standing the Data,” which deals with data exploration. Readers who are just
starting out with machine learning and want to go through from start to finish
might want to save Chapter 2 until they start looking at the solutions to prob-
lems in later chapters.
What This Book Covers
As mentioned earlier, this book covers two algorithm families that are relatively
recent developments and that are still being actively researched. They both
depend on, and have somewhat eclipsed, earlier technologies.
Penalized linear regression represents a relatively recent development in
ongoing research to improve on ordinary least squares regression. Penalized
linear regression has several features that make it a top choice for
predictive
analytics. Penalized linear regression introduces a tunable parameter that makes
it possible to balance the resulting model between overfitting and underfitting.
It also yields information on the relative importance of the various inputs to the
predictions it makes. Both of these features are vitally important to the proc
ess of developing predictive models. In addition,
penalized linear regression
yields the best
prediction performance in some classes of
problems, particularly
underdetermined problems and problems with very many input parameters
such as genetics and text mining. Furthermore, there’s been a great deal of recent
development of coordinate descent methods, making training penalized linear
regression models extremely fast.
To help you understand penalized linear regression, this book recapitulates
ordinary linear regression and other extensions to it, such as stepwise regres-
sion. The hope is that these will help cultivate intuition.
Ensemble methods are one of the most powerful predictive analytics tools
available. They can model extremely complicated behavior, especially for prob-
lems that are vastly overdetermined, as is often the case for many web-based
prediction problems (such as returning search results or predicting ad click-
through rates). Many seasoned data scientists use ensemble methods as their
first try because of their performance. They are relatively simple to use, and
they also rank variables in terms of predictive performance.
Ensemble methods have followed a development path parallel to penalized
linear regression. Whereas penalized linear regression evolved from over-
coming the limitations of ordinary regression, ensemble methods evolved to
overcome the limitations of binary decision trees. Correspondingly, this book’s

Introduction xxv
coverage of ensemble methods covers some background on binary decision trees
because ensemble methods inherit some of their properties from binary decision
trees. Understanding them helps cultivate intuition about ensemble methods.
What Has Changed Since the First Edition
In the three years since the first edition was published, Python has more firmly
established itself as the primary language for data science. Developers of plat-
forms like Spark for big data or TensorFlow and Torch for deep learning have
adopted Python interfaces to reach the widest set of data scientists. The two
classes of algorithms emphasized in the first edition continue to be heavy favor-
ites and are now available as part of PySpark.
The beauty of this marriage is that the code required to build machine learning
models on truly gargantuan data sets is no more complicated than what’s required
on smaller data sets.
PySpark illustrates several important developments, making it cleaner and
easier to invoke very powerful machine learning tools through relatively simple
easy to read and write Python code. When the first edition of this book was
written, building machine learning models on very large data sets required
spinning up hundreds of processors, which required vast knowledge of data
center processes and programming. It was cumbersome and frankly not very
effective. Spark architecture was developed to correct this difficulty.
Spark made it possible to easily rent and employ large numbers of processors
for machine learning. PySpark added a Python interface. The result is that the
code to run a machine learning algorithm in PySpark is not much more compli-
cated than to run the plain Python versions of programs. The algorithms that
were the focus of the first edition continue to be heavily used favorites and are
available in Spark. So it seemed natural to add PySpark examples alongside the
Python examples in order to familiarize readers with PySpark.
In this edition all the code examples are in Python 3, since Python 2 is due to
fall out of support and, in addition to providing the code in text form, the code
is also available in Jupyter notebooks for each chapter. The notebook code when
executed will draw graphs and tables you see in the figures.
How This Book Is Structured
This book follows the basic order in which you would approach a new prediction
problem. The beginning involves developing an understanding of the data and
determining how to formulate the problem, and then proceeds to try an algorithm
and measure the performance. In the midst of this sequence, the book outlines

xxvi Introduction
the methods and reasons for the steps as they come up. Chapter 1 gives a more
thorough description of the types of problems that this book covers and the
methods that are used. The book uses several data sets from the UC Irvine data
repository as examples, and Chapter 2 exhibits some of the methods and tools
that you can use for developing insight into a new data set. Chapter 3, “Predic-
tive Model Building: Balancing Performance, Complexity, and Big Data,” talks
about the difficulties of predictive analytics and techniques for addressing them.
It outlines the relationships between problem complexity, model complexity,
data set size, and predictive performance. It discusses overfitting and how to
reliably sense overfitting. It talks about performance metrics for different types
of problems. Chapters 4 and 5, respectively, cover the background on penalized
linear regression and its application to problems explored in Chapter 2. Chapters
6 and 7 cover background and application for ensemble methods.
What You Need to Use This Book
To run the code examples in the book, you need to have Python 3.x, SciPy,
numpy, pandas, and scikit-learn and PySpark. These can be difficult to install
due to cross-dependencies and version issues. To make the installation easy,
I’ve used a free distribution of these packages that’s available from Continuum
Analytics (http://continuum.io/). Its Anaconda product is a free download and
includes Python 3.x and all the packages you need to run the code in this book
(and more). I’ve run the examples on Ubuntu 14.04 Linux but haven’t tried them
on other operating systems.
PySpark will need a Linux environment. If you’re not running on Linux, then
probably the easiest way to run the examples will be to use a virtual machine.
Virtual Box is a free open source virtual machine—follow the directions to
download Virtual Box and then install Ubuntu 18.05 and use Anaconda to install
Python, PySpark, etc. You’ll only need to employ a VM to run the PySpark exam-
ples. The non-Spark code will run anywhere you can open a Jupyter notebook.
Reader Support for This Book
Source code available in the book’s repository can help you speed your learning.
The chapters include installation instructions so that you can get coding along
with reading the book.
Source Code
As you work through the examples in this book, you may choose either to type
in all the code manually or to use the source code files that accompany the book.

Introduction xxvii
All the source code used in this book is available for download from http://
www.wiley.com/go/pythonmachinelearning2e. You will find the code snippets
from the source code are accompanied by a download icon and note indicating
the name of the program so that you know it’s available for download and can
easily locate it in the download file.
Besides providing the code in text form, it is also included in a Python note-
book. If you know how to run a Jupyter notebook, you can run the code cell-
by-cell. The output will appear in the notebook, the figures will get drawn, and
printed output will appear below the code block.
After you download the code, just decompress it with your favorite com-
pression tool.
How to Contact the Publisher
If you believe you’ve found a mistake in this book, please bring it to our attention.
At John Wiley Sons, we understand how important it is to provide our cus-
tomers with accurate content, but even with our best efforts an error may occur.
In order to submit your possible errata, please email it to our Customer Service
Team at wileysupport@wiley.com with the subject line “Possible Book Errata
Submission”.

C H A P T E R
1
1
This book focuses on the machine learning process and so covers just a few of
the most effective and widely used algorithms. It does not provide a survey of
machine learning techniques. Too many of the algorithms that might be included
in a survey are not actively used by practitioners.
This book deals with one class of machine learning problems, generally
referred to as function approximation. Function approximation is a subset of
problems that are called supervised learning problems. Linear regression and its
classifier cousin, logistic regression, provide familiar examples of algorithms for
function approximation problems. Function approximation problems include
an enormous breadth of practical classification and regression problems in all
sorts of arenas, including text classification, search responses, ad placements,
spam filtering, predicting customer behavior, diagnostics, and so forth. The list
is almost endless.
Broadly speaking, this book covers two classes of algorithms for solving
function approximation problems: penalized linear regression methods and
ensemble methods. This chapter introduces you to both of these algorithms,
outlines some of their characteristics, and reviews the results of comparative
studies of algorithm performance in order to demonstrate their consistent high
performance.
This chapter then discusses the process of building predictive models. It
describes the kinds of problems that you’ll be able to address with the tools
covered here and the flexibilities that you have in how you set up your problem
The Two Essential Algorithms for
Making Predictions

2 Chapter 1 ■ The Two Essential Algorithms for Making Predictions
and define the features that you’ll use for making predictions. It describes process
steps involved in building a predictive model and qualifying it for deployment.
Why Are These Two Algorithms So Useful?
Several factors make the penalized linear regression and ensemble methods a
useful collection. Stated simply, they will provide optimum or near-optimum
performance on the vast majority of predictive analytics (function approxima-
tion) problems encountered in practice, including big data sets, little data sets,
wide data sets, tall skinny data sets, complicated problems, and simple prob-
lems. Evidence for this assertion can be found in two papers by Rich Caruana
and his colleagues:
■
■ “An Empirical Comparison of Supervised Learning Algorithms,” by Rich
Caruana and Alexandru Niculescu-Mizil1
■
■ “An Empirical Evaluation of Supervised Learning in High Dimensions,”
by Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina2
In those two papers, the authors chose a variety of classification problems
and applied a variety of different algorithms to build predictive models. The
models were run on test data that were not included in training the models,
and then the algorithms included in the studies were ranked on the basis of
their performance on the problems. The first study compared 9 different basic
algorithms on 11 different machine learning (binary classification) problems.
The problems used in the study came from a wide variety of areas, including
demographic data, text processing, pattern recognition, physics, and biology.
Table 1.1 lists the data sets used in the study using the same names given by
the study authors. The table shows how many attributes were available for
predicting outcomes for each of the data sets, and it shows what percentage of
the examples were positive.
Table 1.1: Sketch of Problems in Machine Learning Comparison Study
DATA SET NAME NUMBER OF ATTRIBUTES
% OF EXAMPLES THAT
ARE POSITIVE
Adult 14 25
Bact 11 69
Cod 15 50
Calhous 9 52
Cov_Type 54 36
HS 200 24

Chapter 1 ■ The Two Essential Algorithms for Making Predictions 3
The term positive example in a classification problem means an experiment (a
line of data from the input data set) in which the outcome is positive. For example,
if the classifier is being designed to determine whether a radar return signal
indicates the presence of an airplane, then the positive example would be those
returns where there was actually an airplane in the radar’s field of view. The
term positive comes from this sort of example where the two outcomes represent
presence or absence. Other examples include presence or absence of disease in
a medical test or presence or absence of cheating on a tax return.
Not all classification problems deal with presence or absence. For example,
determining the gender of an author by machine-reading his or her text or
machine-analyzing a handwriting sample has two classes—male and female—
but there’s no sense in which one is the absence of the other. In these cases,
there’s some arbitrariness in the assignment of the designations “positive” and
“negative.” The assignments of positive and negative can be arbitrary, but once
chosen must be used consistently.
Some of the problems in the first study had many more examples of one
class than the other. These are called unbalanced. For example, the two data sets
Letter.p1 and Letter.p2 pose closely related problems in correctly classifying
typed uppercase letters in a wide variety of fonts. The task with Letter.p1 is to
correctly classify the letter O in a standard mix of letters. The task with Letter
.p2 is to correctly classify A–M versus N–Z. The percentage of positives shown
in Table 1.1 reflects this difference.
Table 1.1 also shows the number of “attributes” in each of the data sets. Attrib-
utes are the variables you have available to base a prediction on. For example,
to predict whether or not an airplane will arrive at its destination on time,
you might incorporate attributes such as the name of the airline company,
the make and year of the airplane, the level of precipitation at the destination
airport, the wind speed and direction along the flight path, and so on. Hav-
ing a lot of attributes upon which to base a prediction can be a blessing and a
curse. Attributes that relate directly to the outcomes being predicted are a
blessing. Attributes that are unrelated to the outcomes are a curse. Telling the
difference between blessed and cursed attributes requires data. Chapter 3, “Pre-
dictive Model Building: Balancing Performance, Complexity, and Big Data,”
goes into that in more detail.
DATA SET NAME NUMBER OF ATTRIBUTES
% OF EXAMPLES THAT
ARE POSITIVE
Letter.p1 16 3
Letter.p2 16 53
Medis 63 11
Mg 124 17
Slac 59 50

Table 1.2 shows how the algorithms covered in this book fared relative to the
other algorithms used in the study. Table 1.2 shows which algorithms showed
the top five performance scores for each of the problems listed in Table 1.1. Algo-
rithms covered in this book are spelled out (boosted decision trees, Random
Forests, bagged decision trees, and logistic regression). The first three of these
are ensemble methods. Penalized regression was not fully developed when the
study was done and wasn’t evaluated. Logistic regression is a close relative and
is used to gauge the success of regression methods. Each of the 9 algorithms
used in the study had 3 different data reduction techniques applied, for a total of
27 combinations. The top five positions represent roughly the top 20 percent of
performance scores. The row next to the heading Covt indicates that the boosted
decision trees algorithm was the first and second best relative to performance,
the Random Forests algorithm was the fourth and fifth best, and the bagged
decision trees algorithm was the third best. In the cases where algorithms not
covered here were in the top five, an entry appears in the Other column. The
algorithms that show up there are k-nearest neighbors (KNNs), artificial neural nets
(ANNs), and support vector machines (SVMs).
Logistic regression captures top-five honors in only one case in Table 1.2. The
reason for that is that these data sets have few attributes (at most 200) relative to
examples (5,000 in each data set). There’s plenty of data to resolve a model with
Table 1.2: How the Algorithms Covered in This Book Compare on Different Problems
ALGORITHM
BOOSTED
DECISION
TREES
RANDOM
FORESTS
BAGGED
DECISION
TREES
LOGISTIC
REGRESSION OTHER
Covt 1, 2 4, 5 3
Adult 1, 4 2 3, 5
LTR.P1 1 SVM, KNN
LTR.P2 1, 2 4, 5 SVM
MEDIS 1, 3 5 ANN
SLAC 1, 2, 3 4, 5
HS 1, 3 ANN
MG 2, 4, 5 1, 3
CALHOUS 1, 2 5 3, 4
COD 1, 2 3, 4, 5
BACT 2, 5 1, 3, 4

so few attributes, and yet the training sets are small enough that the training
time is not excessive.
As you’ll see in Chapter 3 and in the examples covered in Chapter 5, “Building
Predictive Models Using Penalized Linear Methods,” and Chapter 7, “Build
ing En
semble Models with Python,” the penalized regression methods perform
best relative to other algorithms when there are numerous attributes and not
enough examples or time to train a more complicated ensemble model.
Caruana et al. have run a newer study (2008) to address how these algorithms
compare when the number of attributes increases. That is, how do these algo-
rithms compare on big data? A number of fields have significantly more attrib-
utes than the data sets in the first study. For example, genomic problems have
several tens of thousands of attributes (one attribute per gene), and text mining
problems can have millions of attributes (one attribute per distinct word or per
distinct pair of words). Table 1.3 shows how linear regression and ensemble
methods fare as the number of attributes grows. The results in Table 1.3 show
the ranking of the algorithms used in the second study. The table shows the
performance on each of the problems individually and in the far right column
shows the ranking of each algorithm’s average score across all the problems.
The algorithms used in the study are broken into two groups. The top group
of algorithms are ones that will be covered in this book. The bottom group will
not be covered.
The problems shown in Table 1.3 are arranged in order of their number of
attributes, ranging from 761 to 685,569. Linear (logistic) regression is in the top
three for 5 of the 11 test cases used in the study. Those superior scores were
concentrated among the larger data sets. Notice that boosted decision tree
(denoted by BSTDT in Table 1.3) and Random Forests (denoted by RF in Table 1.3)
algorithms still perform near the top. They come in first and second for overall
score on these problems.
The algorithms covered in this book have other advantages besides raw pre-
dictive performance. An important benefit of the penalized linear regression
models that the book covers is the speed at which they train. On big problems,
training speed can become an issue. In some problems, model training can take
days or weeks. This time frame can be an intolerable delay, particularly early
in development when iterations are required to home in on the best approach.
Besides training very quickly, after being deployed a trained linear model can
produce predictions very quickly—quickly enough for high-speed trading or
Internet ad insertions. The study demonstrates that penalized linear regression
can provide the best answers available in many cases and be near the top even
in cases where they are not the best.

In addition, these algorithms are reasonably easy to use. They do not have
very many tunable parameters. They have well-defined and well-structured
input types. They solve several types of problems in regression and classification.
It is not unusual to be able to arrange the input data and generate a first
trained model and performance predictions within an hour or two of starting a
new problem.
One of their most important features is that they indicate which of their input
variables is most important for producing predictions. This turns out to be
an invaluable feature in a machine learning algorithm. One of the most time-
consuming steps in the development of a predictive model is what is sometimes
called feature selection or feature engineering. This is the process whereby the data
scientist chooses the variables that will be used to predict outcomes. By rank-
ing features according to importance, the algorithms covered in this book aid
in the feature-engineering process by taking some of the guesswork out of the
development process and making the process more sure.
What Are Penalized Regression Methods?
Penalized linear regression is a derivative of ordinary least squares (OLS) regres-
sion—a method developed by Gauss and Legendre roughly 200 years ago.
Penalized linear regression methods were designed to overcome some basic
limitations of OLS regression. The basic problem with OLS is that sometimes it
overfits the problem. Think of OLS as fitting a line through a group of points,
as in Figure 1.1. This is a simple prediction problem: predicting y, the target
value given a single attribute x. For example, the problem might be to predict
men’s salaries using only their heights. Height is slightly predictive of salaries
for men (but not for women).
x – attribute value
y
–
target
value
Figure 1.1: Ordinary least squares fit

The points represent men’s salaries versus their heights. The line in Figure 1.1
represents the OLS solution to this prediction problem. In some sense, the line
is the best predictive model for men’s salaries given their heights. The data set
has six points in it. Suppose that the data set had only two points in it. Imagine
that there’s a population of points, like the ones in Figure 1.1, but that you do
not get to see all the points. Maybe they are too expensive to generate, like the
genetic data mentioned earlier. There are enough humans available to isolate
the gene that is the culprit; the problem is that you do not have gene sequences
for many of them because of cost.
To simulate this in the simple example, imagine that instead of six points you’re
given only two of the six points. How would that change the nature of the line
fit to those points? It would depend on which two points you happened to get.
To see how much effect that would have, pick any two points from Figure 1.1
and imagine a line through them. Figure 1.2 shows some of the possible lines
through pairs of points from Figure 1.1. Notice how much the lines vary depend-
ing on the choice of points.
The problem with having only two points to fit a line is that there is not enough
data for the number of degrees of freedom. A line has two degrees of freedom.
Having two degrees of freedom means that there are two independent param-
eters that uniquely determine a line. You can imagine grabbing hold of a line
in the plane and sliding it up and down in the plane or twisting it to change
its slope. So, vertical position and slope are independent. They can be changed
separately, and together they completely specify a line. The degrees of freedom
of a line can be expressed in several equivalent ways (where it intercepts the
y-axis and its slope, two points that are on the line, and so on). All of these rep-
resentations of a line require two parameters to specify.
When the number of degrees of freedom is equal to the number of points, the
predictions are not very good. The lines hit the points used to draw them, but
there is a lot of variation among lines drawn with different pairs of points. You
x – attribute value
y
–
target
value
Figure 1.2: Fitting lines with only two points

cannot place much faith in a prediction that has as many degrees of freedom
as the number of points in your data set. The plot in Figure 1.1 had six points
and fit a line (two degrees of freedom) through them. That is six points and two
degrees of freedom. The thought problem of determining the genes causing a
heritable condition illustrated that having more genes to choose from makes it
necessary to have more data in order to isolate a cause from among the 20,000
or so possible human genes. The 20,000 different genes represent 20,000 degrees
of freedom. Data from even 20,000 different persons will not suffice to get a
reliable answer, and in many cases, all that can be afforded within the scope of
a reasonable study is a sample from 500 or so persons. That is where penalized
linear regression may be the best algorithm choice.
Penalized linear regression provides a way to systematically reduce degrees of
freedom to match the amount of data available and the complexity of the under-
lying phenomena. These methods have become very popular for problems with
very many degrees of freedom. They are a favorite for genetic problems where
the number of degrees of freedom (that is, the number of genes) can be several
tens of thousands and for problems like text classification where the number of
degrees of freedom can be more than a million. Chapter 4, “Penalized Linear
Regression,” gives more detail on how these methods work, sample code that
illustrates the mechanics of these algorithms, and examples of the process for
implementing machine learning systems using available Python packages.
What Are Ensemble Methods?
The other family of algorithms covered in this book is ensemble methods. The
basic idea with ensemble methods is to build a horde of different predictive
models and then combine their outputs—by averaging the outputs or taking the
majority answer (voting). The individual models are called base learners. Some
results from computational learning theory show that if the base learners are
just slightly better than random guessing, the performance of the ensemble can
be very good if there is a sufficient number of independent models.
One of the problems spurring the development of ensemble methods has
been the observation that some particular machine learning algorithms exhibit
instability. For example, the addition of fresh data to the data set might result in
a radical change in the resulting model or its performance. Binary decision trees
and traditional neural nets exhibit this sort of instability. This instability causes
high variance in the performance of models, and averaging many models can
be viewed as a way to reduce the variance. The trick is how to generate large
numbers of independent models, particularly if they are all using the same base
learner. Chapter 6, “Ensemble Methods,” will get into the details of how this is
done. The techniques are ingenious, and it is relatively easy to understand their
basic principles of operation. Here is a preview of what’s in store.

The ensemble methods that enjoy the widest availability and usage incorporate
binary decision trees as their base learners. Binary decision trees are often por-
trayed as shown in Figure 1.3. The tree in Figure 1.3 takes a real number, called
x, as input at the top, and then uses a series of binary (two-valued) decisions to
decide what value should be output in response to x. The first decision is whether
x is less than 5. If the answer to that question is “no,” the binary decision tree
outputs the value 4 indicated in the circle below the No leg of the upper decision
box. Every possible value for x leads to some output y from the tree. Figure 1.4
plots the output (y) as a function of the input to the tree (x).
5
4
3
Output
Input
2
1
5 6 7
4
3
2
1
Figure 1.4: Input-output graph for the binary decision tree example
x
x 5?
x 3?
y = 2 y = 1
y = 4
Yes No
Yes No
Figure 1.3: Binary decision tree example

This description raises the question of where the comparisons (for example,
x 5?) come from and where the output values (in the circles at the bottom of
the tree) come from. These values come from training the binary tree on the
input data. The algorithm for doing that training is not difficult to understand
and is covered in Chapter 6. The important thing to note at this point is that the
values in the trained binary decision tree are fixed, given the data. The process
for generating the tree is deterministic. One way to get differing models is to
take random samples of the training data and train on these random subsets.
That technique is called Bagging (short for bootstrap aggregating). It gives a way
to generate a large number of slightly different binary decision trees. Those are
then averaged (or voted for a classifier) to yield a final result. Chapter 6 describes
in more detail this technique and other more powerful ones.
How to Decide Which Algorithm to Use
Table 1.4 gives a sketch comparison of these two families of algorithms.
Penalized linear regression methods have the advantage that they train very
quickly. Training times on large data sets can extend to hours, days, or even
weeks. Training usually needs to be done several times before a deployable
solution is arrived at. Long training times can stall development and deploy-
ment on large problems. The rapid training time for penalized linear methods
makes them useful for the obvious reason that faster is better. Depending on the
problem, these methods may suffer some performance disadvantages relative
to ensemble methods. Chapter 3 gives more insight into the types of problems
where penalized regression might be a better choice and those where ensemble
methods might be a better choice. Penalized linear methods can sometimes be
a useful first step in your development process even in the circumstance where
they yield inferior performance to ensemble methods.
Early in development, a number of training iterations will be necessary for
purposes of feature selection and feature engineering and for solidifying the
mathematical problem statement. Deciding what you are going to use as input
to your predictive model can take some time and thought. Sometimes that is
Table 1.4: High-Level Tradeoff between Penalized Linear Regression and Ensemble Algorithms
TRAINING
SPEED
PREDICTION
SPEED
PROBLEM
COMPLEXITY
DEALS
WITH WIDE
ATTRIBUTE
Penalized Linear
Regression
+ + – +
Ensemble Methods – – + –

obvious, but usually it requires some iteration. Throwing in everything you can
find is not usually a good solution.
Trial and error is typically required to determine the best inputs for a model.
For example, if you’re trying to predict whether a visitor to your website will
click a link for an ad, you might try using demographic data for the visitor.
Maybe that does not give you the accuracy that you need, so you try incorpo-
rating data regarding the visitor’s past behavior on the site—what ad the visitor
clicked during past site visits or what products the visitor has bought. Maybe
adding data about the site the visitor was on before coming to your site would
help. These questions lead to a series of experiments where you incorporate
the new data and see whether it hurts or helps. This iteration is generally time-
consuming both for the data manipulations and for training your predictive model.
Penalized linear regression will generally be faster than an ensemble method,
and the time difference can be a material factor in the development process.
For example, if the training set is on the order of a gigabyte, training times
may be on the order of 30 minutes for penalized linear regression and 5 or 6
hours for an ensemble method. If the feature engineering process requires 10
iterations to select the best feature set, the computation time alone comes to
the difference between taking a day or taking a week to accomplish feature
engineering. A useful process, therefore, is to train a penalized linear model in
the early stages of development, feature engineering, and so on. That gives the
data scientist a feel for which variables are going to be useful and important
as well as a baseline performance for comparison with other algorithms later
in development.
Besides enjoying a training time advantage, penalized linear methods gen-
erate predictions much faster than ensemble methods. Generating a predic-
tion involves using the trained model. The trained model for penalized linear
regression is simply a list of real numbers—one for each feature being used to
make the predictions. The number of floating-point operations involved is the
number of variables being used to make predictions. For highly time-sensitive
predictions such as high-frequency trading or Internet ad insertions, compu-
tation time makes the difference between making money and losing money.
For some problems, linear methods may give equivalent or even better
performance than ensemble methods. Some problems do not require com-
plicated models. Chapter 3 goes into some detail about the nature of problem
complexity and how the data scientist’s task is to balance problem complexity,
predictive model complexity, and data set size to achieve the best deployable
model. The basic idea is that on problems that are not complex and problems
for which sufficient data are not available, linear methods may achieve better
overall performance than more complicated ensemble methods. Genetic data
provide a good illustration of this type of problem.
The general perception is that there’s an enormous amount of genetic data
around. Genetic data sets are indeed large when measured in bytes, but in

terms of generating accurate predictions, they aren’t very large. To understand
this distinction, consider the following thought experiment. Suppose that you
have two people, one with a heritable condition and the other without. If you
had genetic sequences for the two people, could you determine which gene was
responsible for the condition? Obviously, that’s not possible because many genes
will differ between the two persons. So how many people would it take? At a
minimum, it would take gene sequences for as many people as there are genes,
and given any noise in the measurements, it would take even more. Humans
have roughly 20,000 genes, depending on your count. And each datum costs
roughly $1,000. So having just enough data to resolve the disease with perfect
measurements would cost $20 million.
This situation is very similar to fitting a line to two points, as discussed
earlier in this chapter. Models need to have fewer degrees of freedom than
the number of data points. The data set typically needs to be a multiple
of the degrees of freedom in the model. Because the data set size is fixed, the
degrees of freedom in the model need to be adjustable. The chapters dealing
with penalized linear regression will show you how the adjustability is built in
to penalized linear regression and how to use it to achieve optimum performance.
NOTE The two broad categories of algorithms addressed in this book match
those that Jeremy Howard and I presented at Strata Conference in 2012. Jeremy took
ensemble methods, and I took penalized linear regression. We had fun arguing about
the relative merits of the two groups. In reality, however, those two cover something
like 80 percent of the model building that I do, and there are good reasons for that.
Chapter 3 goes into more detail about why one algorithm or another is a
better choice for a given problem. It has to do with the complexity of the problem
and the number of degrees of freedom inherent in the algorithms. The linear
models tend to train rapidly and often give equivalent performance to nonlinear
ensemble methods, especially if the data available are somewhat constrained.
Because they’re so rapid to train, it is often convenient to train linear models
for early feature selection and to ballpark achievable performance for a specific
problem. The linear models considered in this book can give information about
variable importance to aid in the feature selection process. The ensemble methods
often give better performance if there are adequate data and also give somewhat
indirect measures of relative variable importance.
The Process Steps for Building a Predictive Model
Using machine learning requires several different skills. One is the required
programming skill, which this book does not address. The other skills have to
do with getting an appropriate model trained and deployed. These other skills
are what the book does address. What do these other skills include?

Initially, problems are stated in somewhat vague language-based terms like
“Show site visitors links that they’re likely to click on.” To turn this into a working
system requires restating the problem in concrete mathematical terms, finding
data to base the prediction on, and then training a predictive model that will
predict the likelihood of site visitors clicking the links that are available for
presentation. Stating the problem in mathematical terms makes assumptions
about what features will be extracted from the available data sources and how
they will be structured.
How do you get started with a new problem? First, you look through the
available data to determine which of the data might be of use in prediction.
“Looking through the data” means running various statistical tests on the data
to get a feel for what they reveal and how they relate to what you’re trying to
predict. Intuition can guide you to some extent. You can also quantify the out-
comes and test the degree to which potential prediction features correlate with
these outcomes. Chapter 2, “Understand the Problem by Understanding the
Data,” goes through this process for the data sets that are used to characterize
and compare the algorithms outlined in the rest of the book.
By some means, you develop a set of features and start training the machine
learning algorithm that you have selected. That produces a trained model and
estimates its performance. Next, you want to consider making changes to the
features set, including adding new ones or removing some that proved unhelpful,
or perhaps changing to a different type of training objective (also called a target)
to see whether it improves performance. You’ll iterate various design decisions
to determine whether there’s a possibility of improving performance. You may
pull out the examples that show the worst performance and then attempt to
determine if there’s something that unites these examples. That may lead to
another feature to add to the prediction process, or it might cause you to bifur-
cate the data and train different models on different populations.
The goal of this book is to make these processes familiar enough to you that
you can march through these development steps confidently. That requires your
familiarity with the input data structures required by different algorithms as
you frame the problem and begin extracting the data to be used in training and
testing algorithms. The process usually includes several of the following steps:
1. Extract and assemble features to be used for prediction.
2. Develop targets for the training.
3. Train a model.
4. Assess performance on test data.
NOTE The first pass can usually be improved on by trying different sets of fea-
tures, different types of targets, and so on.

Machine learning requires more than familiarization with a few packages.
It requires understanding and having practiced the process involved in devel-
oping a deployable model. This book aims to give you that understanding. It
assumes basic undergraduate math and some basic ideas from probability and
statistics, but the book doesn’t presuppose a background in machine learning.
At the same time, it intends to arm readers with the very best algorithms for a
wide class of problems, not necessarily to survey all machine learning algorithms
or approaches. There are a number of algorithms that are interesting but that
don’t get used often, for a variety of reasons. For example, perhaps they don’t
scale well, maybe they don’t give insight about what is going on inside, maybe
they’re difficult to use, and so on. It is well known, for example, that Gradient
Boosting (one of the algorithms covered here) is the leading winner of online
machine competitions by a wide margin. There are good reasons why some
algorithms are more often used by practitioners, and this book will succeed to
the extent that you understand these when you’ve finished reading.
Framing a Machine Learning Problem
Beginning work on a machine learning competition presents a simulation of
a real machine learning problem. The competition presents a brief description
(for example, announcing that an insurance company would like to better pre-
dict loss rates on their automobile policies). As a competitor, your first step is
to open the data set, take a look at the data available, and identify what form
a prediction needs to take to be useful. The inspection of the data will give an
intuitive feel for what the data represent and how they relate to the prediction
job at hand. The data can give insight regarding approaches. Figure 1.5 depicts
the process of starting from a general language statement of objective and
moving toward an arrangement of data that will serve as input for a machine
learning algorithm.
Let’s get
better
results.
??? How?
targets
attributes
??? What does
“better” mean?
??? Any available
helpful data
Figure 1.5: Framing a machine learning problem

The generalized statement caricatured as “Let’s get better results” has first
to be converted into specific goals that can be measured and optimized. For a
website owner, specific performance might be improved click-through rates or
more sales (or more contribution margin). The next step is to assemble data that
might make it possible to predict how likely a given customer is to click various
links or to purchase various products offered online. Figure 1.5 depicts these
data as a matrix of attributes. For the website example, they might include other
pages the visitor has viewed or items the visitor has purchased in the past. In
addition to attributes that will be used to make predictions, the machine learning
algorithms for this type of problem need to have correct answers to use for
training. These are denoted as targets in Figure 1.5. The algorithms covered in
this book learn by detecting patterns in past behaviors, but it is important that
they not merely memorize past behavior; after all, a customer might not repeat
a purchase of something he bought yesterday. Chapter 3 discusses in detail how
this process of training without memorizing works.
Usually, several aspects of the problem formulation can be done in more than
one way. This leads to some iteration between framing the problem, selecting
and training a model, and producing performance estimates. Figure 1.6 depicts
this process.
The problem may come with specific quantitative training objectives, or part
of the job might be extracting these data (called targets or labels). Consider, for
instance, the problem of building a system to automatically trade securities.
To trade automatically, a first step might be to predict changes in the price of
a security. The prices are easily available, so it is conceptually simple to use
historical data to build training examples for which the future price changes
are known. But even that involves choices and experimentation. Future price
change could be computed in several different ways. The change could be the
difference between the current price and the price 10 minutes in the future. It
could also be the change between the current price and the price 10 days in
the future. It could also be the difference between the current price and the
maximum/minimum price over the next 10 minutes. The change in price could
(Re-)Frame the Problem
Qualitative
Problem
Description
Mathematical
Problem
Description
Model Training
and Performance
Assessment
Deployed
Model
Figure 1.6: Iteration from formulation to performance

be characterized by a two-state variable taking values “higher” or “lower”
depending on whether the price is higher or lower 10 minutes in the future.
Each of these choices will lead to a predictive model, and the predictions will
be used for deciding whether to buy or sell the security. Some experimentation
will be required to determine the best choice.
Feature Extraction and Feature Engineering
Deciding which variables to use for making predictions can also involve exper-
imentation. This process is known as feature extraction and feature engineering.
Feature extraction is the process of taking data from a free-form arrangement,
such as words in a document or on a web page, and arranging them into rows
and columns of numbers. For example, a spam-filtering problem begins with
text from emails and might extract things such as the number of capital letters
in the document and the number of words in all caps, the number of times the
word “buy” appears in the document and other numeric features selected to
highlight the differences between spam and non-spam emails.
Feature engineering is the process of manipulating and combining features
to arrive at more informative ones. Building a system for trading securities
involves feature extraction and feature engineering. Feature extraction would
be deciding what things will be used to predict prices. Past prices, prices of
related securities, interest rates, and features extracted from news releases have
all been incorporated into various trading systems that have been discussed
publicly. In addition, securities prices have a number of engineered features
with names like stochastic, MACD (moving average convergence divergence), and
RSI (relative strength index) that are basically functions of past prices that their
inventors believed to be useful in securities trading.
After a reasonable set of features is developed, you can train a predictive
model like the ones described in this book, assess its performance, and make
a decision about deploying the model. Generally, you’ll want to make changes
to the features used, if for no other reason than to confirm that your model’s
performance is adequate. One way to determine which features to use is to try
all combinations, but that can take a lot of time. Inevitably, you’ll face competing
pressures to improve performance but also to get a trained model into use
quickly. The algorithms discussed in this book have the beneficial property of
providing metrics on the utility of each attribute in producing predictions. One
training pass will generate rankings on the features to indicate their relative
importance. This information helps speed the feature engineering process.
NOTE Data preparation and feature engineering is estimated to take 80 to 90
percent of the time required to develop a machine learning model.

The model training process, which begins each time a baseline set of features
is attempted, also involves a process. A modern machine learning algorithm,
such as the ones described in this book, trains something like 100 to 5,000 differ-
ent models that have to be winnowed down to a single model for deployment.
The reason for generating so many models is to provide models of all different
shades of complexity. This makes it possible to choose the model that is best
suited to the problem and data set. You don’t want a model that’s too simple or
you give up performance, but you don’t want a model that’s too complicated or
you’ll overfit the problem. Having models in all shades of complexity lets you
pick one that is just right.
Determining Performance of a Trained Model
The fit of a model is determined by how well it performs on data that were not
used to train the model. This is an important step and conceptually simple. Just
set aside some data. Don’t use it in training. After the training is finished, use
the data you set aside to determine the performance of your algorithm. This
book discusses several systematic ways to hold out data. Different methods have
different advantages, depending mostly on the size of the training data. As easy
as it sounds, people continually figure out complicated ways to let the test data
“leak” into the training process. At the end of the process, you’ll have an algorithm
that will sift through incoming data and make accurate predictions for you. It
might need monitoring as changing conditions alter the underlying statistics.
Chapter Contents and Dependencies
Different readers may want to take different paths through this book, depend-
ing on their backgrounds and whether they have time to understand the basic
principles. Figure 1.7 shows how chapters in the book depend on one another.
Chapter 1
Two Essential
Algorithms
Chapter 2
Understand the
Problem by
Understanding
the Data
Chapter 4
Penalized
Linear
Regression
Chapter 5
Applying
Penalized Linear
Regression
Chapter 6
Ensemble
Methods
Chapter 7
Applying
Ensemble
Methods
Chapter 3
Predictive
Model Building
Figure 1.7: Dependence of chapters on one another

Chapter 2 goes through the various data sets that will be used for problem
examples to illustrate the use of the algorithms that will be developed and to
compare algorithms to each other based on performance and other features.
The starting point with a new machine learning problem is digging into the
data set to understand it better and to learn its problems and idiosyncrasies.
Part of the point of Chapter 2 is to demonstrate some of the tools available in
Python for data exploration. You might want to go through some but not all of
the examples shown in Chapter 2 to become familiar with the process and then
come back to Chapter 2 when diving into the solution examples later.
Chapter 3 explains the basic tradeoffs in a machine learning problem and
introduces several key concepts that are used throughout the book. One key
concept is the mathematical description of predictive problems. The basic dis-
tinctions between classification and regression problems are shown. Chapter 3
also introduces the concept of using out-of-sample data for determining the
performance of a predictive model. Out-of-sample data are data that have not
been included in the training of the model. Good machine learning practice
demands that a developer produce solid estimates of how a predictive model
will perform when it is deployed. This requires excluding some data from the
training set and using it to simulate fresh data. The reasons for this requirement,
the methods for accomplishing it, and the tradeoffs between different methods
are described. Another key concept is that there are numerous measures of
system performance. Chapter 3 outlines these methods and discusses tradeoffs
between them. Readers who are already familiar with machine learning can
browse this chapter and scan the code examples instead of reading it carefully
and running the code.
Chapter 4 shows the core ideas of the algorithms for training penalized
regression models. The chapter introduces the basic concepts and shows how
the algorithms are derived. Some of the examples introduced in Chapter 3
are used to motivate the penalized linear regression methods and algorithms
for their solution. The chapter runs through code for the core algorithms for
solving penalized linear regression training. Chapter 4 also explains several
extensions to linear regression methods. One of these extensions shows how to
code factor variables as real numbers so that linear regression methods can be
applied. Linear regression can be used only on problems where the predictors
are real numbers; that is, the quantities being used to make predictions have
to be numeric. Many practical and important problems have variables like
“single, married, or divorced” that can be helpful in making predictions. To
incorporate variables of this type (called categorical variables) in a linear regres-
sion model, means have been devised to convert categorical variables to real
number variables. Chapter 4 covers those methods. In addition, Chapter 4 also
shows methods (called basis expansion) for getting nonlinear functions out of
nonlinear regression. Sometimes basis expansion can be used to squeeze a little
more performance out of linear regression.

Chapter 5 applies the penalized regression algorithms developed in Chapter 4
to a number of the problems outlined in Chapter 2. The chapter outlines the
Python packages that implement penalized regression methods and uses them
to solve problems. The objective is to cover a wide enough variety of problems
that practitioners can find a problem close to the one that they have in front
of them to solve. Besides quantifying and comparing predictive performance,
Chapter 5 looks at other properties of the trained algorithms. Variable selection
and variable ranking are important to understand. This understanding will
help speed development on new problems.
Chapter 6 develops ensemble methods. Because ensemble methods are most
frequently based on binary decision trees, the first step is to understand the
principles of training and using binary decision trees. Many of the properties
of ensemble methods are ones that they inherit directly from binary decision
trees. With that understanding in place, the chapter explains the three principal
ensemble methods covered in the book. The common names for these are Bagging,
Boosting, and Random Forest. For each of these, the principles of operation
are outlined and the code for the core algorithm is developed so that you can
understand the principles of operation.
Chapter 7 uses ensemble methods to solve problems from Chapter 2 and then
compares the various algorithms that have been developed. The comparison
involves a number of elements. Predictive performance is one element of
comparison. The time required for training and performance is another element.
All the algorithms covered give variable importance ranking, and this information
is compared on a given problem across several different algorithms.
In my experience, teaching machine learning to programmers and com-
puter scientists, I’ve learned that code examples work better than mathematics
for some people. The approach taken here is to provide some mathematics,
algorithm sketches, and code examples to illustrate the important points. Nearly
all the methods that are discussed will be found in the code included in the
book and on the website. The intent is to provide hackable code to help you get up
and running on your own problems as quickly as possible.
Summary
This chapter has given a specification for the kinds of problems that you’ll be able
to solve and a description of the process steps for building predictive models.
The book concentrates on two algorithm families. Limiting the number of algo-
rithms covered allows for a more thorough explanation of the background for
these algorithms and of the mechanics of using them. This chapter showed some
comparative performance results to motivate the choice of these two particular
families. The chapter discussed the different strengths and characteristics of

Random documents with unrelated
content Scribd suggests to you:

In July, 1841, Mr. Payne patented his invention for sulphate of iron in
London; and in June and November, 1846, in France; and in 1846 in London,
for carbonate of soda.[13] The materials employed in Payne’s process are
sulphate of iron and sulphate of lime, both being held in solution with water.
The timber is placed in a cylinder in which a vacuum is formed by the
condensation of steam, assisted by air pumps; a solution of sulphate of iron is
then admitted into the vessel, which instantly insinuates itself into all the
pores of the wood, previously freed from air by the vacuum, and, after about
a minute’s exposure, impregnates its entire substance; the sulphate of iron is
then withdrawn, and another solution of sulphate of lime thrown in, which
enters the substance of the wood in the same manner as the former solution,
and the two salts react upon each other, and form two new combinations
within the substance of the wood—muriate of iron, and muriate of lime. One
of the most valuable properties of timber thus prepared is its perfect
incombustibility: when exposed to the action of flame or strong heat, it simply
smoulders, and emits no flame. We may also reasonably infer that with such a
compound in its pores, decay must be greatly retarded, and the liability to
worms lessened, if not prevented. The greatest drawback consists in the
increased difficulty of working. This invention has been approved by the
Commissioners of Woods and Forests, and has received much approbation
from the architectural profession. Mr. Hawkshaw, C.E., considers that this
process renders wood brittle. It was employed for rendering wood
uninflammable in the Houses of Parliament (we presume, in the carcase; for
steaming was used for the joiner’s work), British Museum, and other public
buildings; and also for the Royal Stables at Claremont.
In 1842, Mr. Bethell stated before the Institute of Civil Engineers, London,
that silicate of potash, or soluble glass, rendered wood uninflammable.
In 1842, Professor Brande proposed corrosive sublimate in turpentine, or oil
of tar, as a preservative solution.
In 1845, Mr. Ransome suggested the application of silicate of soda, to be
afterwards decomposed by an acid in the fibre of the wood; and in 1846, Mr
Payne proposed soluble sulphides of the earth (barium sulphide, c.), to be
also afterwards decomposed in the woods by acids.
In 1855, a writer in the ‘Builder’ suggested an equal mixture of alum and
borax (biborate of soda) to be used for making wood uninflammable. We have
no objection to the use of alum and borax to render wood uninflammable,
providing it does not hurt the wood.

Such are the principal patents, suggestions, and inventions, up to the year
1856; but there are many more which have been brought before the public,
some of which we will now describe.
Dr. Darwin, some years since, proposed absorption, first, of lime water,
then of a weak solution of sulphuric acid, drying between the two, so as to
form a gypsum (sulphate of lime) in the pores of the wood, the latter to be
previously well seasoned, and when prepared to be used in a dry situation.
Dr. Parry has recommended a preparation composed of bees-wax, roll
brimstone, and oil, in the proportion of 1, 2, and 3 ounces to ¾ gallon of
water; to be boiled together and laid on hot.
Mr. Pritchard, C.E., of Shoreham, succeeded in establishing pyrolignite of
iron and oil of tar as a preventive of dry rot; the pyrolignite to be used very
pure, the oil applied afterwards, and to be perfectly free from any particle of
ammonia.
Mr. Toplis recommends the introduction into the pores of the timber of a
solution of sulphate or muriate of iron; the solution may be in the proportion
of about 2 lb. of the salt to 4 or 5 gallons of water.
An invention has been lately patented by Mr. John Cullen, of the North
London Railway, Bow, for preserving wood from decay. The inventor proposes
to use a composition of coal-tar, lime, and charcoal; the charcoal to be
reduced to a fine powder, and also the lime. These materials to be well mixed,
and subjected to heat, and the wood immersed therein. The impregnation of
the wood with the composition may be materially aided by means of
exhaustion and pressure. Wood thus prepared is considered to be proof
against the attacks of the white ant.
The process of preserving wood from decay invented by Mr. L. S. Robins, of
New York, was proposed to be worked extensively by the “British Patent Wood
Preserving Company.” It consists in first removing the surface moisture, and
then charging and saturating the wood with hot oleaginous vapours and
compounds. As the Robins’ process applies the preserving material in the form
of vapour, the wood is left clean, and after a few hours’ exposure to the air it
is said to be fit to be handled for any purposes in which elegant workmanship
is required. Neither science nor extraordinary skill is required in conducting
the process, and the treatment under the patent is said to involve only a
trifling expense.
Reference has already been made to the use of petroleum. The almost
unlimited supply of it within the last few years has opened out a new and

almost boundless source of wealth. An invention has been patented in the
name of Mr. A. Prince, which purports to be an improvement in the mode of
preserving timber by the aid of petroleum. The invention consists, firstly, in
the immersion of the timber in a suitable vessel or receptacle, and to exhaust
the air therefrom, by the ordinary means of preserving wood by saturation.
The crude petroleum is next conveyed into the vessel, and thereby caused to
penetrate into every pore or interstice of the woody fibre, the effect being, it
is said, to thoroughly preserve the wood from decay. He also proposes to mix
any cheap mineral paint or pigment with crude petroleum to be used as a
coating for the bottom of ships before the application of the sheathing, and
also to all timber for building or other purposes. The composition is
considered to render the timber indestructible, and to repel the attacks of
insects. Without expressing any opinion upon this patent as applied to wood
for building purposes, we must again draw attention to the high inflammability
of petroleum.
The ‘Journal’ of the Board of Arts and Manufactures for Upper Canada
considers the following to be the cheapest and the best mode of preserving
timber in Canada: Let the timbers be placed in a drying chamber for a few
hours, where they would be exposed to a temperature of about 200°, so as to
drive out all moisture, and by heat, coagulate the albuminous substance,
which is so productive of decay. Immediately upon being taken out of the
drying chamber, they should be thrown into a tank containing crude
petroleum. As the wood cools, the air in the pores will contract, and the
petroleum occupy the place it filled. Such is the extraordinary attraction
shown by this substance for dry surfaces, that by the process called capillary
attraction, it would gradually find its way into the interior of the largest pieces
of timber, and effectually coat the walls and cells, and interstitial spaces.
During the lapse of time, the petroleum would absorb oxygen, and become
inspissated, and finally converted into a bituminous substance, which would
effectually shield the wood from destruction by the ordinary processes of
decay. The process commends itself on account of its cheapness. A drying
chamber might easily be constructed of sheet iron properly strengthened, and
petroleum is very abundant and accessible. Immediately after the pieces of
timber have been taken out of the petroleum vat, they should be sprinkled
with wood ashes in order that a coating of this substance may adhere to the
surface, and carbonate of potash be absorbed to a small depth. The object of
this is to render the surface incombustible; and dusting with wood ashes until
quite dry will destroy this property to a certain extent.

The woodwork of farm buildings in this country is sometimes subjected to
the following: Take two parts of gas-tar, one part of pitch, one part half
caustic lime and half common resin; mix and boil these well together, and put
them on the wood quite hot. Apply two or three coats, and while the last coat
is still warm, dash on it a quantity of well-washed sharp sand, previously
prepared by being sifted through a sieve. The surface of the wood will then
have a complete stone appearance, and may be durable. It is, of course,
necessary, that the wood be perfectly dry, and one coat should be well
hardened before the next is put on. It is necessary, by the use of lime and
long boiling, to get quit of the ammonia of the tar, as it is considered to injure
the wood.
Mr. Abel, the eminent chemist to the War Department, recommends the
application of silicate of soda in solution, for giving to wood, when applied to
it like paint, a hard coating, which is durable for several years, and is also a
considerable protection against fire. The silicate of soda, which is prepared for
use in the form of a thick syrup, is diluted in water in the proportion of 1 part
by measure of the syrup to 4 parts of water, which is added slowly, until a
perfect mixture is obtained by constant stirring. The wood is then washed
over two or three times with this liquid by means of an ordinary whitewash
brush, so as to absorb as much of it as possible. When this first coating is
nearly dry, the wood is painted over with another wash made by slaking good
fat lime, diluted to the consistency of thick cream. Then, after the limewash
has become moderately dry, another solution of the silicate of soda, in the
proportion of 1 of soda to 2 of water, is applied in the same manner as the
first coating. The preparation of the wood is then complete; but if the lime
coating has been applied too quickly, the surface of the wood may be found,
when quite dry, after the last coating of the silicate, to give off a little lime
when rubbed with the hand; in which case it should be once more coated
over with a solution of the silicate of the same strength as in the first
operation. If Mr. Abel had been an architect or builder, he would never have
invented this process. What would the cost be? and would not a special clerk
of the works be necessary to carry out this method in practice?
The following coating for piles and posts, to prevent them from rotting, has
been recommended on account of its being economical, impermeable to
water, and nearly as hard as stone: Take 50 parts of resin, 40 of finely
powdered chalk, 300 parts of fine white sharp sand, 4 parts of linseed oil, 1
part of native red oxide of copper, and 1 part of sulphuric acid. First, heat the
resin, chalk, sand, and oil, in an iron boiler; then add the oxide, and, with
care, the acid; stir the composition carefully, and apply the coat while it is still

hot. If it be not liquid enough, add a little more oil. This coating, when it is
cold and dry, forms a varnish which is as hard as stone.
Another method for fencing, gate-posts, garden stakes, and timber which is
to be buried in the earth, may be mentioned. Take 11 lb. of blue vitriol
(sulphate of copper) and 20 quarts of water; dissolve the vitriol with boiling
water, and then add the remainder of the water. The end of the wood is then
to be put into the solution, and left to stand four or five days; for shingle,
three days will answer, and for posts, 6 inches square, ten days, Care should
be taken that the saturation takes place in a well-pitched tank or keyed box,
for the reason that any barrel will be shrunk by the operation so as to leak.
Instead of expanding an old cask, as other liquids do, this shrinks it. This
solution has also been used in dry rot cases, when the wood is only slightly
affected.
It will sometimes be found that when oak fencing is put up new, and tarred
or painted, a fungus will vegetate through the dressing, and the interior of the
wood be rapidly destroyed; but when undressed it seems that the weather
desiccates the gum or sap, and leaves only the woody fibre, and the fence
lasts for many years.
About fifteen years ago, Professor Crace Calvert, F.R.S., made an
investigation for the Admiralty, of the qualities of different woods used in ship-
building. He found the goodness of teak to consist in the fact that it is highly
charged with caoutchouc; and he considered that if the tannin be soaked out
of a block of oak, it may then be interpenetrated by a solution of caoutchouc,
and thereby rendered as lasting as teak.
We can only spare the space for a few words about this method.
1st. We have seen lead which has formed part of the gutter of a building
previous to its being burnt down: lead melts at 612° F.; caoutchouc at 248°
F.; therefore caoutchouc would not prevent wood from being destroyed by
fire. At 248° caoutchouc is highly inflammable, burns with a white flame and
much smoke.
2nd. We are informed by a surgical bandage-maker of high repute, that
caoutchouc, when used in elastic kneecaps, c., will perish, if the articles are
left in a drawer for two or three years. When hard, caoutchouc is brittle.
Would it be advisable to interpenetrate oak with a solution of caoutchouc?
In 1825, Mr. Hancock proposed a solution of 1½ lb. of caoutchouc in 3 lb. of
essential oil, to which was to be added 9 lb. of tar. Mr. Parkes, in 1843, and M.
Passez, in 1845, proposed to dissolve caoutchouc in sulphur: painting or

immersing the wood. Maconochie, in 1805, after his return from India,
proposed distilled teak chips to be injected into fir woods.
Although England has been active in endeavouring to discover the best and
cheapest remedy for dry rot, France has also been active in the same
direction.
M. le Comte de Chassloup Lambat, Member of the late Imperial Senate of
France, considers that, as sulphur is most prejudicial to all species of fungi,
there might, perhaps, be some means of making it serviceable in the
preservation of timber. We know with what success it is used in medicine. It is
also known that coopers burn a sulphur match in old casks before using them
—a practice which has evidently for its object the prevention of mustiness,
often microscopic, which would impart a bad flavour to the wine.
M. de Lapparent, late Inspector-General of Timber for the French Navy,
proposed to prevent the growth of fungi by the use of a paint having flour of
sulphur as a basis, and linseed oil as an amalgamater. In 1862 he proposed
charring wood; we have referred to this process in our last chapter (p. 96).
The paint was to be composed of:
Flour of sulphur 200 grammes 3,088 grains.
Common linseed oil 135 ” 2,084 ”
Prepared oil of manganese 30 ” 463 ”
He considered that by smearing here and there either the surfaces of the
ribs of a ship, or below the ceiling, with this paint, a slightly sulphurous
atmosphere will be developed in the hold, which will purify the air by
destroying, at least in part, the sporules of the fungi. He has since stated that
his anticipations have been fully realized. M. de Lapparent also proposes to
prevent the decay of timber by subjecting it to a skilful carbonization with
common inflammable coal gas. An experiment was made at Cherbourg, which
was stated to be completely successful. The cost is only about 10 cents per
square yard of framing and planking.[14] M. de Lapparent’s gas method is
useful for burning off old paint. We saw it in practice (April, 1875) at Waterloo
Railway Station, London, and it appeared to be effective.
At the suggestion of MM. Le Châtelier (Engineer-in-chief of mines) and
Flachat, C.E.’s, M. Ranee, a few years since, injected in a Légé and Fleury
cylinder certain pieces of white fir, red fir, and pitch pine with chloride of
sodium, which had been deprived of the manganesian salts it contained, to
destroy its deliquescent property. Some pieces were injected four times, but

the greatest amount of solution injected into pitch pine heart-wood was from
3 to 4 per cent., and very little more was injected into the white and red fir
heart-wood. It was also noticed that sapwood, after being injected four times,
only gained 8 per cent. in weight in the last three operations. The
experiments made to test the relative incombustibility of the injected wood
showed that the process was a complete failure; the prepared wood burning
as quickly as the unprepared wood.
M. Paschal le Gros, of Paris, has patented his system for preserving all kinds
of wood, by means of a double salt of manganese and of zinc, used either
alone or with an admixture of creosote. The solution, obtained in either of the
two ways, is poured into a trough, and the immersion of the logs or pieces of
wood is effected by placing them vertically in the trough in such a manner
that they are steeped in the liquid to about three-quarters of their length. The
wood is thus subjected to the action of the solution during a length of time
varying from twelve to forty-eight hours. The solution rises in the fibres of the
wood, and impregnates them by the capillary force alone, without requiring
any mechanical action. The timber is said to become incombustible, hard, and
very lasting.
M. Fontenay, C.E., in 1832, proposed to act upon the wood with what he
designated metallic soap, which could be obtained from the residue in
greasing boxes of carriages; also from the acid remains of oil, suet, iron, and
brass dust; all being melted together. In 1816 Chapman tried experiments
with yellow soap; but to render it sufficiently fluid it required forty times its
weight of water, in which the quantity of resinous matter and tallow would
scarcely exceed ⅟80th; therefore no greater portion of these substances could
be left in the pores of the wood, which could produce little effect.
M. Letellier, in 1837, proposed to use deuto-chloride of mercury as a
preservative for wood.
M. Dondeine’s process was formerly used in France and Germany. It is a
paint, consisting of many ingredients, the principal being linseed oil, resin,
white lead, vermilion, lard, and oxide of iron. All these are to be well mixed,
and reduced by boiling to one-tenth, and then applied with a brush. If applied
cold, a little varnish or turpentine to be added.
Little is known in England of the inventions which have arisen in foreign
countries not already mentioned.
M. Szerelmey, a Hungarian, proposed, in 1868, potassa, lime, sulphuric
acid, petroleum, c., to preserve wood.

In Germany, the following method is sometimes used for the preservation of
wood: Mix 40 parts of chalk, 40 parts of resin, 4 of linseed oil; melting them
together in an iron pot; then add 1 part of native oxide of copper, and
afterwards, carefully, 1 part of sulphuric acid. The mixture is applied while hot
to the wood by means of a brush, and it soon becomes very hard.[15]
Mr. Cobley, of Meerholz, Hesse, has patented the following preparation. A
strong solution of potash, baryta, lime, strontia, or any of their salts, are
forced into the pores of timber in a close iron vessel by a pump. After this
operation, the liquid is run off from the timber, and hydro-fluo-silicic acid is
forced in, which, uniting with the salts in the timber, forms an insoluble
compound capable of rendering the wood uninflammable.
About the year 1800, Neils Nystrom, chemist, Norkopping, recommended a
solution of sea salt and copperas, to be laid upon timber as hot as possible, to
prevent rottenness or combustion. He also proposed a solution of sulphate of
iron, potash, alum, c., to extinguish fires.
M. Louis Vernet, Buenos Ayres, proposed to preserve timber from fire by
the use of the following mixture: Take 1 lb. of arsenic, 6 lb. of alum, and 10
lb. of potash, in 40 gallons of water, and mix with oil, or any suitable tarry
matters, and paint the timber with the solution. We have already referred to
the conflicting evidence respecting alum and water for wood: we can now
state that Chapman’s experiments proved that arsenic afforded no protection
against dry rot. Experiments in Cornwall have proved that where arsenical
ores have lain on the ground, vegetation will ensue in two or three years after
removal of the ore. If, therefore, alum or arsenic have no good effect on
timber with respect to the dry rot, we think the use of both of them together
would certainly be objectionable.
The last we intend referring to is a composition frequently used in China,
for preserving wood. Many buildings in the capital are painted with it. It is
called Schoicao, and is made with 3 parts of blood deprived of its febrine, 4
parts of lime and a little alum, and 2 parts of liquid silicate of soda. It is
sometimes used in Japan.
It would be practically useless to quote any further remedies, and the
reader is recommended to carefully study those quoted in this chapter, and of
their utility to judge for himself, bearing in mind those principles which we
have referred to before commencing to describe the patent processes. A large
number of patents have been taken out in England for the preservation of
wood by preservative processes, but only two are now in use,—that is, to any
extent,—viz. Bethell’s and Burnett’s. Messrs. Bethell and Co. now impregnate

timber with copper, zinc, corrosive sublimate, or creosote; the four best
patents.
We insert here a short analysis of different methods proposed for seasoning
timber:—
Vacuum and Pressure Processes generally.
Bréant’s.
Bethell’s.
Payne’s.
Perin’s.
Tissier’s.
Vacuum by Condensation of Steam.
Tissier.
Bréant.
Payne.
Renard Perin, 1848.
Brochard and Watteau, 1847.
Separate Condenser.
Tissier.
Employ Sulphate of Copper in closed vessels.
Bethell’s Patent, 11th July, 1838.
Tissier, 22nd October, 1844.
Molin’s Paper, 1853.
Payen’s Pamphlet.
Légé and Fleury’s Pamphlet.
Current of Steam.
Moll’s Patent, 19th January, 1835.
Tissier’s ” 22nd October, 1844.
Payne’s ” 14th Nov., 1846.
Meyer d’Uslaw, 2nd January, 1851.
Payen’s Pamphlet.
Hot Solution.

Tissier’s Patent, 22nd October, 1844.
Knab’s Patent, 8th September, 1846.
Most solutions used are heated.
The following are the chief ingredients which have been recommended, and
some of them tried, to prevent the decomposition of timber, and the growth
of fungi:—
Acid, Sulphuric.
” Vitriolic.
” of Tar.
Carbonate of Potash.
” Soda.
” Barytes.
Sulphate of Copper.
” Iron.
” Zinc.
” Lime.
” Magnesia.
” Barytes.
” Alumina.
” Soda.
Salt, Neutral.
Salt, Selenites.
Oil, Vegetable.
” Animal.
” Mineral.
Muriate of Soda.
Marcosites, Mundic.
” Barytes.
Nitrate of Potash.
Animal Glue.
” Wax.
Quick Lime.
Resins of different kinds.
Sublimate, Corrosive.
Peat Moss.
For the non-professional reader we find we have three facts:

1st. The most successful patentees have been Bethell and Burnett, in
England; and Boucherie, in France: all B’s.
2nd. The most successful patents have been knighted. Payne’s patent was,
we believe, used by Sirs R. Smirke and C. Barry; Kyan’s, by Sir R. Smirke;
Burnett’s, by Sirs M. Peto, P. Roney, and H. Dryden; while Bethell’s patent can
claim Sir I. Brunel, and many other knights. We believe Dr. Boucherie received
the Legion of Honour in France.
3rd. There are only at the present time three timber-preserving works in
London, and they are owned by Messrs. Bethell and Co., Sir F. Burnett and
Co., and Messrs. Burt, Boulton, and Co.: all names commencing with the letter
B.
For the professional reader we find we have three hard facts:
The most successful patents may be placed in three classes, and we give
the key-note of their success.
1st. One material and one application.—Creosote, Petroleum. Order—Ancient
Egyptians, or Bethell’s, Burmese.
2nd. Two materials and one application.—Chloride of zinc and water; sulphate
of copper and water; corrosive sublimate and water. Order—Burnett,
Boucherie, Kyan.
3rd. Two materials and two applications.—Sulphate of iron and water;
afterwards sulphate of lime and water. Payne.
We thus observe there are twice three successful patent processes.
Any inventions which cannot be brought under these three classes have had
a short life; at least, we think so.
The same remarks will apply to external applications for wood—for
instance, coal-tar, one application, is more used for fencing than any other
material.
We are much in want of a valuable series of experiments on the application
of various chemicals on wood to resist burning to pieces; without causing it to
rot speedily.

CHAPTER VI.
ON THE MEANS OF PREVENTING DRY ROT IN
MODERN HOUSES; AND THE CAUSES OF THEIR
DECAY.
Although writers on dry rot have generally deemed it a new disease, there
is foundation to believe that it pervaded the British Navy in the reign of
Charles II. “Dry rot received a little attention” so writes Sir John Barrow,
“about the middle of the last century, at some period of Sir John Pringle’s
presidency of the Royal Society of London.” As timber trees were, no doubt,
subject to the same laws and conditions 500 years ago as they are at the
present day, it is indeed extremely probable that if at that time unseasoned
timber was used, and subjected to heat and moisture, dry rot made its
appearance. We propose in this chapter to direct attention to the several
causes of the decay of wood, which by proper building might be averted.
The necessity of proper ventilation round the timbers of a building has been
repeatedly advised in this volume; for even timber which has been naturally
seasoned is at all times disposed to resume, from a warm and stagnant
atmosphere, the elements of decay. We cannot therefore agree with the
following passage from Captain E. M. Shaw’s book on ‘Fire Surveys,’ which is
to be found at page 44:—“Circulation of air should on no account be
permitted in any part of a building not exposed to view, especially under
floors, or inside skirting boards, or wainscots.” In the course of this chapter,
the evil results from a want of a proper circulation of air will be shown.
In warm cellars, or any close confined situations, where the air is filled with
vapour without a current to change it, dry rot proceeds with astonishing
rapidity, and the timber work is destroyed in a very short time. The bread
rooms of ships; behind the skirtings, and under the wooden floors, or the
basement stories of houses, particularly in kitchens, or other rooms where
there are constant fires; and, in general, in every place where wood is
exposed to warmth and damp air, the dry rot will soon make its appearance.
All kinds of stoves are sure to increase the disease if moisture be present.
The effect of heat is also evident from the rapid decay of ships in hot
climates; and the warm moisture given out by particular cargoes is also very

destructive. Hemp will, without being injuriously heated, emit a moist warm
vapour: so will pepper (which will affect teak) and cotton. The ship ‘Brothers’
built at Whitby, of green timber, proceeded to St. Petersburgh for a cargo of
hemp. The next year it was found on examination that her timbers were
rotten, and all the planking, except a thin external skin. It is also an important
fact that rats very rarely make their appearance in dry places: under floors
they are sometimes very destructive.
As rats will sometimes destroy the structural parts of wood framing, a few
words about them may not be out of place. If poisoned wheat, arsenic, c.,
be used, the creatures will simply eat the things and die under the floor,
causing an intolerable stench. The best method is to make a small hole in a
corner of the floor (unless they make it themselves) large enough to permit
them to come up; the following course is then recommended:—Take oil of
amber and ox-gall in equal parts; add to them oatmeal or flour sufficient to
form a paste, which divide into little balls, and lay them in the middle of the
infested apartment at night time. Surround the balls with a number of saucers
filled with water—the smell of the oil is sure to attract the rats, they will
greedily devour the balls, and becoming intolerably thirsty will drink till they
die on the spot. They can be buried in the morning.
Building timber into new walls is often a cause of decay, as the lime and
damp brickwork are active agents in producing putrefaction, particularly
where the scrapings of roads are used, instead of sand, for mortar. Hence it is
that bond timbers, wall plates, and the ends of girders, joists, and lintels are
so frequently found in a state of decay. The ends of brestsummers are
sometimes cased in sheet lead, zinc, or fire-brick, as being impervious to
moisture. The old builders used to bed the ends of girders and joists in loam
instead of mortar, as directed in the Act of Parliament, 19 Car. II. c. 3, for
rebuilding the City of London.
In Norway, all posts in contact with the earth are carefully wrapped round
with flakes of birch bark for a few inches above and below the ground.
Timber that is to lie in mortar—as, for instance, the ends of joists, door sills
and frames of doors and windows, and the ends of girders—if pargeted over
with hot pitch, will, it is said, be preserved from the effects of the lime. In
taking down, some years since, in France, some portion of the ancient
Château of the Roque d’Oudres, it was found that the extremities of the oak
girders were perfectly preserved, although these timbers were supposed to
have been in their places for upwards of 600 years. The whole of these
extremities buried in the walls were completely wrapped round with plates of
cork. When demolishing an ancient Benedictine church at Bayonne, it was

found that the whole of the fir girders were entirely worm eaten and rotten,
with the exception, however, of the bearings, which, as in the case just
mentioned, were also completely wrapped round with plates of cork. These
facts deserve consideration.
If any of our professional readers should wish to try cork for the ends of
girders, they will do well to choose the Spanish cork, which is the best.
In this place it may not be amiss to point out the dangerous consequences
of building walls so that their principal support depends on timber. The usual
method of putting bond timber into walls is to lay it next the inside; this bond
often decays, and, of course, leaves the walls resting only upon the external
course or courses of brick; and fractures, bulges, or absolute failures are the
natural consequences. This evil is in some degree avoided by placing the bond
in the middle of the wall, so that there is brickwork on each side, and by not
putting continued bond for nailing the battens to. We object to placing bond
in the middle of a wall: the best way, where it can be managed, is to corbel
out the wall, resting the ends of the joists on the top course of bricks; thus
doing away with the wood-plate. In London, wood bond is prohibited by Act
of Parliament, and hoop-iron bond (well tarred and sanded) is now generally
used. The following is an instance of the bad effects of placing wood bond in
walls: In taking down portions of the audience part and the whole of the
corridors of the original main walls of Covent Garden Theatre, London, in
1847, which had only been built about thirty-five years, the wood horizontal
bond timbers, although externally appearing in good condition, were found,
on a close examination by Mr. Albano, much affected by shrinkage, and the
majority of them quite rotten in the centre, consequently the whole of them
were ordered to be taken out in short lengths, and the space to be filled in
with brickwork and cement.
Some years since we had a great deal to do with “Fire Surveys;” that is to
say, surveying buildings to estimate the cost of reinstating them after being
destroyed by fire; and we often noticed that the wood bond, being rotten,
was seriously charred by the fire, and had to be cut out in short lengths, and
brickwork in cement “pinned in” in its place. Brestsummers and story posts
are rarely sufficiently burnt to affect the stability of the front wall of a shop
building.
In bad foundations, it used to be common, before concrete came into
vogue, to lie planks to build upon. Unless these planks were absolutely wet,
they were certain to rot in such situations, and the walls settled; and most
likely irregularly, rending the building to pieces. Instances of such kind of
failure frequently occur. It was found necessary, a few years since, to

underpin three of the large houses in Grosvenor Place, London, at an
immense expense. In one of these houses the floors were not less than three
inches out of level, the planking had been seven inches thick, and most of it
was completely rotten: it was of yellow fir. A like accident happened to Norfolk
House, St. James’s Square, London, where oak planking had been used.
As an example of the danger of trusting to timber in supporting heavy stone
or brickwork, the failure of the curb of the brick dome of the church of St.
Mark, at Venice, may be cited. This dome was built upon a curb of larch
timber, put together in thicknesses, with the joints crossed, and was intended
to resist the tendency which a dome has to spread outwards at the base. In
1729, a large crack and several smaller ones were observed in the dome. On
examination, the wooden curb was found to be in a completely rotten state,
and it was necessary to raise a scaffold from the bottom to secure the dome
from ruin. After it was secured from falling, the wooden curb was removed,
and a course of stone, with a strong band of iron, was put in its place.
It is said that another and very important source of destruction is the
applying end to end of two different kinds of wood: oak to fir, oak to teak or
lignum vitæ; the harder of the two will decay at the point of juncture.
The bad effects resulting from damp walls are still further increased by
hasty finishing. To enclose with plastering and joiners’ work the walls and
timbers of a building while they are in a damp state is the most certain means
of causing the building to fall into a premature state of decay.
Mr. George Baker, builder of the National Gallery, London, remarked, in
1835, “I have seen the dry rot all over Baltic timber in three years, in
consequence of putting it in contact with moist brickwork; the rot was caused
by the badness of the mortar, it was so long drying.”
Slating the external surface of a wall, to keep out the rain or damp, is
sometimes adopted: a high wall (nearly facing the south-west) of a house
near the north-west corner of Blackfriars Bridge, London, has been recently
slated from top to bottom, to keep out damp.
However well timber may be seasoned, if it be employed in a damp
situation, decay is the certain consequence; therefore it is most desirable that
the neighbourhood of buildings should be well drained, which would not only
prevent rot, but also increase materially the comfort of those who reside in
them. The drains should be made water-tight wherever they come near to the
walls; as walls, particularly brick walls, draw up moisture to a very
considerable height: very strict supervision should be placed over workmen
while the drains of a building are being laid. Earth should never be suffered to

rest against walls, and the sunk stories of buildings should always be
surrounded by an open area, so that the walls may not absorb moisture from
the earth: even open areas require to be properly built. We will quote a case
to explain our meaning. A house was erected about eighteen months ago, in
the south-east part of London, on sloping ground. Excavations were made for
the basement floor, and a dry area, “brick thick, in cement,” was built at the
back and side of the house, the top of the area wall being covered with a
stone coping; we do not know whether the bottom of the area was drained.
On the top of the coping was placed mould, forming one of the garden beds
for flowers. Where the mould rested against the walls, damp entered. The
area walls should have been built, in the first instance, above the level of the
garden-ground—which has since been done—otherwise, in course of time, the
ends of the next floor joists would have become attacked by dry rot.
Some people imagine that if damp is in a wall the best way to get rid of it is
to seal it in, by plastering the inside and stuccoing the outside of the wall; this
is a great mistake; damp will rise higher and higher, until it finds an outlet;
rotting in the meanwhile the wood bond and ends of all the joists. We were
asked recently to advise in a curious case of this kind at a house in Croydon.
On wet days the wall (stucco, outside; plaster, inside) was perfectly wet:
bands of soft red bricks in wall, at intervals, were the culprits. To prevent
moisture rising from the foundations, some substance that will not allow it to
pass should be used at a course or two above the footings of the walls, but it
should be below the level of the lowest joists. “Taylor’s damp course” bricks
are good, providing the air-passages in them are kept free for air to pass
through: they are allowed sometimes to get choked up with dirt. Sheets of
lead or copper have been used for that purpose, but they are very expensive.
Asphalted felt is quite as good; no damp can pass through it. Care must,
however, be taken in using it if only one wall, say a party wall, has to be built.
To lay two or three courses of slates, bedded in cement, is a good method,
providing the slates “break joint,” and are well bedded in the cement.
Workmen require watching while this is being done, because if any opening
be left for damp to rise, it will undoubtedly do so. A better method is to build
brickwork a few courses in height with Portland cement instead of common
mortar, and upon the upper course to lay a bed of cement of about one inch
in thickness; or a layer of asphalte (providing the walls are all carried up to
the same level before the asphalte is applied hot). As moisture does not
penetrate these substances, they are excellent materials for keeping out wet;
and it can easily be seen if the mineral asphalte has been properly applied. To
keep out the damp from basement floors, lay down cement concrete 6 inches

thick, and on the top, asphalte 1 inch thick, and then lay the sleepers and
joists above; or bed the floor boards on the asphalte.
The walls and principal timbers of a building should always be left for some
time to dry after it is covered in. This drying is of the greatest benefit to the
work, particularly the drying of the walls; and it also allows time for the
timbers to get settled to their proper bearings, which prevents after-
settlements and cracks in the finished plastering. It is sometimes said that it is
useful because it allows the timber more time to season; but when the
carpenter considers that it is from the ends of the timber that much of its
moisture evaporates, he will see the impropriety of leaving it to season after it
is framed, and also the cause of framed timbers of unseasoned wood failing at
the joints sooner than in any other place. No parts of timber require the
perfect extraction of the sap so much as those that are to be joined.
When the plastering is finished, a considerable time should be allowed for
the work to get dry again before the skirtings, the floors, and other joiners’
work be fixed. Drying will be much accelerated by a free admission of air,
particularly in favourable weather. When a building is thoroughly dried at first,
openings for the admission of fresh air are not necessary when the
precautions against any new accessions of moisture have been effectual.
Indeed, such openings only afford harbour for vermin: unfortunately, however,
buildings are so rarely dried when first built, that air-bricks, c., in the floors
are very necessary, and if the timbers were so dried as to be free from water
(which could be done by an artificial process), the wood would only be fit for
joinery purposes. Few of our readers would imagine that water forms ⅕th
part of wood. Here is a table (compiled from ‘Box on Heat,’ and Péclet’s great
work ‘Traité de la Chaleur’):—
Wood.
Elements. Ordinary state.
Carbon ·408
Hydrogen ·042
Oxygen ·334
Water ·200
Ashes ·016
1·000
Many houses at our seaport towns are erected with mortar, having sea-sand
in its composition, and then dry rot makes its appearance. If no other sand
can be obtained, the best way is to have it washed at least three times (the

contractor being under strict supervision, and subject to heavy penalties for
evasion). After each washing it should be left exposed to the action of the
sun, wind, and rain: the sand should also be frequently turned over, so that
the whole of it may in turn be exposed; even then it tastes saltish, after the
third operation. A friend of ours has a house at Worthing, which was erected a
few years since with sea-sand mortar, and on a wet day there is always a
dampness hanging about the house—every third year the staircase walls have
to be repapered: it “bags” from the walls.
In floors next the ground we cannot easily prevent the access of damp, but
this should be guarded against as far as possible. All mould should be
carefully removed, and, if the situation admits of it, a considerable thickness
of dry materials, such as brickbats, dry ashes, broken glass, clean pebbles,
concrete, or the refuse of vitriol-works; but no lime (unless unslaked) should
be laid under the floor, and over these a coat of smiths’ ashes, or of pyrites,
where they can be procured. The timber for the joists should be well
seasoned; and it is advisable to cut off all connection between wooden
ground floors and the rest of the woodwork of the building. A flue carried up
in the wall next the kitchen chimney, commencing under the floor, and
terminating at the top of the wall, and covered to prevent the rain entering,
would take away the damp under a kitchen floor. In Hamburg it is a common
practice to apply mineral asphalte to the basement floors of houses to prevent
capillary attraction; and in the towns of the north of France, gas-tar has
become of very general use to protect the basement of the houses from the
effects of the external damp.
Many houses in the suburbs (particularly Stucconia) of London are erected
by speculating builders. As soon as the carcase of a house is finished (perhaps
before) the builder is unable to proceed, for want of money, and the carcase
is allowed to stand unfinished for months. Showers of rain saturate the
previously unseasoned timbers, and pools of water collect on the basement
ground, into which they gradually, but surely, soak. Eventually the houses are
finished (probably by half a dozen different tradesmen, employed by a
mortgagee); bits of wood, rotten sawdust, shavings, c., being left under the
basement floor. The house when finished, having pretty (!) paper on the
walls, plate-glass in the window-sashes, and a bran new brick and stucco
portico to the front door, is quickly let. Dry rot soon appears, accompanied
with its companions, the many-coloured fungi; and when their presence
should be known from their smell, the anxious wife probably exclaims to her
husband, “My dear! there is a very strange smell which appears to come from
the children’s playroom: had you not better send for Mr. Wideawake, the

builder, for I am sure there is something the matter with the drains.” Defective
ventilation, dry rot, green water thrown down sinks, c., do not cause smells,
it’s the drains, of course!
There is another cause which affects all wood most materially, which is the
application of paint, tar, or pitch before the wood has been thoroughly dried.
The nature of these bodies prevents all evaporation; and the result of this is
that the centre of the wood is transformed into touchwood. On the other
hand, the doors, pews, and carved work of many old churches have never
been painted, and yet they are often found to be perfectly sound, after having
existed more than a century. In Chester, Exeter, and other old cities, where
much timber was formerly used, even for the external parts of buildings, it
appears to be sound and perfect, though black with age, and has never been
painted.
Mr. Semple, in his treatise on ‘Building in Water,’ mentions an instance of
some field-gates made of home fir, part of which, being near the mansion,
were painted; while the rest, being in distant parts of the grounds, were not
painted. Those which were painted soon became quite rotten, but the others,
which were not painted, continued sound.
Another cause of dry rot, which is sometimes found in suburban and
country houses, is the presence of large trees near the house. We are
acquainted with the following remarkable instance:—At the northern end of
Kilburn, London, stands Stanmore Cottage, erected a great many years ago:
about fifty feet in front of it is an old elm-tree. The owner, a few years since,
noticed cracks round the portico of the house; these cracks gradually
increased in size, and other cracks appeared in the window arches, and in
different parts of the external and internal walls. The owner became alarmed,
and sent for an experienced builder, who advised underpinning the walls.
Workmen immediately commenced to remove the ground from the
foundations, and it was then found that the foundations, as well as the joists,
were honeycombed by the roots of the elm-tree, which were growing
alongside the joists, the whole being surrounded by large masses of white
and yellow dry-rot fungus.
The insufficient use of tarpaulins is another frequent cause of dry rot. A
London architect had (a few years since) to superintend the erection of a
church in the south-west part of London; an experienced builder was
employed. The materials were of the best description and quality. When the
walls were sufficiently advanced to receive the roof, rain set in; as the clown
in one of Shakespeare’s plays observed, “the rain, it raineth every day;” it was
so, we are told, in this case for some days. The roof when finished was ceiled

below with a plaster ceiling; and above (not with “dry oakum without pitch”
but) with slates. A few months afterwards some of the slates had to be
reinstated, in consequence of a heavy storm, and it was then discovered that
nearly all the timbers of the roof were affected by dry rot. This was an air-
tight roof.
In situations favourable to rot, painting prevents every degree of
exhalation, depriving at the same time the wood of the influence of the air,
and the moisture runs through it, and insidiously destroys the wood. Most
surveyors know that moist oak cills to window frames will soon rot, and the
painting is frequently renewed; a few taps with a two-feet brass rule joint on
the top and front of cill will soon prove their condition. Wood should be a year
or more before it is painted; or, better still, never painted at all. Artificers can
tell by the sound of any substance whether it be healthy or decayed as
accurately as a musician can distinguish his notes: thus, a bricklayer strikes
the wall with his crow, and a carpenter a piece of timber with his hammer.
The Austrians used formerly to try the goodness of the timber for ship-
building by the following method: One person applies his ear to the centre of
one end of the timber, while another, with a key, hits the other end with a
gentle stroke. If the wood be sound and good, the stroke will be distinctly
heard at the other end, though the timber should be fifty feet or more in
length. Timber affected with rot yields a particular sound when struck, but if it
were painted, and the distemper had made much progress, with no severe
stroke the outside breaks like a shell. The auger is a very useful instrument
for testing wood; the wood or sawdust it brings out can be judged by its
smell; which may be the fresh smell of pure wood: the vinous smell, or first
degree of fermentation, which is alcoholic; or the second degree, which is
putrid. The sawdust may also be tested by rubbing it between the fingers.
According to Colonel Berrien, the Michigan Central Railroad Bridge, at Niles,
was painted before seasoning, with “Ohio fire-proof paint,” forming a glazed
surface. After five years it was so rotten as to require rebuilding.
Painted floor-cloths are very injurious to wooden floors, and frequently
produce rottenness in the floors that are covered with them, as the painted
cloth prevents the access of air, and retains whatever dampness the boards
may absorb, and therefore soon causes decay. Carpets are not so injurious,
but still assist in retarding free evaporation.
Captain E. M. Shaw, in ‘Fire Surveys,’ thus writes of the floors of a building,
“They might with advantage be caulked like a ship’s deck, only with dry
oakum, without pitch.” Let us see how far oil floor-cloth and kamptulicon will
assist us in obtaining an air-tight floor.

In London houses there is generally one room on the basement floor which
is carefully covered over with an oiled floor-cloth. In such a room the dry rot
often makes its appearance. The wood absorbs the aqueous vapour which the
oil-cloth will not allow to escape; and being assisted by the heat of the air in
such apartments, the decay goes on rapidly. Sometimes, however, the dry rot
is only confined to the top of the floor. At No. 106, Fenchurch Street, London,
a wood floor was washed (a few years since) for a tenant, and oil-cloth was
laid down. Circumstances necessitated his removal a few months afterwards;
and it was then found that the oil-cloth had grown, so to speak, to the wood
flooring, and had to be taken off with a chisel: the dry rot had been
engendered merely on the surface of the floor boards, as they were sound
below as well as the joists: air bricks were in the front wall.
We have seen many instances of dry rot in passages, where oiled floor-cloth
has been nailed down and not been disturbed for two or three years.
In ordinary houses, where floor-cloth is laid down in the front kitchen, no
ventilation under the floors, and a fire burning every day in the stove, dry rot
often appears. In the back kitchen, where there is no floor-cloth, and only an
occasional fire, it rarely appears. The air is warm and stagnant under one
floor, and cold and stagnant under the other: at the temperature of 32° to 40°
the progress of dry rot is very slow.
And how does kamptulicon behave itself? The following instances of the
rapid progress of dry rot from external circumstances have recently been
communicated to us; they show that, under favourable circumstances as to
choice of timber and seasoning, this fungus growth can be readily produced
by casing-in the timber with substances impervious, or nearly so, to air.
At No. 29, Mincing Lane, London, in two out of three rooms on the first
floor, upon a fire-proof floor constructed on the Fox and Barrett principle (of
iron joists and concrete with yellow pine sleepers, on strips of wood bedded in
cement, to which were nailed the yellow pine floor-boards) kamptulicon was
nailed down by the tenant’s orders. In less than nine months the whole of the
wood sleepers, and strips of wood, as well as the boards, were seriously
injured by dry rot; whilst the third room floor, which had been covered with a
carpet, was perfectly sound.
At No. 79, Gracechurch Street, London, a room on the second floor was
inhabited, as soon as finished, by a tenant who had kamptulicon laid down.
This floor was formed in the ordinary way, with the usual sound boarding of
strips of wood, and concrete two inches thick filled in on the same, leaving a
space of about two inches under the floor boards. The floor was seriously

decayed by dry rot in a few months down to the level of the concrete
pugging, below which it remained sound, and could be pulled up with the
hand.
We will now leave oil-cloth and kamptulicon, and try what “Keene’s cement”
will do for an “air-tight” partition of a house.
At No. 16, Mark Lane, London, a partition was constructed of sound yellow
deal quarters, covered externally with “Keene’s cement, on lath, both sides.” It
was removed about two years after its construction, when it was found that
the timber was completely perished from dry rot; so much so, that the
timbers parted in the middle in places, and were for some time afterwards
moist.
It is still unfortunately the custom to keep up the old absurd fashion of
disguising woods, instead of revealing their natural beauties. Instead of
wasting time in perfect imitations of scarce or dear woods, it would be much
better to employ the same amount of time in fully developing the natural
characteristics of many of our native woods, now destined for decorative
purposes because they are cheap and common; although many of our very
commonest woods are very beautifully grained, but their excellences for
ornamentation are lost because our decorators have not studied the best
mode of developing their beauties. Who would wish that stained deal should
be painted in imitation of oak? or that the other materials of a less costly and
inferior order should have been painted over instead of their natural faces
being exposed to view? There are beauties in all the materials used. The
inferior serve to set off by comparison the more costly, and increase their
effect. The red, yellow, and white veins of the pine timber are beautiful: the
shavings are like silk ribbons, which only nature could vein after that fashion,
and to imitate which would puzzle all the tapissiers of the Rue Mouffetard, in
Paris.
Why should not light and dark woods be commonly used in combination
with each other in our joinery? Wood may be stained of various shades, from
light to dark. The dirt or dust does not show more on stained wood than it
does on paint, and can be as easily cleaned and refreshed by periodical coats
of varnish. Those parts subjected to constant wear and tear can be protected
by more durable materials, such as finger-plates, c. Oak can be stained dark,
almost black, by means of bichromate of potash diluted with water. Wash the
wood over with a solution of gallic acid of any required strength, and allow it
to thoroughly dry. To complete the process, wash with a solution of iron in the
form of “tincture of steel,” or a decoction of vinegar and iron filings, and a
deep and good stain will be the result. If a positive black is required, wash the

wood over with gallic acid and water two or three times, allowing it to dry
between every coat; the staining with the iron solution may be repeated. Raw
linseed oil will stay the darker process at any stage.
Doors made up of light deal, and varied in the staining, would look as well
as the ordinary graining. Good and well-seasoned materials would have to be
used, and the joiners’ work well fitted and constructed. Mouldings of a
superior character, and in some cases gilt, might be used in the panels, c.
For doors, plain oak should be used for the stiles and rails, and pollard oak for
the panels. If rose-wood or satin-wood be used, the straight-grained wood is
the best adapted for stiles and rails; and for mahogany doors, the lights and
shades in the panels should be stronger than in the stiles and rails.
Dark and durable woods might be used in parts most exposed to wear and
tear.
Treads of stairs might be framed with oak nosings, if not at first, at least
when necessary to repair the nosings.
Skirtings could be varied by using dark and hard woods for the lower part
or plinth, lighter wood above, and finished with superior mouldings. It must,
however, be remembered that, contrary to the rule that holds good with
regard to most substances, the colours of the generality of woods become
considerably darker by exposure to the light; allowance would therefore have
to be made for this. All the woodwork must, previously to being fixed, be well
seasoned.
The practice here recommended would be more expensive than the
common method of painting, but in many cases it would be better than
graining, and cheaper in the long run. Oak wainscot and Honduras mahogany
doors are twice the price of deal doors; Spanish mahogany three times the
price. When we consider that by using the natural woods, French polished, we
save the cost of four coats of paint and graining (the customary modes), the
difference in price is very small. An extra 50l. laid out on a 500l. house would
give some rooms varnished and rubbed fittings, without paint. Would it not be
worth the outlay? It may be said that spots of grease and stains would soon
disfigure the bare wood; if so, they could easily be removed by the following
process: Take a quarter of a pound of fuller’s earth, and a quarter of a pound
of pearlash, and boil them in a quart of soft water, and, while hot, lay the
composition on the greased parts, allowing it to remain on them for ten or
twelve hours; after which it may be washed off with fine sand and water. If a
floor be much spotted with grease, it should be completely washed over with

this mixture, and allowed to remain for twenty-four hours before it is
removed.
Let us consider how we paint our doors, cupboards, c., at the present
time. For our best houses, the stiles of our doors are painted French white;
and the panels, pink, or salmon colour! For cheaper houses, the doors,
cupboards, window linings, c., are generally two shades of what is called
“stone colour” (as if stone was always the same colour), and badly executed
into the bargain: the best rooms having the woodwork grained in imitation of
oak, or satin-wood, c. And such imitations! Mahogany and oak are now even
imitated on leather and paper-hangings. Wood, well and cleanly varnished,
stained, or, better still, French polished, must surely look better than these
daubs. But French polish is not extensively used in England: it is confined to
cabinet pieces and furniture, except in the houses of the aristocracy. Clean,
colourless varnish ought to be more generally used to finish off our
woodwork, instead of the painting now so common. The varnish should be
clean and colourless, as the yellow colour of the ordinary varnishes greatly
interferes with the tints of the light woods.
In the Imperial Palace, at Berlin, one or two of the Emperor’s private rooms
are entirely fitted up with deal fittings; doors, windows, shutters, and
everything else of fir-wood. “Common deal,” if well selected, is beautiful,
cheap, and pleasing.
We have seen the offices of Herr Krauss (architect to Prince and Princess
Louis of Hesse), who resides at Mayence, and they are fitted up, or rather the
walls and ceilings are lined, with picked pitch pine-wood, parts being carved,
and the whole French polished, and the effect is much superior to any paint,
be it “stone colour,” “salmon colour,” or even “French white.”
The reception-room, where the Emperor of Germany usually transacts
business with his ministers, and receives deputations, c., as well as the
adjoining cabinets, are fitted with deal, not grained and painted, but well
French polished. The wood is, of course, carefully selected, carefully wrought,
and excellently French polished, which is the great secret of the business. In
France, it is a very common practice to polish and wax floors.
The late Sir Anthony Carlisle had the interior woodwork of his house, in
Langham Place, London, varnished throughout, and the effect of the
varnished deal was very like satin-wood.
About forty years since, Mr. J. G. Crace, when engaged on the decoration of
the Duke of Hamilton’s house, in the Isle of Arran, found the woodwork of red
pine so free from knots, and so well executed, that instead of painting it, he

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Machine Learning With Spark And Python 2nd Edition Michael Bowles

More Related Content

Similar to Machine Learning With Spark And Python 2nd Edition Michael Bowles (20)

Recently uploaded (20)

Machine Learning With Spark And Python 2nd Edition Michael Bowles