Machine Learning Models in Python with sk-learn
The easiest to use library to start working on machine learning in Python is using a library called scikit-learn (or commonly just "sk-learn"). Part of the simplicity of this library is that the same functions and steps are used for all different types of machine models learning! Specifically, we will always follow four steps:
Import the library for the specific machine learning model we want to use,
Create an instance of the model,
Train the model using
model.fit(...)
Use the model by using
model.predict(...)
, using the model we trained in the previous step to predict the outcome
Example: Linear Regression
Linear Regression is one of the easiest to understand machine learning models, where Python will use the training data to find a "line of best fit" to predict the outcome. The sk-learn linear regression model is sklearn.linear_model.LinearRegression
, which means our import line would be:
# Step 1 - Import the library:
from sklearn.linear_model import LinearRegression
The second step will always be to create a new instance of the model, which we'll call model
:
# Step 2 - Create an instance of the model:
model = LinearRegression()
The third step is requires us to train our model. In this example, we will use the diamonds
dataset to predict the price based on the size (carat weight) of the diamond. To do this:
We use a pandas DataFrame to load the data,
We specify the independent variables as list of values -- here we only have the carat weight, so we have a list of one element:
["carat"]
We specify the dependent variable directly (as LinearRegression allows only a single dependent variable):
"price"
# Step 3 - Train the model:
import pandas as pd
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/diamonds.csv")
model = model.fit( df[ ["carat"] ], df["price"] )
Finally, in the final step, we can predict the price of a diamond by providing a list of independent variables. We can do this by creating a simple DataFrame and using the model to predict the price:
# Step 4 - Predict the outcome:
# Create a new DataFrame from scratch to predict the price of a 1, 2, and 3 ct.
# diamond. We'll call this `df2`:
data = []
data.append( {"carat": 1} )
data.append( {"carat": 2} )
data.append( {"carat": 3} )
df2 = pd.DataFrame(data)
# Add a new column to `df2` with the predicted prices:
df2["price_predict"] = model.predict( df2 )
Putting this all together:
carat | price_predict | |
---|---|---|
0 | 1 | 5500.065038 |
1 | 2 | 13256.490656 |
2 | 3 | 21012.916274 |
Example Walk-Throughs with Worksheets
Video 1: Simple Linear Regression in Python
Video 2: Multiple Linear Regression in Python
Practice Questions
Q1: Which of the following is correct?Q2: The DataFrame in the image is the Iris dataset. We want to predict PetalLengthCm( petal length in centimeters) from three other independent variables SepalLengthCm, SepalWidthCm and PetalWidthCm. Which variable should be placed in the place of d in the following snippet of code?