Linear Regression in Python (sk-learn)
The easiest to use library to start working on machine learning in Python is using a library called scikit-learn (or commonly just "sk-learn"). Part of the simplicity of this library is that the same functions and steps are used for all different types of machine models learning! Specifically, we will always follow four steps:
Import the library for the specific machine learning model we want to use,
Create an instance of the model,
Train the model using
model.fit(...)Use the model by using
model.predict(...), using the model we trained in the previous step to predict the outcome
Example: Linear Regression
Linear Regression is one of the easiest to understand machine learning models, where Python will use the training data to find a "line of best fit" to predict the outcome. The sk-learn linear regression model is sklearn.linear_model.LinearRegression, which means our import line would be:
# Step 1 - Import the library:
from sklearn.linear_model import LinearRegressionThe second step will always be to create a new instance of the model, which we'll call model:
# Step 2 - Create an instance of the model:
model = LinearRegression()The third step is requires us to train our model. In this example, we will use the diamonds dataset to predict the price based on the size (carat weight) of the diamond. To do this:
We use a pandas DataFrame to load the data,
We specify the independent variables as list of values -- here we only have the carat weight, so we have a list of one element:
["carat"]We specify the dependent variable directly (as LinearRegression allows only a single dependent variable):
"price"
# Step 3 - Train the model:
import pandas as pd
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/diamonds.csv")
model = model.fit( df[ ["carat"] ], df["price"] )Finally, in the final step, we can predict the price of a diamond by providing a list of independent variables. We can do this by creating a simple DataFrame and using the model to predict the price:
# Step 4 - Predict the outcome:
# Create a new DataFrame from scratch to predict the price of a 1, 2, and 3 ct.
# diamond. We'll call this `df2`:
data = []
data.append( {"carat": 1} )
data.append( {"carat": 2} )
data.append( {"carat": 3} )
df2 = pd.DataFrame(data)
# Add a new column to `df2` with the predicted prices:
df2["price_predict"] = model.predict( df2 )Putting this all together:
carat price_predict 0 1 5500.065038 1 2 13256.490656 2 3 21012.916274
Complete Linear Regression source code.