Simple Linear Regression
What is simple linear regression?
Simple linear regression aims to find a correlation between two variables and derive mathematical equations that explain the relationships between a dependent and an independent variable. With simple linear regression in general, we want to reach the conclusion:
Is there a relationship between the variables we have?
You can determine the relationship between income and spending, experience and salary, or humidity and temperature. But, as an example, there is NO relationship between the height of a student and their exam scores.
Can we forecast / predict values with this?
With regression, we can train the model and find out if we can predict values with certainty. Can we use what we know about the relationship to predict new values?
Example: What will be the temperature tomorrow? How much will my bakery sell this year compared to last year? How much will my salary be if I have 5 years of experience?
Variable roles
In simple linear regression, variables can take one of 2 roles.
Dependent Variable
The variable whose value we want to predict or forecast. We call it dependent because its value depends on something else. We will call this variable y.
Independent Variables
This is the variable which we can control or change in order to affect the dependent variable. We will call this variable x.
Example: If an apple costs $1.00, and you buy 10 of them the total cost will be $10.00. The dependent variable here is the
total cost
while the independent variable are the number of apples you want to buy.
The Simple Linear Equation Mathematical Model
When we use simple linear regression we call it linear because, well… the mathematical model represents a straight line in a 2D plane. Let’s think about it for a second.
What is the math equation for a straight line?
True World Examples
In the real world, data sometimes is not linear and behaves differently to what we think. At first glance, it may seem that data has no relation at all. In the case of simple linear regression what you need to look for is data that somewhat follows a linear pattern.
Suppose that you work as a Data Analyst
for the human resources department of a company that has over 10,000 employees. Your boss wants to know if the years of experience of an employee has anything to do with the amount of money they win. Of course, since you are a Data Analyst
you can check the database of employees and quickly verify the following:
- What is their current salary?
- How many years of experience does the person have?
Assume that you are able to get data from 30 random employees which looks something like this:
Employee ID | Years of Experience | Salary |
---|---|---|
1 | 1.1 | 39343 |
2 | 1.3 | 46205 |
3 | 1.5 | 37731 |
4 | 2.0 | 43525 |
5 | 2.2 | 39891 |
6 | 2.9 | 56642 |
7 | 3.0 | 60150 |
8 | 3.2 | 54445 |
… | … | … |
26 | 9.0 | 105582 |
27 | 9.5 | 116969 |
28 | 9.6 | 112635 |
29 | 10.3 | 122391 |
30 | 10.5 | 121872 |
After you check the table, you plot all these values in a 2D scatter plot and get an image like so.
Scatter Plot: Years of Experience vs Salary. |
As you can see, the dots somewhat resemble a line. Let’s go ahead and draw an imaginary line and see if we can pass through all the dots.
Scatter Plot: Years of Experience vs Salary with line. |
As you can see the line doesn’t pass through ALL the dots, but it’s somewhat close. What does this mean? Why in some cases are the dots close or far away from our imaginary line?
So far we know that:
- The data in follows somewhat of a linear approach.
- The data has 2 important variables SALARY and YEARS OF EXPERIENCE. This means, that we can start to model our like a linear equation.
Question: We know that SALARY and YEARS OF EXPERIENCE are our variables but which one is the dependent and which one is the independent variable?
The Possibility of Errors
As we mentioned before, data may or may not be always consistent and can behave in different ways. What this means is that our linear equation needs to consider a possible error. But how do we represent that error in the equation? How can that error be visualized in the scatter plot?
Lets assume that all the employees that were chosen in the table above are from San Antonio, TX but the HR department accidentally added employees from the city of Seattle, WA into the data set you chose. The cost of living in Seattle is 28.6% higher than San Antonio . This would explain why, some data points in the plot, are farther away from the imaginary line we have traced. These are considered errors in our data.
Error Lines for Simple Linear Regression |
In our linear equation, let’s add that error with the greek letter ε.
\[ SALARY = a(XP) + b + ε \]
ε is the possible error our data can have. What simple linear regression
aims to do is to draw an imaginary line that minimizes this error between the data points. This error is a value that is often ignored but the important thing is that our linear equation will consider this and we can represent the linear equation in a way that is familiar to our example.
Exercise 1: Playing with Scikit-learn
Scikit-learn is a machine-learning library that will help us analyze and use the built-in simple linear regression model to predict data. In the Replit window below, you can run the program 02-e1.py
which will use a data set of employees alongside their years of experience. The program will plot a sample of 30 employees out of the employees within the company:
Exercise 2: Finding the Slope and Intercept
Before we go any further, lets analyze our equation once again. We know that our equation has been updated like so:
\[ SALARY = a(XP) + b + ε \]
We have been able to determine what are the values of x and y but, what about a and b? Let us recall what each of the missing values mean:
- a is the slope or coefficient of the line. The slope represents the estimated change on the dependent variable, in this case, the SALARY.
- b is the intercept or the value of y when x=0. After plotting our data set, you can see that the value for the SALARY when the YEARS OF EXPERIENCE is 0.
Hold on for a second? If I join the company with no experience, my salary will be 0? That doesn’t sound right. Lets go ahead and figure out what is the actual value is.
Using scikit-learn we can use the linear regression model and find the value of a and b. On the Replit window below, lets analyze the code
First, we need to import the data from the CSV file:
# Importing dataset
dataset = pd.read_csv("Experience_vs_Salary.csv")
x = dataset.iloc[:, :-1].values # Get all the values from "Experience"
y = dataset.iloc[:, 1].values # Get all the values from "Salary"
Then we make an instance of the LinearRegression
model class and fit the model to the data. The fit
function will analyze the values from our CSV file and find the slope and intercept values.
model = linear_model.LinearRegression()
model.fit(x,y)
As you can see, the code has returned the value for the coefficient and intercept of our linear equation. Let’s update our linear equation with this.
\[ Intercept = 25792.20 \] \[ Coefficient= 9449.96 \] \[ SALARY = 9449.96(XP) + 25792.20 + ε \]
We know that the model gave us an intercept of 25792.20. What this means is that an employee with NO experience would have a salary of $25792.20. But what does the 9449.96 mean? This means that, for every year of experience, the salary of employees has an increase of 9449.96. But wait a moment, how can we make sure these are the correct values? Do we have confidence that these are indeed the correct values? If we grab another 30 random employees and verify their salaries, are we going to get the same values?