I have been recently looking at new SUVs, which come with a hefty price tag. With such a large finanacial decision I found myself wondering: How does the vehicle lose its value over time? I have heard statements like “you lose thousands of dollars the minute you drive it out of the dealership” and “foreign made cars keep their value the longest”, but I would like to see the data behind it and investigate it myself.

While I am looking at the value of cars, I was also curious about whether it is possible to get by without an SUV. Depending on where you live there are a lot of alternatives for ride sharing, rentals, or transit available. Let’s be honest most of the time you don’t have the vehicle packed to the brim. From personal experience I only really need a big car when camping, for road trips, or skiing. The rest of the time I could potentially get by with rentals or ride-sharing services, depending on the result of the analysis.

I dusted off the old web scraper and ended up collecting some data from the internet. I looked for ads selling new and used SUVs across Canada. I wanted to get a basic bird’s-eye view so I focused only on a limited number of parameters relating to SUVs: age of car, manufacturer, mileage, and cost. There are a lot more parameters that go into the pricing of a car, but this was just a cursory look. I focused on ads posted across all the provinces to ensure that I get a large representative dataset for Canada. Throughout the analysis I use the R programming language to perform the analysis and create the visualizations.

1 Load Ads and Remove Duplicate Entries

I mined SUV ads in Canada and saved them as a CSV file. I recorded:

Price in Canadian Dollars
Mileage of vehicle (in units of kilometers, ironic, I know)
Ad title which usually was something like: “2018 Chevrolet Equinox Premier” or “2014 GMC Acadia SLE”

The ads that I collected had quite a few duplicate entries. These duplicates are likely due to premium ads that place an ad over multiple pages. This was not an issue and with a few lines of code those duplicate ads were pruned:

df.full = read.csv("carCost.csv", 
              sep = ";",
              stringsAsFactors = FALSE)
print(sprintf('Number of rows in original Data: %d', nrow(df.full)))

## [1] "Number of rows in original Data: 5644"

df = df.full[!duplicated(df.full$title),]
print(sprintf('Number of unique rows in original Data: %d', nrow(df)))

## [1] "Number of unique rows in original Data: 3141"

There were 5644 ad entries that were scraped and of those 3141 were unique entries. The large number of duplicates that had to be removed is explained by the fact that websites will let people pay for ads that will display them on every single page. The scraping process tends to mine them as duplicates.

2 Ad data exploration

2.1 The Distribution of mileage in the ads

I want to see how a car value changes with time so I had to ensure that there was a good age and mileage distribution of the vehicles that I will study. The first thing I did was create a histogram of mileage:

library(plotly)

#Style X Label
x.label = list(title = "<b>Mileage [km]</b>", #xlabel 
               color = "#7f7f7f",  #change text color
               showgrid = TRUE) #add vertical grid lines

#Style y-label
y.label = list(title = "<b>Number of Entries</b>",  #ylabel
               color = "#7f7f7f") #change text color

#Style the marker
markerLayout = list(color = "#eeeeee", line = list(color = "#1c9099", width = 2))

line = list(type = "line", 
            x0 = mean(df$mileage),
            x1 = mean(df$mileage),
            y0 = 0,
            y1 = 1,
            yref = "paper",
            line = list(color = "#FF0000"))
text.mean <- list(yref = 'paper', 
                  xref = "x", 
                  ax = 20,
                  y = 0.9, 
                  arrowcolor = "#FF0000",
                  font = list(color = "#FF0000"),
                  xanchor = "left",
                  x = mean(df$mileage), text = sprintf("Mean: %.0f km",mean(df$mileage)))

#generate plot
plot_ly(x = ~df$mileage, type = "histogram", marker = markerLayout)%>%
  layout(xaxis = x.label, 
         yaxis = y.label) %>%
  layout(shapes = line, annotations = text.mean)

This plot tells me that there is a good proporion of new cars and a decent chunk with 50,000 – 200,000 km. This distribution should do a good job of describing the used and new car market. There are 276 vehicles with less than 5,000 km on the list.

2.2 Distribution of the manufacturing year of the ads

The age of the car can be extracted from the year mentioned in the ad of the title. I use regular expressions to extract the numbers from the text. I also remove ads that are too old (cars prior to 1990) and 2019 models as I am writing this in 2018.

#initialize empty column
df$year = 0

library(stringr)
car_year = str_match(df$title, "\\d{4}")
df$year = as.numeric(car_year)

#Filter the ads based on year
df = df[!df$year<1990,]
df = df[!df$year==2019,]

#Create age column
df$age = 2018- df$year

#Style X Label
x.label = list(title = "<b>Year</b>", #xlabel 
               color = "#7f7f7f",  #change text color
               showgrid = TRUE) #add vertical grid lines

#Style y-label
y.label = list(title = "<b>Number of Entries</b>",  #ylabel
               color = "#7f7f7f") #change text color

#Style the marker
markerLayout = list(color = "#eeeeee", line = list(color = "#1c9099", width = 2))

line = list(type = "line", 
            x0 = median(df$year),
            x1 = median(df$year),
            y0 = 0,
            y1 = 1,
            yref = "paper",
            line = list(color = "#FF0000"))
text.mean <- list(yref = 'paper', 
                  xref = "x", 
                  ax = 20,
                  y = 0.9, 
                  arrowcolor = "#FF0000",
                  font = list(color = "#FF0000"),
                  xanchor = "left",
                  x = median(df$year), text = sprintf("Median: %.0f",median(df$year)))


#generate plot
plot_ly(x = ~df$year, type = "histogram", marker = markerLayout)%>%
  layout(xaxis = x.label, 
         yaxis = y.label) %>%
  layout(shapes = line, annotations = text.mean)

The median year of manufacture for the collected ads was 2015. There were not too many SUV options on the market before 2010, which shows up as a drop-off on the histogram.

2.3 Relationship between cost and mileage

The next step is to establish a relationship between vehicle cost and the mileage. Mileage is an easy to extract parameter and it has a defined relationship to the cost. I plot all the ads on a graph and attempt to visually define this mileage-cost relationship.

#Style X Label
x.label = list(title = "<b>Mileage [km]</b>", #xlabel 
               ticksuffix = " km",
               hoverformat = ".0f",
               color = "#7f7f7f",  #change text color
               showgrid = TRUE) #add vertical grid lines

#Style y-label
y.label = list(title = "<b>Price [$]</b>",  #ylabel
               color = "#7f7f7f", #change text color
               hoverformat = ".0f",
               tickprefix = "$ ",
               range = c(0, 100000)) 

markerLayout = list(color = "#1c9099", 
                    size = 4,
                    opacity = 0.5,
                    line = list(width = 0))


#generate plot
plot_ly(data = df, x = ~df$mileage, y = ~df$price, type = 'scatter',
        mode = 'markers',
        marker = markerLayout)%>%
  layout(xaxis = x.label, 
         yaxis = y.label)

There is an expeontially decaying relationship between mileage and the price. There is initially a wide range in price for brand new vehicles and the difference narrows as the mileage increases. This narrowing indicates that as a car gets more miles on it, people don’t want to pay a lot for it, regardless of make.

Some of the really low-priced cars with little mileage (in the $5000 range with less than a few thousand miles) are either old cars that were not used very often, vehivles that may have been in an accident, or there was a typo in the ad. I could potentially determine if the car has been in an accident by analysing the text of the ad itself, but that would require a much more in-depth analysis and was out of the scope.

2.4 How much mileage does an average car drive throughtout its life?

I would like to determine what the “average” SUV mileage is as the vehicle ages. It obviously ages quite a bit, but I can plot the 3000 ads to get a good idea of a rough average via a best-fit line.

#Determine line of best that links the age of the car to an average mileage.
linear.model <- lm(mileage ~ age + 0, data=df)  # build linear regression model on full data with zero y-intercept
age.pred = data.frame(age=seq(0,15))
mile.pred = predict(linear.model, age.pred)

#Style X Label
x.label = list(title = "<b>Age [Years]</b>", #xlabel 
               color = "#7f7f7f",  #change text color
               showgrid = TRUE, #add vertical grid lines
               ticksuffix = " yrs",
               range = c(0,11),
               dtick = 1)

#Style y-label
y.label = list(title = "<b>Mileage [km]</b>",  #ylabel
               color = "#7f7f7f", #change text color
               hoverformat = ".0f",
               ticksuffix = " km",
               range = c(0, 300000)) 

#generate plot
plot_ly(data = df, x = ~df$age, y = ~df$mileage, split = ~df$age, type = 'violin',
        box = list(visible = T),
        meanline = list(visible = T),
        line = list(color = "#1c9099"),
        marker = list(color = "#1c9099")) %>%
  add_trace(x = age.pred, y = mile.pred, 
            inherit=FALSE ,
            type = "scatter", mode = "lines", name = "Average SUV Mileage",
              line = list(width = 2, color = 'red')) %>% 
  layout(xaxis = x.label, 
         yaxis = y.label,
         showlegend = FALSE)

The best-fit line underestimates the mileage on newer vehicles, but still passes through the 25th and 75th quartiles for most of the ads. There might be a better relationship, but simple linear relationship describes a lot of the variation. An average SUV drives 1.731064810^{4} km per year in Canada based on the ads surveyed.

2.5 How does the vehicle make influence the rate at which an SUV loses its value?

I first use regular expressions to extract the manfacturer name from the ad. A summary of the brands and the ads is shown in the next table. The number of ads is biased in the top 11 ads, since I was targeting those brands and had filtered the ads to load more pages with just those brands. The rest of the manufacturers is an approximatly representative sample distribution of the cars on sale.

df$brand = ""
car_brands = sort(c("Ford", "Jeep", "BMW", "Mercedes Benz", "Hyundai", "GMC", "Dodge", "Nissan",
                    "Mazda","Volkswagen", "Toyota", "Chevrolet", "Mercedes-Benz", "Lincoln",
                    "Audi", "Hummer", "Lexus", "Saturn", "Infiniti", "Kia", "Suzuki","Mitsubishi",
                    "Acura", "Honda", "Land Rover", "Porsche","Cadillac", "Buick", "Subaru",
                    "Chrysler", "Volvo", "Jaguar", "Pontiac"))

for (brand in car_brands){
  str_index = grep(pattern = brand, x=df$title, ignore.case = TRUE)
  df$brand[str_index] = brand
}

#Display a summary of the brands
library(knitr)
df$brand = as.factor(df$brand)
brand.summary.table = sort(summary(df$brand), decreasing = TRUE)
kable(brand.summary.table, 
      caption = "Number of car ads by manufacturer.", 
      format = "html", 
      table.attr = "style='width:50%;'")

Number of car ads by manufacturer.
	x
Ford	268
Jeep	267
Toyota	261
Honda	249
Hyundai	217
Chevrolet	214
Dodge	213
Mercedes-Benz	213
GMC	210
BMW	201
Volkswagen	175
Nissan	116
Kia	62
Mazda	50
Lexus	47
Land Rover	46
Audi	43
Lincoln	37
Cadillac	32
Subaru	32
Mitsubishi	31
Buick	30
Acura	29
Infiniti	27
Porsche	19
Chrysler	6
Hummer	6
Volvo	5
Pontiac	4
Suzuki	4
Saturn	3
Jaguar	2

The next series of plots shows the SUV sales price compared to the age of the car. I did not consider any detailed features of the cars (such as Turbo option, limited editions, or wether the vehilce had previsouly been in an accident). These features could be scraped from the actual ad text and the features could be worked into a more detailed analytical data model, but I was only interested in a cursory look at the time of writing. I used 11 brands, that were the most common SUV ads that I found. Some of the rarer makes did not have enough data points to fit a good relationship.

The following plots show that some brands lose a lot of value initially and then plateau, while others lose value steadily throughout their lifetime. The red best-fit line is an average price for each brand and follows an exponential decay and is based on the average mileage driven of 1.731064810^{4} km per year. The exponential decay also takes into account the mileage of each ad. The equation that I used for modelling was

P r i c e = e^{(a \times a g e)} + P r i c e_{f i n a l} + b \times M i l e a g e

$Price = e^{(a\times age)} + Price_{final} + b\times Mileage$ In the equation

P r i c e

$Price$ is the predicted SUV price,

a

$a$ is the exponential decay constant which determines the steepness of the price decay,

P r i c e_{f i n a l}

$Price_{final}$ is the theoretical final value of the car as it gets older, and

b

$b$ is the value loss for each kilometer that the SUV drives.

Once the curve has been fit to each brand, theoretically you would only need to know the age of the car and the mileage, to predict a price. In the following plots there will be points on either side of the best-fit line. There are numerous reasons that explain this variation:

The model is an average for each manufacturer and groups all SUVs into one, even if there is a high-end and low-end model.
The ads have different mileages than the “average” SUV that the best-fit line is based on.
The model does not consider special features unique to each ad (such as any upgrades to the base model, limited editions, or if the car had been previously in an accident)
Sometimes people ask for more or less than the car is worth when they put it up for sale.

library(scales)   # needed for formatting y-axis labels to non-scientific type
library(ggplot2)
 
car_list = c("Jeep",    "Ford",     "Honda",  
             "Toyota", "BMW",    "Dodge",   "Hyundai","Chevrolet", "GMC", "Mercedes-Benz","Volkswagen")
car_type = c("Domestic","Domestic", "Foreign",
             "Foreign","Foreign","Domestic","Foreign","Domestic","Domestic", "Foreign", "Foreign" )
#Initialize plot list
plots <- list()
fit.brand.list <- list()
brand.summary = data.frame()

#Style labels
x.label = list(title = "<b>Age [Years]</b>", #xlabel 
               color = "#7f7f7f",  #change text color #hoverinfo = "x",
               ticksuffix = ' yrs',
               showgrid = TRUE, #add vertical grid lines
               range = c(0,11))

y.label = list(title = "<b>Price [$]</b>",  #ylabel
               color = "#7f7f7f", #change text color
               hoverformat = '.0f',
               tickprefix = '$',
               range = c(0, 80000)) 

i = 1
for (cur_brand in car_list){
  df.brand =  subset(df, brand %in% cur_brand)
  df.brand = df.brand[order(df.brand$age),]
  
  fit.brand = nls(price ~ priceInitial + a*exp(-b*age) + c*mileage,
        data = df.brand,
        start = list(priceInitial= 10000,a=10000, b = 0.2,c = -0.02)
        )
  fit.brand.list[[i]] = fit.brand
  newdf = data.frame(brand = cur_brand,
                     new_cost = coef(fit.brand)[1]+coef(fit.brand)[2],
                     car_type = car_type[i],
                     decay = coef(fit.brand)[3],
                     num = nrow(df.brand),
                     priceInitial = coef(fit.brand)[1],
                     a = coef(fit.brand)[2],
                     b = coef(fit.brand)[3],
                     c = coef(fit.brand)[4],
                     year1 = predict(fit.brand, data.frame(age = 1, mileage = mile.pred[2])),
                     year3 = predict(fit.brand, data.frame(age = 3, mileage = mile.pred[4])),
                     year5 = predict(fit.brand, data.frame(age = 5, mileage = mile.pred[6])),
                     year7 = predict(fit.brand, data.frame(age = 7, mileage = mile.pred[8])),
                     year10 = predict(fit.brand, data.frame(age = 10, mileage = mile.pred[11])),
                     mean_error = mean(abs(predict(fit.brand, df.brand) - df.brand$price)))
  brand.summary = rbind(brand.summary, newdf)
  
  #predict the forecast line
  cost.pred = newdf$priceInitial + 
    newdf$a*exp(-newdf$b*df.brand$age) + 
    newdf$c*linear.model$coefficients[1]*df.brand$age
  
  df.pred = data.frame(df.brand$age, cost.pred)
  colnames(df.pred) = c("age", "cost")

  #plot the resulting fit lines
  plots[[i]] = plot_ly(x=df.brand$age, 
                       y=df.brand$price, 
                       type = 'scatter',
                       mode = 'markers',
                       name = 'Individual Ads',
                       marker = list(color = "#1c9099", opacity = 0.4)) %>%
    add_trace(x=df.pred$age, 
              y=df.pred$cost, 
              type = 'scatter',
              mode = 'lines+markers',
              name = 'Average Price',
              line = list(color = "#aa2222", line = list(width = 1)),
              marker = list('red',color = "#ffffff", opacity = 0)) %>%
    layout(title = cur_brand,
           xaxis = x.label,
           yaxis = y.label)
  #
  i = i+1
}
colnames(brand.summary) <- c("brand", "new_cost", "car_type", 
                             "decay", "num", "priceInitial", "a", "b","c",
                             "year1", "year3", "year5", "year7", "year10",
                             "mean_error")
htmltools::tagList(list(plots[]))

3 Average SUV depreciation throughout its life

Using the previous results I can present the average depreciation for each brand at multiple times throughout the vehicle’s life.

library(RColorBrewer)

#Style X Label
x.label = list(title = "", #xlabel 
               color = "#7f7f7f",  #change text color
               showgrid = TRUE) #add vertical grid lines

#Style y-label
y.label = list(title = "<b>Value [CAD $]</b>",  #ylabel
               color = "#7f7f7f", #change text color
               range = c(0, 80000),
               hoverformat = ".0f",
               hoverinfo = "y+text",
               text = "%") 
#Sort the brands by percentage
brand.summary$brand <- factor(brand.summary$brand, 
                                levels = brand.summary$brand[order(brand.summary$new_cost, decreasing = TRUE)])

legend.text1 = list(xref = "paper", yref = "paper", x = 1.03, y= 0.92,
                    font = list(family = "Arial", size = 15),
                    text = "<b>Domestic</b>", textangle = -90, showarrow = FALSE)
legend.text2 = list(xref = "paper", yref = "paper", x = 1.03, y= 0.45, 
                    font = list(family = "Arial", size = 15),
                    text = "<b>Foreign</b>", textangle = -90, showarrow = FALSE)

plot_ly(x = ~brand.summary$brand, 
        y = ~brand.summary$new_cost, 
        type = 'bar',
        color = ~car_type,
        name = 'New Cost',
        legendgroup = car_type) %>% 
  add_trace(y = ~brand.summary$year1, name = '1 Year Value', legendgroup = car_type) %>%
  add_trace(y = ~brand.summary$year3, name = '3 Year Value', legendgroup = car_type) %>%
  add_trace(y = ~brand.summary$year5, name = '5 Year Value', legendgroup = car_type) %>%
  add_trace(y = ~brand.summary$year7, name = '7 Year Value', legendgroup = car_type) %>%
  add_trace(y = ~brand.summary$year10, name = '10 Year Value', legendgroup = car_type) %>%
  layout(xaxis = x.label, 
         yaxis = y.label,
         legend = list(x = 1.03, y = 1, bgcolor = 'rgba(255, 255, 255, 0)', 
                       bordercolor = 'rgba(0, 0, 0, 255)', tracegroupgap =20),
         annotations = legend.text1) %>%
  layout(annotations = legend.text2)

There is a $50,000 spread between the average cost of a high-end and a low-end SUV. The cost of foregin made SUVs are sorted by new value. Based on the spread there are a few general, more accurate statements that can be made:

Domestic SUVs are on average either really expensive or really affordable with domestics cars situated in the middle.
The smallest amount of value you are going to lose in the first year is approximately $3,600 on a Hyunadai, which is still over 10% of its new value.
The graph shows that some domestic vehicles (Jeep, Chevrolet, and Dodge) tend to plateau out in the later years, while all studied foreign vehicles tend to continue to decrease in value. This means that an older domestic SUV will tend to lose value slower than foreign-made cars.

I can also examine some of the percentages of the car value. The following table is sorted by the first year value, meaning the value of car (as percent of original cost) after 1 year of wonership.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

year1 = data.frame(brand = brand.summary$brand, 
                   year1_percent = brand.summary$year1/brand.summary$new_cost*100,
                   year5_percent = brand.summary$year5/brand.summary$new_cost*100,
                   year10_percent = brand.summary$year10/brand.summary$new_cost*100)
year1 = year1[with(year1, order(-year1_percent)), ]
rownames(year1) = 1:nrow(year1)

#Display table somwehat formated 
kable(year1,
      digit = 1,
      caption = "Average vehicle percent value at various points in time",
      col.names = c('Brand','Year 1','Year 5','Year 10'),
      format = "html", 
      table.attr = "style='width:50%;'")

Average vehicle percent value at various points in time
Brand	Year 1	Year 5	Year 10
Toyota	88.9	56.8	33.6
Hyundai	88.7	51.1	16.5
Honda	86.5	49.4	25.2
Volkswagen	86.2	46.5	17.6
Ford	85.2	46.9	23.1
BMW	84.1	43.2	17.3
GMC	81.7	37.8	16.2
Jeep	81.0	43.0	28.0
Mercedes-Benz	78.0	38.9	17.7
Chevrolet	73.6	32.3	22.5
Dodge	70.3	31.2	19.9

Domestic vehicles had on average a higher first-year drop in percent value for the SUVs in these ads. In the first year, foreign-made cars tend to hold their value better (except a Mercedes Benz).
After 5 years, foreign cars on average still have higher percent value (compared to brand new value) than domestic vehciles.
After 10 years though the percent value for domestic cars is on average higher than for foreign cars.
Interestingly enough, a Toyota appears to be a good financial investment, as you retain a lot of value after 1 year, 5 years, and still have a surprising percentage left after 10 years. A Huyandai is the second-most sound investment (from the studied brands), but something happens after 5 years and their value plummets.
These values are what people are willing to pay for them. So it seems reasonable that the higher the percentage of the original cost, the higher the perception of the public that that particular brand is a sound SUV. Whether that is true or not, I cannot personally say.
Note that these are percent values. A Mercedes Benz can still be pricey after 10 years, but will have lost a much higher percent value than other vehicles.

The next table examines the loss of value per km driven. This could be due to either high maintanece cost, or the likelyhood of breaking down. Domestic cars are at the bottom of the list, meaning that you can put a load of miles on them and not lose too much values. Keep in mind that these numbers are not super accurate, as the age of the car had a higher infleunce on the cost and therefore was more sensitive to changes than these numbers. Still that should give you a general idea.

mileage.adjustment = brand.summary[,c("brand", "c")]
mileage.adjustment$c = abs(mileage.adjustment$c)

mileage.adjustment = mileage.adjustment[order(mileage.adjustment$c ,decreasing = TRUE),]
rownames(mileage.adjustment) <- 1:nrow(mileage.adjustment)
colnames(mileage.adjustment)[2] = "$ Value Loss Per KM"
kable(mileage.adjustment, digits = 3)

brand	$ Value Loss Per KM
Mercedes-Benz	0.168
BMW	0.082
Hyundai	0.050
Volkswagen	0.045
Dodge	0.041
Ford	0.040
Jeep	0.026
Honda	0.024
Toyota	0.023
Chevrolet	0.012
GMC	0.009

How much will an SUV cost you?