I have been recently looking at new SUVs, which come with a hefty price tag. With such a large finanacial decision I found myself wondering: How does the vehicle lose its value over time? I have heard statements like “you lose thousands of dollars the minute you drive it out of the dealership” and “foreign made cars keep their value the longest”, but I would like to see the data behind it and investigate it myself.
While I am looking at the value of cars, I was also curious about whether it is possible to get by without an SUV. Depending on where you live there are a lot of alternatives for ride sharing, rentals, or transit available. Let’s be honest most of the time you don’t have the vehicle packed to the brim. From personal experience I only really need a big car when camping, for road trips, or skiing. The rest of the time I could potentially get by with rentals or ride-sharing services, depending on the result of the analysis.
I dusted off the old web scraper and ended up collecting some data from the internet. I looked for ads selling new and used SUVs across Canada. I wanted to get a basic bird’s-eye view so I focused only on a limited number of parameters relating to SUVs: age of car, manufacturer, mileage, and cost. There are a lot more parameters that go into the pricing of a car, but this was just a cursory look. I focused on ads posted across all the provinces to ensure that I get a large representative dataset for Canada. Throughout the analysis I use the R programming language to perform the analysis and create the visualizations.
I mined SUV ads in Canada and saved them as a CSV file. I recorded:
The ads that I collected had quite a few duplicate entries. These duplicates are likely due to premium ads that place an ad over multiple pages. This was not an issue and with a few lines of code those duplicate ads were pruned:
df.full = read.csv("carCost.csv",
sep = ";",
stringsAsFactors = FALSE)
print(sprintf('Number of rows in original Data: %d', nrow(df.full)))
## [1] "Number of rows in original Data: 5644"
df = df.full[!duplicated(df.full$title),]
print(sprintf('Number of unique rows in original Data: %d', nrow(df)))
## [1] "Number of unique rows in original Data: 3141"
There were 5644 ad entries that were scraped and of those 3141 were unique entries. The large number of duplicates that had to be removed is explained by the fact that websites will let people pay for ads that will display them on every single page. The scraping process tends to mine them as duplicates.
I want to see how a car value changes with time so I had to ensure that there was a good age and mileage distribution of the vehicles that I will study. The first thing I did was create a histogram of mileage:
library(plotly)
#Style X Label
x.label = list(title = "<b>Mileage [km]</b>", #xlabel
color = "#7f7f7f", #change text color
showgrid = TRUE) #add vertical grid lines
#Style y-label
y.label = list(title = "<b>Number of Entries</b>", #ylabel
color = "#7f7f7f") #change text color
#Style the marker
markerLayout = list(color = "#eeeeee", line = list(color = "#1c9099", width = 2))
line = list(type = "line",
x0 = mean(df$mileage),
x1 = mean(df$mileage),
y0 = 0,
y1 = 1,
yref = "paper",
line = list(color = "#FF0000"))
text.mean <- list(yref = 'paper',
xref = "x",
ax = 20,
y = 0.9,
arrowcolor = "#FF0000",
font = list(color = "#FF0000"),
xanchor = "left",
x = mean(df$mileage), text = sprintf("Mean: %.0f km",mean(df$mileage)))
#generate plot
plot_ly(x = ~df$mileage, type = "histogram", marker = markerLayout)%>%
layout(xaxis = x.label,
yaxis = y.label) %>%
layout(shapes = line, annotations = text.mean)
This plot tells me that there is a good proporion of new cars and a decent chunk with 50,000 – 200,000 km. This distribution should do a good job of describing the used and new car market. There are 276 vehicles with less than 5,000 km on the list.
The age of the car can be extracted from the year mentioned in the ad of the title. I use regular expressions to extract the numbers from the text. I also remove ads that are too old (cars prior to 1990) and 2019 models as I am writing this in 2018.
#initialize empty column
df$year = 0
library(stringr)
car_year = str_match(df$title, "\\d{4}")
df$year = as.numeric(car_year)
#Filter the ads based on year
df = df[!df$year<1990,]
df = df[!df$year==2019,]
#Create age column
df$age = 2018- df$year
#Style X Label
x.label = list(title = "<b>Year</b>", #xlabel
color = "#7f7f7f", #change text color
showgrid = TRUE) #add vertical grid lines
#Style y-label
y.label = list(title = "<b>Number of Entries</b>", #ylabel
color = "#7f7f7f") #change text color
#Style the marker
markerLayout = list(color = "#eeeeee", line = list(color = "#1c9099", width = 2))
line = list(type = "line",
x0 = median(df$year),
x1 = median(df$year),
y0 = 0,
y1 = 1,
yref = "paper",
line = list(color = "#FF0000"))
text.mean <- list(yref = 'paper',
xref = "x",
ax = 20,
y = 0.9,
arrowcolor = "#FF0000",
font = list(color = "#FF0000"),
xanchor = "left",
x = median(df$year), text = sprintf("Median: %.0f",median(df$year)))
#generate plot
plot_ly(x = ~df$year, type = "histogram", marker = markerLayout)%>%
layout(xaxis = x.label,
yaxis = y.label) %>%
layout(shapes = line, annotations = text.mean)
The median year of manufacture for the collected ads was 2015. There were not too many SUV options on the market before 2010, which shows up as a drop-off on the histogram.
The next step is to establish a relationship between vehicle cost and the mileage. Mileage is an easy to extract parameter and it has a defined relationship to the cost. I plot all the ads on a graph and attempt to visually define this mileage-cost relationship.
#Style X Label
x.label = list(title = "<b>Mileage [km]</b>", #xlabel
ticksuffix = " km",
hoverformat = ".0f",
color = "#7f7f7f", #change text color
showgrid = TRUE) #add vertical grid lines
#Style y-label
y.label = list(title = "<b>Price [$]</b>", #ylabel
color = "#7f7f7f", #change text color
hoverformat = ".0f",
tickprefix = "$ ",
range = c(0, 100000))
markerLayout = list(color = "#1c9099",
size = 4,
opacity = 0.5,
line = list(width = 0))
#generate plot
plot_ly(data = df, x = ~df$mileage, y = ~df$price, type = 'scatter',
mode = 'markers',
marker = markerLayout)%>%
layout(xaxis = x.label,
yaxis = y.label)
There is an expeontially decaying relationship between mileage and the price. There is initially a wide range in price for brand new vehicles and the difference narrows as the mileage increases. This narrowing indicates that as a car gets more miles on it, people don’t want to pay a lot for it, regardless of make.
Some of the really low-priced cars with little mileage (in the $5000 range with less than a few thousand miles) are either old cars that were not used very often, vehivles that may have been in an accident, or there was a typo in the ad. I could potentially determine if the car has been in an accident by analysing the text of the ad itself, but that would require a much more in-depth analysis and was out of the scope.
I would like to determine what the “average” SUV mileage is as the vehicle ages. It obviously ages quite a bit, but I can plot the 3000 ads to get a good idea of a rough average via a best-fit line.
#Determine line of best that links the age of the car to an average mileage.
linear.model <- lm(mileage ~ age + 0, data=df) # build linear regression model on full data with zero y-intercept
age.pred = data.frame(age=seq(0,15))
mile.pred = predict(linear.model, age.pred)
#Style X Label
x.label = list(title = "<b>Age [Years]</b>", #xlabel
color = "#7f7f7f", #change text color
showgrid = TRUE, #add vertical grid lines
ticksuffix = " yrs",
range = c(0,11),
dtick = 1)
#Style y-label
y.label = list(title = "<b>Mileage [km]</b>", #ylabel
color = "#7f7f7f", #change text color
hoverformat = ".0f",
ticksuffix = " km",
range = c(0, 300000))
#generate plot
plot_ly(data = df, x = ~df$age, y = ~df$mileage, split = ~df$age, type = 'violin',
box = list(visible = T),
meanline = list(visible = T),
line = list(color = "#1c9099"),
marker = list(color = "#1c9099")) %>%
add_trace(x = age.pred, y = mile.pred,
inherit=FALSE ,
type = "scatter", mode = "lines", name = "Average SUV Mileage",
line = list(width = 2, color = 'red')) %>%
layout(xaxis = x.label,
yaxis = y.label,
showlegend = FALSE)
The best-fit line underestimates the mileage on newer vehicles, but still passes through the 25th and 75th quartiles for most of the ads. There might be a better relationship, but simple linear relationship describes a lot of the variation. An average SUV drives 1.731064810^{4} km per year in Canada based on the ads surveyed.
I first use regular expressions to extract the manfacturer name from the ad. A summary of the brands and the ads is shown in the next table. The number of ads is biased in the top 11 ads, since I was targeting those brands and had filtered the ads to load more pages with just those brands. The rest of the manufacturers is an approximatly representative sample distribution of the cars on sale.
df$brand = ""
car_brands = sort(c("Ford", "Jeep", "BMW", "Mercedes Benz", "Hyundai", "GMC", "Dodge", "Nissan",
"Mazda","Volkswagen", "Toyota", "Chevrolet", "Mercedes-Benz", "Lincoln",
"Audi", "Hummer", "Lexus", "Saturn", "Infiniti", "Kia", "Suzuki","Mitsubishi",
"Acura", "Honda", "Land Rover", "Porsche","Cadillac", "Buick", "Subaru",
"Chrysler", "Volvo", "Jaguar", "Pontiac"))
for (brand in car_brands){
str_index = grep(pattern = brand, x=df$title, ignore.case = TRUE)
df$brand[str_index] = brand
}
#Display a summary of the brands
library(knitr)
df$brand = as.factor(df$brand)
brand.summary.table = sort(summary(df$brand), decreasing = TRUE)
kable(brand.summary.table,
caption = "Number of car ads by manufacturer.",
format = "html",
table.attr = "style='width:50%;'")
x | |
---|---|
Ford | 268 |
Jeep | 267 |
Toyota | 261 |
Honda | 249 |
Hyundai | 217 |
Chevrolet | 214 |
Dodge | 213 |
Mercedes-Benz | 213 |
GMC | 210 |
BMW | 201 |
Volkswagen | 175 |
Nissan | 116 |
Kia | 62 |
Mazda | 50 |
Lexus | 47 |
Land Rover | 46 |
Audi | 43 |
Lincoln | 37 |
Cadillac | 32 |
Subaru | 32 |
Mitsubishi | 31 |
Buick | 30 |
Acura | 29 |
Infiniti | 27 |
Porsche | 19 |
Chrysler | 6 |
Hummer | 6 |
Volvo | 5 |
Pontiac | 4 |
Suzuki | 4 |
Saturn | 3 |
Jaguar | 2 |
The next series of plots shows the SUV sales price compared to the age of the car. I did not consider any detailed features of the cars (such as Turbo option, limited editions, or wether the vehilce had previsouly been in an accident). These features could be scraped from the actual ad text and the features could be worked into a more detailed analytical data model, but I was only interested in a cursory look at the time of writing. I used 11 brands, that were the most common SUV ads that I found. Some of the rarer makes did not have enough data points to fit a good relationship.
The following plots show that some brands lose a lot of value
initially and then plateau, while others lose value steadily throughout
their lifetime. The red best-fit line is an average price for each brand
and follows an exponential decay and is based on the average mileage
driven of 1.731064810^{4} km per year. The exponential decay also takes
into account the mileage of each ad. The equation that I used for
modelling was
Once the curve has been fit to each brand, theoretically you would only need to know the age of the car and the mileage, to predict a price. In the following plots there will be points on either side of the best-fit line. There are numerous reasons that explain this variation:
library(scales) # needed for formatting y-axis labels to non-scientific type
library(ggplot2)
car_list = c("Jeep", "Ford", "Honda",
"Toyota", "BMW", "Dodge", "Hyundai","Chevrolet", "GMC", "Mercedes-Benz","Volkswagen")
car_type = c("Domestic","Domestic", "Foreign",
"Foreign","Foreign","Domestic","Foreign","Domestic","Domestic", "Foreign", "Foreign" )
#Initialize plot list
plots <- list()
fit.brand.list <- list()
brand.summary = data.frame()
#Style labels
x.label = list(title = "<b>Age [Years]</b>", #xlabel
color = "#7f7f7f", #change text color #hoverinfo = "x",
ticksuffix = ' yrs',
showgrid = TRUE, #add vertical grid lines
range = c(0,11))
y.label = list(title = "<b>Price [$]</b>", #ylabel
color = "#7f7f7f", #change text color
hoverformat = '.0f',
tickprefix = '$',
range = c(0, 80000))
i = 1
for (cur_brand in car_list){
df.brand = subset(df, brand %in% cur_brand)
df.brand = df.brand[order(df.brand$age),]
fit.brand = nls(price ~ priceInitial + a*exp(-b*age) + c*mileage,
data = df.brand,
start = list(priceInitial= 10000,a=10000, b = 0.2,c = -0.02)
)
fit.brand.list[[i]] = fit.brand
newdf = data.frame(brand = cur_brand,
new_cost = coef(fit.brand)[1]+coef(fit.brand)[2],
car_type = car_type[i],
decay = coef(fit.brand)[3],
num = nrow(df.brand),
priceInitial = coef(fit.brand)[1],
a = coef(fit.brand)[2],
b = coef(fit.brand)[3],
c = coef(fit.brand)[4],
year1 = predict(fit.brand, data.frame(age = 1, mileage = mile.pred[2])),
year3 = predict(fit.brand, data.frame(age = 3, mileage = mile.pred[4])),
year5 = predict(fit.brand, data.frame(age = 5, mileage = mile.pred[6])),
year7 = predict(fit.brand, data.frame(age = 7, mileage = mile.pred[8])),
year10 = predict(fit.brand, data.frame(age = 10, mileage = mile.pred[11])),
mean_error = mean(abs(predict(fit.brand, df.brand) - df.brand$price)))
brand.summary = rbind(brand.summary, newdf)
#predict the forecast line
cost.pred = newdf$priceInitial +
newdf$a*exp(-newdf$b*df.brand$age) +
newdf$c*linear.model$coefficients[1]*df.brand$age
df.pred = data.frame(df.brand$age, cost.pred)
colnames(df.pred) = c("age", "cost")
#plot the resulting fit lines
plots[[i]] = plot_ly(x=df.brand$age,
y=df.brand$price,
type = 'scatter',
mode = 'markers',
name = 'Individual Ads',
marker = list(color = "#1c9099", opacity = 0.4)) %>%
add_trace(x=df.pred$age,
y=df.pred$cost,
type = 'scatter',
mode = 'lines+markers',
name = 'Average Price',
line = list(color = "#aa2222", line = list(width = 1)),
marker = list('red',color = "#ffffff", opacity = 0)) %>%
layout(title = cur_brand,
xaxis = x.label,
yaxis = y.label)
#
i = i+1
}
colnames(brand.summary) <- c("brand", "new_cost", "car_type",
"decay", "num", "priceInitial", "a", "b","c",
"year1", "year3", "year5", "year7", "year10",
"mean_error")
htmltools::tagList(list(plots[]))
Using the previous results I can present the average depreciation for each brand at multiple times throughout the vehicle’s life.
library(RColorBrewer)
#Style X Label
x.label = list(title = "", #xlabel
color = "#7f7f7f", #change text color
showgrid = TRUE) #add vertical grid lines
#Style y-label
y.label = list(title = "<b>Value [CAD $]</b>", #ylabel
color = "#7f7f7f", #change text color
range = c(0, 80000),
hoverformat = ".0f",
hoverinfo = "y+text",
text = "%")
#Sort the brands by percentage
brand.summary$brand <- factor(brand.summary$brand,
levels = brand.summary$brand[order(brand.summary$new_cost, decreasing = TRUE)])
legend.text1 = list(xref = "paper", yref = "paper", x = 1.03, y= 0.92,
font = list(family = "Arial", size = 15),
text = "<b>Domestic</b>", textangle = -90, showarrow = FALSE)
legend.text2 = list(xref = "paper", yref = "paper", x = 1.03, y= 0.45,
font = list(family = "Arial", size = 15),
text = "<b>Foreign</b>", textangle = -90, showarrow = FALSE)
plot_ly(x = ~brand.summary$brand,
y = ~brand.summary$new_cost,
type = 'bar',
color = ~car_type,
name = 'New Cost',
legendgroup = car_type) %>%
add_trace(y = ~brand.summary$year1, name = '1 Year Value', legendgroup = car_type) %>%
add_trace(y = ~brand.summary$year3, name = '3 Year Value', legendgroup = car_type) %>%
add_trace(y = ~brand.summary$year5, name = '5 Year Value', legendgroup = car_type) %>%
add_trace(y = ~brand.summary$year7, name = '7 Year Value', legendgroup = car_type) %>%
add_trace(y = ~brand.summary$year10, name = '10 Year Value', legendgroup = car_type) %>%
layout(xaxis = x.label,
yaxis = y.label,
legend = list(x = 1.03, y = 1, bgcolor = 'rgba(255, 255, 255, 0)',
bordercolor = 'rgba(0, 0, 0, 255)', tracegroupgap =20),
annotations = legend.text1) %>%
layout(annotations = legend.text2)
There is a $50,000 spread between the average cost of a high-end and a low-end SUV. The cost of foregin made SUVs are sorted by new value. Based on the spread there are a few general, more accurate statements that can be made:
Domestic SUVs are on average either really expensive or really affordable with domestics cars situated in the middle.
The smallest amount of value you are going to lose in the first year is approximately $3,600 on a Hyunadai, which is still over 10% of its new value.
The graph shows that some domestic vehicles (Jeep, Chevrolet, and Dodge) tend to plateau out in the later years, while all studied foreign vehicles tend to continue to decrease in value. This means that an older domestic SUV will tend to lose value slower than foreign-made cars.
I can also examine some of the percentages of the car value. The following table is sorted by the first year value, meaning the value of car (as percent of original cost) after 1 year of wonership.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
year1 = data.frame(brand = brand.summary$brand,
year1_percent = brand.summary$year1/brand.summary$new_cost*100,
year5_percent = brand.summary$year5/brand.summary$new_cost*100,
year10_percent = brand.summary$year10/brand.summary$new_cost*100)
year1 = year1[with(year1, order(-year1_percent)), ]
rownames(year1) = 1:nrow(year1)
#Display table somwehat formated
kable(year1,
digit = 1,
caption = "Average vehicle percent value at various points in time",
col.names = c('Brand','Year 1','Year 5','Year 10'),
format = "html",
table.attr = "style='width:50%;'")
Brand | Year 1 | Year 5 | Year 10 |
---|---|---|---|
Toyota | 88.9 | 56.8 | 33.6 |
Hyundai | 88.7 | 51.1 | 16.5 |
Honda | 86.5 | 49.4 | 25.2 |
Volkswagen | 86.2 | 46.5 | 17.6 |
Ford | 85.2 | 46.9 | 23.1 |
BMW | 84.1 | 43.2 | 17.3 |
GMC | 81.7 | 37.8 | 16.2 |
Jeep | 81.0 | 43.0 | 28.0 |
Mercedes-Benz | 78.0 | 38.9 | 17.7 |
Chevrolet | 73.6 | 32.3 | 22.5 |
Dodge | 70.3 | 31.2 | 19.9 |
Domestic vehicles had on average a higher first-year drop in percent value for the SUVs in these ads. In the first year, foreign-made cars tend to hold their value better (except a Mercedes Benz).
After 5 years, foreign cars on average still have higher percent value (compared to brand new value) than domestic vehciles.
After 10 years though the percent value for domestic cars is on average higher than for foreign cars.
Interestingly enough, a Toyota appears to be a good financial investment, as you retain a lot of value after 1 year, 5 years, and still have a surprising percentage left after 10 years. A Huyandai is the second-most sound investment (from the studied brands), but something happens after 5 years and their value plummets.
These values are what people are willing to pay for them. So it seems reasonable that the higher the percentage of the original cost, the higher the perception of the public that that particular brand is a sound SUV. Whether that is true or not, I cannot personally say.
Note that these are percent values. A Mercedes Benz can still be pricey after 10 years, but will have lost a much higher percent value than other vehicles.
The next table examines the loss of value per km driven. This could be due to either high maintanece cost, or the likelyhood of breaking down. Domestic cars are at the bottom of the list, meaning that you can put a load of miles on them and not lose too much values. Keep in mind that these numbers are not super accurate, as the age of the car had a higher infleunce on the cost and therefore was more sensitive to changes than these numbers. Still that should give you a general idea.
mileage.adjustment = brand.summary[,c("brand", "c")]
mileage.adjustment$c = abs(mileage.adjustment$c)
mileage.adjustment = mileage.adjustment[order(mileage.adjustment$c ,decreasing = TRUE),]
rownames(mileage.adjustment) <- 1:nrow(mileage.adjustment)
colnames(mileage.adjustment)[2] = "$ Value Loss Per KM"
kable(mileage.adjustment, digits = 3)
brand | $ Value Loss Per KM |
---|---|
Mercedes-Benz | 0.168 |
BMW | 0.082 |
Hyundai | 0.050 |
Volkswagen | 0.045 |
Dodge | 0.041 |
Ford | 0.040 |
Jeep | 0.026 |
Honda | 0.024 |
Toyota | 0.023 |
Chevrolet | 0.012 |
GMC | 0.009 |
SUVs (like other cars) lose a great deal of values in the first year, which should not surprise anybody. Depending on where you live, there might be alternatives to SUV ownership that can be explored from a purely financial point of view.
These altenatives will likely only work for people who live and work in urban areas who have access to public transit or ride-sharing options. One caveat is that people who buy SUVs likely need them for specific reasons (like hauling large items or going camping), which would make these options impractical. I personally don’t use our SUV except on the weekends to go camping or skiing.
But consider as example an “average” Jeep or an “average” Hyundai. In the first year the value of the jeep will drop from $45,500 to $36,500, which represents a loss of $9,000 in the first year. The value of the average Hyundai will drop by $3,700 (from $32,400 to $28,700). I live in Alberta and the annual cost of insurance for me was $1500, which also has to be factored in, bringing the first-year loss to $10,500 for the jeep and $5,200 for the Hyundai.
We are however fortunate in Alberta that we have cheap car rentals. I was able to rent a Hyundai Tucson (base model) for 4 days for $260 which came with unlimted mileage. Rental cost likely change depending on demand and season, but I will assume a cost of $65 per day. I also know that this is not true for all provinces.
This means that the $10,500 which you lose anyway on an average jeep would potentially get you 160 days of car rental with unlimited mileage. If you are considering an average Hyundai this number would be 80 days for $5,200. There are 52 weeks in a year and 104 weekend days. If you don’t use your car every day of the year and use it only on the occasional weekend trip, it might actually make fincancial sense to plan out alternatives instead. This analysis does not include convenice factors, which are likely why people get those cars in the first place. Given the fact that people are flocking towards urban centers, a financial argument against car ownership can be made.
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });