Predicting Used Cars Prices

Industry: Used Cars Market

Location: USA

Technologies: Python, Pandas, Jupyther, Tensorflow

Every day, thousands of cars are sold in the United States. Some people (auto dealers) make money out of it: they buy undervalued cars from some people and resell them to others. And these people want to maximize their profits - to do that they need to understand the real (market) price and sell time of the car.

The Data

History of online sales was taken as the initial data: apparently, it was obtained through web scraping of such resources as craigslist, ebay, etc. The main problem with such a data is that it's dirty and unstructured. Some fields (for example, the year of manufacture and the odometer) are filled in everywhere, but the list of equipment or even the volume of the engine is either absent or goes in a single paragraph of the text.

Data cleaning & preparation

To extract useful data from a textual description, we used a combination of heuristics: regular expressions, fuzzy strings, vocabularies, term frequency and so on. This combination is different for each field. For example, for the engine search the key component was the regex '\ d \. \ d' (digit point digit) which would match 1.5 or 2.0 (engile volume). To highlight a color - search for matches from a dictionary of 32 most popular colors and their synonyms. We'll emphasize fuzzy strings - an inexact search of strings, which relies on Levinshtein words distance. So, for example, 'Radio System' and 'FM Radio' are close to each other, which means it's a same thing. Fuzzy srings searching was a part of highlighting of almost every component.

Price error target & metric

Two identical cars sold in the same place at the same time may have different prices - one is sold for 5k, the other is for 6k. Simply because sellers have different vision of the market. It is impossible to predict the exact price - from the point of data these are the same cars. The best prediction would be 5.5k ± 0.5k. 0.5k is a market price gap. This was chosen as the target metric - the smaller this average gap, the more accurate the prediction. We chose a specific model (popular in the USA Honda Civic), manually divided it into clusters (similar year, odometer, engine size, sales time, etc.) and found the average gap - about $ 900-1000 (with an average cost of $ 16,500). The gap estimate does not affect the quality and accuracy of the model in any way - it will be the highest possible. This assessment will simply give us an understanding of how close we are to the truth.

ML Model

Decision Trees (in particular, Random Forests) was chosen as a predictive model. This is a classic solution for such kind of problem. Advantages are: state-of-the-art performance, simplicity, clarity, interpretability, stability and resistance to overfitting, speed of learning. It gives a good understanding of which parameters influence how much, which is very useful for research\debug and also in the communication with the customer.


The price prediction is very accurate. For the Honda Civic, the mean absolute error was $ 950 - very close to the real market gap. It's same good for other models - about 5-7% on average. Upon training, the model provides a list of parameters, indicating the importance of each:

  1. sale_age    961122646315638.750000 ################
  2. year    955390524401118.750000 ###############
  3. odometer    422230866955567.562500 #######
  4. engine    178803878256016.906250 ##
  5. transmission    67672749040132.929688 #
  6. trim_level    65038557678927.375000 #
  7. location_state    36002446179795.140625
  8. interior_color    17877616362522.753906
  9. body_style    11330188236041.732422
  10. sunroof    7707656710506.492188
  11. sale_time    6701589845994.826172
  12. exterior_color    6620146018522.518555
We can see that age and odometer are very important while exterior color is much less. We also see that the sale_time is weakly correlated with the price. Means that if we fix all the parameters of the car, then the sale period will have little effect on its price. Intuitively, there should be correlation, but in practice it turns out that if a car is not sold in 2 weeks, the seller lowers the price and then it is sold very quickly (36% of all cars are sold in the first 4 days; 69% - up to 16 days; 82 % - up to 32 days). Either quickly or not at all. Selling time prediction was important for the customer, but thanks to the readability of the random forest, we were able to explain to him why it was impossible to get it with high accuracy.