Every day, thousands of cars are sold in the United States. Some people (auto dealers) make money out of it: they buy undervalued cars from some people and resell them to others. And these people want to maximize their profits - to do that they need to understand the real (market) price and sell time of the car.
History of online sales was taken as the initial data: apparently, it was obtained through web scraping of such resources as craigslist, ebay, etc. The main problem with such a data is that it's dirty and unstructured. Some fields (for example, the year of manufacture and the odometer) are filled in everywhere, but the list of equipment or even the volume of the engine is either absent or goes in a single paragraph of the text.
To extract useful data from a textual description, we used a combination of heuristics: regular expressions, fuzzy strings, vocabularies, term frequency and so on. This combination is different for each field. For example, for the engine search the key component was the regex '\ d \. \ d' (digit point digit) which would match 1.5 or 2.0 (engile volume). To highlight a color - search for matches from a dictionary of 32 most popular colors and their synonyms. We'll emphasize fuzzy strings - an inexact search of strings, which relies on Levinshtein words distance. So, for example, 'Radio System' and 'FM Radio' are close to each other, which means it's a same thing. Fuzzy srings searching was a part of highlighting of almost every component.
Two identical cars sold in the same place at the same time may have different prices - one is sold for 5k, the other is for 6k. Simply because sellers have different vision of the market. It is impossible to predict the exact price - from the point of data these are the same cars. The best prediction would be 5.5k ± 0.5k. 0.5k is a market price gap. This was chosen as the target metric - the smaller this average gap, the more accurate the prediction. We chose a specific model (popular in the USA Honda Civic), manually divided it into clusters (similar year, odometer, engine size, sales time, etc.) and found the average gap - about $ 900-1000 (with an average cost of $ 16,500). The gap estimate does not affect the quality and accuracy of the model in any way - it will be the highest possible. This assessment will simply give us an understanding of how close we are to the truth.
Decision Trees (in particular, Random Forests) was chosen as a predictive model. This is a classic solution for such kind of problem. Advantages are: state-of-the-art performance, simplicity, clarity, interpretability, stability and resistance to overfitting, speed of learning. It gives a good understanding of which parameters influence how much, which is very useful for research\debug and also in the communication with the customer.
The price prediction is very accurate. For the Honda Civic, the mean absolute error was $ 950 - very close to the real market gap. It's same good for other models - about 5-7% on average.
Upon training, the model provides a list of parameters, indicating the importance of each: