We've been working on a predictive ML Model for one of our customers in the construction industry, here are some unstructured thoughts of the experience:
- The importance of feature engineering, especially when using unstructured data sources as inputs. Are you going to stem or lemmatize your keywords. Are you going to select the words with the most occurrences, or the ones with the most differentiation vs. the output variable? Are you going to one-hot-encode some of your numeric data? It's impressive to see how different the models performed with different answers to these questions
- Over-fitting is your enemy, do everything you can to detect it and avoid it
- Take an industrialized approach, build or buy the capabilities to train multiple models with tweaked parameters. Having the results for multiple combinations with give you a deeper understanding than one-by-one steps
- Squeeze the lemon: there were 3 separate occurrences where we thought "that's it", we're doing all we can with the data, there's no way to improve. But taking a step back led to 3 "wait-a-minute-why-don't-we-use-that" moments. If you feel your stuck take a step back: are you really using all the data you have?
Our model's median absolute error started at 300% and is down to 25% in the span of 2 weeks (
mean average error 10% higher mostly due to a few crazy outliers). It's been a tough, fun, chaotic, insightful, frustrating and satisfying journey all at once. We're going to do our best to get MAE down to 20%, for now the week-end beckons. Happy week-end !