This month, I set out to read the Machine Learning Design Patterns book with the goal of deepening my knowledge and stimulating new ideas for future work. I set out to read 75 pages, but thanks to well-written and easily digestible pages, I managed to read 126 pages.
My reading covered two main topics: common challenges in machine learning and data representation design patterns. Here are some key learnings from the book:
- Data quality: There must be completeness in both the train and test datasets. In image recognition, the labels must contain all the types of classifications that the models aim to learn, or else the model won’t learn to recognize those images or will give incorrect labels. In prediction, train and test data must contain thorough presentation of the population or the model will not be effective. A common example is a model trained to predict house prices of mansions will not do well on small condominiums when given the same input features.
- Data drift: It is essentially to continuously monitor model inputs for drift. This refers to when the input features change over time yet the model stays static. For example, take a sentiment classification model trained on early 2000’s data but still running today without retraining. This model would not understand new vocabulary and its connotation, thus leading to worse outcomes in classification.
- Data scaling and transformation: Machine learning models work well with data scaled to specific range, say 0 to 1 or -1 to 1. This is models because model optimizes like values in this range and that models can be sensitive to mismatched magnitudes of different features. To get this scale, options include min-max scaling, z-score normalization, and the clipping of outlier values. In addition, log transforms can be used on non-linear data to get it into a normal distribution prior to normalization.
- Data embedding: When data is represented in high dimensionality, such as one-hot encoding of a feature with many values, embeddings are quite useful. When data is one-hot encoded, relationships between each of the values is lost. By adding an embedding layer which represents this data in a lower dimensional space, the model can understand the closeness between each of the feature values and also take in fewer inputs (say a two value embedding versus many one-hot encoded variables).
It makes sense that the book focuses so much on ideas like data drift, data transformation, and lower dimensionality embeddings since data work itself is 80+ percent of machine learning systems. The models themselves are often straight forward, generally applied out of the box by industry practitioners, except in the most complex or niche use cases. Academics and research scientists will continue to innovate with models, but that’s not the focus of this book or most applied machine learning roles.
On to October.