For this month, I am starting a DataCamp track to learn about machine learning fundamentals in R. I have started a course called “Supervised Learning for Classification” with the first topic being KNN models. I originally set out to do one course (also called a module) per week, but instead I’ll take quality over quantity here.
Let me summarize what I’ve learned so far:
- Supervised learning is teaching a computer/program to learn based on prior examples. If this learning is putting items into categories, then it is a classification problem.
- KNN (k-nearest neighbors) is an algorithm that looks for the distance between observations in a feature space. Then it calculates the Euclidean distance between points to make the classification. For an example of a feature space, think of a three-dimensional plane with each axis being the amount of red, blue, or green in a given image.
- The “k” in KNN determines the number of neighboring points used to make the classification. If k=1, then a point is classified the same as its nearest neighbor. If k>1, then the classification is based on a plurality vote of the k-number of nearest neighbors. If there is a tie for leading votes, then the point is assigned at random.
Thankfully, R has built-in libraries to do KNN. Once data is obtained and cleaned, fitting the model is straightforward.
>library(class)
>knn(train=trainingdata, test=testdata, cl=traininglabels, k=1)
As I continue to learn, I’ve realized I need to not focus on syntax and code. Instead, I should focus on how the algorithm works, why it would be used, and the trade-offs in its implementations.