🧮 tf-idf
= a statistical measure that evaluates how relevant a 🈁 Word is to a document in a collection of documents
Formula
- : term
- : document
- : log transformed term frequency:
- → greater when term is frequent in a document
- : inverse document frequency
- = collection size, = # num of docs with word i
- → greater when the term is rare in the collection
To compare 2 documents instead of 2 words: