๐งฎ tf-idf
= a statistical measure that evaluates how relevant a ๐ Word is to a document in a collection of documents
Formula
- : term
- : document
- : log transformed term frequency:
- โ greater when term is frequent in a document
- : inverse document frequency
- = collection size, = # num of docs with word i
- โ greater when the term is rare in the collection
To compare 2 documents instead of 2 words: