๐Ÿงฎ tf-idf

= a statistical measure that evaluates how relevant a ๐Ÿˆ Word is to a document in a collection of documents

Formula

  • : term
  • : document
  • : log transformed term frequency:
    • โ†’ greater when term is frequent in a document
  • : inverse document frequency
    • = collection size, = # num of docs with word i
    • โ†’ greater when the term is rare in the collection

To compare 2 documents instead of 2 words: