Watch text turn into numbers, one step at a time.
Every technique from the deck — One-Hot Encoding, Count Vectorizer, Bag-of-Words, N-grams, TF-IDF and Word2Vec/FastText embeddings — is simulated here against a real Python backend. Type your own sentences, hit run, and the API tokenizes, builds the vocabulary, and constructs the vectors live, the same way the reference notebook does it.
Each panel sends your text to a Flask endpoint in app.py. That endpoint runs the exact scikit-learn / numpy / gensim calls from the reference notebook — CountVectorizer, TfidfVectorizer, OneHotEncoder, a hand-rolled Bag-of-Words counter, an N-gram generator, and a gensim.models.Word2Vec/FastText trainer — and streams the intermediate results back as JSON. The page then reveals those results stage by stage along the pipeline tape at the top of each panel, so you can see tokenization happen before the vocabulary appears, and the vocabulary settle before the vectors fill in.
One-Hot Encoding
Every unique word in the vocabulary gets one dedicated position in the vector. A word's vector is all zeros except a single 1 at its own index — so the vector length always equals the vocabulary size.
Count Vectorizer
Builds a vocabulary from the whole corpus, then counts how many times each vocabulary word appears in every document — a document–term frequency matrix. Word order is discarded.
Bag-of-Words
Bag-of-Words is the concept — text as an unordered bag of words where only frequency matters. Count Vectorizer is simply scikit-learn's implementation of that idea. Below: a from-scratch BoW counter, a binary (presence/absence) variant, and the cosine similarity between documents it implies.
N-grams
An N-gram is a run of N consecutive tokens. Unigrams (N=1) are plain BoW; bigrams and trigrams keep a sliver of local word order that BoW throws away — at the cost of a much bigger vocabulary.
TF-IDF Vectorizer
Term Frequency × Inverse Document Frequency. Words that show up in almost every document (like "the" or "is") get pulled down; words that are frequent in one document but rare across the corpus get pushed up.
Word Embeddings — Word2Vec & FastText
Instead of counting, embeddings learn dense, low-dimensional vectors from context — words used in similar contexts end up with similar vectors. This trains a real gensim Word2Vec (Skip-gram & CBOW) and FastText model on your sentences, in the background, right now.