Text Vectorization Lab

How this lab is wired

Each panel sends your text to a Flask endpoint in app.py. That endpoint runs the exact scikit-learn / numpy / gensim calls from the reference notebook — CountVectorizer, TfidfVectorizer, OneHotEncoder, a hand-rolled Bag-of-Words counter, an N-gram generator, and a gensim.models.Word2Vec/FastText trainer — and streams the intermediate results back as JSON. The page then reveals those results stage by stage along the pipeline tape at the top of each panel, so you can see tokenization happen before the vocabulary appears, and the vocabulary settle before the vectors fill in.

Raw sentences

›

Tokenize

›

Build vocabulary

›

One-hot vectors

›

sklearn check

Input corpus

one sentence per line

Sentences

Raw corpus

›

Tokenize docs

›

Fit vocabulary

›

Count matrix

›

Transform new doc

Input corpus

one document per line

Documents

Max features (optional)

New document to transform (optional)

remove English stop words

Raw corpus

›

Tokenize

›

Vocabulary

›

BoW matrix

›

Binary BoW

›

Cosine similarity

Input corpus

one document per line

Documents

Sentence

›

Unigrams

›

Bigrams

›

Trigrams

›

N-gram matrices

Input

Sentence for manual N-grams

Corpus for N-gram matrices (one document per line)

Raw corpus

›

Term frequency (TF)

›

Inverse doc. freq. (IDF)

›

TF × IDF

›

sklearn matrix

›

Top words / doc

TF-IDF(t, d) = [ count(t, d) / total words in d ] × log( N / (1 + df(t)) )

Input corpus

one document per line

Documents

Training sentences

›

Train Word2Vec

›

Similarity

›

Most similar

›

PCA plot

›

FastText OOV

Training sentences

one per line · trains live, takes a couple seconds

Watch text turn into numbers, one step at a time.

One-Hot Encoding

Count Vectorizer

Bag-of-Words

N-grams

TF-IDF

Word Embeddings

One-Hot Encoding

Count Vectorizer

Bag-of-Words

N-grams

TF-IDF Vectorizer

Word Embeddings — Word2Vec & FastText