Bag Of Words(BOW)

NewNerd
3 min readOct 22, 2020

--

Bag Of Words(BOW) is a representation of text, that describe the occurrence of words within a document. It involves two things: A vocabulary of known words and measure of known words.

Let’s understand this with a toy-set or corpus of four reviews:
r1: This pasta is very tasty and affordable.
r2: This pasta is not tasty and is affordable.
r3: This pasta is delicious and cheap.
r4: Pasta is tasty and pasta tastes good.

Now, In the process of making BOW we have to follow some steps.

Step1:
Constructing a dictionary which contains set of all unique words in corpus of reviews.
Set={This, pasta, is, very, ………………….} ; Set contains d-unique words

Step2:
Now we’ll convert the review into vectors of d-dim.
The simplest scoring methods is Boolean/Binary vector in which we have to mark the presence of words as boolean value.
1, if wi occurs atleast once.
0, if wi is absent.

Let’s take two reviews
r1: This pasta is very tasty and affordable
r2: This pasta is not tasty and is affordable
In order to convert review into Boolean vector of d-dim, first we’ll construct a vector of d-dim(d-unique words) of all unique words from corpus.
Now for each review, put 1 if the word exist in that review and 0 if absent.

Now, we can find out that r1 and r2 are similar or not , by just checking distance between two vectors
||v1-v2||=√(number of differing words in r1 and r2)
||v1-v2||=√ (12+12)=√2

NOTE: vector representation of all the reviews will be d-dim.

Uni-gram, bi-gram, n-gram

When we are taking each and every word as new dimension then there might be some problem as it is possible that combination of two words make more sense.

For example:
This pasta is very tasty. This is the best in New York.
In this example, New York is a location and should be treated as single word so, we have to prevent the sequence information.

Now we introduce uni-gram, bi-gram…………. with some example
r1: This pasta is very tasty and affordable.
r2: This pasta is not tasty and is affordable.

NOTE:
In Bag Of Words(BOW), while using uni-gram there is a chance that uni-gram discards the sequence information. But while using bi-gram, tri-gram or any n-gram it retains some of the sequence information.

Number of dimensions in bi-gram will be greater than or equal to number of dimensions in uni-gram.
As ’n’ increases in n-gram, the dimension increases.

ADVANTAGE:
* Very simple to understand and implement.

DISADVANTAGE:
* Bag Of Words leads to high dimensional feature vector due to large size of vocabulary.
*Bag Of Words will leads to a sparse matrix, as there are less number of non zero values corresponding to words that occurs in sentence.

--

--

Responses (2)