# COSINE SIMILARITY and SEO

## Quiz

Cosine similarity in seo is a metric used to measure how comparable the files are regardless of their size. Mathematically, it calculates the cosine of the perspective among vectors projected in a multi-dimensional space.

This is effective because although the two comparable files are far aside through the Euclidean distance (because of the dimensions of the document), the possibilities are they will still be orientated nearer together. The smaller the angle, the better the cosine similarity.

**By the end of the article, you will get to know about:**

- What is cosine similarity, and how does it work?
- How to compute the duplicate files in python?
- What is soft cosine similarity, and how it is distinctive from sim?
- When to apply faint cosine similarity and how to calculate it in python?

**Introduction**:

A general method to fit similar files is primarily based on counting the variety of common phrases among the files. But this method has an inherent flaw.

As the size of the file increases, the range of common phrases tends to increase, although the files speak about exclusive topics. The cosine angle facilitates conquering this essential flaw in the ‘count-the-common-words’ or Euclidean distance approach.

**What is Cosine Similarity, and why is it beneficial?**

**What is Cosine Similarity, and why is it beneficial?**Cosine similarity is used to decide how comparable the files are regardless of their size. Mathematically, Cosine similarity measures the cosine of the angle among vectors projected in a multi-dimensional space.

In this context, the two vectors are arrays containing the phrase counts of documents.

**How does cosine similarity in SEO vary from the number of common words as a similarity metric?**

**How does cosine similarity in SEO vary from the number of common words as a similarity metric?**Cosine similarity is used to decide how comparable the files are regardless of their size. Mathematically, Cosine similarity measures the cosine of the angle among vectors projected in a multi-dimensional space.

In this context, the two vectors are arrays containing the phrase counts of documents.

**Cosine similarity examples:**

Let’s assume you’ve got 3 files primarily based on more than one-star cricket players – Sachin Tendulkar and Dhoni.

Two of the files (A) and (B) are from the Wikipedia pages on the respective players, and the third document (C) is a smaller snippet from Dhoni’s Wikipedia page.

As you may see, all 3 files are linked through a common theme – the sport of Cricket. Our goal is to quantitatively estimate the similarity among the files.

For ease of understanding, let’s remember only the top 3 common phrases among the files: ‘Dhoni’, ‘Sachin’ and ‘Cricket’. You might assume Doc B and Doc C; the two files on Dhoni might have a better similarity than Doc A and Doc B because Doc C is essentially a snippet from Doc B itself.

However, if we pass by the wide variety of common phrases, the two large files may have the maximum common words. Consequently, it can be judged as entirely similar, which we need to avoid.

The outcomes would be more harmonious if we used the cosine similarity rating to evaluate the similarity. As you include more phrases from the document, it’s more challenging to visualise a better dimensional space. You can also calculate cosine similarity using the vector formula:

**How to Compute Cosine Similarity in Python? **

We have the subsequent 3 texts:

- Doc Trump (A): Mr Trump has become president after winning the political election. Though he misplaced the assistance of a few republican pals, Trump is friends with President Putin.
- Doc Trump Election (B): President Trump says that Putin had no political involvement in the election outcome. He says it became a witchhunt through political parties. He claimed President Putin was a pal who had nothing to do with the election.
- Doc Putin (C): Post elections, Vladimir Putin has become President of Russia.

President Putin had previously served himself as the prime minister in his political career. Since Doc B has more in common with Doc A than with Doc C, I could assume the Cosine among A and B is more significant than (C and B).

# Define the documents

doc_trump = “Mr Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin.”

doc_election = “President Trump says Putin had no political interference in the election outcome. He says it was a witch hunt by political parties. He claimed President Putin is a friend who had nothing to do with the election.”

doc_putin = “Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career.”

documents = [doc_trump, doc_election, doc_putin]

To compute the cosine similarity, you will require the word count of the phrases in every document. The CountVectorizer or the TfidfVectorizer from scikit examine lets us add this. The result of this comes as a sparse_matrix. On this, I am optionally changing it to a panda’s data frame to look at the phrase frequencies in a tabular format.

# Scikit Learn from sklearn.feature_extraction.text import CountVectorizer import pandas as pd # Create the Document Term Matrix count_vectorizer = CountVectorizer(stop_words=’english’) count_vectorizer = CountVectorizer() sparse_matrix = count_vectorizer.fit_transform(documents) # OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies. doc_term_matrix = sparse_matrix.todense() df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names(), index=[‘doc_trump’, ‘doc_election’, ‘doc_putin’])

**Soft Cosine Similarity in SEO: **

Suppose you have some other set of files on a very exceptional subject matter, say ‘food’. You need a similarity metric that offers better rankings for files belonging to the identical subject matter and decreases rankings while evaluating docs from exclusive topics.

We want to consider the semantics in such a case, which means we need to be considered. That is, phrases similar in meaning have to be treated as comparable.

For Example, ‘President’ vs’ Prime minister’, ‘Food’ vs ‘Dish’, ‘Hi’ vs ‘Hello’ have to be considered comparable. For this, changing the phrases into respective phrase vectors and computing the similarities can address this problem.

**Conclusion: **

Now you should recognise the mathematics behind the computation of cosine similarity and how it’s far more effective than importance-based metrics like Euclidean distance. Soft cosines may be an exceptional characteristic if you need to apply a similarity metric that can assist in the clustering of documents. If you need to dig further into natural language processing, genesis Education is incredibly recommended.

credits: https://www.machinelearningplus.com/nlp/cosine-similarity/

More Resources: