An Introduction to Semantic Search

02 December 2021 - 3 mins read time
Tags: Search Engine NLP ML DL Elasticsearch BERT

Last year I was looking at the latest usecase trends in Natural Language Processing, and as usual I found the articles related to Transformers based models everywhere. After learning about Transformers, I started looking at the various areas where Transformers are being used. And then, I saw one application that caught my eye i.e. Google Search. As I started digging this further, I got fimiliar with the phrase ‘Semantic Search’.

Below are some brief notes from my learning.

Search Engine

As per wikipedia, a search engine is a software system that is designed to carry out web searches. A search engine is a software program that helps people find the information they are looking for online using keywords or phrases.

Search engines are able to return results quickly even with millions of websites online by scanning the Internet continuously and indexing every page they find.

When a user enters a search term, they looks at the website page titles, contents and keywords it has indexed and uses algorithms (step-by-step operations) to produce a list of sites—with the most relevant websites at the top of the list.

Objective of Search Engines

The main objectives of search engine are:
• Faster responsive-time to users
• Lower service cost
• Better utilization of knowledge bases

There are mainly two types of searches:

Lexical Search

Original search engines were lexical, they looked for literal matches of the query words without understanding of the query’s meaning and only returning links that contained the exact query. They find these words inside home pages, images, urls, top level domain/subdomain, meta tags etc.
Traditional search engines are based on the concept of inverted index that looks for the keywords through multiple documents stored in knowledge base/database/document stores and return the document with the most keyword matching document.

Semantic Search

In easy words, semantic search refers to search algorithms that mine the data based on meaning of search query. It not only finds the keywords that are lexically similar to the search query but also search based on the intent and contextual meaning of the query.
Semantic search engines, in addtion to just searching keywords, makes use of deep learning techniques where dense vectors are used to create index and queries are also converted in the form of dense vetor. Then the distance of query vector from index is calculated and nearest documents are return as relevant result.

Components in Semantic search engine system

Corpus - Corpus refers to the raw data that contains unprocessed information related to a particular topic. The corpus is an unstructured collection of information.
Indexing - Indexing is the process of generating the index to a document so that it can be easily listed/identified when needed.
Vectorization - Vectorization is the process of generating dense vectors from document in a corpus which is unique and generates a comprehensive embedding space related to the corpus’s domain.
Document Store - Document Stores are the core of a search system. They contain indexed documents which can be queried based on search keywords as well as semantic meaning of the query. eg. Elasticsearch.
User Interface - User Interface is where a user will go and provide the query for which relevant information needs to be found.