Google's SMITH algorithm outperforms BERT on long-form text

A recent research paper from Google describes work being done on a Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder designed to match long queries to long content - a task that the BERT algorithm finds difficult.

Quoting from the abstract of the paper:

"In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input...Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048."

In plain English: BERT and tools like it rely on semantic matching to identify information in sentences within website content that's related to the language used in a search query. But they have trouble with matching long content to long queries.

Unlike BERT, which is designed to understand words within sentences, SMITH is able to predict further content of a page based on its top content, and also to understand page structure - sections, passages, sentences - and match queries to passages within the entire content of a page.

The researchers have concluded that SMITH is better than BERT at understanding and matching queries to the content of long pages:

“The experimental results on several benchmark datasets show that our proposed SMITH model outperforms previous state-of-the-art Siamese matching models including HAN, SMASH and BERT for long-form document matching...The SMITH model which enjoys longer input text lengths compared with other standard self-attention models is a better choice for long document representation learning and matching.”

It's unknown at this point if Google is actually using SMITH in its ranking algorithm But any tool as promising as it seems to be will likely be used sooner or later.

However, because BERT and SMITH, by design, have different capabilities, Google will likely continue to use both.

And if you have questions or comments, you can easily send them to me with the Quick Reply form, below, or send me an e-mail.

Sorry, you don't have permission to post comments. Log in, or register if you haven't yet.

Subhead	SMITH matches passages within the context of the entire content of a document
Website	Visit Website https://www.google.com/
Rating	100/5 1 2 3 4 5 5/5 based on 1 vote. Show Individual Votes
Related Listings	Optimize your local business website for Google Rich Results using Schema.org markup Google organic rankings fluctuate wildly as SERP volatility spikes European Commission to attempt to force Google to its make search ranking algorithms more transparen... Google launches 'Travel Insights with Google' to aid travel partners in recovery from COVID-19 Google announces December 2020 Core Update of its search ranking algorithm Google passage based ranking makes SEO easier - and harder Google to drop desktop-only sites from index March 2021 Make sure what you have on your site is different enough to be what Google needs to index

Google's SMITH algorithm outperforms BERT on long-form text

Comments on Google's SMITH algorithm outperforms BERT on long-form text