Google's SMITH algorithm outperforms BERT on long-form text
By
11 January 2021

SMITH matches passages within the context of the entire content of a document
Subscribe to my blog

Share This Article

A recent research paper from Google describes work being done on a Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder designed to match long queries to long content - a task that the BERT algorithm finds difficult.

Quoting from the abstract of the paper:

"In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input...Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048."

In plain English: BERT and tools like it rely on semantic matching to identify information in sentences within website content that's related to the language used in a search query. But they have trouble with matching long content to long queries.

Unlike BERT, which is designed to understand words within sentences, SMITH is able to predict further content of a page based on its top content, and also to understand page structure - sections, passages, sentences - and match queries to passages within the entire content of a page.

The researchers have concluded that SMITH is better than BERT at understanding and matching queries to the content of long pages:

“The experimental results on several benchmark datasets show that our proposed SMITH model outperforms previous state-of-the-art Siamese matching models including HAN, SMASH and BERT for long-form document matching...The SMITH model which enjoys longer input text lengths compared with other standard self-attention models is a better choice for long document representation learning and matching.”

It's unknown at this point if Google is actually using SMITH in its ranking algorithm But any tool as promising as it seems to be will likely be used sooner or later.

However, because BERT and SMITH, by design, have different capabilities, Google will likely continue to use both.


If you found this article helpful and would like to see more like it, please share it via the Share This Article link at the top of the page.

And if you have questions or comments, you can easily send them to me with the Quick Reply form, below, or send me an e-mail.


David Boggs    - David
David@DavidHBoggs.com
View David Boggs's profile on LinkedIn

Google Certifications - David H Boggs
View my profile on Quora
Subscribe to my blog

   
   
Website
Visit Website
Rating
5/5 based on 1 vote.
Show Individual Votes
Tags , , , , , , ,
Related Listings
External Article: https://arxiv.org/abs/2004.12297


E-mail:
Quick Reply
Name:
E-mail:
Subscribe to my blog:
Your Comment:


You may use BB Codes in your message.
Spam Prevention:


Members currently reading this thread:

Previous Article | Next Article