Google estimates 90% of email is "machine-generated", that is a structure that is largely consistent with small variable changes (e.g. a template). Google has many uses for identifying templates, and their underlying variables: Allowing the Google Assistant to understand flight receipts to power automatic calendar reminders, Answering questions about purchase delivery from amazon receipts, etc.
For our purposes, Google also indicates that the template tags generated by their internal systems (Juicer/Crusher) are used by other teams. Notably, "sales promotions" are called out as a reason to skip the inbox altogether.
We can also presume this "template tagging" is used for Smart Reply generation.
So, understanding how templates are deduced is beneficial for understanding the implications to outbound, and, ideally, developing methods to optimize.
This is a tough problem (for Google)
Authors Note: The research material is quite dense, and I only have a general grasp on the specifics. To make this easier to grok myself, I'll start with analogy of sorting marbles, then progress to the learnings of Google's template induction engine.
Imagine you have a jar of marbles, each with a different color, and you want them sorted by similarity. If the number of marbles in the jar is small, and the number of colors also small, you may take a simple strategy: Pick up a marble, then individually compare to all the other marbles one by one (noting how similar each pair is), then group the most similar marbles. In statistics, this is called the Jaccard Index.
The Jaccard Index is useful for limiting false negatives (since every potential grouping is considered), but scales very poorly. Where n = the number of marbles, the number of comparisons is equal to (n x n-1)/2.
For Google, to review only 100,000 emails by Jaccard Index, they would need to run over 5 billion comparisons (100,000 * 99,999 / 2). Since the number of emails is much higher, they use a different method.
Hashing your marbles
Imagine, instead of comparing each marble to every other marble, you noted each marbles main color, e.g. Red Marble = r, Blue = b, and so on. Then you note any specs by count and color. So a primarily blue marble with three yellow flecks would be marked "b3y". After repeating this process for each marble, you would have buckets of patterns. Since this process only requires reviewing the individual marbles, then comparing the condensed features, or hashes, the computation requirements are much lower. Reviewing and sorting 100,000 marbles would only require 100,000 hashes, and then the computation to compare within the marbles that have a similar hash.
This is roughly how Google groups emails to determine if they are templates.
But, say the numbers of color flecks also have their own properties, like size and shape. Generating a complete hash of these properties would create such highly unique hashes, that few would completely match. You still want to match the primarily blue marbles with 3 yellow flecks, even of one of those flecks is more oblong.
What you may do then, is generate miniature hashes. By rotating the marble and hashing only the visible portion, then repeating multiple times, you will have several hashes for each marble.
- Front: b3y-flat
- Back: b2y-flat1y-oblong
By then comparing all the Front hashes, then Back hashes you would be able to quickly determine even "mostly" similar marbles, without comparing individual marbles. For instance, while the b2y-1y-oblong is less likely to match all the other b3y-flat marbles, the Front view of that marble would.
Google reads your template, before reading the email
Now, for Google to create hashes of the content of every email, the number of potential hashes would be incredibly diverse, like marbles where the hash label is b3y5r8p2b9i. While this would ensure the highest number of templates are deduced, it introduces additional computation restraints at Google scale.
So, Google's Juicer system does not read the content of emails first. Instead, it generates hash of the structure of your email.
How you see an email (the text):
How your email client sees that email (HTML)
<div> <span style="font-family:"Helvetica Neue","Liberation=Sans",Arial,"sans serif";font-size:13px">Hi Kyle, </span> </div> <div> <br> </div> <div> <span style="font-family:"Helvetica Neue"=;,"Liberation Sans",Arial,"sans serif";font-size:13px"> This is the text of that email. </span> </div> <div> <br> </div> <div> <span style="font-family:"Helvetica Neue","Liberation Sans",Aria=l,"sans serif";font-size:13px">It's fairly simple. </span> </div> <div> <br> </div> <div> <span style="font-family:"Helvetica Neue"=,"Liberation Sans",Arial,"sans serif";font-size:13px"> - Kyle</span> </div> <img src="<https://dogpatchadvisors.oramalthea.com/api/m=> ailings/opened/PMRGSZBCHI2DALBCN5ZGOIR2EI2TOYJQGYYTSMZNGUZDOYJNGQ2TIZJNMI2G= CNZNGMYGINBUGEIOBVHEYCELBCOZSXE43JN5XCEORCGQRCYITTNFTSEORCNV3WY3SPMZHXKQS=BGBYGIQ2FNJWTQUTUMNKUSMSNPBYXS5KYGMYEGTKTOJ3TMTSMORBHOPJCPU=======.gif" alt="" width="1" height="1">
How Gmail's template engine views the email (HTML structure)
/div/span/style /div/br /div/span /div/span /div/br/style /div/span
In the final example, all specific content is removed, and a mapping of the content structure remains. This content structure is then hashed by Google to determine if the email is part of a template, and whether it should be investigated further.
Only sorting marbles when we have enough...
In the contrived marble analogy, you may only decide to investigate marbles when there are a certain number of marbles with a given label. So, as you drop your marbles into a labeled "bucket", you only name that bucket if you have a good number of marbles in that bucket.
Google does the same thing both to reduce computation overhead and to increase utility of this system (labeling a template sent to two people has little use for determining flight patterns). Also, for reasons of privacy. It is presumed that once a template is sent to a threshold of recipients, that the variables will house all of the personal data, the templates themselves will be less concerning from a privacy perspective.
What does this mean for us?
Potentially, this means that despite additional merge tags, Google may simply view our emails as templates sent to fewer recipients. Thus, we should aim to increase the both the variation density (how MUCH of the message changes) but also variation diversity (difference and frequency of merge tags and their combinations). For instance, if we have 3 merge tags with 3 variations each, how potential variation diversity is 9. But if there are correlations that cause merge tags to appear together, then the variation diversity is reduced.