Which letters are friends?

Which letters are friends?

The other day I was wondering. Which letters are friends?

Like, if a word is a party, and letters are guests, which letters want to hang out with each other at the party?

Basically, which letters are you likely to find together in the same words?

Part A: Which letters are friends

I have some intuitions about this at the start. Maybe you do too! What would you guess?

Maybe TR, that seems common. Or QU is an easy pick!

Hey, couldn’t I just google this?

Part B: Just Googling It

The simple version of this question is “Which pairs of letters are the most common”?

This is a really easy question to answer because it comes up all the time in cryptography. You can even look it up on the Wikipedia. It’s TH, HE, and a few others.

But… this doesn’t really answer my question. These letters hang out a lot because they are the most popular letters. They get invited to the most popular words. T and H might not actually like each other, they’re just both in “The”, which is like, the high-school football team of letters.

I want to know how much letters show up together, across all words. In the enormous social melting pot of English, which letters seem to just, enjoy hanging out?

This is not something you can google.

Part C: Setting Some Goals

To find out which letters are friends, we need to know how often they show up in the same word (regardless of how often that word is used).

First, we should find the number of words in which the letters appear together, compared to the total number of words. That means we need a list of words!

Part D: Choosing a Corpus

Choosing a corpus is hard. Also, the word “corpus” is very scary.

My project is very silly, so I’m going to call it a Word Pile.

Part E: Choosing a Word Pile

I can’t use a random sample of text from books or websites, because I don’t want word frequency to matter. I could filter duplicates, but that’s hard and messy (plurals, contractions, sometimes there’s a different e on resume).

I really just need a list of all English words. The trouble with this is that there’s not really a decisive list of English words. Dictionaries exist but they have heaps of arcane obscure scientific jargon, like “Aardvark”. Also they charge money for their lists of words.

I don’t mind including informal language, so I think I have a solution. I’m going to use a Google Ngram dataset of English word frequency. This is great because it’s a list of words from English texts, sorted by how common they are.

There’s also four hundred thousand of them.

Part F: Too Many Words in my Word Pile

400,000 words seems like too many words.

Also, it will include a lot of weird slang, typos, scientific jargon and weird acronyms.

The average person has a vocabulary of around 30,000 words, so I’m going to use the 30,000 most frequent words.

I checked out what the 30,000th word was to get a vibe for if it was suitably common. I swear to god I am not making this up, the word was “substring”.

That seems like a good omen so I did some code and now we’re going to run The Calculation.

Part G: Mathematically Defining Friendship

First, we should compare the frequency of the pair to the frequency of the two letters individually. This means we won’t just see the popular letters! For example, X and Y aren’t cool. They’re not invited to many words. But, they could still be friends!

We want to know, **if **letters are invited to a party, who is their plus one?

To calculate this, we can figure out the number of times we’d **expect **two letters to show up together based on how common they are, and compare that to how much they **actually **show up together.

Math time!

For the letters X and Y in a given corpus:

Expected hangs = (frequency of X in the corpus) * (frequency of Y in the corpus) / (number of letters in the corpus)

Then we can figure out how much more they are hanging out than if they were just mingling:

Hang differential = count of XY + count of YX - Expected hangs

We can normalise this to find how much more or less than expected the letters show up together as a percentage.

friendship score = 100*hang differential / expected hangs

This gives us a “friendship score” equal to the percentage difference of hangs, with a minimum of -100% if they never hang out.

Part H: The Secret Lives of Words (And Quiz!)

Which letters do you think are best friends?

Remember, our definition is:

  • Letters which show up adjacent to one another,

  • in common English words,

  • more than you would expect based on their overall frequency.

Have a guess!

The answers are below so here’s some classic best friends to scroll through.

Bill and ted

Troy and Abed

Woody and Buzz

Part I (The Letter): It’s Q and U

The letters which are best friends are: Q and U!

This isn’t very surprising. Q is not a very common letter and it almost always shows up with U.

Here are the top 5, plus another 5 after that (aka the top 10):

Friendship Friendship score Expected hangs Observed hangs
QU 2651.58 11.81 325
ZZ 1367.6 1.84 27
JU 575.24 17.03 115
IZ 494.02 50.17 298
GN 479.57 400.12 2319
CK 477.19 92.69 535
FF 445.25 42.73 233
IV 436.62 202.38 1086
EX 424.48 93.81 492
EV 422.98 277.07 1449

How did you go on the quiz?

Also the entire word QUIZ is in the top 5! Isn’t that fun.

Part J: Analysis aka Letter Gossip

Let’s get into it!

ZZ really surprised me. Double letters are always gonna be less common, because they only have one combination to be counted (ZZ vs IZ + ZI), but Z is just really into self-love! They love a night in with some jaZZ, followed by piZZa and maybe a puZZle.

F also spends a lot of time alone. But, that’s because they can’t get oFF work. Things are diFFerent at the oFFice, and the staFF need to make their oFFers eFFective.

G and N overwhelmingly hang out in the -ing suffix, but they have some cool side hustles with gn- words. They also get their freak on unexpectedly in the middle of a siGNificant number of words.

J and U have a fun dynamic where 95% of the time J stands up against the wall and U chats them up. Don’t JUdg, it’s JUst that parties make them JUmpy. Although, now and then they adJUst to the vibe of the word. They can even mix it up on the dance floor, but only if you play the fUJis (or weirdly, hallelUJah).

Part 12: Letter Enemies

Now we know the best friends.

But I’m a Letter Gossip at heart, and I want to know the lowest friendship scores.

Friendship Friendship score Expected hangs Observed hangs
VW -92.16 25.51 2
VV -93.32 29.94 2
HH -93.8 112.86 7
AA -97.67 1456.79 34
UU -97.81 182.9 4
II -98.61 1367.95 19
JQ -100 1.1 0
KQ -100 4.01 0
QY -100 6.25 0
QZ -100 1.18 0

QY, QZ, KQ and JQ fully are not on speaking terms.

In general the vowels did not like their own company.

EE and OO are a bit further up the list, while the others seem to hang out exclusively in acronyms (and people screaming).

Now we know which letters are friends!

Thanks for reading.

Part K: Doubt

But… doesn’t it seem like some of these “friendships” are a little one-sided?

Like, are Q and U actual friends?

If you read the room, it seems like in reality, Q is super awkward and U is the only person who takes pity on them. But U probably has a ton of better friends, right?

In fact, a bunch of our “best friendships” look like a weirdo letter who has picked one of the cool letters, and totally won’t leave them alone.

That’s not friendship at all!

Part L: But Do You “Like” Like Me?

Let’s figure out which letters have genuine, non-toxic two-sided friendships.

Part M: Mathematically Defining Friendship (For Real This Time)

Our new question is, how much does each letter like hanging out with each other letter?

To do this, we need a new one-sided friendship score for each pair. Actually, we need two for each pair, since they only go one direction.

This is pretty easy to calculate. We’ll say the expected proportion of adjacencies (i.e. friendship) from one letter to another is equal to that letter’s overall frequency, then compare that to the actual frequency.

For X’s friendship towards Y:

ExpectedFriendshipXtoY = (count of Y in dataset)/26

actualFriendshipXtoY = (count of XY) + (count of YX)

friendshipDifference = (actualFriendship - expectedFriendship)/expectedFriendship

Part N: How To Know If Your Friend (Who is a Letter) is Toxic

With this data, we can figure out a “onesidedness” score for each friendship.

This is a little tricky because we’ve been using percentage differences, but those are hard to compare accurately as values less than 0 are not on the same scale.

A difference of -90% to +500% is more significant than between 500% and 1000%

We’re going to invert the negative values so that the one-sided friendship score reflects the number of times smaller or bigger than the expected value the friendship actually is.

If the actual hangouts is 0, we’re going to give that a score of -15 since that’s about where it tops out.

Ready to name and shame?

Part O: Uncomfortable Letter Dynamics

The most one-sided friendships are:

Friendship Onesidedness first letter friendship score second letter friendship score
AV 39.1 -33.33 5.77
QU 19.96 20.34 0.38
GL 17.96 -20 -2.04
DL 16.92 0.25 -16.67
EX 16.5 -2.27 14.23
OV 16.44 -12.5 3.94
BR 15.08 2.58 -12.5
PY 14.67 -2 -16.67
EV 13.55 0.64 14.19
IZ 13.45 -1.85 11.6

Wow.

As suspected, a bunch of our “best friends” are actually social parasites.

A absolutely cannot stand V to an unprecedented extent and would rather be literally anywhere else.

U is extremely lukewarm about Q. It’s the least popular of the vowels, but Q is just so keen on it and it’s a bit creepy.

V is a total vowel-chaser, hanging out with A, O and E whenever it gets the chance. E tolerates this because E is friendly with everyone (it even gets along with X), but O is weirded out.

This is so depressing! Aren’t there some really good letter friendships out there?

Part P: Is Letter Love Even Real?

With our current data, we know which letter each letter likes most.

Given this, we can figure out which (if any) letters are best friends with each other!

I can now reveal the dramatic truth. Are any letters actually best friends?

Here’s some more best friends to build the tension.

Frodo kissing sam on his sweet little head

Frog and Toad on a tandem bicycle

Part Q: The Big Reveal

There are two pairs of best friends in commonly used English.

They are:

E and R

And…

I and N

Given they have the lower friendship differential, I’m ready to proclaim I and N the strongest friendship of all the letters!

Here’s a directional graph of the best friend network:

Directional graph showing one-way relationships between letters

Obviously, the vowels are very popular.

R is the best friend of both E and A, joining their respective circles together.

N performs the same role for I and O, and brings its friend G to the party as well.

This feels true, which is a nice feeling to have about statistics.

Now we know which letters are friends!