Which letters are friends?

Which letters are friends?

The other day I was wondering. Which letters are friends?

Like, if a word is a party, and letters are guests, which letters want to hang out with each other at the party?

Basically, which letters are you likely to find together?

Part 1: Which letters are friends

Hey, couldn’t I just google this?

Part 1A: Can I just google this?

The simple version of this question is “Which letter pairs are the most common”?

This is a really easy question to answer because it comes up all the time in cryptology. You can look it up on the Wikipedia. It’s TH, HE, and a few others.

But… this doesn’t really answer my question. These letters hang out a lot because they are the most popular letters, and they get invited to the most popular words. T and H might not actually like each other, they’re just both in school sport team “The”.

I want to know how much letters show up together. In the enormous social melting pot of English, which letters drift towards each other?

This is not something you can google.

Part 1B: Mathematically defining friendship

To find out which letters are friends, we need to know how often they show up in the same words.

We also want to adjust this for the total frequency of the letters. X and Y aren’t invited to many words, but that doesn’t necessarily mean they are not friends.

We want to know that **if **letters are invited to a party, who is their plus one?

To calculate this, we can figure out the number of times we’d **expect **two letters to show up together based on how common they are, and compare that to how much they **actually **show up together.

Math time!

For the letters X and Y in a given corpus

Expected hangs = (frequency of X in the corpus) * (frequency of Y in the corpus) / (number of letters in the corpus)

Then we can figure out how much more they are hanging out than if they were just mingling:

Hang differential = count of XY + count of YX - Expected hangs

We can normalise this to find how much more or less than expected the letters show up together as a percentage.

friendship score = 100*hang differential / expected hangs

This gives us a “friendship score” equal to the percentage difference of hangs, with a minimum of -100% if they never hang out.

Part 2: Choosing the corpus

Choosing a corpus is hard. Also, the word “corpus” is very formal.

My project is very silly, so I’m going to switch to “Word Pile”.

Part 2A: Choosing the Word Pile

I can’t use a random clump of text cribbed from a bunch of books or websites, because I don’t want word frequency to matter.

I’m interested in how often letters turn up together in any given word, which means I need some kind of list of words.

The trouble with this is that there’s not really a decisive list of English words. Dictionaries exist but they have heaps of words that most people don’t ever use or think about, and also they don’t just let you download the list of words for free.

I don’t mind including informal language, so I think I have a solution. I’m going to use a Google Ngram dataset of English word frequency. This is great because it’s a list of words from English texts, sorted by how common they are.

There’s also four hundred thousand of them.

Part 3A: Too many words in my Word Pile

400,00 words seems like too many words.

Also, it will include a lot of weird slang, typos, scientific jargon and weird acronyms.

The average person has a vocabulary of around 30,000 words so I’m just going to use the 30,000 most frequent words.

I checked out what the 30,000th word was to get a vibe for if it was suitably common, and I swear to god I am not making this up, the word was “substring”.

That seems like a good omen so I did some code and now we’re going to run The Calculation.

Part 2: The Secret Lives of Words

What is your guess for which letters are best friends?

Part 2A: Quiz!

Remember, our definition is the letters which show up adjacent to one another in common English words more than you would expect based on their overall frequency.

Have a guess!

The answers are below so I’m putting some images of classic best friends while you make your guess.

Bill and ted

Troy and Abed

Woody and Buzz

Part 2B: It’s Q and U

The letters which are best friends are Q and U!

This isn’t very surprising. Q is not a very common letter and it almost always shows up with U.

Here are the top 5, plus another 5 after that:

Friendship Friendship score Expected hangs Observed hangs
QU 2651.58 11.81 325
ZZ 1367.6 1.84 27
JU 575.24 17.03 115
IZ 494.02 50.17 298
GN 479.57 400.12 2319
CK 477.19 92.69 535
FF 445.25 42.73 233
IV 436.62 202.38 1086
EX 424.48 93.81 492
EV 422.98 277.07 1449

How did you go on the quiz? Three of the top 5 are in the word Quizzes!

Part 2C: Letter gossip

Here is some juicy gossip about the letter friends.

ZZ really surprised me. Double letters have a big headwind because they only have one combination to be counted instead of two, but Z is just really into self-love! They love a night in with some Jazz, followed by Pizza and maybe a Puzzle.

F also spends a lot of solo time, but that’s because they can’t get Off work, the Office is just a Different vibe and the Staff need to make their Offers sound Official.

G and N overwhelmingly hang out in the -ing suffix, but have some cool side hustles with gn- words. They are get their freak on unexpectedly in a Significant number of words.

J and U have a fun dynamic where 95% of the time J stands up against the wall and U stands next to them, but every now and then they Adjust to the vibe and even mix it up on the dance floor with Fuji and Fujitsu.

Part 2D: Letter enemies

Now I know the best friends, but I’m a letter gossip at heart, and I want to know the lowest friendship scores:

Friendship Friendship score Expected hangs Observed hangs
VW -92.16 25.51 2
VV -93.32 29.94 2
HH -93.8 112.86 7
AA -97.67 1456.79 34
UU -97.81 182.9 4
II -98.61 1367.95 19
JQ -100 1.1 0
KQ -100 4.01 0
QY -100 6.25 0
QZ -100 1.18 0

QY, QZ, KQ and JQ fully are not on speaking terms.

In general the vowels did not like their own company. EE is a bit further up the list, while the others seem to hang out exclusively in corporate and government acronyms.

Part 2E: Letter chill bros

The chillest, most “I’ll see you when I see you” letters are N and W.

They get together 152 times out of an expected 152.5. A and I are a little more friendly, and E and Y are a little less. These letters are completely in the acquaintance zone.

Now we know which letters are friends!

Part 2F: Doubt

But… doesn’t it seem like some of these “friendships” are a little onesided?

Like, are Q and U actual friends?

If you read the room, it seems like Q is just super awkward and U is the only person who takes pity on them - but U probably has a ton of better friends, right?

In fact, a bunch of our “best friendships” look like a weirdo letter who has picked one of the cool letters and totally won’t leave them alone.

That’s not how friends work at all!

Part 3: But do you “like” like me?

Let’s figure out which letters have genuine, non-toxic two-sided friendships.

Part 3A: Mathematically defining one-sided friendships

Our new question is, how much does each letter like hanging out with each other letter?

To do this, we need a new one-sided friendship score for each pairing.

This is pretty easy to calculate. For X and Y:

ExpectedFriendshipXtoY = (count of X)/26

actualFriendshipXtoY = (count of XY) + (count of YX)

friendshipDifference = (actualFriendship - expectedFriendship)/expectedFriendship

Part 3B: Who does E like?

Now we have a list of the friends of each letter and a score for each. Here’s the list for E!

Table

Friend Friendship score Expected hangs Actual hangs
R

6.39

883.04

6526

S

3.68

883.04

4136

N

3.29

883.04

3786

D

3.13

883.04

3645

T

2.84

883.04

3388

L

2.53

883.04

3121

C

1.07

883.04

1825

M

1.06

883.04

1820

V

0.64

883.04

1449

P

0.51

883.04

1334

I

0.29

883.04

1142

G

0.24

883.04

1098

A

0.2

883.04

1061

H

0.19

883.04

1053

B

-0.19

883.04

716

F

-0.31

883.04

613

W

-0.31

883.04

611

K

-0.33

883.04

591

E

-0.4

883.04

528

X

-0.44

883.04

492

U

-0.55

883.04

395

Y

-0.6

883.04

356

O

-0.74

883.04

234

Z

-0.74

883.04

227

J

-0.85

883.04

136

Q

-0.91

883.04

76

E’s best friend is R.

Like most letters, E thinks Q is a total weirdo and doesn’t understand why U hangs out with them.

E is unusually good friends with V, and unusually bad friends with J.

Part 3C: How to know if your friendship (with a letter) is toxic

With this data, we can figure out a “onesidedness” score for each friendship.

This is a little tricky because we’ve been using percentage differences, but those are hard to compare accurately as values less than 0 are not on the same scale.

A difference of -90% to +500% is more significant than between 500% and 1000%

We’re going to invert the negative values so that the one-sided friendship score reflects the number of times smaller or bigger than the expected value the friendship actually is.

If the actual hangouts is 0, we’re going to give that a score of -15 since that’s about where it tops out.

Ready to name and shame?

Part 3D: The most uncomfortable letter friendships

The most one-sided friendships are:

Friendship Onesidedness first letter friendship score second letter friendship score
AV 39.1 -33.33 5.77
QU 19.96 20.34 0.38
GL 17.96 -20 -2.04
DL 16.92 0.25 -16.67
EX 16.5 -2.27 14.23
OV 16.44 -12.5 3.94
BR 15.08 2.58 -12.5
PY 14.67 -2 -16.67
EV 13.55 0.64 14.19
IZ 13.45 -1.85 11.6

Wow.

As suspected, a bunch of our “best friends” are actually social parasites.

A absolutely cannot stand V to an unprecedented extent and would rather be literally anywhere else.

U is extremely lukewarm about Q. It’s the least popular of the vowels, but Q is just so keen on it and it’s a bit creepy.

V is a total vowel-chaser, hanging out with A, O and E whenever it gets the chance. E tolerates this because E is friendly with everyone (it even tolerates X), but O is weirded out.

This is so depressing! Aren’t there some really good letter friendships out there?

Part 4: Is letter love even real?

With our current data, we know which letter each other letter likes the most. Given this, we can figure out which letters are best friends with each other!

We already know that E is best friends with R, but is R best friends with E? I can now reveal the dramatic truth. Are any letters actually best friends?

Here’s some more best friends to build the tension.

Frodo kissing sam on his sweet little head

Frog and Toad on a tandem bicycle

Part 5: The Big Reveal

There are two pairs of best friends in commonly used English.

They are:

E and R

And…

I and N

Given they have the lower friendship differential, I’m ready to proclaim I and N the strongest friendship of all the letters. Here’s a directional graph of the best friend network!

Directional graph showing one-way relationships between letters

Obviously the vowels are very popular.

R is the best friend of both E and A, joining their respective circles together.

N performs the same role for I and O, and brings its friend G to the party as well.

This feels true which is a nice feeling to have about statistics. Now we know which letters are friends!