Zipf’s Law: Breakdown & Application in App Development

Hassan
DataDrivenInvestor

--

I’m going to prelude the main topic of this article with how I came to think about Zipf’s law to begin with. Working on a project I was creating some seed data for my app. I realized how important seed data is to see the relationships you’ve established in your app on a mass scale. As you can see below I started to create some seed data and realized that my join model, PostCategory, wouldn’t really work with the seed data that was being pushed in.

This lead me to think about user post validation. What if a user chooses the wrong tags on a post and their post ends up populating the wrong filters? That would potentially create a bad user experience. I thought of creating a set of buzz words for each category, then sweeping the post title and content for those words. If none of the buzzwords hit, I would alert the user to make sure they were making the right selection as far as category tag. This led me to think…how can I possibly cover all the words that a category can cover?

The answer was Zipf’s Law.

Zipf’s Law is an empirical law formulated using mathematical statistics, it is a discrete form of the continuous Pareto Principle, a law that I will discuss further in depth, below. George Zipf, a linguist at Harvard University, wasn’t the first person to observe this law, but he definitely did the most research into it and made it famous. The law postulates that the top 18% of most frequently used words account for 80% of word occurrences.

Top 20 Most Used Words

  1. The
  2. Of
  3. And
  4. To
  5. A
  6. In
  7. Is
  8. I
  9. That
  10. It
  11. For
  12. You
  13. Was
  14. With
  15. On
  16. As
  17. Have
  18. but
  19. Be
  20. They

The pattern of Zipf’s law is such that each successive word on the list is that the second most used word in the list will appear about as half much as the most used word, the third word on the list appearing about one third as much as the most used word, the pattern continuing with the frequency directly correlated with the word’s rank. An example of this can be seen even in the famous play: Romeo & Juliet.

Zipf’s law in action

One principle that Zipf’s law is founded on is the Pareto Principle. The Pareto Principle states that 20% of the causes are responsible for 80% of the outcomes. This carries not only in language but in most areas of life even including coding. One of the first things we learn as coders is that the application creation process is 80% planning and 20% coding. That same principle exists in languages, even in the way society is structured: 80% of the wealth in the world is owned by 20% of people.

Zipf’s law also explores the idea of the principal of least effort. Life follows the path of least resistance, it seeks the most efficient way. This also applied to the evolution of human language. As speakers, we seek to use as little words as possible to convey our ideas to make the communication as efficient as possible. As listeners we seek to hear more words as possible so that we, as listeners, have to do less work to understand what is being conveyed. If you notice the words that are higher in rank are smaller in length than the words that exist at the end of the curve.

The law also seems to hold true during random occurrences and even through deliberate events. We definitely choose the topics that we speak about, so why does the data time and time again point to this power law? It seems to be that Zipf’s law is hardwired into our brains.

Another observation that is made under the law is something called Preferential Attachment Process. Preferential Attachment Process is when something is given out according to how much is already possessed. This phenomena explains things such as

  1. Viral views: The more views something has, the more likely it is to be recommended, shown in the news, both of which give it more publicity. It’s a snowball effect.
  2. The usage of words: The easiest words to use are what were used the most historically and end up being the highest in rank of frequency because of their ease of use.
  3. Even small things such as the path you take walking from your kitchen to your living room. 20% of your carpet path will account for 80% of your carpet wear.

Zipf’s law is just another observation in human behavior. We make applications for other humans and taking into account laws such as these help us better design our apps. In my situation, I could have took into account the most common words used and within a certain degree of accuracy feel comfortable that I would be covering the most common words in a validator method within my app.

The law can also be applied to some applications in AI. One can potentially use Zipf’s law as the bias part in the algorithm of an AI to further nudge the weights of a node towards a certain outcome. Zipf’s law can potentially be applied in some computer vision algorithm to correctly predict whatever it is looking at. Not just for reading handwriting but potentially for any area of life where Zipf’s law has been observed.

--

--