Telling Stories With Data: A Simple Math Guide

Nare K.
DataDrivenInvestor

--

Journalism is reporting facts in a way that people can understand more about issues that matter. In the era when open source data and digital tools that make data analysis fast and easy are proliferating, data journalism is increasingly becoming more relevant and necessary.

But what exactly is data journalism?

Data journalism is all about telling stories using numbers. It is about carefully looking for and finding patterns. It is about finding both anecdotes and evidence that goes beyond anecdotes. Data journalist’s role is identifying stories, adding reliability to the stories and bringing data to life.

As one of the founding fathers of data journalism, Adrian Holovaty put it:

“ Data journalism is about creating structure from chaos”

Nare Krmoyan ©

Remember: just as proper maths revitalize and add reliability to your stories, mathematical mistakes harm your credibility. The good news is that newsroom math is easy: add, subtract, multiply, divide and conquer the world with your stories!

Below I am briefly going over some simple mathematical and statistical concepts that might be helpful for anyone who wishes to leverage the power of data to come up with more compelling stories. More specifically, I cover how to

  • Calculate and use percentage change, medians, rates, ranges, averages, quartiles.
  • Use standard deviation to identify outliers.
  • Know what correlation and linear regression are.

1. Percentage change

Calculating percentage change helps you compare a new number to an old number.

Formula: (new — old) / old

Example:

In a village 25 accidents were reported this year. 32 accidents were reported last year.

Change = (25–32)/25 =- 0.28
The number of accidents has decreased by 28%

2. Rates

Calculating rates allows you to compare samples of different size.

Formula: events / population * ‘per’ unit

Example:

In a city with 248,000 population, 37 suicide incidents were reported.
Rate per 100,000 = 37/248000 *100000 = 14.9
Suicide rate per 100,000 people was around 15

3. Univariate statistics

Besides simple arithmetics, data journalists might also consider applying some statistical calculations.

Descriptive statistics is about taking a single variable in a collection of data and describing the characteristics. Both measures of centrality and measures of variability can be insightful.

Measures of the Centrality

  • Mean (average): Total of the values, divided by the number of those values.
  • Median: The middle value of an ordered list. If the list has even numbers, one can either take any of the two middle values, or calculate the mean of these values.
  • Mode: The most common value.
  • Outliers: Atypical values far from the average: this might be where a story is at, where things are different from the average.

Variability

  • Maximum and minimum: largest and smallest values.
  • Range: the distance between the maximum and minimum.
  • Quartiles: the medians of each half of the ordered list of values. Halfway down from the median is the first quartile. Halfway up from the median is the third quartile.
  • Standard deviation: the average distance from the mean. [Here is a simple free online standard deviation calculator]

Example:

Test scores of 5 students were 98; 75; 81; 75; 70

Mean: (98 +75 +81 +75 +70)/5 = 79.8. The average score for the particular test is 79.8

Median: first, let’s order the list 70; 75; 75; 81; 98. We now see that the middle value is 75.

Mode: two students have received a score of 75, thus, it is the mode.

Outliers: It seems as though 98 is an atypically high score, since it is further from the median than any of the other scores.

Maximum and Minimum: 98 is the biggest value, 70 is the smallest

Range: 98 - 70 = 28

Stories are found in outliers. Standard deviation helps define whether a value is in fact a true outlier. Usually, values are considered outliers if found more than 3 standard deviations from the mean.

*Note: A lot of the data out in the world is normally distributed, which means if we take our data points (read: numbers that we have) and plot them on a Cartesian Plane (read: on a line), the peak will be in the middle, near the mean. The wider the curve the greater the standard deviation.

One useful method for assessing the “normalcy” of data is The Empirical Rule, which suggests that

  • 68% of values within 1 standard deviation from the mean
  • 95 of values within 2 standard deviations from the mean
  • 99.7% of values within 3 standard deviations from the mean

*Note: Don’t confuse the mean with the median. The average (the mean) is affected by outliers, whereas the median isn’t.

4. Multivariate statistics

One concept in multivariate statistics is that of correlation, which is used to understand the relationship between two or more variables in data. One mathematical test used for assessing correlation is the Pearson’s R. It is a number that ranges from -1 to 1.

In order to keep it simple and not to scare anyone with formulas that look complicated ( I say ‘look’, because in reality these formulas are simple and easy-to-understand, but fancy notations make them look incomprehensible). Instead, here is a link to online Pearson’s R Calculator, which you can use to skip the tedious calculations and get the numbers in a few seconds. Once you have the number, you have to interpret it. Positive r means that if one variable goes up, the other goes up. Negative r means that if one variable goes up, the other goes down.

*Note: Correlation does not imply causation. Even skilled statisticians sometimes make the mistake of confusing correlation with causation. A term used for false causations is “spurious correlation”. In fact, spurious correlations in themselves frequently result in super engaging journalistic stories. Feel free to check up some funny examples.

Another useful concept in multivariate statistics is linear regression. It is a method for predicting the dependent variable, based on the value of the independent variable. Just as with correlation, the mathematical formula might look more complex than it is. So, journalists might again use an online calculator, or use data analysis add-ons in excel, google sheets, or any other spreadsheet programs.

*Note: Using linear regression the predictions are made on historical data. The assumption is that future will most probably resemble the past. Hence, before telling stories with certainty, journalists should remember to look at the big picture and assess the relevance of the assumption

The Bottom Line: Data journalism is indeed an exciting branch of journalism. Finding patterns and deviations from those patterns, using numbers to make bold statements, detecting fraud — all of these are foundations for “breaking stories” that media professionals adore. Although some journalists have slight “math apprehension”, everyone is capable of grasping and using basic concepts. In my post I tried to lightly introduce how to

  • Calculate and use percentage change, medians, rates, ranges, averages, quartiles.
  • Use standard deviation to identify outliers.
  • Know what correlation and linear regression are.

Hope it was helpful.

--

--