5 Mistakes to Avoid When Selecting a News API

Ran Geva
DataDrivenInvestor
Published in
5 min readJan 12, 2021

--

Thanks to digital news publications, millions of news articles are now being published every second. For many organizations, selecting the right news API can help keep track of it all.

The ability to efficiently collect news from the web is critical. The need to collect news data typically falls into two distinct categories:

  • News scraper — Sometimes the data needed is specific or small. For example, you might need to collect data from specific sites or specific data from a variety of sites. In these cases, a solution like ScrapingHub will allow you to manage the parsing and structuring of the data yourself.
  • On-demand news API — On the other hand, sometimes you need a lot of data. For example, you want all news articles in English with the keyword “bitcoin” that had high social engagement within the last 30 days. In that case, you’ll probably want to use a service like Webhose, where the crawling, scraping, and structuring of data is done for you. The data is then stored in a repository or database so that it can be searched.

But sometimes organizations need a bit more than just news data. Your organization might need enriched data as well. For example, you may need to create advanced queries to gather the exact news you’re after (e.g. by person, organization, or location — or a combination of all three). Or you need to aggregate and analyze the data for the purpose of delivering insights. Enriching the data yourself will take time and resources away from your insights, so you’d rather someone else do it for you. Another possibility is that you want to build or refine your machine learning models on top of enriched data.

So keeping these distinct needs in mind, let’s review a few mistakes to avoid when selecting a news API.

1) It’s not comprehensive

For brands that need to be able to conduct constant competitor analysis of dozens and even hundreds of products simultaneously, selecting a news API comprehensive news coverage is paramount. That’s also true for media and web monitoring companies that need to keep up with the endless flow of information being produced every minute of the day. Financial management companies and other enterprise-level companies also rely heavily on comprehensive high-quality news data feeds to develop accurate artificial intelligence (AI) or machine learning (ML) algorithms.

But many news APIs don’t cover the huge number of news articles published every minute online. They also might not cover specific niche sites. Take Google’s Programmable Search Engine, once known as the Google News API, for example. It crawls and indexes sites according to their own algorithm, and new and niche sites might be overlooked. Another point to consider is that many news APIs don’t crawl content in multiple languages. Even those that do may not let you search or query content according to a specific language.

In contrast, Webhose’s advanced on-demand crawlers include coverage of millions of news articles in over 80 languages. That includes every geographic territory with online access.

2) The data isn’t machine-readable and ready to integrate into your solution

When organizations collect data for the purpose of analyzing it, they need structured data. Fields and values on the web pages must be mapped (e.g. title, post text, comments, dates, author names, etc) so the data can be delivered in a format ready for analysis. This includes standardizing and normalizing that data so that it can be quickly ingested into an AI or machine-learning application. Unfortunately, preparing and cleaning is still a major struggle for organizations.

Webhose standardizes and normalizes data for organizations who need it for the next step, whether it’s analysis or building an AI or ML algorithm. We also offer multiple ways to ingest the data to fit different needs via our News API and Firehose API.

3) It’s not continuous

If news sites aren’t crawled continuously, customers miss out on the most relevant data, which is essential for accurate competitive analysis, financial analysis, or media and web monitoring. Organizations also rely on accurate data as a foundation for their AI and ML algorithms.

For continuous new data feeds, you’ll want to make sure you select a news API with low latency. (In other words, it should be able to process a lot of data very quickly with minimal delay).

Webhose delivers comprehensive coverage with low latency. (One caveat to keep in mind: Our source coverage highly favors frequently updated sources. That means that if a news story breaks and directs traffic to a previously obscure destination, the particular site is added as a news source. As interest in that particular source subsides, however, crawling latency of that source increases over time. However, this remains the exception rather than the rule).

4) It’s not scalable

Maybe your organization built an in-house crawler that suited your needs then, and now it’s time to scale or maybe you just have a specific query without a predefined list of sources. Once your business gets to the point where it needs data from hundreds of thousands of sources you haven’t previously crawled, you’ll need an advanced news data feed.

Webhose’s News API scales easily because its crawlers use sophisticated patterns matching heuristics to match patterns on newly discovered sites. It leverages knowledge about the structure of previously crawled sites onto sites it has never crawled before.

5) It only includes current news data, not past news data

Past news data can significantly add value to organizations looking to detect patterns in data and make accurate predictions about the future. Take SESAMm, a big data company that helps clients construct financial markets forecasts and strategies for all asset classes. They rely on recent and continuously updated news articles (as well as blogs, discussions and reviews) to then apply natural language processing (NLP) to build customer indicators based on sentiment, emotions, and ESG scores on a wide variety of financial assets. The accuracy of these types of predictive analysis is based on comprehensive, large datasets that only advanced crawlers offer.

Advanced News APIs like Webhose deliver these large datasets, which include archived news data going back to 2008 — up to 25TB of archived news stories.

Test Drive Your News API

This list should help you save time and effort once your organization is at the point where it’s starting to search for a news API. There are many options, and it’s important to select the one that best suits your needs.

Still not sure which news API to use? Try a bunch of them, and keep in mind all of the mistakes we’ve elaborated for you above. Sign up for a free 10-day trial of Webhose’s News API.

--

--

CEO of Webhose, a leading provider of data from the open, deep, and dark web, serving enterprise clients worldwide.