The data elephant in the room. [Image by Midjourney]

Business ecosystems and data decentralisation in the financial markets

The data elephant in the room

DataDrivenInvestor
Published in
14 min readNov 29, 2022

--

Even though the definition of a business ecosystem is diverse, one of the defining characteristics is a marketplace built around producers and consumers freely interacting over a platform that is delivered by an orchestrator. Ecosystems may be designed as closed or open but are generally defined through a coordinated set of rules across participants who exchange value over the network. [1][2]

In the case of financial markets, participants were without doubt an early adopter of business ecosystems in the form of exchanges, transactional networks and information sharing platforms. In time these transactional networks became increasingly open and widespread with the de-mutualisation of exchanges, emergence of peer-to-peer block trading facilities and establishment of clear open protocols establishing a data dictionary of how to communicate (e.g. FIX, SWIFT). [3]

However, one area that has remained stubbornly closed to the detriment of liquidity, participation, and information symmetry, across the traditional financial markets, is the area of data distribution. As the financial industry became digitised as far back as the 1980’s, large incumbent information vendors emerged, of which a large component of their business was to assemble data from a variety of sources, clean and aggregate the data and then sell back to the participants.

Indeed, even the latest trend in this long-established data hoarding business, is to locate public datasets used as crucial components of investment strategies, acquire the rights to them and then immediately put them behind an expensive enterprise subscription. Now we may blame the incumbents for seizing on the opportunity, however the problem in my opinion lies deeper, and that is a problem of industry collaboration.

With the emergence of significant innovation around open and trust-less ecosystems however, we may finally be reaching a position where we are able to address their monopolistic hold on the industry. Reduced in scope examples may be found in Decentralised Finance (DeFi) in the form of price oracles such as Pyth Network, however there is still much to do to make this innovation usefully widespread, especially to traditional finance.

It is also a common misconception that data architecture is a technical exercise when in fact it is not. It is inextricably linked to business agility, risk, and outcomes.

This paper is not a technical paper, it is a high-level discussion. However we do need to get into the technical details a little, so, hang on tight as we dive a (little) bit deeper into how this problem could be addressed.

Impediments to inter-organisational data collaboration

One of the defining problems to date in the capital markets has been the availability of good quality, reliable data for decision making across diverse portfolios. In solving for this data availability problem, large data aggregators emerged across different asset classes, who effectively act as re-vending mechanisms for data which is largely generated by consumers activities. This paradigm around data re-vending, sometimes adding little value in the process, is one of the defining problems of the financial markets.

Part of the problem also with this paradigm has been that information vendors hold a larger than reasonable proportional sway across how this data is used and have constantly been re-inventing ways to monetise that data. This has resulted in organisations paying multiple times for the same data through various relationships or usage, whether direct or through third party service providers.

It has also stymied innovation and frustrated service providers seeking to deploy new product but only to be faced with cost prohibitive redistribution fees from monopolistic information vendors. In fact, information vendors themselves can freely give data at zero cost to service providers they like, and yet make the same data cost prohibitive to a service provider they do not like.

The traditional model is also not compatible with public cloud architectures where the concept of data being shared across organisational boundaries has become less and less meaningful in an environment of virtualised spaces and homogenised shared services

Information vendors have never been incentivised to address these problems in their licensing models as not only is the problem very profitable, but the effort to reconcile different users of the data across various re-vending relationships is too complicated to solve without significant administrative overhead involved. Rather than innovate around the technology, they have instead sought to ‘plug the gaps’ in their licensing models.

However, this approach has consistently stymied innovation in the industry. With the latest technology innovations however, there are potentially better ways to solve this problem. Not from the perspective of attempting to thwart financial reward for honest work in reconciling and cleansing data, but from the obvious inequity of charging for the same data repeatedly, to the same end client.

Before we get on with the business of how this issue might be resolved, firstly we will talk through the definitions involved and popular data architectures used within enterprise deployments and information vendors.

Innovation but not the kind you want

Ever since the centralisation of information vendors and the monetisation through re-vending of industry data, there have been constant attempts to find new ways of charging for data as technology and use cases have evolved. This has led to definitions of data usage that have made it increasingly difficult for service providers and dis-incentivised inter-organizational data sharing. For those that are unfamiliar, such contractual terms introduce concepts such as:

  • Non-displayed data — refers to data which is used by machines to make decisions or generate derived works and is not displayed to an end user therefore not requiring an end-user license.
  • Derived works — refers to original works which been created in which the data has been used in whole or part, to calculate the derived works. They are considered by information vendors to be based on or derived from the underlying data.
  • Systematic or non-systematic — refers to whether the data is delivered through a digital method (e.g. web portal or application) which is considered systematic rather than another type of digital distribution (e.g. a PDF in an email), which is considered non-systematic.
  • Redistribution — specifically refers to the transfer of data, in whole or part and includes derived works whether viewable or non-viewable, to an entity other than the licensed entity

This has meant that inter-organisational workflows involving data usage have been severely limited by the ability to share data and resolving this problem goes a long way toward improving collaboration, removing risk, and increasing innovation in the financial markets.

Figure 1: Venue data is re-vended back to participants through information vendors who control redistribution rights, including multiple service providers redistributing the same data. [Image by Author]

Information vendors and intra-organisational patterns

Innovation around data distribution has been largely focused on optimising licensing through the resolution of intra-company data discovery and usage, by solving for centralised collection and storage from producers, and then distribution of the data to consumers. This is true of participants who have attempted to optimise their license costs, but also within information vendors that are also centralizing data before distributing back again.

The most recent attempts to optimise the license cost, discoverability, and quality issues within an organization’s boundaries are data virtualization, data federation and data fabric [5]. Like a medicine attempting to treat the symptoms, they focus purely on how to work around an unwieldy license paradigm.

· Data virtualisation — a single interface for accessing distributed data with different data models without needing to know where it is physically located

· Data federation — a single interface with a unified view, to virtual databases that provide a unified data model for accessing distributed data with different data models. The intent with federation is to leave the data where it is but provide a unified view across multiple sources.

· Data fabric — an architecture across a variety of data sources and services that provides consistency in capability across endpoints and multi-cloud environments, standardising across cloud, on-premises and edge devices.

Figure 2: Variations on a theme. Alternate patterns to solve data distribution, focused on intra-company, however limited in their ability to deal with inter-organisational concerns. [Image by Author]

All three methods are aimed at improving quality, efficiency and removing risk from the usage of data within an organisation. They also go some way towards solving the geo-political issues associated with use and storage of data within sovereign countries, from a single organisation perspective. They are still however not very good patterns to solve how to distribute and collaborate with data across organisational boundaries.

The fundamental issue with using these patterns to address the issue of cross-organisational data collaboration and re-vending, is not to approach from a better way to centralize and distribute the data, but to think of how data producers and consumers can act as peers in an exchange of value, while retaining sovereignty rights over the data that they own.

The data devil is in the details

This does sound deceptively simple, and you’d be forgiven for thinking, if that is all that is required, then why hasn’t it been resolved already? This is because there are complicating factors that affect the scope of data that is shareable and where it is located.

  • Geopolitical — There are geo-political ramifications in terms of ability to access the data, to prevent foreign or undesirable access, and include the right to access from local regulators. This includes areas such as obfuscation, data privacy and data sovereignty.
  • Proprietary — The data could be intellectual property and proprietary to the owner, whether derived from ground up collection or derived from other sources.
  • Confidential — Confidentiality or restrictions in distribution may apply to the data, such that it can only be distributed to authorized parties involved in a transaction. There may also be data embargoes to navigate and restrictions on transferability.
  • Personal — Data that identifies a natural person needs to be treated under the associated GDPR regulation which protects the rights of the individual and includes the ‘right to be forgotten’.

Bring forth the decentralised data

In essence to solve the inter-organisational collaboration issue we need to solve four fundamental problems. Solving these problems goes to the heart of facilitating the free-form exchange of information across organisational boundaries.

  • Sovereignty — rather than thinking of the regulatory issues being one of physical location or domicile, it is rather one of sovereignty. Who has sovereignty over the data, and can they control access to parties who legitimately have rights to read the data?
  • Distribution — to be able to run analytics and create derived works from the data, does the data need to be distributed to another instance across physical or virtual boundaries to analyse it?
  • Audit-ability — if the data does cross boundaries during the process of distribution, how can I know what data was shared in the distribution and who it was shared with?
  • Scalability — data has a variability in size, speed, and time decay. Can we cater for the smallest transactional data right through to the larger objects such as videos, contracts, and legal documents?

The key to solving the data monopolies through centralisation of data across venues and information vendors, is to solve the business problem of inter-organisational collaboration. This could be done by creating an ecosystem of peers that are incentivised to share data across organizational boundaries. The ecosystem may be based on trust, or trustless. There are benefits to either however the fundamental concept is that the peer to peer sharing of data reduces the need to collate that data centrally and then redistribute or re-vend back to the participants that created it.

Data shared across the peer-to-peer ecosystem could be broadcast, restricted or private. Multiple models are possible.

Accomplishing this freeform exchange of information involves combining multiple branches on the latest technology, being the combination of immutable ledgers to track data lineage and auditability, data distribution and pointers, and a way to analyse data without having to have access to the underlying raw data.

In the simplest terms, you need to be able to share the data but prevent it from actually ‘moving’ anywhere.

Figure 3: A simplified example of a distributed ecosystem of peers collaborating over broadcast and private data across permissioned interactions. [Image by Author]

It is not a stretch of the imagination that the same blockchain technology used to decentralise the interchange of value in currency, could also be used to decentralise the exchange of data. [7] However decentralized storage of data on the blockchain has significant issues which cannot be easily overcome.

  • Blockchain was not designed to keep large data blocks on chain
  • Blockchain is designed as an immutable ledger, it is not designed for content that changes frequently
  • Keeping content blocks on chain in perpetuity doesn’t sound like a practical or efficient use of long-term storage and would lead to significant cost issues
  • Data has a lifecycle and can degrade over time, keeping all of it online at any point is a poor design and waste of resource
  • How does an individual exercise their right to be forgotten if their data is stored as immutable on a blockchain. As important as data creation is, so is data destruction.

However, blockchain is an excellent technology to create an immutable ledger to show data lineage and audit information and can be accessed across all nodes for transparency.

When combined with other technologies, blockchain can be used as a distributed ledger alongside a distributed data technology. The distributed data technology is the subject of many white-papers and is not central to this discussion, but it is important how we manage the content over the network.

When combined in this way, we have a transparent but permissioned data sharing ecosystem. This has therefore provided the mechanism for creating the peer-to-peer ecosystem. This is perfectly fine for data which is generated internally over which the party has sovereignty, but not for data which has been re-vended from somewhere else.

Data that travels without moving

Using a distributed ledger to track utilisation across a peer-to-peer ecosystem assists to solve the issue of audit-ability, however we still have other problems to solve with sovereignty, distribution, and scalability. To solve these problems, we need to have the concept of different types of data being shared, but not actually distributed in the process. We need to define data that is located into different segments in the nodes.

· Broadcast data — data that can be available to all participants in the ecosystem and is shared across all nodes in a way that is generally accessible

· Permissioned data — data that is available only to select participants that are involved in a transaction or have a temporal relationship for the purpose of a particular circumstance

· Private data — data that resides only in the node and may be used for local enrichment but cannot be shared outside of the node with other participants in the ecosystem.

The classification of data into different types of broadcast, permissioned and private, allows the data to be scoped effectively on whether it is shareable by the originator and if so then who with. The arrangement also allows for data from third-party providers that has the previously mentioned redistribution issues to remain in the local node.

The final piece of the puzzle is regarding how data is shared and stored. In this ecosystem data is shared from one party to another, however the data is not actually copied in the process. Essentially what is shared is a pointer to the location of the data, rather than transmission of the actual content to the receiving party. This allows data to be shared, but not actually copied across the ecosystem in the process.

There have been numerous papers written regarding distributed file systems [8], however one of the keys to retaining sovereignty, is to keep the data locally where it should reside rather than to allow that data to be copied across a distributed set of nodes. This helps to satisfy the regulatory and control conditions that relate to sovereignty.

A final point about multi-party computation

Regardless of the peer-to-peer data sharing mechanism removing the need for centralisation (and the information vendors who sit in the middle), there will still be circumstances in which third-party data may be required to be shared across the ecosystem. Potential techniques to allow this data to be shared without comprising the data sovereignty of third-party data, are through using cryptographic techniques under the general category of multi-party computation [6]. Using these methods, not only is data not actually traversing the ecosystem, but sensitive data remains encrypted while third parties are still able to derive insights from it.

  • Private set intersection — allows two parties to join their sets and discover identifiers that are held in common. This would be very useful in circumstances such as master security management or trade reconciliations
  • Homomorphic encryption — allows for certain types of computation to be performed on encrypted data without needing to decrypt the data first, preserving the privacy of the raw data. This is useful in circumstances requiring personally identifiable information, or when seeking to use data without needing to have access to the raw data. [9]

These techniques allow for data to be shared, but not actually copied, and protected into differing types according to the data sovereignty and sensitivity.

Other areas that are outside the scope of this paper are discussions on incentivisation schemes, such as those used in Pyth Network [10], which provide useful pattern examples. Incentivisation schemes may be used to encourage the sharing of data across the ecosystem to increase the ‘flywheel’ effects of the ecosystem.

These topics of incentivisation schemes, virtuous cycles and anti-manipulation or abuse algorithms, are deep topics that may be covered in future articles. They also are important issues that have a direct bearing on the success of a truly collaborative peer to peer data ecosystem.

The call for innovators

This paper has described an ambitious direction for the establishment of more industry collaboration and peer-to-peer sharing within the institutional investment landscape. It has presented a business problem and technical approach to a decentralised ecosystem in which peers are producers, consumers but can also be service providers, providing value back into the ecosystem such as aggregation, reconciliation, and cleansing.

It can be expected that vendors or ecosystems will arise in time that will solve for the inter-organisational data collaboration issue in financial markets, and this will provide a welcome catalyst to solving some of the data monopolization issues that have plagued the industry for decades. It is worth reminding that the Bloomberg terminal was initially released in 1982. There has been virtually no innovation on this model in 40 years.

With the latest technological advancements however, there has never been a better time than now to disrupt the legacy data incumbents and bring a new approach to the industry. The scope is ambitious, and not all the technology presented is proven. The industry needs innovators to present the next generation of data collaboration technologies.

In the meantime, I welcome any discussions or discourse around solving these issues. They remain one of the key issues to solve, if we are ever going to truly unlock innovation in the financial markets for all participants.

--

--