Does Information Want to be Free?

A low cost to copy and arbitrary price discrimination makes it a tricky quasi-commodity

Know someone who might like Capital Gains? Use the referral program to gain access to my database of book reviews (1), an invite to the Capital Gains Discord (2), stickers (10), and a mug (25). Scroll to the bottom of the email version of this edition or subscribe to get your referral link!

The history of the catchphrase "information wants to be free" is a nice encapsulation of the tricky economics of buying and selling data. It's actually part of a longer and more ambiguous phrase from Stewart Brand:

[I]nformation sort of wants to be expensive because it is so valuable—the right information in the right place just changes your life. On the other hand, information almost wants to be free because the costs of getting it out is getting lower and lower all of the time. So you have these two things fighting against each other.

Like Spiegelman's Monster, when it's in the right environment it's stripped down to the simplest version of itself that can still reproduce. But that broader tension is interesting! The direct cost of transmitting a few kilobytes of information is low, but if that information is a little chunk of code that OpenAI used to radically speed up inference, or a press release announcing a public company merger, or the outline of a new trade war gambit by the US or China, it can easily be a billion-dollar piece of data.

This slippery characteristic, where it's valuable but cheap to transmit, leads to all sorts of odd contortions in “data” businesses. Reddit, for example, wants to make it very easy for human users to access Reddit, to read comments, and to respond to them. Those users are typing up valuable training data! But Reddit wants to make it very hard for people to access that data, because if there are two sellers, the market-clearing price is roughly the cost of collecting the data. An AI lab might pay Reddit a small premium over the cost of running the servers and hiring the engineers, but they're not going to pay more than they have to.

This sets up the most interesting feature of information businesses: they, too, want information to be free—the most profitable ones are the ones that get customers to input search queries for free, voluntarily post photos, videos, and status updates, share the maximum price they'd pay or the minimum price they'd accept when trading a given amount of stock. These businesses, search engines, social media, and exchanges, all create an environment where one set of users creates valuable data, which they can then monetize.

There's an echo of this in other information businesses, like media. Take a story that plays out over a two-hour video and you probably have something people will pay to watch, subscribe to get access to, or tolerate ads if they're watching it for free. But you also have the raw material to condense the premise and a few key scenes into a trailer, which will be readily available for free online and which the studio will actually pay to get wider distribution.

There are often low-friction media environments where messages spread easily, and high-friction ones where it's easier to collect money. In investing, a classic media stack is:

  1. Give away some high-level ideas for free on Twitter, in order to

  2. Acquire subscribers for more in-depth or niche pieces on Substack, which can lead to

  3. Getting poached by a fund to generate those same ideas in-house.

This process is actually very helpful for media quality, since nobody wants to degrade their brand by appealing to a large number of low- or zero-monetization people instead of a smaller audience that's willing to pay a lot more (a thousand true fans still rings true). It also means that hucksters are implicitly claiming that they don't have much skill and that more efficient monetization would be bad for them.

That process, of using free content to advertise for lightly-monetized offerings, and using lightly-monetized offerings to capture the maximum stake in the upside, illustrates something else about information businesses: they have a tendency to either go full-stack or to have strictly-separated layers with distinct functions. The best example of the full-stack companies are the big ad platforms. Colloquially, people will sometimes talk about these companies "selling user data to advertisers," but that's the last thing they want to do—because then those advertisers have data that they can keep using—instead of paying the advertisers! (Also, the users would be very annoyed if this critique were actually true.) Instead, what's best for these companies is to ruthlessly hoard the data, and only sell the outcome from analyzing that data. A social network that monetized by selling travel companies a list of everyone who wants to book a trip this week, and car companies a list of everyone talking about buying a new car, is a lot less lucrative than a business that lets all of those advertisers bid against each other for impressions. And that business is especially strong if the platforms slowly reduce how much targeting information the brands get. Their market research may indicate that their product is primarily purchased by 25-49-year-old males, but if the ad targeting system identifies a 51-year-old woman as a likely prospect, she might see the ad (which will be priced in competition to all the ads targeting exactly her demographic).

In an adversarial environment like trading, the data economics get even more complicated. Stock quotes that are real-time enough for human purposes are an economic complement to data feeds that are as real-time as FPGA programmers, networking experts, and physics can allow. The fastest traders want slower data to be widely available, because it means that they'll have more counterparties to trade with; a world where retail investors still have 15-minute delayed quotes and have to pay for up-to-the-decisecond-but-not-microsecond latency is a world where the fastest traders will represent more of the total liquidity, and bid/ask spreads will be correspondingly larger.

Take the same business and apply the full-stack approach, where one platform has a monopoly on data and the other side just gets to put money into a black box and get some return that they can't predict, and you end up with a system where exchanges are both the venue for trading and the liquidity provider doing the trading. That system has existed in the distant past (the NYSE was collectively run by people who owned seats allowing them to trade there directly). More recently Alameda was the main liquidity provider on FTX, and Binance has been accused of doing the same thing on its exchange. But a monopoly data-and-liquidity provider doesn't know how much someone else would pay to use their data to be a liquidity provider. And different liquidity providers can run different strategies that don't fully overlap—maybe one of them passively makes markets, another also takes directional bets but cares a lot about execution, while a third is making a market in something else, like ETFs, and is mostly trading to hedge. As it turns out, it's more profitable for exchanges to set prices for various levels of data quality and then let the market figure out what those are worth.

That's a simple sketch, but it outlines the fundamentals of a profitable data business: the first goal is to identify some flywheel for cheaply collecting information that other people could use to make money. And the higher-level strategic question is to figure out where in the value chain it makes sense to draw a sharp boundary, beyond which outside companies figure out exactly how to use it.

The Diff has talked about information economics at the level of companies, but also at a higher level.

Share Capital Gains

Subscribed readers can participate in our referral program! If you're not already subscribed, click the button below and we'll email you your link; if you are already subscribed, you can find your referral link in the email version of this edition.

Join the discussion!

Reply

or to participate.