Imagine that we’re on some kind of imaginary data black market. Someone offers a hard drive that they say contains a data set. They won’t say what it is, but they can guarantee that we don’t already have it. How much should you be willing to pay? What should new knowledge do to the price? I’ve made some notes, but they are mainly just me rambling1.

data-spaceMost people, myself included, have the intuition that zero is pretty close to the right number. I’m going to try and argue that’s a little bit more complicated.

The first thing that gives this away, is if you have all the data that you think is possible to have, then this mystery data set would have a very high value. If you believe that more data is better, you might still assign a low value to it if you have specifically collected up the data that you believe to be valuable. That would assume that this in the useless camp. If you thought that you had all the data, then this new set should reveal an unknown connection, or describe an unknown phenomenon. If you don’t have any data at all, then you probably don’t have the skills needed to use this new set. But let’s leave that to one side for the moment. Let’s assume that “we”, is the same in both cases, differentiated only by the data we’re hoarding. Our manipulation and exploitation skill is the same.

If you don’t have much data then, for any non-trivial amount that you’d be willing to pay to get that data, you would probably be better off designing a collection method to give you exactly what you want. You’d be in a better position to exploit it immediately2. I’m starting to see parallels with this with marginal returns in standard economics. I don’t really know anything about it beyond its name and to see some vague parallel. It’s also related to the ‘classical science’ versus ‘big data’ philosophy divide. In other words, make a hypothesis and then get data vs get a lot of data then test hypotheses against it3.

There are a lot of parallel avenues to explore, but I still don’t know how to price this data set. My intuition is that there will be some kind of asymptotic curve that’s at or near zero for most of the graph, but spikes to infinity near the point where all the previous known knowledge. This is probably a more extreme shape than it would be if it was a historic artefact that is on offer to the British Museum.

If there is any real work on this sort of thing then I’d like to read about it.

  1. I wrote this on a train in Spain! 

  2. speaking probabilistically, you might roll the dice and get _exactly _what you needed! 

  3. This is the same argument I have with traditional POE. People who ask first, measure later because sensors are are expensive. I sympathize with their position historically, but I don’t think that argument holds any more.