January 18th, 2012

Influence and causality: castration doesn't make you live longer

Hopefully obvious to most readers, but in my experience many people tend to still confuse the following two statements: "whenever A holds, B is more likely to hold" and "A causes B". They are not the same. The first statement expresses statistical dependence, the second causal influence.


Let me demonstrate this distinction with the following example: if you look at statistics, you may observe, that "people who do not have testicles live longer". Clearly, this doesn't imply that if you do have testicles then you should cut them off to live longer. (I mean, really, please don't try this at home). It simply reflects the fact that women tend not to have testicles and they also tend to live longer than men. Despite this extreme example - and several others - clearly demonstrating the possible implications of misinterpreting statistical dependence as causal influence, the distinction is very often overlooked not only by common people and journalists but, very sadly, even by scientists and policy-makers.


When analysing social influence of people, blogs and news sites, the distinction between causation and dependence is highly relevant: we may observe that whenever a particular user shares something in a social network, the volume of activity around the topic - on average - increases. But this alone does not imply that the user is actually directly responsible for this increase in activity, nor that she or he is influential.


Fortunately, in social networks, there are ways to explicitly record causal influence: for example, if Alice retweets Bob's message, or shares his post, it is very likely that there was direct causal relationship between Bob's activity and Alice's activity. But often such influences remain implicit in the data: instead of explicitly forwarding Alice's message, Bob may just post the same piece of information without explicitly acknowledging Alice as a source of inspiration. These situations make it very hard (although not impossible, that's my job) to disambiguate between alternative stories explaining the data: was Bob influenced by Alice, or is it just a coincidence that they both shared the same piece of information being influenced by third party sources.


The most powerful, although usually very costly, way of detecting causal influence is through intervention: to go back to our castration example, this amounts to cutting a few guys' testicles and implanting them into women and then measuring how long these patients lived. If you can do that - set the value of a few variables and observe what happens to the rest of the variables - you really are in a better position to detect causal relationships.


In a recently published study, Facebook's data team did just that in the context of social networks: they intervened. During their experiment in 2010, they randomly filtered users' news feed to see how likely they were to share certain content in situations when they do vs. when they do not see their friends related activities. Unsurprisingly from facebook, the scale of the experiment was humongous: it involved around 250 million users, 78 million shared URLs amounting to over 1 billion (as in thousand million, (10^9)) user-URL pairs. This randomised experiment allowed them to gain unprecedented qualitative insights into the dynamics of social influence: the effect of peer influence on the latency of sharing times; the effect of multiple friends sharing the same piece of information; connections between tie strength and influence. I encourage everyone interested in either causality or influence in social networks to look at the paper.


Finally, just to illustrate how hard inferring the presence of influence is, just consider this blog post: The underlying truth is that I first read about the Facebook study on TechCrunch, then I looked it up on Google News, chose to read the New Scientist coverage which finally pointed me to the facebook note and the paper. Now, had I not included this final paragraph, just imagine how hard it would've been to identify these sources of influence. Well, this is my (or rather, my algorithms') job.