Full disclosure, I work on the Analytics team for a large Internet company in San Francisco. Members of my team and the teams I work with have T-shirts or office signs that say things like “I <3 Big Data” and “I love it when you call me Big Data.”
This post is going to be … controversial… to some of them.
As you have no doubt gathered from the title, I’m not a big fan of what tends to be called “big data.” For the uninitiated, big data refers to extremely large datasets (think hundreds of millions to billions of records, if not more) that reveal patterns, most frequently about how users are interacting with a product or metadata about users. Examples might include click data from Facebook — essentially a list of every mouse click any user has performed on a Facebook page, including e.g. the location on the page and the link URL target and the time of day and the location of the user, among other things — or information about phone calls — the metadata the NSA is so interested in keeping tabs on, which might include things like phone numbers, locations, dates, times, durations of calls, etc.
Let me tell you right off the bat (250+ words in…) that I don’t hate big data because I’m afraid of being snooped on or I’m paranoid the gummymint or The Facebook is after me, or tracking me, or… anything me, really. Honestly, they don’t give a flying duck about me. No, I can break my hatred of big data down into three distinct reasons, none of which have anything to do with paranoia.
The first is that as someone who works with data a lot, I can tell you it’s messy. This is true of “small” data provided by people paid to give you data. You’ll get a CSV that should have 24 rows, and it has 23. Maybe it has 25. Maybe a row or a single cell is blank for some reason; maybe it’s ASCII-encoded and they have to report a name with an accented letter, or, God forbid, an emoji, so it just shows up as “?”. This stuff is actually really complicated, and you usually won’t know you have a problem until suddenly you have one, which could happen at any time for any reason, no matter how stupid, esoteric, or obvious.
Now imagine you’re getting data not from a paid vendor but from users clicking on things. Maybe they have a spotty Internet connection, so the click event doesn’t get logged. Maybe you accidentally send the click event twice. Maybe they clicked on the wrong thing. Maybe the team who’s in charge of ingesting click data and mapping it back to users is at their offsite in Acapulco when their server goes down, or they changed the event code just before they left and didn’t tell anyone. So you pull the data the next morning to find out how many links were clicked, and you see the number is way down — is that because of a new feature that was just pushed? Is it a holiday in some large portion of the world so people aren’t using your website? Is there a bug in the website? Or is the bug in the mobile app, or the click event firing, or the click event ingestion, or in the query used to pull the click count? Who can even say?
Related to this, typically from datasets this large, you can even find patterns that have nothing to do with what the dataset originally set out to record. For instance, a dataset might simply tell you how long a user has your application open in a window. From this information, you might be able to infer how “engaged” a user is with your website. But this is a potentially terrible proxy; for instance, just now I got up and wandered into the kitchen for 5 minutes; I was not engaged with www.carscafmoo.com during that time, even though the window was open. In this case, your most engaged user is the one who literally dies after loading your site and isn’t discovered for days. So you get around this by developing other heuristics (scroll actions, click actions, page views, having the window open for a certain time window, etc.) that, if you’re really lucky, are based on a broad survey of user interactions, but are still messy, and probably aren’t actually based on anything you could remotely describe as “scientific”1. Basically, without access to the user’s laptop camera to see what they’re actually doing, you’re screwed, and no matter what Zuckerberg does you can’t actually turn on the camera without turning on the indicator light on a modern Macbook unless you are able to physically reinstall the camera, so that’s pretty unlikely.
To sum up, you’re generally measuring proxies that correlate with or approximate whatever actual metric you’re searching for, and you’re constantly asking yourself, “How well are we answering this question?” The answer is almost always, “Not very well.”
The second reason I hate big data is because it is descriptive of general trends, but frequently applied to specific people. Here’s an arbitrary example of a dataset that might shows the behaviors of two groups of hypothetical users:
You can see visually that group 1 tends to be higher than Group 2; indeed, group 1 has a mean of ~10.4 and Group 2’s mean is down at 8.7; unsurprisingly (by construction) the difference between the two groups is statistically significant2. Let’s say Group 1 is users of my website with webbed fingers and Group 2 is normal people, and the value we’re recording is logins per month. Here’s how this information would typically get reported (imagine this is part of an infographic; note the arbitrary heights of the bars):
Except that as you can see, several of the points from Group 1 are actually pretty low — in fact, 21 of the 100 points in Group 1 are actually below the mean for Group 2. What is actually true is that on average from our particular sampling of the dataset a user with webbed fingers will login 18% more often than a user without webbed fingers. Any particular user will fall within a pretty large variance, so just because I have webbed hands doesn’t mean you should start targeting me as a high-login user; I may actually behave much more like a non-webbed-hand person. That’s neither a compelling nor pithy narrative, so we distill it as much as we can in order to tell our particular story (with our particular spin — “Hey, webbies love our website!”); but from my point of view that’s not sound statistics or data science or big data analytics, it’s marketing. Sure, that’s a role that needs to be fulfilled, but every time I see an infographic or a headline like the one above, I immediately wonder about about the actual shapes of the underlying distributions. I would propose a WAG that 90+% of the “big data” roles at tech companies fall into support of marketing or sales and involve exactly this sort of broad-strokes “analysis.”
Of course, there are great insights to be gleaned from proper analysis of big data — in particular, recommendation models based on grouping users’ actions or preferences and then extrapolating to what other members of that group have liked or done tend to work pretty well (think Netflix recommendations), but even then there’s so much individual variance that it’s very difficult to present a great set of predictions for any individual user.
So if big data works for generalizations but isn’t great at specific, individual insights, what is it good for? This brings me to my third problem with big data — that it’s really a misnomer. “Big data” datasets actually are potentially great for individual data, messiness aside. But per our definition above, that’s not actually what big data is — it’s more than an enormous dataset, it’s about using that dataset to reveal patterns. And if you’re looking at an individual, say, their cell phone records, you’re not really looking for patterns and thus not really performing “big data” analysis. You’re just looking stuff up. If you want me to look up the metadata around your phone calls between 8 AM on the 9th and 10 PM on the 10th, I can do that with just your data. Hell, I can do that if the only data I have is your calls between those specific hours. That’s not big data, that’s just data.
Now, if you’ve read all of this (and why would you), what you’ll hopefully see is that I don’t actually hate big data — I think data is messy and hard to work with, and I think that “big data” is a term that is frequently mis- or over-applied, or that it’s poorly defined. The technological capability to store and process enormous amounts of information, and the resulting ability to glean both broad patterns and individual insights from that information, is a capability that can and does drive enormous value. Even the Big Date Lite broad generalizations serve a real and necessary marketing and sales function, especially in a world of high-level (and frequently somewhat innumerate) decision making.
I just wish I didn’t know enough about the term to cringe every time I see it.