Undergrad here. It would be great if you could elaborate on this as I am having a growing interest in this field. What's the best way to approach it from a comp sci perspective? Who do I talk to in the department to get my feet wet? Where are the internships and such? Where do I start? I'm really eager to know.
Preamble: I am a Big Data Bod. My background is CS (MEng, Manchester) and I work for a global systems integrator wrangling clusters across western europe.
If you go back to the original question of "Why is big data not stats when big data is about analysis?" (for anyone still wondering) that's a pretty easy one. Statistics, as a practice (rather than a theory) is generally about accounting for unknowns. Questions like "How do I model a population of millions from a sample of 1,000?" or "What's the chance this result is wrong?". These are great questions to ask and of course have applicability across technology-oriented analytics. Focusing just on those analytics, Big Data (as a practice) is generally concerned with much finer-grained resolution. Traditional statistical techniques are great for modelling how a population is going to behave, they're not great for telling me how
you are going to behave, while you're doing whatever it is you're doing, before you even know I'm involved.
And that brings us onto the core definition of big data, courtesy of IBM: The Four Vs. Volume (big data gets big, quickly), Velocity (big data changes, quickly), Variety (lots of formats, sources, targets etc.) and Veracity (this one is marketing fluff).
That definition of the field is an answer to a question: How does big data differ from, well, data? And that is the beginning of the answer to the question of "What do I need to know to get into big data?"
An earlier poster pretty much nailed it. Big Data is Normal Data, just with more configuration files. As much as we talk about legacy frameworks like MapReduce and its newer cousin Spark, 90% of the work done on most clusters is launched through a SQL interface from a business intelligence tool. 90% of the effort expended is in traditional practices like
ETL,
MDM and schema design.
This applies even in cutting edge environments with mature and high quality data science groups. A great example of this is
a recent blog from Jay Krebs. Despite talking about the hottest technology in the ecosystem (Kafka) on the website of one of the hottest startups (Confluent), he's talking about the importance of robust schemas and change control.
And that is the foundation of a good big data bod; certainly what we look for in our hires. Do you have a good grounding in data warehousing and business intelligence? Can you tell your 3NF from your snowflake?
But that's just the beginning, and for a CS bod, pretty mundane. What makes you
great? The full stack. You need a range of knowledge covering hardware, networking, operating systems, java+scala+python+SQL and security, plus at least one framework/platform of choice. Plus you need a solid grounding in at least the fundamentals of modern analytics: statistics, machine learning, graphs and search. Or more of one than the other if you prefer analysis to engineering.
Now, you might be thinking to yourself, 'holy shit my university doesn't even offer that many courses!' That's an ugly truth. University is poor preparation for the actual work. The #1 most valuable thing for a big data person? Hands on experience. Every time.