Initially, one might think that data science is difficult because of the rigorous and complex math related to machine learning algorithms. Or you might find data cleaning difficult, getting the data into a usable state for your model fitting . However, as this article states, these can actually be a couple of these easiest parts of data science. Once you understand the math, you can pretty quickly grasp new algorithms, and once you have some experience cleaning data, it becomes a tedious but moderately easy task. Some machine algorithms are even implemented for you already through software libraries and can simply be treated as a black box. And when trying to figure out which model to fit to a problem, there are handy flowcharts and diagrams available that can greatly simplify the process.

In contrast, data science is difficult because one must know what questions to ask as well as how to answer those questions (i.e. figuring out what data to collect and how to determine if you have found a solution or not). Data collection, cleaning, exploratory analysis and visualizations, model fitting, model analysis and model validation are all necessary processes one must take when trying to answer a data-based question. One can only enhance these skills through constant practice and exposure. I feel there can be some misunderstanding about what exactly data science is, and hopefully this post helps readers gain a better idea of the various components that go into answering questions with a data-driven approach, as well as serve as a precursor of the difficulties that can arise when entering the field.

Thank you for this post! I found it super useful, especially the diagram depicting what statistical methods to use in various situations. I’m also made aware of techniques I’m not currently aware of, especially when it comes to dimensionionality reduction.

I really hope someone would come up with an interactive version of the diagram, where clicking a certain subsection of it would lead to a tutorial page depicting how to run those methods of data analysis (with relevant examples).

Glad you liked the diagrams! I was really happy to find them myself too, thought it would be nice to share aha. If you checkout the link under “diagrams” in the post (aka this link: http://scikit-learn.org/stable/tutorial/machine_learning_map/), you can actually click the different methods and get a tutorial page on that specific method. 🙂

Oh, I did not notice that. Thank you so much once again! 😀

You could make an interactive diagram with Gephi, Tableau or Domo. I would actually be relatively easy! If you wanted to do that and submit it into a contest, I recommen dataisbeautiful.com, they offer paid contests for little things like this all the time.

For me, what makes data science difficult is the sheer amount of data we deal with. When a program of mine runs on a terabyte of data, I’m basically just praying that things are working like I hope/think they’re working.

Humans weren’t designed to juggle terabytes of numerical data. It’s weird being able to do that with the push of a button. Even after four years, I still find it unsettling and unintuitive.