When it comes to getting started in data science it can be a bit overwhelming. You need to know statistics, programming, machine learning… within each of those domains there are a many, many sub domains that can dominate a person’s focus and once they’re done reading everything there is to know about one thing, they may not feel any farther along than when they started.
Through the course of this post, I’ll talk a bit about some of the challenges to getting started, the best way to think about data science as a discipline, and how to get started.
Challenges of Getting Started
The Volume of Resources is Overwhelming
One of the amazing things about trying to get into data science today is that there are a ton of resources… that said, it can be a bit overwhelming to navigate it all. What should you start with and what should you spend your time and money on are classic questions it seems no aspiring data scientist has the answer to.
There are a myriad of machine learning courses on udemy, udacity, and other learning sites, many of which cover machine learning from a very shallow perspective.. I’ve seen when individuals use this type of resource they often jump from course to course trying to go deeper, but getting a lot of redundant material. That’s not to say these resources are bad, but one requires direction when it comes to the timing of a course like that.
There are very expensive paid programs which also do very well to get one onto their feet in the space, these programs can be excellent when it comes to the curation of content and the linear learning structure one needs to be successful, but all in all, it may not be absolutely necessary to get your start in the field of data science. In either case, gathering some fundamentals and familiarization with the space can help one be more successful and can also help them really validate their interest.
The other thing that happens is one might do an ML course, and then a basic python course, and then jump to a ML engineering course, and whatever else. This leads to a very disjointed education where the principles one is learning don’t necessarily build off of one another.
An all too common phenomenon in data science is what’s known as imposter syndrome. Imposter syndrome is the feeling that maybe you don’t belong, or don’t deserve the title you have because there is so much you don’t know. Because the data science umbrella is so broad, it makes it difficult for a data scientist to have great depth in every single sub-discipline of the field. A data scientist is often looked to as the expert on all things data science, and as a result blind spots are frequently highlighted. The fact is, there is so much to know in this field, and you will struggle to learn it all. Something that is key to overcoming imposter syndrome is having a good understanding of what really falls under the domain of data science, what is essential, & what constitutes a nice-to-have.
From masters programs, nano-degrees, & video lessons to newsletters, text books, & more; these resources can be expensive! While many of them undoubtedly have great content, it can constitute a barrier for many. Adding to the note on imposter syndrome and having a good understanding of what one really needs and in what priority; if you are brand new to the field and start off learning about deep learning, you may burn a chunk of change while feeling no closer to your first role as a data scientist.
What is Data Science, Really?
As I’ve eluded to, a key first step is really grasping the workflow of a data scientist and how different skills and technologies come into play.
In the following section we’ll break that down.
First and foremost, you need to be able to get your hands on data. Regardless of what you hope to do with it, having the skills to get it is a key first step.
If you’re not already familiar, get your feet wet with SQL. SQL stands for structured query language. It’s all about pulling data out of a database. It’s actually pretty simple as far as code goes, as the main purpose is to ask a database for data.
Let’s say you’re a data analyst at a B2B SaaS company. They have all of their sales activity data in a series of database tables. Your boss wants to know which industries they should focus on. As a first step, it would be up to you to find out what industries the company already sells to, and how well they perform in each of them. This could be a highly valuable analysis and the first step is to be able to write a SQL query.
If you’re just starting out; I’d almost always say get started with SQL first. I won’t break down the specifics of SQL here, but it is a great first step to thinking about data, its structure, how you’d pull it and use it in the right ways.
There are many flavors of SQL, in the end that depends on the DBMS (data base management system) you use. You’ll likely hear of redshift, mysql, sql server, mariadb, among many others… The syntactical changes are often pretty minimal here and are a simple google search away. Don’t feel a need to familiarize yourself with the unique flavors of each right off the bat… they aren’t sufficiently different for that to be necessary, especially right at the start.
There are great SQL courses out there. Khan Academy has a free SQL course that’s a great introduction. Codeacademy & datacamp also have excellent SQL courses that will give you the basics of what you need to get started with data collection.
Beyond SQL, there are many other places you can get data. Data can come from a csv, excel file, directly from websites, JSON, & more. While familiarization is helpful, I’d call those nice-to-haves. In your early days as a data scientist you need to know you can write SQL queries well.
Data cleaning is all about getting your data into a state where it’s usable for whatever analysis should follow.
There are many aspects to data cleaning; how do we handle missing values, are data types correct, is there any specific type of re-encoding of variables that needs to take place, and many other things– largely in consideration of the analysis to follow.
Data wrangling operates as an adjacent step to data cleaning. This also has to do with getting your data in the right format to be useful.
You may have a series of datasets that you need to combine into one. As such, you might use what is called a join or a union to combine said datasets. There is also consideration of making your dataset wide versus long, I wont go into the specifics here, but having those few simple operations in your tool belt will go a long way.
You can typically accomplish most data wrangling needs with SQL, with that said there are lots of additional functionality provided via R or Python.
Exploratory Data Analysis
Exploratory data analysis or EDA represents familiarization with your data. This includes looking at samples of your dataset, looking at its datatypes, assessing the relationship between different combinations of variables through different charting options, making assessments of different variables using summary statistics.
Precedent to modeling, engaging in the EDA process helps one understand the patterns and relationships between different variables, and lends well to the analysis you will eventually conduct.
This is where skills with data visualization software come in. While you can create these visualizations with R or Python, it’s sometimes helpful to have them exist in a data visualization platform like Tableau, Domo, PowerBI, etc.
Everything up to this point is what I’d say might qualify you to be a data analyst. With that said, your typical data scientist might introduce more sophisticated methods or approaches to any given area we’ve already covered. This is not to say you won’t find data analysts doing work beyond this scope either, but it’s just typically were I’d identify the distinction.
Statistical Analysis & Modeling
Once you have some good understanding of your data, this is where you get to flex your statistics muscles. This includes things like probability density functions, t-tests, linear regression, logistic regression, hypothesis testing and so forth.
There are many statistics tools and methods to be familiar with and this is a major area to differentiate yourself as a data scientist. Data science as a field has major roots in statistics, but is often looked over in favor of more complex machine learning approaches.
Having a strong statistics background will set you apart as a data scientist.
Machine learning or ML constitutes abstractions of a lot of the more basic statistics principles you might use otherwise. A lot of the most useful ML algorithms represent different packaging of your traditional linear regression, or it may just be building various linear regression models with slight changes from the inputs to the data passed into them to eventually land at an ideal model.
This is true all the way through neural networks and deep learning.
When you’re first getting your start, working to have a fundamental understanding of the statistics and math that supports more complex algorithms will prove helpful when it comes to having confidence in what you’re using, but also will help you better identify appropriate use of a given algorithm.
It’s easy for ML to appear overwhelming. The synonym artificial intelligence typically gives cause for surprise when one realizes these are built off of traditional statistical models.
My recommendation here is to learn the basics first. Give yourself enough time to master the data analyst track, get exposed to using a variety of statistics tools and analyses, don’t be overwhelmed by the notion of ML, and work to expose yourself to the broad expanse of machine learning algorithms and seek to understand what differentiates them from whatever else is out there.
The last thing I’d like to highlight is what’s known as ML engineering.
For starters, I should clarify that this is not something I explicitly interpret as the data science domain. While many data scientists have these skills, it is not absolutely pertinent as a data scientist that you know everything there is to know about this area.
In fact, machine learning engineer is a very common job title where the entirety of one’s work is around this area. ML engineering is all about deploying ml models. This is everything from standing up APIs where teams & individuals can interact with models, setting up jobs that re-train models and re-run predictions, storing these models and making them accessible.
Data science is a new and exciting field, but can often appear overwhelming. Thinking of it through this lense will help you make the difficult decisions of which resources and which order you should take as you work on your data science education.
The Data Science Umbrella
- Data Collection
- Data Cleansing
- Data Wrangling
- Exploratory Data Analysis
- Statistical Analysis
- Machine Learning
- Machine Learning Engineering
Thinking of your data science education in a linear fashion with these specific skills in mind should help inform your next step on your journey to become a data scientist.
Best of luck! Happy data science-ing!