Leverage Semi-joins in R

Introduction Assuming you already have some background with the other more common types of joins, inner, left, right, and outer; adding semi and anti can prove incredibly useful saving you what could have alternatively taken multiple steps. In this post, I’ll be focusing on just semi-joins; with that said, there is a lot of overlap …

Kmeans clustering

Introduction Clustering is a machine learning technique that falls into the unsupervised learning category. Without going into a ton of detail on different machine learning categories, I’ll give a high level description of unsupervised learning. To put it simply, rather than pre-determining what we want our algorithm to find, we provide the algorithm little to …

What Every Data Scientist Needs to Know About Clustering

Introduction to Machine Learning Machine learning is a frequently buzzed about term, yet there is often a lack of understanding into its different areas. One of the first distinctions made with machine learning is between what’s called supervised and unsupervised learning. Having a basic understanding of this distinction and the purposes/applications of either will be …

Building a Regression Model with Categorical Factors

Introduction Regression is a staple in the world of data science, and as such it’s useful to understand it in its simplest form. I recently wrote a post that gave us more detail into regression. You can find that here. To follow on the ideas that we explored there, today we will be exploring the …

Build, Evaluate, and Interpret a Linear Regression Model in Minutes

Intro Regression is central to so much of the statistical analysis & machine learning tools that we leverage as data scientists. Stated simply, we utilize regression techniques to model Y through some function of X. We’ll take a look at some additional ideas to set up the premise of regression; and then we’ll take a …

Understanding The General Modeling Framework

When it comes to building statistical models, we do so with the purpose of understanding or approximating some aspect of our world. The concept of the general modeling framework lends well to breaking down the purposes and approaches that we might take to generate said understanding. What is the General Modeling Framework? Take a look …

GIT Essentials for a Data Scientist

Version Control 101 Version control is all about managing changes to files and directories by one or many contributors. Git is an incredibly popular system for version control and the one we will be running through for this course. There are many benefits to version control, and Git specifically. Including a view of historical changes …

Don’t Miss The Bias-Variance Tradeoff Question in Your Next Interview

Why Do Interviewers Ask About it? Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one who is unfamiliar with their options for mitigating prediction …

You’re Not a Data Scientist Until You Understand the Binomial Distribution

Inference at the heart of data analysis What is the point of inference? Inference is about drawing conclusions about a greater population via some sample of observed data. For example, you have some sample of the countries opinion on the president and you’d like to make some conclusions about the population at large. Obviously you …

Machine Learning, Simplified. Be Apart of the Conversation.

What’s all the buzz about? Machine learning is a concept and frequently dropped buzz word in today’s tech environment that leaves a lot to be desired as far as explanation goes. People often refer to machine learning algorithms as a black box; and while there may be certain aspects of machine learning that may lack …