Proportion data examples
Whatever your application of data analytics & data science, there are proportions everywhere. Proportions are all about understanding the different parts that make up a whole.
Proportions are pretty much just a count of something across a given categorical variable. That could be number of customers across different industries, number of sales calls in different geographies, the number of activities across various activity types, or the number of ice cream cones sold of various flavors. If you can count it, and break it into groups, then you’ve got proportion data!
Basic proportion visualization
Whether you’re familiar with the idea of ‘exploratory data analysis’ or not; simple plotting of basic statistics is very helpful to any analysis, especially as you are establishing a foundation of understanding that will inform your more complex analysis.
I am going to break down three visualization types for analyzing proportions that will prove very useful: Pie charts, waffle charts, and bar charts (imagine that they’re actually maple bar or candy bar charts for the sake of ‘sweets’ theme)
Pitfalls of a pie chart
- Displaying proportions at angles and offset angles at that; can make pie charts pretty tough to interpret
- Once you get more then 3-5 classes in a given pie, it is pretty difficult to compare relative proportion — whole purpose here…
- Ok, lets say yes you can get an idea of the general allotment for any given level or value for your categorical variable… but we often lack precision, or a precise consideration of the disparity between any given set of values.
Redeeming qualities of using pie charts
- Conversely pie charts are amazing for real estate. Rather than taking up a ton of space, they are small and can include a lot of information in a small space.
- Depending on your audience, the pie chart can be very easy for uninformed groups to quickly absorb a given idea.
Lets get to it!
From the group up using the mtcars dataset, lets build a pie chart.
First things first, install & load up
ggplot2 (install.packages(‘ggplot2), then library(ggplot2) and then you’re off to the races)
Quick break down of
- you first include the dataframe you’re working with, in this case mtcars
- then specify
aes()-thetics… which is pretty much–where you want different variables to show up on a plot
- The first here is x, so whatever your categorical variable, your bucket, your container, your ice cream flavor; add it there.
From here throw
geom_bar() at the bottom to let you know exactly what type of chart you’d like to see. We’ll jump into the syntax, but with
ggplot, you effectively create the visualization object, and then tell that object how you want to use it.
First to give you a quick idea of the data; below you can see that we’re grouping by the cylinders variable and counting the number of records in each.
counts <- mtcars %>% group_by(cyl) %>% summarise(n = n())
Lets throw this into a pie!
ggplot(counts, aes(x = 1, y = n, fill = cyl)) + geom_col()+ coord_polar(theta = 'y')
Boom! There’s your first pie chart. You’ll see that whatever categorical variable you’re grouping by goes into the color, and the count or n as I’ve written it goes into the y aesthetic.
You may also notice the
geom_col() command as well as
To give an idea of the purpose of
coord_polar() I’ll run this with only
ggplot(counts, aes(x = 1, y = n, fill = cyl)) + geom_col()
As you can see, this is a stacked bar with the relative portions included here. Throwing on the
coord_polar(theta = 'y') allows us to wrap this bar into a pie chart.
A great alternative to pie? Waffles!
Ok so you don’t love pie…. Waffle charts are an excellent alternative. While waffle charts are similar to pie charts, they actually encode each level, class or value of a categorical variable as a proportion of squares.
pitfalls of a waffle chart
- Similar to pie charts, waffle charts can quickly be bogged down with the inclusion of too many classes
- Definitely don’t try to facet waffle or pie charts.. it does not lend well to making a reasonable comparison of the ‘relative proportion’ which is the whole purpose.
To prep your data for a waffle chart, you need to scale values to 1-100 adding up to 100. For this we’ll use
What you’ll see below is that we group our dataset by our categorical, then we’ll summarise according to the counts or
n(). From there, we then create a new variable called percent using
mutate. The big thing here is in our
mutate() function, we are creating this scaled to 100 value.
We’ll set up the names for case_counts and then we’ll run
count <- mtcars %>% group_by(cyl) %>% summarise(n = n()) %>% mutate(percent = round(n/sum(n)*100)) case_counts <- count$percent names(case_counts) <- count$cyl waffle(case_counts)
Ok we’re on our way!
Lets wrap it up with bar charts
For a lot of things, bars just work better at establishing the relative comparability value to value. Lets unroll our pie and throw it into bars. Also take note that this is not a histogram. We are treating the cylinder count as a categorical variable.
library(ggplot2) ggplot(mtcars, aes(x = as.factor(cyl))) + geom_bar()
best practice for stacked bars: don’t make them in isolation, it’s not nearly as useful after three
the key is that the wholes being compared all share the same y axis
Something to keep in mind for bars is that anything far beyond three variables will be a lot more difficult to interpret.
In order to reorder the bars of your bar chart, you’ll need to make sure the categorical variable is a factor
as.factor(), then change the levels into the order you want them displayed
Ggplot orders the bars and legend based upon the order it sees the variables in the dataset. To override this, turn the disease column into a factor with the
levels in the order we want our plot to use.
mtcars %>% factor(levels = c('2', '4', '6'))
This can often play a big part in organizing your plots to optimize for interpret-ability
Enjoy getting your hands dirty with proportion charts and categorical related data visualization. As you familiarize yourself with different charting techniques it will do you well to think about different charting tools as tools you might use for a given datatype and situation.
Happy Data science-ing! And don’t forget to follow my blog to get more blogs related to machine learning, data visualization, data wrangling, and all things data science! datasciencelessons.com.