# Plotting mixture models in ggplot2

This post deals with something niche but practical - getting ggplot2 to plot multiple fitted gaussians from a model with different amplitudes. Google failed to provide the answer I was looking for - if you can’t find it with Google, it must need a blog post.

We have a couple of different projects running at the moment that need us to
think about mixture models - these are common in single molecule biophysics work
as you frequently end up with two or more populations of molecules stretched out
along a continuous variable; could be velocity, step size or something else. A
common solution to figuring out what the typical representative values are in
the mixed population is to fit multiple gaussians to the data and describe the
mean and standard deviation of each. Doing this also allows you to describe the
relative size of each population if this changes with conditions. The `mixtools`

package^{1} for R
does all this and more - it’s a fantastic resource (and the
vignette^{2}
is really well written too).

There are already some great blog posts^{3} ^{4} that go into using the
`normalmixEM`

function to find a specified number (*k*) of best fit gaussians
for your data. And also some that do a far better job than this biologist is
ever going to do of explaining the Expectation-Maximization Algorithm^{5} that’s
at work. It’s worth noting that `mixtools`

has a whole suite of tools for
estimating the number of gaussians in your sample (not trivial) and can deal
with non-normal distributions.

### set up and getting the model

```
#libraries needed
library(tidyverse)
library(mixtools)
#just a plotting theme
ggplot2::theme_set(
theme_bw() +
theme(
plot.title = element_text(size = 14),
axis.title = element_text(size = 14, colour = "grey30"),
axis.text = element_text(size = 12, colour = "grey30"),
panel.grid = element_blank()
)
)
```

First let’s simulate a data set comprised of two groups with different means and the same standard deviation, but with half the observations in one group compared to the other:

```
set.seed(12) #better make this reproducible
observations <- tibble(value = c(
rnorm(n = 125, mean = 0.1, sd = 0.2), #the first normal distribution
rnorm(n = 250, mean = 0.8, sd = 0.2) #this second distribution is double the size of the first
)
)
#and a quick gander...
ggplot(observations, aes(x = value)) +
geom_histogram(binwidth = 0.05)
```

Next let’s get `mixtools`

to figure out where it thinks those gaussians are:

`my_mix <- normalmixEM(observations$value, k = 2) #telling it to find two gaussians in the observations`

`## number of iterations= 38`

`plot(my_mix, which = 2) #looking at the model produced with base graphics`

### the problem

For publication ready plots we need to move over to ggplot2 rather than base
graphics (some will argue no doubt, but there’s value in making graphics as
human friendly as possible for communicating science).
`normalmixEM`

provides a list of 9 outputs for the gaussians it calculates,
including the mean, standard deviation and amplitude:

`my_mix[["mu"]] #means`

`## [1] 0.08612553 0.79291123`

`my_mix[["sigma"]] #standard deviations`

`## [1] 0.1638908 0.1950395`

`my_mix[["lambda"]] #amplitudes`

`## [1] 0.3262176 0.6737824`

It’s straight forward enough to use `dnorm`

with `stat_function`

to plot a
gaussian line on top of a histogram. Two gaussians requires two `stat_functions`

- not a big deal, but if you’ve got a data set with four or five, or you want to
insert this more programmatically into an analysis script you quickly hit
problems.

```
ggplot(observations, aes(x = value)) +
geom_histogram(binwidth = 0.05) +
stat_function(fun = dnorm,
args = list(mean = my_mix[["mu"]][[1]],
sd = my_mix[["sigma"]][[1]])) +
stat_function(fun = dnorm,
args = list(mean = my_mix[["mu"]][[2]],
sd = my_mix[["sigma"]][[2]]))
```

There’s another couple of problems immediately. The gaussians are expressed as
densities and the histogram as counts. Adding the aes `y = ..density..`

to the
histogram can fix this problem, but only if you want to express the data as a
density instead of event counts. More difficult, `dnorm`

doesn’t have an
argument for the relative amplitude, so both gaussians have the same area. Three
things need fixing - plot all the gaussians in a single call, convert them to be
proportional to counts and incorporate the relative amplitude. Plotting multiple
instances of a function requires `mapply`

- the above plot can also be made like
this:

```
ggplot(observations, aes(x = value)) +
geom_histogram(binwidth = 0.05) +
mapply(function(mean, sd) {
stat_function(fun = dnorm, args = list(mean = mean, sd = sd))
},
#the argument definitions hang out down here
mean = my_mix[["mu"]],
sd = my_mix[["sigma"]]
)
```

### the solution

Scaling the gaussian density to the count number requires taking into account
the number of observations in the data set and the bin width of the histogram.
If we’re creating a function to modify `dnorm`

then it can also incorporate the
amplitude, `my_mix[["lambda"]]`

. I need to nest a new function with extra
arguments inside `stat_function`

to modify `dnorm`

.

```
ggplot(observations, aes(x = value)) +
geom_histogram(binwidth = 0.05) +
mapply(
function(mean, sd, lambda, n, binwidth) {
stat_function(
fun = function(x) {
(dnorm(x, mean = mean, sd = sd)) * n * binwidth * lambda
}
)
},
mean = my_mix[["mu"]], #mean
sd = my_mix[["sigma"]], #standard deviation
lambda = my_mix[["lambda"]], #amplitude
n = length(observations$value), #sample size
binwidth = 0.05 #binwidth used for histogram
)
```

That’s more like it! And just to check it works for a more complicated data set with more than two distributions:

```
observations <- tibble(value = c(
rnorm(n = 125, mean = 0.1, sd = 0.2),
rnorm(n = 175, mean = 0.4, sd = 0.05),
rnorm(n = 250, mean = 0.6, sd = 0.05),
rnorm(n = 250, mean = 0.8, sd = 0.2)
)
)
my_mix <- normalmixEM(observations$value, k = 4)
```

`## number of iterations= 167`

```
ggplot(observations, aes(x = value)) +
geom_histogram(binwidth = 0.05) +
mapply(function(mean, sd, lambda, n, binwidth) {
stat_function(
fun = function(x) {
(dnorm(x, mean = mean, sd = sd)) * n * binwidth * lambda
}
)
},
mean = my_mix[["mu"]],
sd = my_mix[["sigma"]],
lambda = my_mix[["lambda"]],
n = length(observations$value),
binwidth = 0.05
)
```

### R session

`devtools::session_info()`

```
## ─ Session info ──────────────────────────────────────────────────────────
## setting value
## version R version 3.6.0 (2019-04-26)
## os macOS Mojave 10.14
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_GB.UTF-8
## ctype en_GB.UTF-8
## tz Europe/London
## date 2019-06-12
##
## ─ Packages ──────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.0)
## blogdown 0.13 2019-06-11 [1] CRAN (R 3.6.0)
## bookdown 0.11 2019-05-28 [1] CRAN (R 3.6.0)
## broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0)
## callr 3.2.0 2019-03-15 [1] CRAN (R 3.6.0)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
## devtools 2.0.2 2019-04-08 [1] CRAN (R 3.6.0)
## digest 0.6.19 2019-05-20 [1] CRAN (R 3.6.0)
## dplyr * 0.8.1 2019-05-14 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.6.0)
## fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggplot2 * 3.1.1 2019-04-07 [1] CRAN (R 3.6.0)
## glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven 2.1.0 2019-02-19 [1] CRAN (R 3.6.0)
## hms 0.4.2 2018-03-10 [1] CRAN (R 3.6.0)
## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.6.0)
## httr 1.4.0 2018-12-11 [1] CRAN (R 3.6.0)
## jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0)
## knitr 1.23 2019-05-18 [1] CRAN (R 3.6.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
## lattice 0.20-38 2018-11-04 [1] CRAN (R 3.6.0)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0)
## lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## MASS 7.3-51.4 2019-03-31 [1] CRAN (R 3.6.0)
## Matrix 1.2-17 2019-03-22 [1] CRAN (R 3.6.0)
## memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
## mixtools * 1.1.0 2017-03-10 [1] CRAN (R 3.6.0)
## modelr 0.1.4 2019-02-18 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-140 2019-05-12 [1] CRAN (R 3.6.0)
## pillar 1.4.1 2019-05-28 [1] CRAN (R 3.6.0)
## pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.6.0)
## pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.6.0)
## prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
## processx 3.3.1 2019-05-08 [1] CRAN (R 3.6.0)
## ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
## purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.6.0)
## R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0)
## Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## remotes 2.0.4 2019-04-10 [1] CRAN (R 3.6.0)
## rlang 0.3.4 2019-04-07 [1] CRAN (R 3.6.0)
## rmarkdown 1.13 2019-05-22 [1] CRAN (R 3.6.0)
## rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
## rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
## rvest 0.3.4 2019-05-15 [1] CRAN (R 3.6.0)
## scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.0)
## segmented 0.5-4.0 2019-05-13 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## survival 2.44-1.1 2019-04-01 [1] CRAN (R 3.6.0)
## tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
## tidyr * 0.8.3 2019-03-01 [1] CRAN (R 3.6.0)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0)
## tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.6.0)
## usethis 1.5.0 2019-04-07 [1] CRAN (R 3.6.0)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
## xfun 0.7 2019-05-14 [1] CRAN (R 3.6.0)
## xml2 1.2.0 2018-01-24 [1] CRAN (R 3.6.0)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
##
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
```