In the previous post I plotted multiple fit gaussians onto a data set with ggplot2 - the final model is actually the sum of these fitted gaussians though, so how do I plot that? This is a more tricksy problem. I know I need to nest the mapply function inside another function, but my mind doesn’t immediately leap to the answer - I tried to come up with something quickly, but it inevitably failed.

My approach to fixing analysis problems like this is to break it down and build up the complexity slowly until I get something that works. Rather than working over the same code until it was right, this time I kept track of my iterations so I had something I could use to demonstrate my approach to my students.

An aside on why I think it’s important: I really want to create an environment in my lab where folks are comfortable with coding (so far it’s working!); I share Steve Royle’s view that “[computer-based methods] have now permeated mainstream […] cell biology to such an extent that any groups that want to do cell biology in the future have to adapt”. We’re wet lab scientists, and almost all my students coming in don’t have huge experience in analyzing data with a programming language. I didn’t learn R until I self-taught as a postdoc. One of the things that made it difficult to learn was when you did figure out an answer with Google and stack overflow, it was usually just presented to you - “Ta da! I’m smart look at my solution!”. This gives the impression that if you know R/python/whatever you should just know the answer - but really getting things to work requires the same sort of focus and optimizing of a protocol as bench work. You can do it in a structured way and with more experience you get better, just like bench work. For me, part of fostering the right environment to encourage people to pick up R and python and be comfortable not always getting it right, is to show how I work through some problems.

where we left off

I kick off where I left off, plotting the gaussians on a simulated data set

#libraries needed
library(tidyverse)
library(mixtools)
library(ghibli) #' if I want to look less professional by having 
                #' studio ghibli themed plots on my blog that's my choice, okay?

#just a plotting theme
ggplot2::theme_set(
  theme_bw() +
    theme(
      plot.background = element_rect(fill = ghibli_palettes$PonyoLight[7]),
      panel.background = element_rect(fill = ghibli_palettes$PonyoLight[7]),
      plot.title = element_text(size = 14),
      axis.title = element_text(size = 14, colour = ghibli_palettes$PonyoDark[7]),
      axis.text = element_text(size = 12, colour = ghibli_palettes$PonyoDark[7]),
      panel.grid = element_blank()
    )
)

#simulate a dataset
set.seed(12) #better make this reproducible
observations <- tibble(value = c(
  rnorm(n = 1000, mean = 0.1, sd = 0.1),
  rnorm(n = 4000, mean = 0.5, sd = 0.4),
  rnorm(n = 3000, mean = 0.9, sd = 0.1)
  )
  )

#model the dataset with multiple gaussians
my_mix <- normalmixEM(observations$value, k = 3)
## number of iterations= 374
#plot the data and the model
ggplot(observations, aes(x = value)) + 
  geom_histogram(binwidth = 0.05, fill = ghibli_palettes$PonyoMedium[2]) +
  mapply(function(mean, sd, lambda, n, binwidth, colour) {
    stat_function(
      fun = function(x) {
        (dnorm(x, mean = mean, sd = sd)) * n * binwidth * lambda
        },
      colour = colour, #this is just to make it pretty
      size = 1
      ) 
    },
    mean = my_mix[["mu"]], 
    sd = my_mix[["sigma"]],
    lambda = my_mix[["lambda"]],
    n = length(observations$value),
    binwidth = 0.05,
    colour = c(ghibli_palettes$PonyoMedium[3], 
               ghibli_palettes$PonyoMedium[4], 
               ghibli_palettes$PonyoMedium[5])
  )

where do I start?

[warning - this really does go at a snails pace, skip to the end if you want the finished function]

Define the problem. I know what I want to do - at each point along x I want to sum the individual guassians together to get the final model. I take this pretty literally and start there. To make my life simpler while I focus on the problem, I move out of plotting in ggplot for the time being and focus on the numbers. The first thing I do is make the dnorm modification inside mapply into a stand alone function so I can call it more easily.

#my function
proportional_gauss <- function(x, mean, sd, lambda, n, binwidth) {
        (dnorm(x, mean = mean, sd = sd)) * n * binwidth * lambda
        }
#generate a vector of x values to work with
x_values <- seq(from = -1, to = 2, length.out = 1000)

I’ll make a new vector which should be the y values at any given x value for the first gaussian function from my model and then check it quickly:

model_1 <- proportional_gauss(
  x = x_values,
  mean = my_mix[["mu"]][[1]],
  sd = my_mix[["sigma"]][[1]],
  lambda = my_mix[["lambda"]][[1]],
  n = length(observations$value),
  binwidth = 0.05
)

plot(model_1)

Great - that worked. I’ll make the other two then assemble them all into one table (tibble). With dplyr I’ll add a column with mutate to have the sum of all the models, then quickly check if this give me the data I’m after.

model_2 <- proportional_gauss(
  x = x_values,
  mean = my_mix[["mu"]][[2]],
  sd = my_mix[["sigma"]][[2]],
  lambda = my_mix[["lambda"]][[2]],
  n = length(observations$value),
  binwidth = 0.05
)

model_3 <- proportional_gauss(
  x = x_values,
  mean = my_mix[["mu"]][[3]],
  sd = my_mix[["sigma"]][[3]],
  lambda = my_mix[["lambda"]][[3]],
  n = length(observations$value),
  binwidth = 0.05
)

all_models <- tibble(x_values = x_values,
                     first = model_1,
                     second = model_2,
                     third = model_3) %>%
  mutate(., sum = sum(first, second, third))

plot(all_models$sum) #checking the product of the data

Eeeer nope… definitely not what I was hoping for. Using the sum function like this has literally just summed the whole lot together, rather than at each x point. Try again…

#this time I'll manually add the columns together

all_models <- tibble(
  x_values = x_values, 
  first = model_1, 
  second = model_2, 
  third = model_3) %>% 
  mutate(., sum = (first + second + third))

plot(all_models$sum) #checking the product of the data

Okay - Better. Seems relatively simple. But I don’t want to manually call each model, I want to have a function call all the relevant models at once. I Google a bit and then read the apply help file for the thousandth time and try this:

test_sum <- apply(all_models[, 2:4], 1, sum) #row sums using apply across the 2nd to fourth columns
plot(test_sum)

Sweet! that totally worked. Okay, now do it in the table -

#replacing the manual sum with the apply function
all_models <- tibble(
  x_values = x_values,
  first = model_1,
  second = model_2,
  third = model_3) %>%
  mutate(., sum = apply(all_models[, 2:4], 1, sum)) 

plot(all_models$sum)

Get in! okay, next task - change the absolute references to columns, all_models[, 2:4], to relative ones with ncol():

#making the apply function in mutate generic
all_models <- tibble(
  x_values = x_values,
  first = model_1,
  second = model_2,
  third = model_3) %>%
  mutate(., sum = apply(.[, 2:ncol(.)], 1, sum))

plot(all_models$sum)

#It still works, and overlaying this manual data on the histrogram confirms it:
ggplot(observations, aes(x = value)) +
  geom_histogram(binwidth = 0.05, aes(y = ..count..), 
                 fill = ghibli_palettes$PonyoMedium[2]) +
  geom_line(data = all_models, aes(x = x_values, y = sum), 
            colour = ghibli_palettes$PonyoMedium[5], size = 1)

Is it possible to nest the whole function inside ggplot or is that asking for trouble? I can only try. First I’ll nest my proportional_gauss function inside mapply pretty much the same as I did before when I was plotting.

#this should do the same thing as before...
test <- mapply(
  proportional_gauss, 
  x = x_values, 
  MoreArgs = list(
    mean = my_mix[["mu"]],
    sd = my_mix[["sigma"]],
    lambda = my_mix[["lambda"]],
    n = length(observations$value),
    binwidth = 0.05
    )
  )

When I use the function on the vector x_values, I can see the product test becomes a matrix in my environment. The model data is along the row instead of down the column. You can see this by using the row selector, [row_number, column_number], to plot the data:

#we've got a matrix, so the data is along the row instead of down the column
plot(test[1,])

So now I need to check that apply will still work and change the MARGIN command from 1 to 2:

test2 <- apply(test, 2, sum) #which means using apply with MARGIN set to 2 not 1 to sum across the guassians
plot(test2)

#And putting it all together...
test3 <- apply(mapply(proportional_gauss,
    x = x_values,
    MoreArgs = list(
      mean = my_mix[["mu"]],
      sd = my_mix[["sigma"]],
      lambda = my_mix[["lambda"]],
      n = length(observations$value),
      binwidth = 0.05
      )
    ),
  2, sum
  )
#I guess I should plot this one too, just to be sure...

plot(test3)

Amazeballs! Happy with that. Okay, one last step; nest the proportional_gauss() function inside a new proportional_gauss_sum() function, so I can use that inside stat_function(). I’ll test it outside ggplot first:

proportional_gauss_sum <- function(x, mean, sd, lambda, n, binwidth) {
  apply(mapply(proportional_gauss,
                      x = x,
                      MoreArgs = list( mean, sd, lambda, n, binwidth)
                      ), 
               2, sum)
}

test4 <- proportional_gauss_sum(
  x = x_values,
  mean = my_mix[["mu"]],
  sd = my_mix[["sigma"]],
  lambda = my_mix[["lambda"]],
  n = length(observations$value),
  binwidth = 0.05
)

plot(test4)

Okay - that means I’m ready for the final reveal. Ta da!

proportional_gauss <- function(x, mean, sd, lambda, n, binwidth) {
        (dnorm(x, mean = mean, sd = sd)) * n * binwidth * lambda
}


proportional_gauss_sum <- function(x, mean, sd, lambda, n, binwidth) {
  apply(mapply(proportional_gauss,
               x = x,
               MoreArgs = list( mean, sd, lambda, n, binwidth)
               ), 
        2, sum)
}

ggplot(observations, aes(x = value)) +
  geom_histogram(binwidth = 0.05, fill = ghibli_palettes$PonyoMedium[2]) +
  stat_function(
    fun = proportional_gauss_sum,
    args = list(
      mean = my_mix[["mu"]],
      sd = my_mix[["sigma"]],
      lambda = my_mix[["lambda"]],
      n = length(observations$value),
      binwidth = 0.05
    ),
    colour = ghibli_palettes$PonyoMedium[5], size = 1
  )

a few comments on the process

Just as there’s more than one way to do a western blot, there’s more than one way to work through a bump in analysis. Putting all the models into a table/tibble in the begining was a bit of a wrong turn - it didn’t really help get to the solution. I ended up there because I use dplyr a lot and it’s where I’m comfortable. I do tend to plot a lot though, for me it’s the easiest way to get a sense of the shape of the data and whether I’m on the right track. I’m happy to make lot’s of incremental changes when I’m less familiar with the functions I’m using - it’s easier to keep a handle on what’s going on.

R session

devtools::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.0 (2019-04-26)
##  os       macOS Mojave 10.14          
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_GB.UTF-8                 
##  ctype    en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2019-06-19                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version  date       lib source        
##  assertthat    0.2.1    2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.4    2019-04-10 [1] CRAN (R 3.6.0)
##  blogdown      0.13     2019-06-11 [1] CRAN (R 3.6.0)
##  bookdown      0.11     2019-05-28 [1] CRAN (R 3.6.0)
##  broom         0.5.2    2019-04-07 [1] CRAN (R 3.6.0)
##  callr         3.2.0    2019-03-15 [1] CRAN (R 3.6.0)
##  cellranger    1.1.0    2016-07-27 [1] CRAN (R 3.6.0)
##  cli           1.1.0    2019-03-19 [1] CRAN (R 3.6.0)
##  colorspace    1.4-1    2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4    2017-09-16 [1] CRAN (R 3.6.0)
##  desc          1.2.0    2018-05-01 [1] CRAN (R 3.6.0)
##  devtools      2.0.2    2019-04-08 [1] CRAN (R 3.6.0)
##  digest        0.6.19   2019-05-20 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.1    2019-05-14 [1] CRAN (R 3.6.0)
##  evaluate      0.14     2019-05-28 [1] CRAN (R 3.6.0)
##  forcats     * 0.4.0    2019-02-17 [1] CRAN (R 3.6.0)
##  fs            1.3.1    2019-05-06 [1] CRAN (R 3.6.0)
##  generics      0.0.2    2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.1.1    2019-04-07 [1] CRAN (R 3.6.0)
##  ghibli      * 0.2.0    2019-03-21 [1] CRAN (R 3.6.0)
##  glue          1.3.1    2019-03-12 [1] CRAN (R 3.6.0)
##  gtable        0.3.0    2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.1.0    2019-02-19 [1] CRAN (R 3.6.0)
##  hms           0.4.2    2018-03-10 [1] CRAN (R 3.6.0)
##  htmltools     0.3.6    2017-04-28 [1] CRAN (R 3.6.0)
##  httr          1.4.0    2018-12-11 [1] CRAN (R 3.6.0)
##  jsonlite      1.6      2018-12-07 [1] CRAN (R 3.6.0)
##  knitr         1.23     2019-05-18 [1] CRAN (R 3.6.0)
##  labeling      0.3      2014-08-23 [1] CRAN (R 3.6.0)
##  lattice       0.20-38  2018-11-04 [1] CRAN (R 3.6.0)
##  lazyeval      0.2.2    2019-03-15 [1] CRAN (R 3.6.0)
##  lubridate     1.7.4    2018-04-11 [1] CRAN (R 3.6.0)
##  magrittr      1.5      2014-11-22 [1] CRAN (R 3.6.0)
##  MASS          7.3-51.4 2019-03-31 [1] CRAN (R 3.6.0)
##  Matrix        1.2-17   2019-03-22 [1] CRAN (R 3.6.0)
##  memoise       1.1.0    2017-04-21 [1] CRAN (R 3.6.0)
##  mixtools    * 1.1.0    2017-03-10 [1] CRAN (R 3.6.0)
##  modelr        0.1.4    2019-02-18 [1] CRAN (R 3.6.0)
##  munsell       0.5.0    2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-140  2019-05-12 [1] CRAN (R 3.6.0)
##  pillar        1.4.1    2019-05-28 [1] CRAN (R 3.6.0)
##  pkgbuild      1.0.3    2019-03-20 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.2    2018-08-16 [1] CRAN (R 3.6.0)
##  pkgload       1.0.2    2018-10-29 [1] CRAN (R 3.6.0)
##  plyr          1.8.4    2016-06-08 [1] CRAN (R 3.6.0)
##  prettyunits   1.0.2    2015-07-13 [1] CRAN (R 3.6.0)
##  processx      3.3.1    2019-05-08 [1] CRAN (R 3.6.0)
##  ps            1.3.0    2018-12-21 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.2    2019-03-15 [1] CRAN (R 3.6.0)
##  R6            2.4.0    2019-02-14 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.1    2019-03-17 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1    2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1    2019-03-13 [1] CRAN (R 3.6.0)
##  remotes       2.0.4    2019-04-10 [1] CRAN (R 3.6.0)
##  rlang         0.3.4    2019-04-07 [1] CRAN (R 3.6.0)
##  rmarkdown     1.13     2019-05-22 [1] CRAN (R 3.6.0)
##  rprojroot     1.3-2    2018-01-03 [1] CRAN (R 3.6.0)
##  rstudioapi    0.10     2019-03-19 [1] CRAN (R 3.6.0)
##  rvest         0.3.4    2019-05-15 [1] CRAN (R 3.6.0)
##  scales        1.0.0    2018-08-09 [1] CRAN (R 3.6.0)
##  segmented     0.5-4.0  2019-05-13 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1    2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.3    2019-03-12 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0    2019-02-10 [1] CRAN (R 3.6.0)
##  survival      2.44-1.1 2019-04-01 [1] CRAN (R 3.6.0)
##  tibble      * 2.1.3    2019-06-06 [1] CRAN (R 3.6.0)
##  tidyr       * 0.8.3    2019-03-01 [1] CRAN (R 3.6.0)
##  tidyselect    0.2.5    2018-10-11 [1] CRAN (R 3.6.0)
##  tidyverse   * 1.2.1    2017-11-14 [1] CRAN (R 3.6.0)
##  usethis       1.5.0    2019-04-07 [1] CRAN (R 3.6.0)
##  withr         2.1.2    2018-03-15 [1] CRAN (R 3.6.0)
##  xfun          0.7      2019-05-14 [1] CRAN (R 3.6.0)
##  xml2          1.2.0    2018-01-24 [1] CRAN (R 3.6.0)
##  yaml          2.2.0    2018-07-25 [1] CRAN (R 3.6.0)
## 
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library