Dirichlet Multinomial Mixtures

Community typing with Dirichlet Multinomial Mixtures

Dirichlet Multinomial Mixtures (DMM) 是一种用于对微生物群落分析数据进行群落分型(或聚类)的概率方法。 这是一个无限的混合模型,这意味着该方法可以推断出最佳数量的群落类型。 请注意,群落类型的数量可能会随数据大小而增长。

library(microbiome)
library(DirichletMultinomial)
library(reshape2)
library(magrittr)
library(dplyr)
# Load example data
data(dietswap)
pseq <- dietswap

# To speed up, only consider the core taxa
# that are prevalent at 0.1% relative abundance in 50% of the samples
# (note that this is not strictly correct as information is
# being discarded; one alternative would be to aggregate rare taxa)
pseq.comp <- microbiome::transform(pseq, "compositional")
taxa <- core_members(pseq.comp, detection = 0.1/100, prevalence = 50/100)
pseq <- prune_taxa(taxa, pseq)

# Pick the OTU count matrix
# and convert it into samples x taxa format
dat <- abundances(pseq)
count <- as.matrix(t(dat))

拟合 DMM 模型.,让我们将群落类型的最大允许数量设置为3,以加速示例。

fit <- lapply(1:3, dmn, count = count, verbose=TRUE)
##   Soft kmeans
##   Expectation Maximization setup
##   Expectation Maximization
##   Hessian
##   Soft kmeans
##     iteration 10 change 0.000029
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 0.000000
##   Hessian
##   Soft kmeans
##     iteration 10 change 0.030731
##     iteration 20 change 0.000110
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 0.000063
##   Hessian

判断拟合效果

lplc <- sapply(fit, laplace) # AIC / BIC / Laplace
aic  <- sapply(fit, AIC) # AIC / BIC / Laplace
bic  <- sapply(fit, BIC) # AIC / BIC / Laplace
#plot(lplc, type="b", xlab="Number of Dirichlet Components", ylab="Model Fit")
#lines(aic, type="b", lty = 2)
#lines(bic, type="b", lty = 3)

选择最佳模型

best <- fit[[which.min(unlist(lplc))]]

参数pi及theta

mixturewt(best)
##          pi     theta
## 1 0.3738027 159.10473
## 2 0.3188891  81.91265
## 3 0.3073082  64.24696

元素(otu)分配给不同cluster

ass <- apply(mixture(best), 1, which.max)

每个otu对每个组成群落的贡献

for (k in seq(ncol(fitted(best)))) {
  d <- melt(fitted(best))
  colnames(d) <- c("OTU", "cluster", "value")
  d <- subset(d, cluster == k) %>%
     # Arrange OTUs by assignment strength
     arrange(value) %>%
     mutate(OTU = factor(OTU, levels = unique(OTU))) %>%
     # Only show the most important drivers
     filter(abs(value) > quantile(abs(value), 0.8))     

  p <- ggplot(d, aes(x = OTU, y = value)) +
       geom_bar(stat = "identity") +
       coord_flip() +
       labs(title = paste("Top drivers: community type", k))
  print(p)
}

Logo

瓜分20万奖金 获得内推名额 丰厚实物奖励 易参与易上手

更多推荐