单细胞芯片转录组测序的数据挖掘文章一比一复现

站长网2023-07-27 12:40:220阅

自己造轮子

(生信技能树优秀学员“细心的网友”)

马拉松授课线上直播课程，最近的一期是5月8号（今晚七点半）开课哦：

生信入门&数据挖掘线上直播课5月班

文章复现：Integration of Single-Cell RNA Sequencing and Bulk RNA Sequencing Data to Establish and Validate a Prognostic Model for Patients With Lung Adenocarcinoma

全文数据分析，没有湿实验。在GEO数据库下载了一份scRNA的数据，两份基因芯片的数据，还下载了TCGA的数据。研究肺腺癌（LUAD），整合scRNA-seq和传统的RNA-seq数据来构建LUAD患者的预后模型，并采用两个外部验证队列来验证其风险分层能力。

文章整体思路如下：

单细胞降维分群注释，找出tumor和normal间关键细胞类型。单细胞功能富集，拟时序分析，细胞通讯分析TCGA数据进行差异分析，GO，KEGG功能富集TCGA数据进行WGCNA分析，找到hub gene后，取hub gene和TCGA DEGs的交集，进行后续分析单因素Cox回归鉴定潜在的预后DEGs，非负矩阵分解进行sample clustering（亚型），不同cluster的生存分析，免疫浸润分析单因素Cox回归筛选出的gene用LASSO模型进一步筛选，筛选完后进行多因素Cox回归，计算每个样本的risk score，根据risk score分成两类，进行生存分析，将构建的模型应用于两个GEO芯片数据，进行同样分析高危低危人群与临床信息相关性分析，分析risk score是否可作为预后指标，基因突变分析ssGSEA分析，免疫检查点与肿瘤突变负荷分析

本次复现只是为了复现图表，数据不一定具有生物学意义（大家可以理解为凑图或者灌水）

Step 1.单细胞数据处理

先下载好数据，一共4个样本就新建4个文件夹，名字命名成样本名，文件夹内带'GSM'开头的数据是下载的原始数据，重新复制一份，把文件名改成标准的read10x函数读入格式，一定注意名字不要写错。

一定要保证这3个文件同时存在，而且在同一个文件夹下面。

示例代码是：

rm(list=ls())options(stringsAsFactors = F)library(Seurat)sce1 <- CreateSeuratObject(Read10X('../10x-results/WT/'), "wt")

重点就是 Read10X 函数读取文件夹路径，比如：../10x-results/WT/ ，保证文件夹下面有3个文件。

1.1 QC

library(Seurat)

library(harmony)

library(dplyr)

library(stringr)

rm(list = ls())

# 多个数据读取与合并

rawdata_path <- './rawdata'

filename <- list.files(rawdata_path)

rawdata_path <- paste(rawdata_path,filename,sep = '/')

sceList <- lapply(rawdata_path, function(x){

obj <- CreateSeuratObject(counts = Read10X(x),

project = str_split(x,'/')[[1]][3])

})

names(sceList) <- filename

这里sceList是一个列表，列表里共4个元素，每个元素都是一个seurat对象，每个元素的名字就是上面的样本名。rawdata_path就是上面下载的.gz文件所在文件夹的路径，Read10X函数只需要传入文件夹路径就可以自动读入数。

project参数传入的是一个字符串，一般用样本名即可，这里用str_split函数分割路径字符串得到样本名

sce <- merge(sceList[[1]],sceList[-1],add.cell.ids = names(sceList),project = 'luad')

sce@meta.data$group <- str_split(sce@meta.data$orig.ident,'_',simplify = T)[,2]

把4个seurat对象用merge函数合并后，看一下每个细胞的基本信息，这里我根据作者给出的信息，把疾病和正常分成两组，又增加了一列，ade1是疾病，nor是正常

# 查看线粒体(MT开头)、核糖体(RPS/RPL开头)、血红细胞所占比例

grep('^MT',x=rownames(sce@assays$RNA@data),value = T)

grep('^RP[SL]',x=rownames(sce@assays$RNA@data),value = T)

grep('^HB[^(P)]',x=rownames(sce@assays$RNA@data),value = T)

sce <- PercentageFeatureSet(sce,'^MT',col.name = 'percent_MT')

sce <- PercentageFeatureSet(sce,'^RP[SL]',col.name = 'percent_RP')

sce <- PercentageFeatureSet(sce,'^HB[^(P)]',col.name = 'percent_HB')

VlnPlot(sce,features = "nCount_RNA",pt.size = 0,y.max = 10000)

VlnPlot(sce,features = "nFeature_RNA",pt.size = 0,y.max = 2500)

VlnPlot(sce,features = "percent_MT",pt.size = 0)

VlnPlot(sce,features = "percent_RP",pt.size = 0)

VlnPlot(sce,features = "percent_HB",pt.size = 0,y.max = 0.1)

VlnPlot(sce,features = c("nCount_RNA","nFeature_RNA","percent_MT"),pt.size = 0,group.by = 'orig.ident')

对于每个细胞，我又添加了三列信息：线粒体基因占比，核糖体基因占比，红细胞基因占比。这几列用来过滤数据，这里重新给sce赋值是因为，PercentageFeatureSet函数的返回值还是一个seurat对象，和原来的sce相比，就只是在meta data里多了一列而已，所以直接重新赋值，把原来的seurat对象sce给更新掉

过滤前先看下数据，32538个gene，13060个细胞

# 过滤细胞

sce <- subset(sce,subset = nCount_RNA>1000 & nFeature_RNA>300 & percent_MT<25)

# 过滤基因

sce <- sce[rowSums(sce@assays$RNA@counts>0)>3,]

# 过滤线粒体、核糖体、血红细胞进占比高的细胞

sce <- subset(sce,subset = percent_MT<25 & percent_RP<30 & percent_HB<0.1)

过滤完后22249个gene，7892个细胞

# 细胞周期评分

# S.Score较高的为S期，G2M.Score较高的为G2M期，都比较低的为G1期

s_feature <- cc.genes.updated.2019$s.genes

g2m_feature <- cc.genes.updated.2019$g2m.genes

sce <- CellCycleScoring(sce,

s.features = s_feature,

g2m.features = g2m_feature,

set.ident = T)

VlnPlot(sce,features = c('S.Score','G2M.Score'),group.by = 'orig.ident',pt.size = 0)

saveRDS(sce,'sce_qc.rds')

再看一下细胞周期，把细胞周期两个评分加到meta data里，更新sce

1.2 用harmony去除批次效应

library(harmony)

library(dplyr)

library(Seurat)

library(clustree)

rm(list = ls())

sce <- readRDS('sce_qc.rds')

sce <- NormalizeData(sce,

normalization.method = 'LogNormalize',

scale.factor = 10000)

sce <- FindVariableFeatures(sce,

selection.method = "vst",

nfeatures = 2000)

# 默认用variableFeature做scale

sce <- ScaleData(sce)

sce <- RunPCA(sce,features = VariableFeatures(sce))

DimPlot(sce,reduction = 'pca',group.by = 'group')

从PCA降维的结果看，两个组的批次效应不大，均匀的混合在一起，但是这里还是用harmony试试

sce <- RunHarmony(sce,group.by.vars = 'orig.ident')

ElbowPlot(sce,reduction = 'harmony')

sce <- RunUMAP(sce,dims = 1:10,reduction = 'harmony')

sce <- RunTSNE(sce,dims = 1:10,reduction = 'harmony')

DimPlot(sce,reduction = 'umap',label = T,group.by = 'group')

DimPlot(sce,reduction = 'tsne',label = T,group.by = 'group')

选择orig.ident这一列，去除4个样本之间的批次效应。根据肘部图确定降维后的维度。

sce <- FindNeighbors(sce,reduction = 'harmony',dims = 1:10)

sce_res <- FindClusters(sce_res,resolution = 0.8)

# 设置不同的分辨率查看分组情况

sce_res <- sce

for (i in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)){

sce_res <- FindClusters(sce_res,resolution = i)

}

clustree(sce_res,prefix = 'RNA_snn_res.')

FindNeighbors函数计算细胞间的距离，FindClusters函数确定分群结果。上面的UMAP看着大概分了7坨，但实际上并不是7个分群，实际分群数量还是要看FindClusters的结果。FindClusters函数分辨率不同，分群数量会不同，具体分多少需要摸索。这里分辨率选了0.5，对应第六行蓝色，13个分群，如果选0.8的话，感觉分群数量会突然增加好多，不太合适。

sce <- FindClusters(sce,resolution = 0.5)

saveRDS(sce,file = 'step2_harmony.rds')

DimPlot(sce,reduction = 'umap',group.by = 'orig.ident')

DimPlot(sce,reduction = 'umap',group.by = 'seurat_clusters')

确定好分辨率后，分群赋值给sce，在meta data里会多出来两列，这两列是一样的，表示每个细胞分在哪个cluster里，最后保存一下更新后的sce对象，再来看下分群结果

1.3 细胞注释

library(SingleR)

library(celldex)

library(Seurat)

library(dplyr)

library(stringr)

library(pheatmap)

library(ReactomeGSA)

library(monocle)

library(limma)

library(ggplot2)

library(msigdbr)

library(singleseqgset)

rm(list = ls())

gc()

sce <- readRDS('step2_harmony.rds')

# singleR注释

hpca.se <- HumanPrimaryCellAtlasData()

bpe.se <- BlueprintEncodeData()

anno <- SingleR(sce@assays$RNA@data,

ref = list(BP=bpe.se,HPCA=hpca.se),

labels = list(bpe.se$label.main,hpca.se$label.main),

clusters = sce@meta.data$seurat_clusters

)

用SingleR自动注释，HumanPrimaryCellAtlasData和BlueprintEncodeData是SingleR自带的两个数据库，用两个数据库注释效果好像比一个会好点。得到的anno也是一个对象，每个细胞群的注释信息就藏在里面，13个分群被注释成7种细胞，其中有6个分群都是上皮细胞，也许上皮细胞还能继续细分。想继续细分的话，对sce取子集，把标签为上皮细胞的细胞取出来，降维分群，丢到SingleR()函数里再注释

plotScoreHeatmap(anno,clusters = anno@rownames,show_colnames = T)

再看下注释结果的打分，不满意的话也可以手动注释

sce@meta.data$singleR_label <- unlist(lapply(sce@meta.data$seurat_clusters, function(x){anno$labels[x]}))

DimPlot(sce,reduction = 'umap',group.by = 'singleR_label',label = T)

DimPlot(sce,reduction = 'tsne',group.by = 'singleR_label',label = T)

anno里面有每个分群对应的注释信息，把这个信息添加到sce的meta data里，然后就可以在分群的图上添加上注释信息了

1.4 鉴定关键细胞类型

cell_list <- split(colnames(sce), sce$singleR_label)

deg <- c()

for (i in names(cell_list)){

sce_temp <- sce[,colnames(sce) %in% cell_list[[i]]]

sce_temp <- CreateSeuratObject(counts = sce_temp@assays$RNA@counts,

meta.data = sce_temp@meta.data,

min.cells = 3,

min.features = 200)

sce_temp <- NormalizeData(sce_temp)

sce_temp <- FindVariableFeatures(sce_temp)

sce_temp <- ScaleData(sce_temp,vars.to.regress = c('nFeature_RNA','"percent_MT"'))

Idents(sce_temp) <- sce_temp$group

sce_temp_markers <- FindAllMarkers(object=sce_temp,

only.pos = T,

logfc.threshold = 2,

min.pct = 0.1)

sce_temp_markers <- sce_temp_markers[order(sce_temp_markers$cluster,sce_temp_markers$avg_log2FC),]

deg_gene <- sce_temp_markers[sce_temp_markers$p_val_adj<0.05,'gene']

deg <- c(deg,length(deg_gene))

}

df <- data.frame(cell_type = names(cell_list),deg_gene = deg)

把每种类型的细胞分别拿出来，用FindAllMarkers找出ade1和nor组之间的差异基因

1.5 单细胞通路富集

temp <- sce

Idents(temp) <- temp@meta.data$singleR_label

reactome_obj <- analyse_sc_clusters(temp,create_reactome_visualization=T,use_interactors=F)

path_ways <- pathways(reactome_obj)

path_ways$diff <- rowMax(as.matrix(path_ways[,2:ncol(path_ways)])) - rowMin(as.matrix(path_ways[,2:ncol(path_ways)]))

path_ways <- path_ways[order(path_ways$diff,decreasing = T),]

plot_gsva_heatmap(reactome_obj,rownames(path_ways)[1:10],margins = c(10,20))

saveRDS(sce,file = 'step3_sce.rds')

用pathway()提取富集出来的结果，每行是一个通路，每列是一种细胞，计算每个通路在7种细胞中表达差异的diff，根据diff排序，画出前10个通路

1.6 拟时序分析

library(Seurat)

library(monocle)

sce <- readRDS('step3_sce.rds')

# 1. 将seurat对象转化为monocle的CDS对象

sparse_data <- as(as.matrix(sce@assays$RNA@counts),'sparseMatrix')

mdata <- new('AnnotatedDataFrame',data=sce@meta.data) # 行为cell

fData <- data.frame(gene_short_name=row.names(sparse_data),row.names = row.names(sparse_data))

fd <- new('AnnotatedDataFrame',data=fData) # 行为gene 列为gene description

monocle_cds <- newCellDataSet(cellData = sparse_data,

phenoData = mdata,

featureData = fd,

lowerDetectionLimit = 0.5,

expressionFamily = negbinomial.size())

# 2. 计算size factor和离散度

monocle_cds <- estimateSizeFactors(monocle_cds)

monocle_cds <- estimateDispersions(monocle_cds)

# 3.过滤低质量细胞

monocle_cds <- detectGenes(monocle_cds,min_expr = 0.1)

# 4. 找高变基因、降维分群

# 用seurat找的高变基因

sce_var_gene <- VariableFeatures(sce)

monocle_cds <- setOrderingFilter(monocle_cds,sce_var_gene)

monocle_cds <- reduceDimension(monocle_cds,num_dim=10,norm_method = 'log',reduction_method = 'tSNE')

monocle_cds <- clusterCells(monocle_cds,num_clusters = 10)

plot_cell_clusters(monocle_cds,color_by = 'singleR_label')

diff_test_gene <- differentialGeneTest(monocle_cds[sce_var_gene,],fullModelFormulaStr = '~singleR_label')

diff_gene <- row.names(subset(diff_test_gene,qval<0.01))

monocle_cds <- setOrderingFilter(monocle_cds,diff_gene)

# 5. 用reversed graph embedding降维

monocle_cds <- reduceDimension(monocle_cds,reduction_method = 'DDRTree')

# 6. 计算细胞拟时间

monocle_cds <- orderCells(monocle_cds)

这里运行到第六步，orderCells()函数大概率会报错，详细的解决办法可以参考知乎上的贴子https://zhuanlan.zhihu.com/p/589134519 ，这里说一下大概解决步骤

下载一个包，'igraph’下载一个R文件，order_cells.R，链接在知乎贴子里，然后source调用把orderCells()中用DDRTree计算的那部分代码单独拿出来，写成一个函数，后面直接调用，记得给出返回值。或者不封装成函数的形式也行，那就不需要返回值了# 6. 计算细胞拟时间

source('order_cells.R')

library('igraph')

my_ordercell <- function(cds){

root_state = NULL

num_paths = NULL

reverse = NULL

root_cell <- select_root_cell(cds, root_state, reverse)

cds@auxOrderingData <- new.env(hash = TRUE)

if (cds@dim_reduce_type == "DDRTree") {

if (is.null(num_paths) == FALSE) {

message("Warning: num_paths only valid for method 'ICA' in reduceDimension()")

}

cc_ordering <- extract_ddrtree_ordering(cds, root_cell)

pData(cds)$Pseudotime <- cc_ordering[row.names(pData(cds)), ]$pseudo_time

K_old <- reducedDimK(cds)

old_dp <- cellPairwiseDistances(cds)

old_mst <- minSpanningTree(cds)

old_A <- reducedDimA(cds)

old_W <- reducedDimW(cds)

cds <- project2MST(cds, project_point_to_line_segment)

minSpanningTree(cds) <- cds@auxOrderingData[[cds@dim_reduce_type]]$pr_graph_cell_proj_tree

root_cell_idx <- which(V(old_mst)$name == root_cell, arr.ind = T)

cells_mapped_to_graph_root <- which(cds@auxOrderingData[["DDRTree"]]$pr_graph_cell_proj_closest_vertex == root_cell_idx)

if (length(cells_mapped_to_graph_root) == 0) {

cells_mapped_to_graph_root <- root_cell_idx

}

cells_mapped_to_graph_root <- V(minSpanningTree(cds))[cells_mapped_to_graph_root]$name

tip_leaves <- names(which(degree(minSpanningTree(cds)) == 1))

root_cell <- cells_mapped_to_graph_root[cells_mapped_to_graph_root %in% tip_leaves][1]

if (is.na(root_cell)) {

root_cell <- select_root_cell(cds, root_state, reverse)

}

cds@auxOrderingData[[cds@dim_reduce_type]]$root_cell <- root_cell

cc_ordering_new_pseudotime <- extract_ddrtree_ordering(cds, root_cell)

pData(cds)$Pseudotime <- cc_ordering_new_pseudotime[row.names(pData(cds)), ]$pseudo_time

if (is.null(root_state) == TRUE) {

closest_vertex <- cds@auxOrderingData[["DDRTree"]]$pr_graph_cell_proj_closest_vertex

pData(cds)$State <- cc_ordering[closest_vertex[, 1], ]$cell_state

}

cds

}

monocle_cds <- my_ordercell(monocle_cds)

# 可以画图了

plot_cell_trajectory(monocle_cds,color_by = 'Pseudotime')

plot_cell_trajectory(monocle_cds,color_by = 'State')

plot_cell_trajectory(monocle_cds,color_by = 'singleR_label')

plot_cell_trajectory(monocle_cds,color_by = 'singleR_label') facet_wrap(~singleR_label,nrow = 3)

saveRDS(monocle_cds,file = 'monocle.rds')

1.7 细胞通讯

library(CellChat)

library(patchwork)

library(Seurat)

library(dplyr)

library(ggalluvial)

library(NMF)

options(stringsAsFactors = FALSE)

rm(list = ls())

gc()

sce <- readRDS('step3_sce.rds')

# Step1.创建cellchat对象

data.input <- sce@assays$RNA@data

meta.data <- sce@meta.data

cellchat <- createCellChat(object=data.input,

meta = meta.data,

group.by='singleR_label')

cellchat <- addMeta(cellchat,meta = meta.data)

# Step2.加载CellChatDB数据库

cellchatDB <- CellChatDB.human

cellchat@DB <- cellchatDB

# Step3.处理表达数据

cellchat <- subsetData(cellchat)

cellchat <- identifyOverExpressedGenes(cellchat)

cellchat <- identifyOverExpressedInteractions(cellchat)

# Step4.计算通讯概率，推断细胞通讯网络这一步很耗时

cellchat <- computeCommunProb(cellchat,population.size = F)

cellchat <- filterCommunication(cellchat,min.cells = 10)

# Step5. 提取预测的细胞通讯网络为data frame

df.net <- subsetCommunication(cellchat)

df.pathway <- subsetCommunication(cellchat,slot.name = 'netP')

# Step6. 在信号通路水平推断细胞通讯

cellchat <- computeCommunProbPathway(cellchat)

# Step7. 计算加和的cell-cell通讯网络

cellchat <- aggregateNet(cellchat)

par(mfrow = c(1,2))

netVisual_circle(cellchat@net$count,vertex.weight = groupSize,

weight.scale = T,label.edge = F,

title.name = 'number of Interaction')

netVisual_circle(cellchat@net$weight,vertex.weight = groupSize,

weight.scale = T,label.edge = F,

title.name = 'Interaction Weight')

mat <- cellchat@net$weight

par(mfrow = c(3,3),xpd=T)

for (i in 1:nrow(mat)){

mat2 <- matrix(0,nrow = nrow(mat),ncol = ncol(mat),

dimnames = dimnames(mat))

mat2[i,] <- mat[i,]

netVisual_circle(mat2,vertex.weight = groupSize,

weight.scale = T,edge.weight.max = max(mat),

title.name = rownames(mat)[i])

}

# Step8. 使用层次图（Hierarchical plot），圆圈图（Circle plot）或和弦图（Chord diagram）可视化每个信号通路

cellchat@netP$pathways

pathway.show <- c('SEMA4')

vertex.receiver <- c(1,2,3,4)

netVisual_aggregate(cellchat,signaling = pathway.show,

vertex.receiver = vertex.receiver,

layout = 'hierarchy')

par(mfrow=c(1,2),xpd=T)

netVisual_aggregate(cellchat,signaling = pathway.show,

layout = 'circle')

netVisual_aggregate(cellchat,signaling = pathway.show,

layout = 'circle',label.edge=T)

pairLR.CXCL <- extractEnrichedLR(cellchat,signaling = pathway.show,

geneLR.return = F)

LR.show <- pairLR.CXCL[1,]

netVisual_individual(cellchat,signaling = pathway.show,

pairLR.use = 'SEMA4A_PLXNB2',

layout = 'circle')

netVisual_aggregate(cellchat,signaling = pathway.show,layout = 'chord',

title.name = "Chord diagram 1")

Step 2. TCGA差异分析

library(TCGAbiolinks)

library(SummarizedExperiment)

library(AnnoProbe)

library(stringr)

library(limma)

library(tinyarray)

library(data.table)

library(ggplot2)

library(dplyr)

library(tibble)

library(pheatmap)

library(clusterProfiler)

library(org.Hs.eg.db)

rm(list = ls())

# 下载LUAD表达矩阵

query <- GDCquery(project = 'TCGA-LUAD',

data.category = 'Transcriptome Profiling',

data.type = 'Gene Expression Quantification',

workflow.type = 'STAR - Counts')

GDCdownload(query)

dat <- GDCprepare(query)

exp <- assay(dat)

save(exp,file='TCGA_LUAD_exp.Rdata')

# 下载LUAD临床信息

query <- GDCquery(project = 'TCGA-LUAD',

data.category = 'Clinical',

data.type = 'Clinical Supplement',

file.type = 'xml')

GDCdownload(query)

dat <- GDCprepare_clinic(query,clinical.info = 'patient')

k = apply(dat, 2, function(x){!all(is.na(x))});table(k)

clinical <- dat[,k]

save(clinical,file = 'TCGA_LUAD_Clinical.Rdata')

注意一下TCGA的样本命名含义：

以这个为例：TCGA-A6-6650-01A-11R-1774-07

A6：Tissue source site，组织来源编码

6650：Participant, 参与者编号

01：Sample, 编号01~09表示肿瘤，10~19表示正常对照

A：Vial, 在一系列患者组织中的顺序，绝大多数样本该位置编码都是A;

很少数的是B，表示福尔马林固定石蜡包埋组织，已被证明用于测序分析的效果不佳，所以不建议使用-01B的样本数据

11：Portion, 同属于一个患者组织的不同部分的顺序编号，同一组织会分割为100-120mg的部分，分别使用

R：RNA, 分析的分子类型，对应关系如下所示：

1774：Plate, 在一系列96孔板中的顺序，值大表示制板越晚

07：Center, 测序或鉴定中心编码

# count 转换成TPM

gene_length <- fread('gene_len.txt')

colnames(gene_length) <- c('id','length')

gene_length <- as.data.frame(gene_length)

gene_length <- gene_length[!duplicated(gene_length$id),]

exp <- as.data.frame(exp)

exp$id <- unlist(lapply(rownames(exp),function(x){str_split(x,'\\.',simplify = T)[,1]}))

merge <- left_join(x=exp,y=gene_length,by='id')

merge <- na.omit(merge)

merge <- merge[!duplicated(merge$id),]

rownames(merge) <- merge$id

merge <- merge[,-601]

# 计算TPM

kb <- merge$length / 1000

rpk <- merge[,-601] / kb

tpm <- t(t(rpk) / colSums(rpk) * 1E6)

exp <- tpm

采用limma做差异分析，先把counts转换成TPM。不转也可以用DEseq2做差异分析。gene_length是每个基因的长度，这个文件是把gtf文件中，基因的终止位置-起始位置得到的。

# 更改行名：ENSEMBL -> SYMBOL

a <- annoGene(str_split(rownames(exp),'\\.',simplify = T)[,1],'ENSEMBL')

a <- a[!duplicated(a$ENSEMBL),]

rownames(a) <- a$ENSEMBL

rownames(exp) <- a[str_split(rownames(exp),'\\.',simplify = T)[,1],'SYMBOL']

# 给样本分组 tumor/normal

group <- ifelse(sapply(str_split(colnames(exp),'-',simplify = T)[,4],function(x){as.integer(substr(x,1,2))})>=10,

'normal','tumor')

sample_group <- data.frame(sample=colnames(exp),group=group)

a是annoGene函数返回的一个ID转换的结果，是一个dataframe，多个ENSEMBL ID可能对应同一个SYMBOL ID，对symbol去重后，再映射成exp的行名。

按照TCGA名字中第四部分，把样本分成tumor和normal

# 过滤

# 1. 去除FFPE样本

sample_group$sample_type <- ifelse(sapply(str_split(colnames(exp),'-',simplify = T)[,4],function(x){substr(x,3,3)})=='A',

'tissue','ffpe')

exp <- exp[,sample_group$sample_type!='ffpe']

sample_group <- sample_group[sample_group$sample_type=='tissue',]

# 2. 保留在一半以上样本中有值的gene

exp <- exp[rowSums(exp>0)>as.integer(ncol(exp)/2),]

dim(exp)

save(exp,file = 'TCGA_LUAD_TPM_exp.Rdata')

FFPE样本无论做测序还是蛋白组，代谢组，效果都不怎么好

# 对表达矩阵取log2

for (i in 1:ncol(exp)){

exp[,i] <- log2(exp[,i] 0.0001)

}

# 差异筛选

design <- model.matrix(~factor(sample_group$group,levels = c('tumor','normal')))

fit <- lmFit(exp,design = design)

fit <- eBayes(fit)

deg <- topTable(fit, coef = 2,number = Inf)

deg$change <- ifelse((deg$logFC>log2(2))&(deg$adj.P.Val<0.05),'up',

ifelse((deg$logFC<log2(0.5))&(deg$adj.P.Val<0.05),'down','nochange'))

deg <- na.omit(deg)

table(deg$change)

做完差异筛选，有10个gene含空值，删去有空值的行即可。更新下exp并保存好

# deg比exp少了几个gene

exp <- exp[intersect(rownames(exp),deg$ID),]

save(exp,file = 'TCGA_exp.Rdata')

save(deg,file = 'TCGA_DEGS.Rdata')

# volcano plot

ggplot(deg,aes(x=deg$logFC,y=-log10(deg$adj.P.Val),color=change))

geom_point()

geom_hline(yintercept = -log10(0.05),linetype=2,linewidth=1)

geom_vline(xintercept = log2(0.5),linetype=2,linewidth=1)

geom_vline(xintercept = log2(2),linetype=2,linewidth=1)

scale_color_manual(values=c('blue','gray','red'))

# heatmap

deg_heatmap <- deg[order(deg$logFC,decreasing = T),]

deg_gene <- deg_heatmap[deg_heatmap$change!='nochange','ID']

heatmap_deg_gene <- as.data.frame(exp)[deg_gene,]

heatmap_deg_gene <- na.omit(heatmap_deg_gene)

anno_col <- data.frame(group=sample_group$group)

rownames(anno_col) <- sample_group$sample

pheatmap(heatmap_deg_gene,show_rownames = F,cluster_rows = F,scale = 'row',

show_colnames = F,annotation_col=anno_col ,

breaks = seq(-2.5,2.5,length.out=100))

dim(heatmap_deg_gene)

# pca plot

exp <- as.data.frame(exp)

pca <- prcomp(t(exp))

ggplot(data.frame(pca$x),aes(x=PC1,y=PC2,color=sample_group$group))

geom_point(size = 3) stat_ellipse(level = 0.95, show.legend = F)

labs(title = 'PCA Analysis')

theme(plot.title = element_text(hjust = 0.5))

guides(color = guide_legend(title = 'Group'))

# GO enrich

go_bp <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL','BP')

go_mf <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL','MF')

go_cc <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL','CC')

dotplot(go_bp)

barplot(go_bp,showCategory = 4)

# KEGG enrich

entrez_id <- bitr(deg_gene,fromType = 'SYMBOL',toType = 'ENTREZID',OrgDb = org.Hs.eg.db)

kegg <- enrichKEGG(entrez_id$ENTREZID,organism = 'hsa',keyType = 'ncbi-geneid')

dotplot(kegg)

barplot(kegg,showCategory = 4)

Step 3. WGCNA分析

library(WGCNA)

library(reshape2)

library(stringr)

library(dplyr)

library(patchwork)

library(ggplot2)

rm(list = ls())

gc()

load('TCGA_DEGS.Rdata')

load('TCGA_exp.Rdata')

load('TCGA_LUAD_Clinical.Rdata')

options(stringsAsFactors = F) # 开启多线程

enableWGCNAThreads()

# 1. 数据处理

# >>> 选取表达量前10000的gene，用全部gene的话，计算后面的邻接矩阵和TOM矩阵太慢了

temp <- as.data.frame(exp)

temp$avg_exp <- rowMeans(temp)

temp <- temp[order(temp$avg_exp,decreasing = T),]

top10000_gene <- rownames(temp[1:10000,])

exp <- as.data.frame(exp[top10000_gene,])

# >>> 查找并删除异常值

temp <- exp

colnames(temp) <- 1:ncol(exp)

plot(hclust(dist(t(temp)))) # 没有异常样本

# 2. 构建基因共表达网络

exp <- as.data.frame(t(exp))

# >>> 2.1 选择合适的power

power_vec <- c(seq(1, 10, by = 1), seq(12, 20, by = 2))

sft <- pickSoftThreshold(exp,powerVector = power_vec)

a <- ggplot(sft$fitIndices,aes(x=Power,y=-sign(slope)*SFT.R.sq))

geom_text(label=sft$fitIndices$Power,color='red')

geom_hline(yintercept = 0.9,color='red')

xlab("Soft Threshold (power)") ylab("Scale Free Topology Model Fit,signed R^2")

ggtitle("Scale independence") theme(plot.title = element_text(hjust = 0.5))

b <- ggplot(sft$fitIndices,aes(x=Power,y=mean.k.))

geom_text(label=sft$fitIndices$Power,color='red')

xlab("Soft Threshold (power)") ylab("Mean Connectivity")

ggtitle('Mean connectivity') theme(plot.title = element_text(hjust = 0.5))

a b

这里power=12，达到0.8即可

power <- sft$powerEstimate

# >>> 2.2 检验所选power是否符合无尺度网络

k <- softConnectivity(exp,power = power)

scaleFreePlot(k)

power=12确实也符合无尺度网络，R2较高

# >>> 2.3 计算gene在sample间的corr或者distance，构建临近矩阵。默认用dist函数计算（欧氏距离）

adj <- adjacency(exp,power = power)

# >>> 2.4 根据临近矩阵计算TOM（拓扑重叠矩阵）

TOM <- TOMsimilarity(adj)

# >>> 2.5 计算gene间相异性

dissTOM <- 1-TOM

# 3. 对gene进行聚类并可视化

geneTree <- hclust(as.dist(dissTOM),method = 'average')

sizeGrWindow(12,9)

plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",

labels = FALSE, hang = 0.04)

# >>> 3.2 将gene聚类树进行修剪后，把gene归到不同的模块里

dynamic_modules <- cutreeDynamic(dendro = geneTree,distM = dissTOM,minClusterSize = 50,

deepSplit = 2,pamRespectsDendro=F,minSplitHeight = NULL)

table(dynamic_modules)

module_color <- labels2colors(dynamic_modules)

plotDendroAndColors(geneTree,module_color,dendroLabels = F,hang=0.03,addGuide = T,

guideHang = 0.05,groupLabels='Dynamic color tree',

main='gene dendrogram and module colors')

# >>> 3.3 计算每个模块的特征基因向量(对每个模块的gene表达量降维后，只保留PC1，即MEs)

MEList <- moduleEigengenes(exp,colors = dynamic_modules)

MEs <- MEList$eigengenes

# >>> 3.4 计算特征基因的相异度，基于相异度对原有模块进行层次聚类

dissME <- 1-cor(MEs)

MEtree <- hclust(as.dist(dissME),method = 'average')

plotEigengeneNetworks(MEs,

"Eigengene adjacency heatmap",

marHeatmap = c(3,4,2,2),

plotDendrograms = FALSE,

xLabelsAngle = 90)

# >>> 3.5 如果模块太多，可以对模块进行合并.设置剪切高度,剪切高度以下的模块进行合并

cut_height <- 0.2

sizeGrWindow(8,6)

plot(MEtree,main = "Clustering of module eigengenes",

xlab = "", sub = "")

abline(h=cut_height,col='red')

把高度0.2以下的模块进行合并，再看看合并前后模块的变化

module_merge <- mergeCloseModules(exp,module_color,cutHeight = cut_height)

merged_color <- module_merge$colors

merged_MEs <- module_merge$newMEs

sizeGrWindow(8,6)

plotDendroAndColors(geneTree,cbind(module_color,merged_color),dendroLabels = F,hang=0.03,

addGuide = T,groupLabels = c("Dynamic Tree Cut", "Merged dynamic"))

# 4. 计算每个模块与临床信息的相关性、pvalue。找出与临床信息相关性高的模块

nGenes <- ncol(exp)

nSamples <- nrow(exp)

################################准备临床信息####################################

clinical <- clinical[,c('bcr_patient_barcode','gender','residual_tumor')]

traits <- data.frame(row.names = rownames(exp),ID=rownames(exp))

traits$ID <- substr(traits$ID,1,12)

index <- match(traits$ID,clinical$bcr_patient_barcode)

traits$gender <- as.numeric(as.factor(clinical[index,'gender']))

traits$residual_tumor <- as.numeric(as.factor(clinical[index,'residual_tumor']))

traits$type <- ifelse(sapply(str_split(rownames(traits),'-',simplify = T)[,4],

function(x){as.integer(substr(x,1,2))})>=10,'normal','tumor')

traits$type <- as.numeric(as.factor(traits$type))

traits <- traits[,-1]

################################准备临床信息####################################

clinical是前面下载的TCGA临床数据，这里随便取了两个比较关心的临床信息。clinical的每一行是一个病人的ID，而表达矩阵每一行是一个样本，一个病人身上可能有多个样本。在把临床信息和表达矩阵一一对应的时候，如果直接把exp的行名取前12位，就会产生重复值。在做的时候，先准备好样本和病人ID的对应关系，再把clinical里病人的信息给匹配过来

最终的临床信息是这样，要把字符串变成数值型变量

module_trait_cor <- cor(merged_MEs,traits,use = 'p')

module_trait_pvalue <- corPvalueStudent(module_trait_cor,nSamples)

trait_colors <- numbers2colors(traits,signed = T,centered = T)

a <- paste(signif(module_trait_cor, 2), "\n(", signif(module_trait_pvalue, 1), ")", sep = "")

sizeGrWindow(8,6)

labeledHeatmap(module_trait_cor,xLabels = colnames(traits),yLabels = colnames(merged_MEs),colorLabels = FALSE,

colors = blueWhiteRed(50),textMatrix = a,

setStdMargins = FALSE,cex.text = 0.65,zlim = c(-1,1),

main = paste("Module-trait relationships"))

模块与tumor/normal之间的相关性并不是很高。。。。。就拿粉色模块当做感兴趣的模块吧

# 5. 找出感兴趣的模块内的枢纽gene

# >>> 5.1 计算gene表达量与临床性状的相关性

gs <- apply(exp,2,function(x){cor(x,traits$type)})

# >>> 5.2 基于网络计算gene与临床性状的相关性

ns <- networkScreening(y=traits$type,datME = merged_MEs,datExpr = exp)

# >>> 5.3 计算每个gene与特征gene的相关性

kme <- signedKME(datExpr = exp,datME = merged_MEs)

# >>> 5.4 根据gs, ns, kme 找出hub gene

hubgene_filter <- (gs>0.2 & ns$q.Weighted<0.01 & abs(kme$kMEmagenta)>0.4)

hub_gene <- colnames(exp)[hubgene_filter]

plotNetworkHeatmap(exp,

plotGenes = hub_gene,

networkType = "unsigned",

useTOM = TRUE,

power=8,

main="unsigned correlations")

看下kme，其中粉色模块里，gene的相关性并不是很高，这里筛选了>0.4的，最后找到25个hub gene。如果阈值再设高一点，hub gene就全卡掉了。。。这跟没有用所有的3w多个gene去计算有关，也跟一些参数有关，我没有一个个去调整优化，只是展示一下如何做图和分析思路

hub_gene <- intersect(hub_gene,deg$ID)

save(hub_gene,file = 'TCGA_WGCNA_hub_gene.Rdata')

最后找出hub gene和前面deg的交集部分，作为hub gene，存档

Step 4. 非负矩阵分解算法

4.1 NMF算法对样本分类

library(NMF)

library(survival)

library(stringr)

library(data.table)

library(survminer)

library(dplyr)

library(AnnoProbe)

library(CIBERSORT)

library(tidyr)

library(tibble)

library(RColorBrewer)

library(ggalluvial)

library(ggplot2)

options(stringsAsFactors = F)

rm(list = ls())

gc()

load('TCGA_LUAD_TPM_exp.Rdata')

load('TCGA_DEGS.Rdata')

load('TCGA_WGCNA_hub_gene.Rdata')

dim(exp)

# 对表达矩阵取log2

exp <- log2(exp 1)

ranks <- seq(2,10)

hubgene_exp <- exp[hub_gene,]

estimate <- nmf(hubgene_exp,2:10,nrun=50,method = 'brunet')

plot(estimate)

注意hubgene_exp里不能有负数，不然会报错，rank为2时，波动最大

rank <- 2

nmf.rank4 <- nmf(hubgene_exp,rank,nrun=50,method = 'brunet')

index <- extractFeatures(nmf.rank4,'max')

index <- unlist(index)

nmf.exp <- hubgene_exp[index,]

nmf.exp <- na.omit(nmf.exp)

group <- predict(nmf.rank4)

jco <- c("#2874C5","#EABF00","#C6524A","#868686")

consensusmap(nmf.rank4,labRow=NA,labCol=NA,

annCol = data.frame("cluster"=group[colnames(nmf.exp)]),

annColors = list(cluster=c("1"=jco[1],"2"=jco[2],"3"=jco[3],"4"=jco[4])))

cluster_df <- data.frame(row.names = colnames(exp),ID=substr(colnames(exp),1,12),cluster=group)

NMF算法把所有样本分成两类，顺便看下最后的clueter_df

4.2 生存分析

load('TCGA_LUAD_TPM_exp.Rdata')

load('TCGA_LUAD_Clinical.Rdata')

# 1. 取log（TPM）

for (i in 1:ncol(exp)){

exp[,i] <- log2(exp[,i] 0.0001)

}

# 2. 以病人为中心，去重

exp <- as.data.frame(exp)

exp <- exp[,sort(colnames(exp))]

index <- !duplicated(substr(colnames(exp),1,12))

exp <- exp[,index]

# 3.读取从xena下载的生存数据

sur <- fread('TCGA-LUAD.survival.tsv')

sur <- as.data.frame(sur)

patient_id <- substr(colnames(exp),1,12)

clin <- data.frame(row.names = patient_id,ID=patient_id)

index <- match(clin$ID,sur[,'_PATIENT'])

clin <- cbind(clin,sur[index,c('OS','OS.time')])

clin <- na.omit(clin)

index2 <- match(clin$ID,clinical$bcr_patient_barcode)

clin <- cbind(clin,clinical[index2,'gender'])

colnames(clin) <- c(colnames(clin)[1:ncol(clin)-1],'gender')

clin$gender <- as.numeric(as.factor(clin$gender))

下载下来的生存数据长这样

之前从TCGA下载的临床数据长这样

合并后的clin是这样，所有变量类型都要转成数值型

# 合并NMF结果

cluster_type_group <- ifelse(sapply(str_split(rownames(cluster_df),'-',simplify = T)[,4],

function(x){as.integer(substr(x,1,2))}

)>=10,

'normal','tumor')

cluster_df <- cluster_df[which(cluster_type_group=='tumor'),]

clin <- left_join(clin,cluster_df,'ID',multiple='first')

再把NMF分类的结果添加进来，就是cluster那一列

# 7. 过滤生存数据，不要<30天的，再把天数转换成月

clin <- clin[clin$OS.time>30,]

clin$OS.time <- clin$OS.time / 30

# 8. exp的列名和临床信息的行名一一对应。NMF将样本分成若干个亚型，画出亚型的生存曲线

colnames(exp) <- substr(colnames(exp),1,12)

index <- match(clin$ID,colnames(exp))

exp <- exp[,index]

sfit <- survfit(Surv(OS.time,OS)~cluster,data=clin)

ggsurvplot(sfit,pval = T)

save(clin,file = 'clin.Rdata')

4.3 免疫浸润

my_pallet <- colorRampPalette(brewer.pal(8,'Set1'))

lm22f = system.file("extdata", "LM22.txt", package = "CIBERSORT")

cluster1 <- exp[,rownames(cluster_df[cluster_df$cluster==1,])]

cluster2 <- exp[,rownames(cluster_df[cluster_df$cluster==2,])]

TME.result1 <- cibersort(lm22f,as.matrix(cluster1),perm = 100,QN=T) # 耗时很久

TME.result2 <- cibersort(lm22f,as.matrix(cluster2),perm = 100,QN=T)

原文中免疫浸润用MCPcounter做的，这里我用CIBERSORT做，运行完的结果就是各种免疫细胞在每个样本中的含量

result1 <- TME.result1[,-c(23:25)] %>% as.data.frame() %>%

rownames_to_column('Sample') %>% gather(key = cell_type,value = proportion,-Sample)

result2 <- TME.result2[,-c(23:25)] %>% as.data.frame() %>%

rownames_to_column('Sample') %>% gather(key = cell_type,value = proportion,-Sample)

result1$cluster <- rep('cluster1', times=nrow(result1))

result2$cluster <- rep('cluster2', times=nrow(result2))

dat <- rbind(result1,result2)

ggplot(dat,aes(axis1 = cluster, axis2 = cell_type))

geom_alluvium(aes(fill = as.factor(cell_type)),width = 2/5, discern = FALSE)

geom_stratum(width = 2/5, discern = FALSE)

geom_text(stat = "stratum", discern = FALSE,aes(label = after_stat(stratum)))

theme(legend.position = "none",#去除刻度线和背景颜色

panel.background = element_blank(),

axis.ticks = element_blank(),

axis.text.y = element_blank(),

axis.text.x = element_text(size =15,colour = "black"),#坐标轴名

axis.title = element_blank())

scale_x_discrete(position = "top")

把上面的透视表转换成长表，添加上cluster的信息，再把cluster1和cluster2对应的长表合并起来，就可以画图了

cell <- 'Plasma cells'

ggviolin(dat[dat$cell_type==cell,],x='cluster',y='proportion',fill = 'cluster',

add = 'boxplot',add.params = list(fill = 'white',width=0.1,linetype=1),title = cell)

stat_compare_means(method = 't.test',label = 'p.format',aes(label = 'p.format'),

label.x.npc = 'center',label.y = 0.5,size=5)

4.4 批量单因素Cox回归

load('TCGA_WGCNA_hub_gene.Rdata')

load('TCGA_LUAD_TPM_exp.Rdata')

load('clin.Rdata')

hubgene_exp <- exp[hub_gene,]

hubgene_exp <- as.data.frame(t(hubgene_exp))

hubgene_exp$ID <- substr(rownames(hubgene_exp),1,12)

clin <- left_join(clin,hubgene_exp,by='ID')

把WGCNA中找到的hub gene挑出来，拿到他们在所有样本中的表达值。把表达值和生存信息合并到一起，最后的clin长这样。在合并之前要注意，删掉normal样本，我这里没有做。

res.cox <- coxph(Surv(OS.time,OS)~SAPCD2,data=clin)

sumres.cox <- summary(res.cox)

挑一个基因SAPCD2做单因素cox回归，最后需要的信息都藏在sumres.cox里

# 批量单因素cox回归

gene <- c()

p_value <- c()

HR <- c()

lower95 <- c()

upper95 <- c()

for (i in 6:ncol(clin)){

res <- coxph(Surv(OS.time,OS)~clin[,i],data=clin)

sum_res <- summary(res)

p <- sum_res$coefficients[,'Pr(>|z|)']

if (p<0.05){

gene <- c(gene,colnames(clin)[i])

p_value <- c(p_value,p)

HR <- c(HR,sum_res$conf.int[,'exp(coef)'])

lower95 <- c(lower95,sum_res$conf.int[,'lower .95'])

upper95 <- c(upper95,sum_res$conf.int[,'upper .95'])

}

cox_df <- data.frame(row.names = gene,pvalue=p_value,HR=HR,lower.95=lower95,upper.95=upper95)

lasso_input_gene <- rownames(cox_df[cox_df$pvalue<0.01,])

save(lasso_input_gene,file = 'Lasso_input_gene.Rdata')

批量单因素cox回归得到每个gene的pvalue，挑出pvalue<0.01的再进行后续筛选

Step 5. 预后模型构建和验证

load('Lasso_input_gene.Rdata')

load('TCGA_LUAD_TPM_exp.Rdata')

load('clin.Rdata')

# 对表达矩阵取log2

exp <- log2(exp 0.00001)

exp <- as.data.frame(exp)

# Lasso回归进一步筛选gene

exp2 <- exp[lasso_input_gene,]

exp2 <- as.data.frame(t(exp2))

exp2$ID <- substr(rownames(exp2),1,12)

exp2 <- exp2[!duplicated(exp2$ID),]

clin <- clin[,1:3]

exp3 <- left_join(clin,exp2,by='ID')

rownames(exp3) <- exp3$ID

exp3 <- exp3[,-1]

exp3 <- na.omit(exp3)

exp3是生存数据和上一步批量单因素Cox回归筛选出来的基因的表达量合并起来的结果

x <- as.matrix(exp3[,3:(ncol(exp3)-1)])

y <- Surv(exp3$OS.time,exp3$OS)

fit1 <- glmnet(x,y,alpha = 1,family = 'cox')

plot(fit1)

cv_fit <- cv.glmnet(x,y,alpha = 1,nfolds = 10,family="cox")

plot(cv_fit)

best_lambda <- cv_fit$lambda.min

fit2 <- glmnet(x,y,lambda = best_lambda,alpha = 1,family = 'cox')

lasso_filter_gene <- names(fit2$beta[fit2$beta[,1]!=0,1])

由于之前筛选出来的gene效果不怎么好，经过lasso过滤后，只剩2个gene了，就用这两个来做多因素cox回归

# 多因素cox回归构建预后模型

exp4 <- exp3 %>% dplyr::select(OS,OS.time,lasso_filter_gene)

multicox <- coxph(Surv(OS.time,OS)~.,data = exp4)

sum_multicox <- summary(multicox)

riskScore <- predict(multicox,type = 'risk',newdata = exp4)

riskScore <- as.data.frame(riskScore)

riskScore$ID <- rownames(riskScore)

riskScore$risk <- ifelse(riskScore$riskScore>median(riskScore$riskScore),'high','low')

做完多因素回归后，计算每个样本的riskscore，用riskScore把样本分成两类

# KM plot

km_data <- left_join(riskScore,clin,by='ID')

fit <- survfit(Surv(OS.time,OS)~risk,data=km_data)

ggsurvplot(fit,data = km_data,pval = T,risk.table = T,surv.median.line = 'hv',

legend.title = 'RiskScore',title = 'Overall survival',

ylab='Cummulative survival',xlab='Time(Days)')

# 绘制ROC曲线

roc_data <- data.frame(x = roc$FP[,1],

y = roc$TP[,1],

time = roc$times[1])

for (i in 2:length(roc$times)) {

temp <- data.frame(x = roc$FP[,i],

y = roc$TP[,i],

time = roc$times[i])

roc_data <- rbind(roc_data, temp)

}

ggplot(roc_data, aes(x = x, y = y, color = as.factor(time)))

geom_line()

scale_color_manual(values = c("gray", "blue", "black"))

geom_abline(slope = 1, intercept = 0, linetype = "dashed")

xlab("1 - Specificity")

ylab("Sensitivity")

ggtitle("ROC Curve")

theme_minimal()

# 绘制风险图

ggrisk(multicox)

# risk score与临床病理特征的相关性

load('TCGA_LUAD_Clinical.Rdata')

clinical_df <- clinical[,c('bcr_patient_barcode','gender','stage_event_pathologic_stage')]

clinical_df <- merge(km_data,clinical_df,by.x = 'ID',by.y = 'bcr_patient_barcode')

clinical_df <- clinical_df[!duplicated(clinical_df$ID),]

clinical_df$stage_event_pathologic_stage <- case_when(clinical_df$stage_event_pathologic_stage=='Stage IA'~'Stage I',

clinical_df$stage_event_pathologic_stage=='Stage IB'~'Stage I',

clinical_df$stage_event_pathologic_stage=='Stage IIA'~'Stage II',

clinical_df$stage_event_pathologic_stage=='Stage IIB'~'Stage II',

clinical_df$stage_event_pathologic_stage=='Stage IIIA'~'Stage III',

clinical_df$stage_event_pathologic_stage=='Stage IIIB'~'Stage III',

)

ggviolin(clinical_df,x='gender',y='riskScore',fill = 'gender',

add = 'boxplot',add.params = list(fill = 'white',width=0.1,linetype=1),)

stat_compare_means(method = 'wilcox.test',label = 'p.format',aes(label = 'p.format'),

label.x.npc = 'center',label.y = 20,size=5)

ggviolin(clinical_df,x='stage_event_pathologic_stage',y='riskScore',fill = 'stage_event_pathologic_stage',

add = 'boxplot',add.params = list(fill = 'white',width=0.1,linetype=1),)

stat_compare_means(method = 'wilcox.test',label = 'p.signif',aes(label = 'p.format'),

label.x = 1,label.y = 20,size=5)

Step 6. riskScore和临床信息的相关性

# 森林图

# 批量单因素回归

df_forest <- clinical_df

df_forest$gender <- as.numeric(as.factor(df_forest$gender))

df_forest$stage_event_pathologic_stage <- as.numeric(as.factor(df_forest$stage_event_pathologic_stage))

gene <- c()

p_value <- c()

HR <- c()

lower95 <- c()

upper95 <- c()

means <- c()

for (i in c('riskScore','gender','stage_event_pathologic_stage')){

res <- coxph(Surv(OS.time,OS)~df_forest[,i],data=df_forest)

sum_res <- summary(res)

p <- sum_res$coefficients[,'Pr(>|z|)']

gene <- c(gene,i)

p_value <- c(p_value,p)

HR <- c(HR,sum_res$conf.int[,'exp(coef)'])

lower95 <- c(lower95,sum_res$conf.int[,'lower .95'])

upper95 <- c(upper95,sum_res$conf.int[,'upper .95'])

means <- c(means,res$means)

}

cox_df <- data.frame(row.names = gene,pvalue=p_value,HR=HR,lower.95=lower95,upper.95=upper95,means=means)

cox_df$pvalue <- round(cox_df$pvalue,3)

cox_df$HR <- round(cox_df$HR,3)

cox_df <- rownames_to_column(cox_df,var = 'Variable')

cox_df <- rbind(colnames(cox_df),cox_df)

forestplot(labeltext=as.matrix(cox_df)[,1:3],lower=as.numeric(cox_df[,4]),

upper=as.numeric(cox_df[,5]),mean=as.numeric(cox_df[,6]),

is.summary=c(T,F,F,F),

zero=0,lineheight=unit(0.5,'cm'),graphwidth=unit(35,'mm'),

colgap=unit(2,'mm'),lwd.zero=2,lwd.ci=3,

col=fpColors(box = '#7AC5CD',lines = 'black',zero = 'purple'),

lwd.xaxis=2,lty.ci='solid',ci.vertices.height=0.05,graph.pos=2,

xlim=c(0,2.5),box.size=0.5,

hrzl_lines=list('2'=gpar(lwd=3,columns=1:4,col='dark green'),

'5'=gpar(lwd=3,columns=1:4,col='dark green'))

)

# 多因素回归

res <- coxph(Surv(OS.time,OS)~riskScore gender stage_event_pathologic_stage,data=df_forest)

sum_res <- summary(res)

df <- data.frame(row.names = rownames(sum_res$conf.int),pvalue=sum_res$coefficients[,'Pr(>|z|)'],

HR=sum_res$conf.int[,'exp(coef)'],lower=sum_res$conf.int[,'lower .95'],

upper=sum_res$conf.int[,'upper .95'],means=res$means)

df$pvalue <- round(df$pvalue,3)

df$HR <- round(df$HR,3)

df <- rownames_to_column(df,var='Variable')

df <- rbind(colnames(df),df)

forestplot(labeltext=as.matrix(df)[,1:3],lower=as.numeric(df[,4]),

upper=as.numeric(df[,5]),mean=as.numeric(df[,6]),

is.summary=c(T,F,F,F),

zero=0,lineheight=unit(0.5,'cm'),graphwidth=unit(35,'mm'),

colgap=unit(2,'mm'),lwd.zero=2,lwd.ci=3,

col=fpColors(box = '#7AC5CD',lines = 'black',zero = 'purple'),

lwd.xaxis=2,lty.ci='solid',ci.vertices.height=0.05,graph.pos=2,

xlim=c(0,2.5),box.size=0.5,

hrzl_lines=list('2'=gpar(lwd=3,columns=1:4,col='dark green'),

'5'=gpar(lwd=3,columns=1:4,col='dark green'))

)

# GSEA

df_exp <- as.data.frame(t(exp))

df_exp <- rownames_to_column(df_exp,var = 'sample_ID')

df_exp$sample_ID <- substr(df_exp$sample_ID,1,12)

df_exp <- df_exp[!duplicated(df_exp$sample_ID),]

rownames(df_exp) <- df_exp$sample_ID

df_exp <- as.data.frame(t(df_exp[,-1]))

df_exp <- df_exp[,km_data$ID]

group <- as.factor(km_data$risk)

design <- model.matrix(~km_data$risk)

fit <- lmFit(df_exp,design)

fit <- eBayes(fit)

df_deg <- topTable(fit,coef = 2,number = Inf)

df_deg$change <- ifelse(df_deg$logFC>log2(2) & df_deg$adj.P.Val<0.05,'up',

ifelse(df_deg$logFC<log2(0.5) & df_deg$adj.P.Val<0.05,'down','nochange'))

df_deg <- df_deg[order(df_deg$logFC,decreasing = T),]

symbol_2_entrez <- bitr(rownames(df_deg),fromType = 'SYMBOL',toType = 'ENTREZID',OrgDb = org.Hs.eg.db)

symbol_2_entrez <- symbol_2_entrez[!duplicated(symbol_2_entrez$SYMBOL),]

rownames(symbol_2_entrez) <- symbol_2_entrez$SYMBOL

df_deg$entrezid <- symbol_2_entrez[rownames(df_deg),'ENTREZID']

df_deg <- df_deg[!duplicated(df_deg$entrezid),]

df_deg <- na.omit(df_deg)

a <- df_deg[(df_deg$change=='up'),'logFC']

names(a) <- df_deg[(df_deg$change=='up'),'entrezid']

res_a <- gseKEGG(a,'hsa')

b <- df_deg[(df_deg$change=='down'),'logFC']

names(b) <- df_deg[(df_deg$change=='down'),'entrezid']

res_b <- gseKEGG(b,'hsa')

gseaplot2(res_a,geneSetID = 1:5)

gseaplot2(res_b,geneSetID = 1:5)

先做差异分析，找到high risk vs low risk的差异基因，上调和下调的gene分别拿去做GSEA再画图

# waterfall plot

query <- GDCquery(project = 'TCGA-LUAD',

data.category = "Simple Nucleotide Variation",

data.type = "Masked Somatic Mutation",

access = 'open')

GDCdownload(query)

GDCprepare(query,save = T,save.filename = 'TCGA-LUAD-SNP.Rdata')

load('TCGA-LUAD-SNP.Rdata')

luad_snp <- read.maf(maf=data)

plotmafSummary(luad_snp,rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

oncoplot(luad_snp,top = 15)

Step 7. 高低风险组间的免疫浸润

library(GSVA)

library(GSEABase)

library(ggplot2)

library(tidyr)

rm(list = ls())

#1. 获取geneSets

# 下载28种免疫细胞的参考基因集 </TISIDB/data/download/CellReports.txt>

geneSet <- read.csv("cellReport.txt",header = F,sep = "\t",)

geneSet[1:5,1:5]

geneSet <- t(tibble::column_to_rownames(geneSet,var = 'V1'))

gene_list <- list()

for (i in colnames(geneSet)){

gene_list[[i]] <- geneSet[,i]

gene_list[[i]] <-gene_list[[i]][gene_list[[i]]!='']

}

直接读取的geneSet长这样，有空值，需要转变成gene_list，list长度为28，其中每个元素是某个细胞所有基因的向量，名字是对应的细胞

# 载入表达矩阵,把表达矩阵分成high risk 和low risk组

load('TCGA_LUAD_TPM_exp.Rdata')

load('risk.Rdata')

# 对表达矩阵取log2

exp <- log2(exp 0.00001)

exp <- as.data.frame(exp)

colnames(exp) <- substr(colnames(exp),1,12)

exp <- exp[,!duplicated(colnames(exp))]

high_risk_sample <- km_data[km_data$risk=='high','ID']

low_risk_sample <- km_data[km_data$risk=='low','ID']

high_risk_exp <- exp[,high_risk_sample]

low_risk_exp <- exp[,low_risk_sample]

# ssGSEA

high_risk_gsea <- gsva(high_risk_exp,gene_list,method='ssgsea',kcdf='Gaussian',abs.ranking=T)

low_risk_gsea <- gsva(low_risk_exp,gene_list,method='ssgsea',kcdf='Gaussian',abs.ranking=T)

要注意gsva函数的输入得是matrix，不能是data.frame，不然会报错

hr_gsea_norm <- high_risk_gsea

for (i in colnames(hr_gsea_norm)) {

diff <- max(hr_gsea_norm[,i] )-min(hr_gsea_norm[,i] )

hr_gsea_norm[,i] <- (hr_gsea_norm[,i] -min(hr_gsea_norm[,i]))/diff

}

hr_gsea_norm <- as.data.frame(t(hr_gsea_norm))

hr_gsea_norm$type <- rep('high',times=nrow(hr_gsea_norm))

lr_gsea_norm <- low_risk_gsea

for (i in colnames(lr_gsea_norm)) {

diff <- max(lr_gsea_norm[,i] )-min(lr_gsea_norm[,i])

lr_gsea_norm[,i] <- (lr_gsea_norm[,i] -min(lr_gsea_norm[,i]))/diff

}

lr_gsea_norm <- as.data.frame(t(lr_gsea_norm))

lr_gsea_norm$type <- rep('low',times=nrow(lr_gsea_norm))

gsea_norm <- rbind(hr_gsea_norm,lr_gsea_norm)

longer <- pivot_longer(gsea_norm,cols = !type,names_to = 'cell_type',values_to = 'Score')

ggplot(longer,aes(x=cell_type,y=Score,color=type)) geom_boxplot()

theme(axis.text.x = element_text(angle = 45,hjust = 1))

stat_compare_means(label = "p.signif",size=3,method = 'wilcox.test')

先把high risk gsea和low risk gsea的数据分别进行min max scale，新增一列标记上high和low，然后合并起来，行是样本，列是细胞。再把透视表转成长表，就可以画图了

# 免疫检查点&TBM

risk_exp <- as.data.frame(t(exp))

risk_exp <- risk_exp[km_data$ID,]

risk_exp$type <- km_data$risk

risk_exp <- risk_exp[,c('CD274','PDCD1','CTLA4','LAG3','TIGIT','type')]

risk_long <- pivot_longer(risk_exp,cols = !type,names_to = 'ICIs',values_to = 'Expression')

ggplot(risk_long,aes(x=ICIs,y=Expression,fill=type)) geom_boxplot()

stat_compare_means(label = "p.format",size=3,method = 'wilcox.test')

文末友情宣传

强烈建议你推荐给身边的博士后以及年轻生物学PI，多一点数据认知，让他们的科研上一个台阶：

生物信息学马拉松授课（买一得五），你的生物信息学入门课144线程640Gb内存服务器共享一年仍然是仅需800千呼万唤始出来的独享生物信息学云服务器

单细胞芯片 转录组测序的数据挖掘文章一比一复现

0000

评论列表

共(0)条

单细胞 芯片 转录组测序的数据挖掘文章一比一复现

Step 1.单细胞数据处理

1.1 QC

1.2 用harmony去除批次效应

1.3 细胞注释

1.4 鉴定关键细胞类型

1.5 单细胞通路富集

1.6 拟时序分析

1.7 细胞通讯

Step 2. TCGA差异分析

Step 3. WGCNA分析

Step 4. 非负矩阵分解算法

4.1 NMF算法对样本分类

4.2 生存分析

4.3 免疫浸润

4.4 批量单因素Cox回归

Step 5. 预后模型构建和验证

Step 6. riskScore和临床信息的相关性

Step 7. 高低风险组间的免疫浸润

文末友情宣传

单细胞芯片转录组测序的数据挖掘文章一比一复现