How can I retrieve the significant genes mentionned in the results of a GO enrichment analysis in R? I run the following with the topGO package:
library("topGO")
#list with transcripts ID and one or more associated GO terms
contig2GO=readMappings(file="C:/Users/Documents/transcripts_topGO.txt")
str(head(contig2GO))
names_contig=names(contig2GO)
my_list <- factor(as.integer(names_contig %in% list_of_interest$id))
names(my_list)=names_contig
GOdata_BP = new("topGOdata",ontology="BP",allGenes = my_list,annot=annFUN.gene2GO,gene2GO=contig2GO)
> GOdata_BP
------------------------- topGOdata object -------------------------
Description:
-
Ontology:
- BP
12532 available genes (all genes from the array):
- symbol: tr1 tr2 tr3 ...
- 73 significant genes.
5255 feasible genes (genes that can be used in the analysis):
- symbol: tr1 tr2 tr3 ...
- 38 significant genes.
GO graph (nodes with at least 1 genes):
- a graph with directed edges
- number of nodes = 2340
- number of edges = 4702
------------------------- topGOdata object -------------------------
resultsFisher_BP=runTest(GOdata_BP,algorithm = "classic",statistic = "fisher")
> print(resultsFisher_BP)
Description:
Ontology: BP
'classic' algorithm with the 'fisher' test
2340 GO terms scored: 35 terms with p < 0.01
Annotation data:
Annotated genes: 5255
Significant genes: 38
Min. no. of genes annotated to a GO: 1
Nontrivial nodes: 228
topGO_results_BP <- GenTable(GOdata_BP, classicFisher = resultsFisher_BP, topNodes = 100)
So I have 35 GO terms with p < 0.01. Now I want to find which are the genes in my_list <- factor(as.integer(names_contig %in% list_of_interest$id))
that are associated with those GO terms.
Is there a direct way to get them?
For now, I am cross-comparing with this:
topGO_results_BP$classicFisher <- as.numeric(topGO_results_BP$classicFisher)
topGO_results_BP$classicFisher <- round(topGO_results_BP$classicFisher,10)
significant_GO_BP <- topGO_results_BP %>% filter(classicFisher < 0.01)
list_of_interest_filt <- list_of_interest %>% filter(!is.na(GO)) %>% mutate(GO_list = str_split(GO, ",")) %>% filter(
sapply(GO_list, function(go_terms) any(go_terms %in% significant_GO_BP$GO.ID))) %>% dplyr::select(-GO_list)
Which is quite indirect. I wonder if there is something with genesInTerm
to do the job better?
I understand I provide the code and not a reproducible example. transcripts_topGO is like:
> set.seed(42)
> transcripts <- paste0("tr", 1:10)
> go_terms <- c("GO:0005515", "GO:0007616", "GO:0016021", "GO:0043226", "GO:46872")
> go_column <- sapply(1:10, function(x) paste(sample(go_terms, sample(1:4, 1), replace = TRUE), collapse = ","))
> transcripts_topGO <- data.frame(Sample = samples, GO = go_column, stringsAsFactors = FALSE)
> transcripts_topGO
transcripts GO
1 tr1 GO:46872
2 tr2 GO:0005515
3 tr3 GO:0043226,GO:0007616
4 tr4 GO:0005515,GO:0043226
5 tr5 GO:46872
6 tr6 GO:0043226,GO:0007616
7 tr7 GO:0016021,GO:0005515
8 tr8 GO:0016021
9 tr9 GO:46872,GO:46872,GO:46872,GO:0043226
10 tr10 GO:0043226,GO:0016021
And list_of_interest is a larger dataframe but that contains a subset of transcripts_topGO, with transcripts names and associated GO as well.
How can I retrieve the significant genes mentionned in the results of a GO enrichment analysis in R? I run the following with the topGO package:
library("topGO")
#list with transcripts ID and one or more associated GO terms
contig2GO=readMappings(file="C:/Users/Documents/transcripts_topGO.txt")
str(head(contig2GO))
names_contig=names(contig2GO)
my_list <- factor(as.integer(names_contig %in% list_of_interest$id))
names(my_list)=names_contig
GOdata_BP = new("topGOdata",ontology="BP",allGenes = my_list,annot=annFUN.gene2GO,gene2GO=contig2GO)
> GOdata_BP
------------------------- topGOdata object -------------------------
Description:
-
Ontology:
- BP
12532 available genes (all genes from the array):
- symbol: tr1 tr2 tr3 ...
- 73 significant genes.
5255 feasible genes (genes that can be used in the analysis):
- symbol: tr1 tr2 tr3 ...
- 38 significant genes.
GO graph (nodes with at least 1 genes):
- a graph with directed edges
- number of nodes = 2340
- number of edges = 4702
------------------------- topGOdata object -------------------------
resultsFisher_BP=runTest(GOdata_BP,algorithm = "classic",statistic = "fisher")
> print(resultsFisher_BP)
Description:
Ontology: BP
'classic' algorithm with the 'fisher' test
2340 GO terms scored: 35 terms with p < 0.01
Annotation data:
Annotated genes: 5255
Significant genes: 38
Min. no. of genes annotated to a GO: 1
Nontrivial nodes: 228
topGO_results_BP <- GenTable(GOdata_BP, classicFisher = resultsFisher_BP, topNodes = 100)
So I have 35 GO terms with p < 0.01. Now I want to find which are the genes in my_list <- factor(as.integer(names_contig %in% list_of_interest$id))
that are associated with those GO terms.
Is there a direct way to get them?
For now, I am cross-comparing with this:
topGO_results_BP$classicFisher <- as.numeric(topGO_results_BP$classicFisher)
topGO_results_BP$classicFisher <- round(topGO_results_BP$classicFisher,10)
significant_GO_BP <- topGO_results_BP %>% filter(classicFisher < 0.01)
list_of_interest_filt <- list_of_interest %>% filter(!is.na(GO)) %>% mutate(GO_list = str_split(GO, ",")) %>% filter(
sapply(GO_list, function(go_terms) any(go_terms %in% significant_GO_BP$GO.ID))) %>% dplyr::select(-GO_list)
Which is quite indirect. I wonder if there is something with genesInTerm
to do the job better?
I understand I provide the code and not a reproducible example. transcripts_topGO is like:
> set.seed(42)
> transcripts <- paste0("tr", 1:10)
> go_terms <- c("GO:0005515", "GO:0007616", "GO:0016021", "GO:0043226", "GO:46872")
> go_column <- sapply(1:10, function(x) paste(sample(go_terms, sample(1:4, 1), replace = TRUE), collapse = ","))
> transcripts_topGO <- data.frame(Sample = samples, GO = go_column, stringsAsFactors = FALSE)
> transcripts_topGO
transcripts GO
1 tr1 GO:46872
2 tr2 GO:0005515
3 tr3 GO:0043226,GO:0007616
4 tr4 GO:0005515,GO:0043226
5 tr5 GO:46872
6 tr6 GO:0043226,GO:0007616
7 tr7 GO:0016021,GO:0005515
8 tr8 GO:0016021
9 tr9 GO:46872,GO:46872,GO:46872,GO:0043226
10 tr10 GO:0043226,GO:0016021
And list_of_interest is a larger dataframe but that contains a subset of transcripts_topGO, with transcripts names and associated GO as well.
Share Improve this question asked Mar 12 at 16:38 MataMata 6735 silver badges19 bronze badges1 Answer
Reset to default 1Yes, there is a more direct way to extract the significant genes associated with the enriched GO terms using genesInTerm(). This function allows you to retrieve the genes mapped to specific GO terms in your topGOdata object.
Try:
# Which are the significant GO terms?
significantGO <- function(result_test, alpha = 0.01) {
if (!inherits(result_test, "topGOresult")) {
stop("result_test must be a topGOresult object.")
}
significant_terms <- names(score(result_test)[score(result_test) < alpha])
if (length(significant_terms) == 0) {
message("No significant GO terms found at alpha = ", alpha)
}
return(significant_terms)
}
significant_GO_terms <- significantGO(resultsFisher_BP, alpha = 0.01)
# Which genes are associated with these GO terms?
significant_genes_list <- genesInTerm(GOdata_BP, whichGO = significant_GO_terms)
# Format nicely to Gene - GO Term pairs
significant_genes_df <- stack(significant_genes_list)
colnames(significant_genes_df) <- c("Gene", "GO.ID")
# Keep only the significant genes
significant_genes_df <- significant_genes_df %>% filter(Gene %in% names(my_list[my_list == 1]))
# This is the result
print(significant_genes_df)
Now, significant_genes_df contains the list of genes that are significant in your enrichment analysis and the GO terms they are associated with.
To be honest, I am not a big fan of GO terms. They are too simple. They don't tell us usually anything which we don't know already.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744739700a4590954.html
评论列表(0条)