bioconductor - How to extract the significant genes from a GO analysis with topGO in R?

How can I retrieve the significant genes mentionned in the results of a GO enrichment analysis in R? I run the following with the topGO package:

library("topGO")  
#list with transcripts ID and one or more associated GO terms    
contig2GO=readMappings(file="C:/Users/Documents/transcripts_topGO.txt")
str(head(contig2GO))
names_contig=names(contig2GO)
        
my_list <- factor(as.integer(names_contig %in% list_of_interest$id))
names(my_list)=names_contig
GOdata_BP = new("topGOdata",ontology="BP",allGenes = my_list,annot=annFUN.gene2GO,gene2GO=contig2GO)
> GOdata_BP

------------------------- topGOdata object -------------------------
    
Description:
   -   
    
 Ontology:
   -  BP 
    
 12532 available genes (all genes from the array):
   - symbol:  tr1 tr2 tr3  ...
   - 73  significant genes. 
    
 5255 feasible genes (genes that can be used in the analysis):
   - symbol:  tr1 tr2 tr3  ...
   - 38  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 2340 
   - number of edges = 4702 
    
------------------------- topGOdata object -------------------------

resultsFisher_BP=runTest(GOdata_BP,algorithm = "classic",statistic = "fisher")
> print(resultsFisher_BP)

Description:  
Ontology: BP 
'classic' algorithm with the 'fisher' test
2340 GO terms scored: 35 terms with p < 0.01
Annotation data:
    Annotated genes: 5255 
    Significant genes: 38 
    Min. no. of genes annotated to a GO: 1 
    Nontrivial nodes: 228

topGO_results_BP <- GenTable(GOdata_BP, classicFisher = resultsFisher_BP, topNodes = 100)

So I have 35 GO terms with p < 0.01. Now I want to find which are the genes in my_list <- factor(as.integer(names_contig %in% list_of_interest$id)) that are associated with those GO terms. Is there a direct way to get them?

For now, I am cross-comparing with this:

topGO_results_BP$classicFisher <- as.numeric(topGO_results_BP$classicFisher)
topGO_results_BP$classicFisher <- round(topGO_results_BP$classicFisher,10)
significant_GO_BP <- topGO_results_BP %>% filter(classicFisher < 0.01)

list_of_interest_filt <- list_of_interest %>% filter(!is.na(GO)) %>% mutate(GO_list = str_split(GO, ",")) %>% filter(
                                  sapply(GO_list, function(go_terms) any(go_terms %in% significant_GO_BP$GO.ID))) %>% dplyr::select(-GO_list)

Which is quite indirect. I wonder if there is something with genesInTerm to do the job better?

I understand I provide the code and not a reproducible example. transcripts_topGO is like:

> set.seed(42)
> transcripts <- paste0("tr", 1:10)
> go_terms <- c("GO:0005515", "GO:0007616", "GO:0016021", "GO:0043226", "GO:46872")
> go_column <- sapply(1:10, function(x) paste(sample(go_terms, sample(1:4, 1), replace = TRUE), collapse = ","))
> transcripts_topGO <- data.frame(Sample = samples, GO = go_column, stringsAsFactors = FALSE)
> transcripts_topGO
   transcripts                                   GO
    1     tr1                              GO:46872
    2     tr2                            GO:0005515
    3     tr3                 GO:0043226,GO:0007616
    4     tr4                 GO:0005515,GO:0043226
    5     tr5                              GO:46872
    6     tr6                 GO:0043226,GO:0007616
    7     tr7                 GO:0016021,GO:0005515
    8     tr8                            GO:0016021
    9     tr9 GO:46872,GO:46872,GO:46872,GO:0043226
    10   tr10                 GO:0043226,GO:0016021

And list_of_interest is a larger dataframe but that contains a subset of transcripts_topGO, with transcripts names and associated GO as well.

How can I retrieve the significant genes mentionned in the results of a GO enrichment analysis in R? I run the following with the topGO package:

library("topGO")  
#list with transcripts ID and one or more associated GO terms    
contig2GO=readMappings(file="C:/Users/Documents/transcripts_topGO.txt")
str(head(contig2GO))
names_contig=names(contig2GO)
        
my_list <- factor(as.integer(names_contig %in% list_of_interest$id))
names(my_list)=names_contig
GOdata_BP = new("topGOdata",ontology="BP",allGenes = my_list,annot=annFUN.gene2GO,gene2GO=contig2GO)
> GOdata_BP

------------------------- topGOdata object -------------------------
    
Description:
   -   
    
 Ontology:
   -  BP 
    
 12532 available genes (all genes from the array):
   - symbol:  tr1 tr2 tr3  ...
   - 73  significant genes. 
    
 5255 feasible genes (genes that can be used in the analysis):
   - symbol:  tr1 tr2 tr3  ...
   - 38  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 2340 
   - number of edges = 4702 
    
------------------------- topGOdata object -------------------------

resultsFisher_BP=runTest(GOdata_BP,algorithm = "classic",statistic = "fisher")
> print(resultsFisher_BP)

Description:  
Ontology: BP 
'classic' algorithm with the 'fisher' test
2340 GO terms scored: 35 terms with p < 0.01
Annotation data:
    Annotated genes: 5255 
    Significant genes: 38 
    Min. no. of genes annotated to a GO: 1 
    Nontrivial nodes: 228

topGO_results_BP <- GenTable(GOdata_BP, classicFisher = resultsFisher_BP, topNodes = 100)

For now, I am cross-comparing with this:

topGO_results_BP$classicFisher <- as.numeric(topGO_results_BP$classicFisher)
topGO_results_BP$classicFisher <- round(topGO_results_BP$classicFisher,10)
significant_GO_BP <- topGO_results_BP %>% filter(classicFisher < 0.01)

list_of_interest_filt <- list_of_interest %>% filter(!is.na(GO)) %>% mutate(GO_list = str_split(GO, ",")) %>% filter(
                                  sapply(GO_list, function(go_terms) any(go_terms %in% significant_GO_BP$GO.ID))) %>% dplyr::select(-GO_list)

Which is quite indirect. I wonder if there is something with genesInTerm to do the job better?

I understand I provide the code and not a reproducible example. transcripts_topGO is like:

> set.seed(42)
> transcripts <- paste0("tr", 1:10)
> go_terms <- c("GO:0005515", "GO:0007616", "GO:0016021", "GO:0043226", "GO:46872")
> go_column <- sapply(1:10, function(x) paste(sample(go_terms, sample(1:4, 1), replace = TRUE), collapse = ","))
> transcripts_topGO <- data.frame(Sample = samples, GO = go_column, stringsAsFactors = FALSE)
> transcripts_topGO
   transcripts                                   GO
    1     tr1                              GO:46872
    2     tr2                            GO:0005515
    3     tr3                 GO:0043226,GO:0007616
    4     tr4                 GO:0005515,GO:0043226
    5     tr5                              GO:46872
    6     tr6                 GO:0043226,GO:0007616
    7     tr7                 GO:0016021,GO:0005515
    8     tr8                            GO:0016021
    9     tr9 GO:46872,GO:46872,GO:46872,GO:0043226
    10   tr10                 GO:0043226,GO:0016021

And list_of_interest is a larger dataframe but that contains a subset of transcripts_topGO, with transcripts names and associated GO as well.

Share Improve this question asked Mar 12 at 16:38 Mata 6735 silver badges19 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Yes, there is a more direct way to extract the significant genes associated with the enriched GO terms using genesInTerm(). This function allows you to retrieve the genes mapped to specific GO terms in your topGOdata object.

Try:

# Which are the significant GO terms?
significantGO <- function(result_test, alpha = 0.01) {
  if (!inherits(result_test, "topGOresult")) {
    stop("result_test must be a topGOresult object.")
  }
  
  significant_terms <- names(score(result_test)[score(result_test) < alpha])
  
  if (length(significant_terms) == 0) {
    message("No significant GO terms found at alpha = ", alpha)
  }
  
  return(significant_terms)
}

significant_GO_terms <- significantGO(resultsFisher_BP, alpha = 0.01) 

# Which genes are associated with these GO terms?
significant_genes_list <- genesInTerm(GOdata_BP, whichGO = significant_GO_terms)

# Format nicely to Gene - GO Term pairs
significant_genes_df <- stack(significant_genes_list)
colnames(significant_genes_df) <- c("Gene", "GO.ID")

# Keep only the significant genes
significant_genes_df <- significant_genes_df %>% filter(Gene %in% names(my_list[my_list == 1]))


# This is the result
print(significant_genes_df)

Now, significant_genes_df contains the list of genes that are significant in your enrichment analysis and the GO terms they are associated with.

To be honest, I am not a big fan of GO terms. They are too simple. They don't tell us usually anything which we don't know already.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744739700a4590954.html

bioconductor - How to extract the significant genes from a GO analysis with topGO in R? - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

bioconductor - How to extract the significant genes from a GO analysis with topGO in R? - Stack Overflow

1 Answer 1

相关推荐