This Data is a bigger data set with 96 micro arrays files for analysis. The experiment conditions are also different. - There is no WT strain - Plus serves as a within strain control - There are unequal number of samples per group
‘Array quality metrics’ package provides us with a repeatable qc metrics. It looks at the quality of our files in terms of bias from experiment and confounding variables.
The output is saved as an HTML file in the working directory.
# arrayQualityMetrics(expressionset = rawData,
# outdir = "/Users/hp/R projects/LupusProject",
# force = TRUE, do.logtransform = TRUE,
# intgroup = c("Strain", "Tissue", "Weeks"))
Once we’ve loaded up the data we can look at the distribution of the expressions across all samples using a box plot. The arrayqualitymetrics package provides us with insight into what celfiles are not within the range of the majority.
boxplot(rawData,target="core") # choose the level of summarization: "core"
After normalizing the data set using RMA algorithm we can see that the expressions follow a more normal distribution.
boxplot(norm.data, target= "core")
We will annotate the probesets using the ‘mogene11sttranscriptcluster’ database. This database was created using the probset information from affymetirix and can be downloaded from Bioconductor. We will annotate the data with the Gene Symbol, Name, Entrez ID, Gene Ontology & Description. And save those with the featuredata.
We choose not to remove probes with no annotation at this stage. However, in further analysis we will remove them. If you wanted to remove them now the chunck below does so.
There is a considerable amount of batch effect. But it is hard to determine if this is from experiment error or if it is from the difference in tissues. i.e, it could be batch effect due to biological difference.
plotMDS(norm.data) #checking batch effect
In the pca from the array quality metrics we noticed that there was significant groping of the celfiles based on pc1 and pc2 we will replicate the PCA plot and compare it with the normalized data.
For the normalized Data:
## index filename Strain Tissue Subject Weeks
## 1_MoGene.CEL 1 1_MoGene.CEL plus Amygdala 537 24
## 101_MoGene.CEL 2 101_MoGene.CEL lpr Kidney 1127 12
## 102_MoGene.CEL 3 102_MoGene.CEL lpr Kidney 724 12
## 103_MoGene.CEL 4 103_MoGene.CEL lpr Kidney 738 12
## 104_MoGene.CEL 5 104_MoGene.CEL lpr Kidney 874 6
## 105_MoGene.CEL 6 105_MoGene.CEL lpr Kidney 992 6
## 106_MoGene.CEL 7 106_MoGene.CEL lpr Kidney 993 6
## 107_MoGene.CEL 8 107_MoGene.CEL lpr Kidney 994 6
## 108_MoGene.CEL 9 108_MoGene.CEL lpr2 Kidney 1104 18
## 109_MoGene.CEL 10 109_MoGene.CEL lpr2 Kidney 1018 18
## 11_MoGene.CEL 11 11_MoGene.CEL lpr Amygdala 439 24
## 110_MoGene.CEL 12 110_MoGene.CEL lpr2 Kidney 1003 12
## 112_MoGene.CEL 13 112_MoGene.CEL lpr2 Kidney 1005 12
## 113_MoGene.CEL 14 113_MoGene.CEL lpr2 Kidney 1137 6
## 114_MoGene.CEL 15 114_MoGene.CEL lpr2 Kidney 1181 6
## 115_MoGene.CEL 16 115_MoGene.CEL lpr2 Kidney 1136 6
## 116_MoGene.CEL 17 116_MoGene.CEL plus Kidney 869 24
## 117_MoGene.CEL 18 117_MoGene.CEL plus Kidney 752 12
## 118_MoGene.CEL 19 118_MoGene.CEL plus Kidney 751 12
## 119_MoGene.CEL 20 119_MoGene.CEL plus Kidney 1164 6
## 12_MoGene.CEL 21 12_MoGene.CEL lpr Amygdala 472 24
## 120_MoGene.CEL 22 120_MoGene.CEL plus Kidney 1165 6
## 125_MoGene.CEL 23 125_MoGene.CEL lpr Amygdala 1118 18
## 129_MoGene.CEL 24 129_MoGene.CEL lpr Amygdala 1119 18
## 13_MoGene.CEL 25 13_MoGene.CEL lpr Amygdala 473 24
## 132_MoGene.CEL 26 132_MoGene.CEL lpr Amygdala 700 12
## 134_MoGene.CEL 27 134_MoGene.CEL lpr Amygdala 699 12
## 136_MoGene.CEL 28 136_MoGene.CEL lpr2 Amygdala 622 24
## 137_MoGene.CEL 29 137_MoGene.CEL lpr2 Amygdala 796 24
## 14_MoGene.CEL 30 14_MoGene.CEL lpr Amygdala 551 24
## 142_MoGene.CEL 31 142_MoGene.CEL lpr2 Amygdala 1138 6
## 146_MoGene.CEL 32 146_MoGene.CEL plus Amygdala 604 18
## 147_MoGene.CEL 33 147_MoGene.CEL plus Amygdala 592 18
## 15_MoGene.CEL 34 15_MoGene.CEL lpr Amygdala 764 24
## 155_MoGene.CEL 35 155_MoGene.CEL lpr Hippocampus 1118 18
## 156_MoGene.CEL 36 156_MoGene.CEL lpr Hippocampus 700 12
## 158_MoGene.CEL 37 158_MoGene.CEL lpr Hippocampus 701 12
## 159_MoGene.CEL 38 159_MoGene.CEL lpr Hippocampus 724 12
## 16_MoGene.CEL 39 16_MoGene.CEL lpr Amygdala 1024 24
## 162_MoGene.CEL 40 162_MoGene.CEL lpr Hippocampus 849 6
## 163_MoGene.CEL 41 163_MoGene.CEL lpr Hippocampus 981 6
## 165_MoGene.CEL 42 165_MoGene.CEL lpr2 Hippocampus 794 24
## 166_MoGene.CEL 43 166_MoGene.CEL lpr2 Hippocampus 622 24
## 167_MoGene.CEL 44 167_MoGene.CEL lpr2 Hippocampus 834 18
## 168_MoGene.CEL 45 168_MoGene.CEL lpr2 Hippocampus 863 18
## 17_MoGene.CEL 46 17_MoGene.CEL lpr Amygdala 1025 24
## 174_MoGene.CEL 47 174_MoGene.CEL lpr Kidney 997 18
## 19_MoGene.CEL 48 19_MoGene.CEL lpr Amygdala 844 18
## 2_MoGene.CEL 49 2_MoGene.CEL plus Amygdala 538 24
## 20_MoGene.CEL 50 20_MoGene.CEL lpr Amygdala 997 18
## 21_MoGene.CEL 51 21_MoGene.CEL lpr Amygdala 998 18
## 22_MoGene.CEL 52 22_MoGene.CEL lpr Amygdala 1110 18
## 23_MoGene.CEL 53 23_MoGene.CEL lpr Amygdala 1117 18
## 24_MoGene.CEL 54 24_MoGene.CEL lpr Amygdala 693 12
## 25_MoGene.CEL 55 25_MoGene.CEL lpr Amygdala 694 12
## 26_MoGene.CEL 56 26_MoGene.CEL lpr Amygdala 728 12
## 27_MoGene.CEL 57 27_MoGene.CEL lpr Amygdala 738 12
## 29_MoGene.CEL 58 29_MoGene.CEL lpr Amygdala 1128 12
## 3_MoGene.CEL 59 3_MoGene.CEL plus Amygdala 544 24
## 30_MoGene.CEL 60 30_MoGene.CEL lpr Amygdala 849 6
## 31_MoGene.CEL 61 31_MoGene.CEL lpr Amygdala 844 18
## 32_MoGene.CEL 62 32_MoGene.CEL lpr Amygdala 998 18
## 33_MoGene.CEL 63 33_MoGene.CEL lpr Amygdala 1110 18
## 34_MoGene.CEL 64 34_MoGene.CEL lpr Amygdala 994 6
## 35_MoGene.CEL 65 35_MoGene.CEL lpr Hippocampus 809 18
## 36_MoGene.CEL 66 36_MoGene.CEL lpr2 Amygdala 1137 6
## 38_MoGene.CEL 67 38_MoGene.CEL lpr2 Amygdala 1083 6
## 45_MoGene.CEL 68 45_MoGene.CEL lpr Hippocampus 1110 18
## 46_MoGene.CEL 69 46_MoGene.CEL lpr Hippocampus 600 24
## 47_MoGene.CEL 70 47_MoGene.CEL lpr Hippocampus 764 24
## 5_MoGene.CEL 71 5_MoGene.CEL plus Amygdala 582 18
## 50_MoGene.CEL 72 50_MoGene.CEL lpr Hippocampus 844 18
## 51_MoGene.CEL 73 51_MoGene.CEL lpr Amygdala 738 12
## 53_MoGene.CEL 74 53_MoGene.CEL lpr Hippocampus 1111 18
## 54_MoGene.CEL 75 54_MoGene.CEL lpr Hippocampus 694 12
## 57_MoGene.CEL 76 57_MoGene.CEL lpr Hippocampus 1127 12
## 58_MoGene.CEL 77 58_MoGene.CEL lpr Hippocampus 1128 12
## 59_MoGene.CEL 78 59_MoGene.CEL lpr Hippocampus 874 6
## 6_MoGene.CEL 79 6_MoGene.CEL plus Amygdala 590 18
## 60_MoGene.CEL 80 60_MoGene.CEL lpr Hippocampus 992 6
## 61_MoGene.CEL 81 61_MoGene.CEL lpr Hippocampus 993 6
## 62_MoGene.CEL 82 62_MoGene.CEL lpr Hippocampus 994 6
## 63_MoGene.CEL 83 63_MoGene.CEL lpr Amygdala 1128 12
## 64_MoGene.CEL 84 64_MoGene.CEL lpr2 Hippocampus 1005 12
## 65_MoGene.CEL 85 65_MoGene.CEL lpr2 Hippocampus 1186 12
## 66_MoGene.CEL 86 66_MoGene.CEL lpr2 Hippocampus 1136 6
## 67_MoGene.CEL 87 67_MoGene.CEL lpr Amygdala 849 6
## 68_MoGene.CEL 88 68_MoGene.CEL lpr2 Hippocampus 1181 6
## 7_MoGene.CEL 89 7_MoGene.CEL plus Amygdala 752 12
## 8_MoGene.CEL 90 8_MoGene.CEL lpr Amygdala 809 18
## 84_MoGene.CEL 91 84_MoGene.CEL lpr Hippocampus 598 24
## 86_MoGene.CEL 92 86_MoGene.CEL lpr Hippocampus 998 18
## 9_MoGene.CEL 93 9_MoGene.CEL plus Amygdala 1173 6
## 97_MoGene.CEL 94 97_MoGene.CEL lpr Kidney 1024 24
## 98_MoGene.CEL 95 98_MoGene.CEL lpr Kidney 1117 18
## 99_MoGene.CEL 96 99_MoGene.CEL lpr Kidney 843 18
LIMMA (Linear Models for Microarray Analysis) is a statistical method for identifying differentially expressed genes in microarray experiments. It uses linear models to analyze the intensity values of the genes measured in the microarray, taking into account factors such as the experimental design and the inherent variability of the data. The result of a LIMMA analysis is a list of genes that are significantly differentially expressed between the experimental conditions being compared.
I’m renaming the columns because, make.contrasts throws an error when using ‘:’ and ‘.’.
Remove the samples that do not have a value. therefore, the design becomes a full rank.
design<-design[,-c(12,20,24,30)]
In this code, lmFit is used to fit a linear model to the data in norm.data using the design matrix design. This model is then passed to the eBayes function, which applies empirical Bayes moderation to the standard errors of the model.
fit1 <- lmFit(norm.data, design)
fit1 <- eBayes(fit1, trend=TRUE, robust=TRUE) # the compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a global value.
results <- decideTests(fit1)
summary(results)
## lprAmygdalaW6 lprHippocampusW6 lprKidneyW6 lpr2AmygdalaW6
## Down 0 0 0 0
## NotSig 0 0 0 0
## Up 35556 35556 35556 35556
## lpr2HippocampusW6 lpr2KidneyW6 plusAmygdalaW6 plusKidneyW6
## Down 0 0 0 0
## NotSig 0 0 11 0
## Up 35556 35556 35545 35556
## lprAmygdalaW12 lprHippocampusW12 lprKidneyW12 lpr2HippocampusW12
## Down 0 0 0 0
## NotSig 0 0 0 1
## Up 35556 35556 35556 35555
## lpr2KidneyW12 plusAmygdalaW12 plusKidneyW12 lprAmygdalaW18
## Down 0 0 0 0
## NotSig 0 27 1 0
## Up 35556 35529 35555 35556
## lprHippocampusW18 lprKidneyW18 lpr2HippocampusW18 lpr2KidneyW18
## Down 0 0 0 0
## NotSig 0 0 0 0
## Up 35556 35556 35556 35556
## plusAmygdalaW18 lprAmygdalaW24 lprHippocampusW24 lprKidneyW24
## Down 0 0 0 0
## NotSig 0 0 0 13
## Up 35556 35556 35556 35543
## lpr2AmygdalaW24 lpr2HippocampusW24 plusAmygdalaW24 plusKidneyW24
## Down 0 0 0 0
## NotSig 1 0 0 6
## Up 35555 35556 35556 35550
table(targets$Strain, targets$Tissue, targets$Weeks)
## , , = 6
##
##
## Amygdala Hippocampus Kidney
## lpr 3 6 4
## lpr2 3 2 3
## plus 1 0 2
##
## , , = 12
##
##
## Amygdala Hippocampus Kidney
## lpr 9 6 3
## lpr2 0 2 2
## plus 1 0 2
##
## , , = 18
##
##
## Amygdala Hippocampus Kidney
## lpr 11 6 3
## lpr2 0 2 2
## plus 4 0 0
##
## , , = 24
##
##
## Amygdala Hippocampus Kidney
## lpr 7 3 1
## lpr2 2 2 0
## plus 3 0 1
Based on the table above the possible comparisions are between 1 - Amygdala and Hippocampus for LPR vs LPR 2 throughout the weeks 6, 24 2 - LPR Amygdala at Weeks 6, 12, 18, 24 + Vs. Plus 3 - LPR2 Amygdala at Weeks 6, 24 + Vs. Plus 4 - LPR Hippocampus at Weeks 6, 12, 18, 24 5 - LPR2 Hippocampus at Weeks 6, 12, 18, 24 6 - Change in Plus Amygdala throughout the weeks 7 - Kidney LPR Vs LPR2 at Weeks 6, 12, 18 8 - Kidney LPR for the Week 6, 12, 18, 24 9 - Kidney LPR2 for the Week 6, 12, 18 10 - Kidney Plus for the Week 6, 12, 24
## lprAmygdalaW6 - lprHippocampusW6 lprAmygdalaW6 - lpr2AmygdalaW6
## Down 1526 4
## NotSig 32856 35538
## Up 1174 14
## lprHippocampusW6 - lpr2HippocampusW6 lpr2AmygdalaW6 - lpr2HippocampusW6
## Down 5 869
## NotSig 35547 34058
## Up 4 629
## lprAmygdalaW24 - lprHippocampusW24 lprAmygdalaW24 - lpr2AmygdalaW24
## Down 1249 12933
## NotSig 33303 11728
## Up 1004 10895
## lprHippocampusW24 - lpr2HippocampusW24
## Down 1
## NotSig 35546
## Up 9
## lpr2AmygdalaW24 - lpr2HippocampusW24
## Down 9835
## NotSig 14397
## Up 11324
## lprAmygdalaW6 - lprAmygdalaW12 lprAmygdalaW6 - lprAmygdalaW18
## Down 123 10
## NotSig 35344 35530
## Up 89 16
## lprAmygdalaW6 - lprAmygdalaW24 lprAmygdalaW6 - plusAmygdalaW6
## Down 14 15133
## NotSig 35496 8348
## Up 46 12075
## lprAmygdalaW12 - lprAmygdalaW18 lprAmygdalaW12 - lprAmygdalaW24
## Down 951 1513
## NotSig 34068 32733
## Up 537 1310
## lprAmygdalaW12 - plusAmygdalaW12 lprAmygdalaW18 - lprAmygdalaW24
## Down 1344 0
## NotSig 32936 35554
## Up 1276 2
## lprAmygdalaW18 - plusAmygdalaW18 lprAmygdalaW24 - plusAmygdalaW24
## Down 2645 4
## NotSig 30012 35530
## Up 2899 22
## lpr2AmygdalaW6 - lpr2AmygdalaW24 lpr2AmygdalaW6 - plusAmygdalaW6
## Down 12310 14987
## NotSig 12782 8293
## Up 10464 12276
## lprAmygdalaW24 - plusAmygdalaW24
## Down 4
## NotSig 35530
## Up 22
## lprHippocampusW6 - lprHippocampusW12
## Down 83
## NotSig 35471
## Up 2
## lprHippocampusW6 - lprHippocampusW18
## Down 10
## NotSig 35543
## Up 3
## lprHippocampusW6 - lprHippocampusW24
## Down 113
## NotSig 35422
## Up 21
## lprHippocampusW12 - lprHippocampusW18
## Down 0
## NotSig 35556
## Up 0
## lprHippocampusW12 - lprHippocampusW24
## Down 0
## NotSig 35555
## Up 1
## lprHippocampusW18 - lprHippocampusW24
## Down 0
## NotSig 35556
## Up 0
## lpr2HippocampusW6 - lpr2HippocampusW12
## Down 0
## NotSig 35556
## Up 0
## lpr2HippocampusW6 - lpr2HippocampusW18
## Down 6699
## NotSig 22200
## Up 6657
## lpr2HippocampusW6 - lpr2HippocampusW24
## Down 0
## NotSig 35556
## Up 0
## lpr2HippocampusW12 - lpr2HippocampusW18
## Down 5454
## NotSig 24332
## Up 5770
## lpr2HippocampusW12 - lpr2HippocampusW24
## Down 0
## NotSig 35556
## Up 0
## lpr2HippocampusW18 - lpr2HippocampusW24
## Down 5755
## NotSig 24112
## Up 5689
## plusAmygdalaW6 - plusAmygdalaW12 plusAmygdalaW6 - plusAmygdalaW18
## Down 11119 12043
## NotSig 11148 8742
## Up 13289 14771
## plusAmygdalaW6 - plusAmygdalaW24 plusAmygdalaW12 - plusAmygdalaW18
## Down 12205 9
## NotSig 8139 35531
## Up 15212 16
## plusAmygdalaW12 - plusAmygdalaW24 plusAmygdalaW18 - plusAmygdalaW24
## Down 366 149
## NotSig 34713 35157
## Up 477 250
## lprKidneyW6 - lpr2KidneyW6 lprKidneyW12 - lpr2KidneyW12
## Down 23 7
## NotSig 35517 35538
## Up 16 11
## lprKidneyW18 - lpr2KidneyW18
## Down 125
## NotSig 35194
## Up 237
## lprKidneyW6 - lprKidneyW12 lprKidneyW6 - lprKidneyW18
## Down 487 847
## NotSig 35009 34382
## Up 60 327
## lprKidneyW6 - lprKidneyW24 lprKidneyW12 - lprKidneyW18
## Down 637 11
## NotSig 34880 35536
## Up 39 9
## lprKidneyW12 - lprKidneyW24 lprKidneyW18 - lprKidneyW24
## Down 32 5
## NotSig 35524 35549
## Up 0 2
## lpr2KidneyW6 - lpr2KidneyW12 lpr2KidneyW6 - lpr2KidneyW18
## Down 59 751
## NotSig 35461 34081
## Up 36 724
## lpr2KidneyW12 - lpr2KidneyW18
## Down 41
## NotSig 35495
## Up 20
## plusKidneyW6 - plusKidneyW12 plusKidneyW6 - plusKidneyW24
## Down 7 74
## NotSig 35543 35455
## Up 6 27
## plusKidneyW12 - plusKidneyW24
## Down 14
## NotSig 35537
## Up 5
To extract the significantly expressed genes we create a for loop to go through each coefficient (comparison) for each contrast matrix. We then apply the topTable function to extract genes with Log Fold Change above 1.5.
Volcano plots plot the negative logarithm of the p-value on the y-axis and the logarithm of the fold change on the x-axis, with genes that are significantly differentially expressed appearing as points above a threshold line on the plot. This allows for quick visualization of which genes are most significantly differentially expressed, as well as their direction of expression change.
We use the cutoff LFC 1.5 and P-value < 0.1
## ERROR : non-numeric argument to mathematical function
## ERROR : non-numeric argument to mathematical function
## ERROR : non-numeric argument to mathematical function
## ERROR : non-numeric argument to mathematical function
## ERROR : non-numeric argument to mathematical function
## ERROR : non-numeric argument to mathematical function
Heatmaps are often used to visualize the expression levels of genes across a set of samples. This can be useful for identifying patterns and trends in the data, such as which genes are co-expressed or differentially expressed between groups of samples.
topGO is an R package used to perform Gene Ontology (GO) enrichment analysis. GO enrichment analysis can be used to identify which GO terms are significantly overrepresented in a set of genes of interest. The output from the topGO analysis has been saved in a PDFs and is available in the data folder. Rectangle color represents the relative significant, ranging from dark red (most significant) to bright yellow (least significant).
Reactome is a database of biological pathways, which provides detailed information about the molecular events that take place in cells. We use the ‘ReactomePA’ package for performing enrichment analysis on sets of genes or proteins to identify pathways that are significantly enriched in the data. The output from the Reactome analysis has been saved in a PDFs and is available in the data folder.
The Following code runs through the genes extracted and grabs the NCBI summary/abstract from the NCBI website. However, it is extremely slow taking about 2 seconds per gene and there are close to 80,000 Genes to be referenced. I have left it commented out. You can run it bu removing the #s (ctrl-shift-c).
In the meantime, I’ve explored running the loop in parallel using the NCBI api. It is faster and more efficient. Average run time is 50 seconds per 200 genes using parlapply (parallel version of lapply). You can run the code below by switching which cgene_comp to use.
# # Set the number of cores to use for parallel processing
# # Don't use all your core's your computer WILL crash. I've left 2 for the OS.
# par_data<- cgene_comp9 #change to comparison of interest
# ncores <- (detectCores()-2)
#
# # Initialize a parallel backend
# cl <- makeCluster(ncores)
# entrez_ids <- par_data$Entrez_IDs
#
# # Use parLapply to run the API calls in parallel
# summary_text <- parLapply(cl, entrez_ids, function(entrez_id) {
# #pass the required package to the worker nodes
# library("httr")
# library("RCurl")
# library("XML")
# #api_key generated from NCBI. If you don't have one, you can create an account with NCBI and generate one
# api_key <- readLines("api.txt")
# url <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=",
# entrez_id, "&api_key=", api_key)
# response <- getURL(url, ssl.verifypeer = FALSE)
# # Pause for 1 second after each API request limit of 10 requests per second or your API gets banned
# Sys.sleep(1)
# # Parse the XML data in the response
# gene_summary <- xmlParse(response)
# # Extract the summary information from the XML data
# # (Note: the exact method for extracting the information will depend on the structure of the XML data)
# summary_text_list <- xmlToList(gene_summary)
# summary_text_comp9<- summary_text_list[["DocumentSummarySet"]][["DocumentSummary"]][["Summary"]] #change name to match comparison of interest
# })
#
# # Stop the parallel backend when done
# stopCluster(cl)
The gene ontology annotation takes a long time as well. I have not found an API to make it faster, but a parallel version is a work in progress.
The data extracted from the analysis was exported as excel files for easier visualization outside of R Studio.