Tutorial 2: Characterizing the Rett Transcriptome

Introduction

In this tutorial, we will analyze an RNA sequencing (RNA-seq) dataset generated by Aldinger et al. from postmortem brain tissue of four female Rett syndrome patients and age-matched female controls (https://doi.org/10.1038/s41597-020-0527-2). This dataset is already integrated as an example dataset in the ArrayAnalysis app. Rett syndrome is a rare neurodevelopmental disorder caused by mutations in the X-linked MECP2 gene.

Data upload

Start the analysis of the data by clicking "Start Analysis" after selecting "RNA-Seq Analysis" and "Raw count". In the next tab, click on "Run example".

After uploading the data, the metadata and a preview of the count table are shown. In the metadata, you can see that there are four experimental groups: the cingulate and temporal cortex of controls and the cingulate and temporal cortex of Rett syndrome patients. Each experimental group includes 4 samples.

Question 1: How old is the youngest and oldest Rett syndrome (RTT) patient in the dataset?

Show answer

16 and 31, respectively. You can find this answer by looking in the metadata tab at the data upload page. You can sort the metadata by age.

Pre-processing

Proceed to the pre-processing step, by clicking "Next" (bottom left of the page) or by clicking the "pre-processing" tab at the navigation bar. In this step, you can remove samples, remove lowly expressed genes, and perform normalization. Select "Genotype" and "Tissue" as experimental group and start the data pre-processing by clicking on "Calculate".

Click here for help

Question 2: Why is normalization needed? How would the boxplots and density plots look like if you would not do any normalization?

Show answer

Normalization is needed to adjust for sample-to-sample variability and make the data comparable between samples. Without normalization the boxplots and density plots would show notable differences in the distribution between samples (see figure below).

Question 3: Why is filtering needed? What do we filter for? How many genes passed the applied gene filtering?

Show answer

Filtering is needed to remove lowly expressed genes, where the measurements are subject to noise and which cannot be analyzed confidently. So, we filter for minimal expression levels (of 10 reads in at least a substantial subset of the samples). If you apply a threshold of 10, 18,311 genes passed the filtering (see bottom of normalized count table).

Question 4: Look at the correlation plot and PCA plot. Do the samples cluster as expected?

Show answer

In the PCA plot and correlation plot, you can see that there seems to be some clustering by genotype (which would be expected), but the clustering is not perfect (two of the clear subclusters consist of one genotype only, one is mixed). Moreover, there is almost perfect clustering per patient (which may also be expected), but again not perfect (some samples of the same individual don’t cluster together for the two tissues studied).

Differential gene expression analysis

After we have pre-processed the data, we want to compare the expression profiles of Rett syndrome patients with the controls in the temporal cortex. Furthermore, we would like to correct for age and add Ensembl IDs to the output. Please select the correct options on the statistical analysis page and click "Calculate".

Click here for help

Question 5: What is the difference between the p-value and adj. p-value? Which one should you use to find differentially expressed genes?

Show answer

The adj. p-value is adjusted for multiple testing. If you test many genes, you will end up with many false positives if you do not correct the p-value. For instance, with a 5% significance level for each test, performing 10 tests raises the chance of at least one false positive to about 40%. Hence, the adj. p-value should be used to limit the number of false positives and find truly differentially expressed genes.

Question 6: Is MECP2 differentially expressed between RTT and IC?

Show answer

You can look up genes using the search bar (top right of top table). Here, you can see that MECP2 is not differentially expressed.

Question 7: How many genes are downregulated (adj. p-value < 0.05 and log₂FC < 1)?

Show answer

In the Summary tab you can calculate how many genes are up and downregulated for different p-value and log₂FC thresholds. Here you can see that 110 genes are downregulated.

Gene set analysis

You can now perform gene set analysis to identify which biological processes and pathways are altered.

Question 8: What processes and pathways are altered in Rett syndrome?

Show answer

The exact processes and pathways will depend on which gene set collection is used and whether ORA or GSEA is performed. However, with both methods, you will find processes and pathways related to inflammation/immune response and synaptic signaling.

Tutorial 2: Characterizing the Rett Transcriptome

Introduction

Data upload

Question 1: How old is the youngest and oldest Rett syndrome (RTT) patient in the dataset?

Pre-processing

Question 2: Why is normalization needed? How would the boxplots and density plots look like if you would not do any normalization?

Question 3: Why is filtering needed? What do we filter for? How many genes passed the applied gene filtering?

Question 4: Look at the correlation plot and PCA plot. Do the samples cluster as expected?

Differential gene expression analysis

Question 5: What is the difference between the p-value and adj. p-value? Which one should you use to find differentially expressed genes?

Question 6: Is MECP2 differentially expressed between RTT and IC?

Question 7: How many genes are downregulated (adj. p-value < 0.05 and log2FC < 1)?

Gene set analysis

Question 8: What processes and pathways are altered in Rett syndrome?

Question 7: How many genes are downregulated (adj. p-value < 0.05 and log₂FC < 1)?