Pipeline description

Overview of the Affymetrix Quality Control workflow

The quality control of an Affymetrix arrays dataset follows this workflow:

The dark blue boxes represent the main steps of the automated QC analysis. The violet box is managed by the user itself. The main steps of the workflow are the raw data quality control, the pre-processing applied on raw data, the evaluation of the pre-processing and the results export. Then if the QC report shows no main quality problem, the user can download the pre-processed data for further analysis. Otherwise, he can choose to remove one or more arrays from the dataset and then will have to re-compute the whole analysis with the modified dataset, until the QC gives positive results.

The automated QC analysis includes the following steps:

Raw data Quality Control:
Each white box represents a plot dedicated to one quality indicator. These plots are organized by groups, including: sample quality, hybridization quality, signal comparability and biases and array correlation. The plots with a green cross are those proposed by default. Other plots are optional. Some analyses are only available for arrays containing mismatch (MM) probes. A summary QC table, presenting the QC values is built specifically for these plots.

Data pre-processing:
A pre-processing strategy is applied by default, including the background correction and the intensities normalization. The default strategy is RMA for arrays containing only perfect match (PM) probes and GC-RMA for arrays containing both PM and MM probes. The pre-processing step includes also the re-annotation of the probes using a public database. This modification of the CDF file is proposed by default because it improves notably the interpretation of the results in further analyses. The annotation database proposed by default is Ensembl. Note that some databases are only available for particular species.

Evaluation of the pre-processing:
All graphs selected for this step are proposed by default and are available for all types of arrays.


Graphs description

Control of the Sample Quality

Sample prep controls: Dap, Thr, Phe and Lys -PolyA unlabeled spikes

These probe sets are designed from several B. subtilis genes (dap, thr, phe and lys). They are spiked at the beginning of the chip processing and used to assess the overall success of the target prep steps.
The ploy-A controls Dap, Thr, Phe and Lys should be called present at a decreasing intensity, to verify that there was no bias during the retro-transcription between highly expressed genes and low expressed genes. The linearity for lys, phe and thr (dap is present at a much higher concentration) is affected by a double amplification.

In this example, the intensity is increasing from Lys to Dap for all arrays and the lowest expected intensity (Lys) are called present for 11 arrays among the 12. ERT1 array's Lys probeset is called Absent (red 'A' on the graph).

[Technical documentation of the function]

RNA quality control: 3'/5' ratio for beta-actin and GAPDH

Because beta-actin and GAPDH are expressed is most cell types and are relatively long genes, Affymetrix chips use them as controls of the RNA quality. Three probe-sets are designed on 3 regions of these genes (5�, mid (called M) and 3' extremities). Similar intensities for their 3 regions indicate that the transcripts were not truncated and labeled equally along the sequence. By the way, since RNA degradation starts from the 5' end of the molecule, it is common that the probeset intensity at that end is slightly lower.

For an array of good quality, Affymetrix recommends that the 3'/5' ratio should not exceed :
  - 3 for beta-actin
  - 1.25 for GAPDH
These values were set for human samples so the ratio may be slightly higher for other species.

The QC pipeline proposes the following graphs to evaluate these controls:

The left graph plots the ratios between 3' and 5' ends of the genes (filled colored triangles or circles) and between 3' end and M, the middle region (unfilled black triangles or circles) for each array. 3'/M ratios are given is case that 5' intensity is not available (in case of a strong degradation of the gene). The recommended cut-off is represented by a grey rectangle so a faulty array is rapidely diagnosed. Be aware that these cut-offs were determined from human samples. The legend on the left upper side gives the maximal ratios found. The right graph is a boxplot of all 3'/5' ratios and 3'/M ratios. The legend under the title says if all ratios are bellow the recommended threshold or if some ratios are above the threshold.

In this example, all arrays give good quality results for these indicators.

[Technical documentation of the function]

Overall RNA quality control: RNA degradation plot

In Affymetrix arrays, a probe-set is dedicated to each target. A probe-set is composed by several probes (classically 11) all targeting the probe-set target sequence. The RNA degradation plot proposes to plot the average intensity of each probes across all probe-sets, ordered from the 5� to the 3� end. Indeed since RNA degradation starts from the 5� end of the molecule, we would expect probe intensities to be globally lowered at that end of a probe set when compared to the 3� end. The RNA degradation plot aims at visualizing this trend.
RNA which is too degraded will have a very high slope from 5' to 3'. The standardized slope of the curves is thus used as a quantitative indicator of the RNA degradation. An array with unexpected degradation is identified because it has a bigger slope and should stand out.

Each line corresponds to an array. In this example, all arrays give curve with a regular slope, exept between the 10th and the 11th probes.

[Technical documentation of the function]


Hybridization and overall signal quality

Hybridization spike-in controls: BioB, BioC, BioD and CreX

Affymetrix arrays include spike-in hybridization controls: 4 targets are spiked before the labeling step with 4 different concentrations; from the lowest to the highest: BioB, BioC, BioD and CreX. bioB, bioC and bioD are genes in the biotin synthesis pathway of E. coli, and cre is the recombinase gene from P1 bacteriophage and are not expected to cross-hybrid with non-bacterial and non-viral samples. Intensity pattern for these 4 controls should show the increase in target concentration. Other patterns would be a sign of bad hybridization.
If BioB is not flagged �present�, this would also be a sign of bad quality, indicating that the sensitivity may not be sufficient for the array.

Each line corresponds to an array. The legend at the top left corner indicated the number of present calls among all BioB probe-sets. The text below the graph concludes about these calls. In this example, all arrays have a BioB probe-set called present.

[Technical documentation of the function]

Background intensity

The background intensity is defined for each array from the MM (mismatch probes) values. Average, minimum and maximum background should be comparable between arrays; an array with a significantly higher (or lower) background value may be a sign of bad quality.

Dataset #1: Good result for this indicator ; backround intensities are similar between arrays
and the overall intensity is quite low (around 40) :

Dataset #2: Bad result for this indicator ; the values are different between arrays (sd = 80)
and globally the intensities are quite high (between 300 and 600):

The left graph shows the percent present values for all arrays. The legend gives the array group names (here experimental factors). The grey rectangle represents the 10% range centered around the mean value. The right graph is a boxplot of all values. the top right legend gives the maximal distance between arrays (max - min).

[Technical documentation of the function]

Percent present

Affymetrix MAS5 algorithm flags �present� probe-sets, indicating that their targeted transcript was present. The present calls are defined with significant PM (perfect match) values regarding the MM (mismatch) values. It is thus meaningfull only if MM probes are present onto the slide. The percentage of present calls should be similar for replicate arrays and within a range of 10% over the arrays. If this is not verified the quality of one of the replicates may be bad.

Dataset #1: Good result for this indicator ; all array are within a 10% range:

Dataset #2: Bad result for this indicator ; arrays S2E2-1, NoCT-1 and NoE2-2 have
particularly low percentage of present calls:

The left graph shows the percent present values for all arrays. The legend gives the array group names (here experimental factors). The grey rectangle represents the 10% range centered around the mean value. The right graph is a boxplot of all values. the top right legend gives the maximal distance between arrays (max - min).

[Technical documentation of the function]

Present/Margin/Absent (PMA) table

Optionaly, and for arrays containing perfect match (PM) and mismatch (MM) probes, we propose to create the table of the probeset present calls: "P" for Present, "M" for Margin and "A" for absent. See description of Percent Present plot for explanations about the compuation of these values. The output table is a text file (.txt), each column is tab-separated so it is easily open with any spredsheet editor like Excel:

This table may be useful for further investigation on particular genes.

[Technical documentation of the function]

Positive and Negative control distribution

Affymetrix arrays contain border elements, positioned on the outer edge of the array. They serve for the automatic setting of the grids but also as controls for signal intensity. For each array the intensities for all border elements are collected. Elements with an intensity greater the 1.2 times the mean for that group are assumed to be positive controls. Elements with a signal less that 0.8 of the mean are assumed to be negative controls. Elements falling in between these cut offs are not used in further calculations. This graph presents boxplots of the positive and negative distribution. The means and spread of positive elements should be comparable between arrays. Dissimilarity can arise either from non-uniform hybridization or gridding problems. The negative elements represent spots with no hybridization signal, so they are expected to be close to 0. Boxplots that are strongly elevated relative to all the others reflect a higher background level.

Dataset #1: Good result for this indicator ; positive and negative controls are in separate
intensity ranges. Negative controls have similar distributions between arrays:

Dataset #2: Bad result for this indicator ; positive controls reach the saturation level (50,000) and negative
controls are spread between 0 and 50,000. Array 2 has significatively higher negative controls.

[Technical documentation of the function]

Profiles and boxplots of all controls (AFFX, INTRON, EXON)

Affymetrix arrays contain several control probesets, most of them annotated with the "AFFX" prefix. This is mainly the case for the spike-in sample prep and hybridization controls, GAPDH and beta-actin 3'/5' controls, but other control probesets may be present, depending of your array type. The first image contains a list of graphs, one per "AFFX"-annotated probeset, plotting log-intensity profiles of all probes for all arrays. Oulier arrays may have different intensity profiles compared to other arrays. The general tendance is represented by an average curve, plotted in black. Some control probesets contain too many probes so the profile plot cannot be read properly and used for QC.

The second image is a boxplot summarizing all AFFX control log-intensities for each array. Other boxplots may be generated, depending on your array type, to present EXON and INTRON controls.

[Technical documentation of the function]


Signal comparability and biases diagnostic

Signal distribution

Scale factor

A main assumption behind most of the normalization methods for high-throughput expression arrays is that most of the genes are unchanged. The proportion of up- and down-regulated genes should not disturb the average signal intensity which should be comparable between arrays. In this context, Affymetrix MAS5 algorithm applies a scale factor to each array in order to equalize their mean intensities. A dataset of arrays of good quality should not have very different scale factors. Affymetrix recommends that their scale factors should be within 3-fold of one another.

Dataset #1: Good result for this indicator ; all scale factor values are similar between arrays:

Dataset #2: Bad result for this indicator ; scale factor values are very dissimilar,
especially for three arrays (S2E2-1, NoCT-1 and NoE2-2):

The left graph shows the scale factor values on a log scale, for all arrays. The legend gives the array group names (here experimental factors). The grey rectangle represents the 3-fold range centered around the mean value. The right graph is a boxplot of all values. the top right legend gives the maximal ratio between arrays (max / min).

[Technical documentation of the function]

Boxplots of log-intensities

Boxplots of log-intensity distribution are plotted for between-array comparison. The distributions of raw PM (perfect match probes) log-intensities are not expected to be identical but still not totally different while the distributions of normalized (and summarized) probe-set log-intensities are expected to be more comparable if not identical (some normalization methods make the distributions even). Drawing these boxplots before and after normalization allows also checking the normalization step.

Left : Raw data // Right : Normalized data.

Dataset #1: Raw signal distributions are similar even before normalization:

Dataset #2: Four arrays have particularly low intensities (S1TAM-1, S2E2-1, NoCT-1 and NoE2-2).
The normalization step does not manage to correct totally these differences:

[Technical documentation of the function]

Density histogram of log-intensities

Density plots of log-intensity distribution of each array are superposed on a single graph for a better comparison between arrays and for an identification of arrays with weird distribution. As for the boxplots, the density distributions of raw PM (perfect match probes) log-intensities are not expected to be identical but still not totally different while the distributions of normalized probe-set log-intensities are expected to be more. Drawing these plots before and after normalization allows also checking the normalization step.

Left : Raw data // Right : Normalized data.

Dataset #1: Raw signal distributions are similar even before normalization:

Dataset #2: Raw distributions are different when compared between arrays.
After the normalization, array NoE2-2 still presents a different distribution:

[Technical documentation of the function]

Unsmoothed density histogram

We propose an alternative function to draw the density histogram, using an unsmoothed density curve of the intensities for all arrays in the raw or normalized dataset. This function is not called by default but is implemented. You may modify the function calls to be able to use this particular version.

Raw signal distributions shown on an unsmoothed density histogram

[Technical documentation of the function]


Intensity-dependent biases

MA plot

The MA plots allow pairewise comparison of log-intensity of each array to a reference array and identification of intensity-dependent biases. The Y axis of the plot contains the log-ratio intensity of one array to the reference median array, which is called 'M' while the X axis contains the average log-intensity of both arrays - called 'A'. Within a group of replicates, the probe levels are not likely to differ a lot so we expect a MA plot centered on the Y=0 axis from low to high intensities. When the MA plots is computed for each replicate group separately, the references array is the median array of each group. A smooth Loess regression curve is plotted to facilitated the comparison to the Y=0 axis. The normalization is expected to correct for intensity-dependent biases: these graphs plotted before and after normalization allow checking the efficiency of this correction.

Left : Raw data // Right : Normalized data.

Dataset #1: Raw signal distributions does not show any intensity-dependent bias:

Dataset #2.1: There is an intensity-bias in the raw data that is corrected after normalization

Dataset #2.2: The intensity-bias in the raw data is so strong that the normalization
cannot correct it totally

[Technical documentation of the function]


Spatial biases

Array reference layout

This graph presents the layout of the grid by color-coding a 2D image of one array according to the probe type (mainly regular probes, control probes and control regions), using available annotation libraries. Thus no data are plotted; the plot only shows the position of control and regular probes on the array. If applicable, a distinction is made between perfect match (PM) and mismatch (MM) probes.

[Technical documentation of the function]

Spatial positive and negative border elements comparison

The control elements are separated based on which edge of the array they are located. The mean values for the left, right, top and bottom elements are calculated for positive and negative controls and a "center of intensity" (COI) for the controls is calculated. If the hybridization is uniform across the array, the COI will be located at the physical center of the array, otherwise this may signify the presence of spatial biases. The COI is plotted on a relative scale where the point (0,0) is the center and 1 and -1 represent the edges of the array. Arrays where the COI has coordinates with a magnitude greater that 0.5 may be consider as having spatial biases.

Dataset #1: The Center Of Intensity is well centered around (0,0) for all arrays and
for positive as negative controls:

Dataset #2: Arrays with a COI out of the recommended range are tagged on the plot.
In this dataset, arrays 2, 9, 12, 13 and 16 have negative controls COI out of range:

[Technical documentation of the function]

2D images for spatial bias diagnostic

The expression estimate's characteristics plotted on the array positions allow to see spatial trends or biases that are not possible to distinguish on the raw data. Expression measure may be estimated by a Probe Level Model (PLM) using a M-estimator robust regression or directly by the raw probe intensity measurment.
The 2D image proposed by default tries first to use the PLM estimate and plots the model residuals. See spatialImages for more technical details on the function. If the image cannot be built, which is the case for certain types of arrays, an expression estimate is calculated from the raw data. It uses a median array and plot intensites relatively to other arrays if there are more than 6 arrays in the dataset. Otherwise, it adapts the color-set so the spurious areas can be optimally diagnosed. See array.image for more technical details on the function.
The normalization step includes a spatial correction but as this step also includes probe intensity summarization into probe-set intensities, the graph is not re-computed after normalization: there is no spatial position associated to probe-sets.

Dataset #1: No spatial bias; the color-coded values are homogeneous:

Dataset #2: We can see several spurious areas:

We propose also to optionally compute a set of four 2D images for each array, including:
- 2D raw data image
- 2D image of PLM weights
- 2D image of PLM residuals (computed by default, allows a good view of spurious areas)
- 2D image of PLM residual signs (allows a better view of general gradients)

Note that the calculation and building of the complet set of images is quite time-consuming. If the array type does not allow to built the PLM-based images, these plots will be skipped.
The 2D images will be presented by sort, optimizing the number of arrays presented per page, according to the number of arrays in the dataset.

[Technical documentation of the function]


Probe-sets homogeneity

NUSE plot

The Normalized Unscaled Standard Error (NUSE) is the individual probe error fitting the Probe-Level Model (the PLM models expression measures using a M-estimator robust regression). The NUSE values are standardized at the probe-set level across the arrays: median values for each probe-set are set to 1. The boxplots allow checking (1) if all distributions are centered near 1 � typically an array with a boxplot centered around 1.1 shows bad quality and (2) if one array has globally higher spread of NUSE distribution than others, which may also be a sign of low quality.

[Technical documentation of the function]

RLE plot

The Relative Log Expression (RLE) values are computed by calculating for each probe-set the ratio between the expression of a probe-set and the median expression of this probe-set across all arrays of the experiment. It is assumed that most probe-sets are not changed across the arrays, so it is expected that these ratios are around 0 on a log scale. The boxplots presenting the distribution of these log-ratios should then be centered near 0 and have similar spread. Other behavior would be a sign of low quality.

[Technical documentation of the function]


Correlation between arrays

Correlation plot

A correlation coefficient is computed for each pair of arrays in the dataset and is presented qualitatively on a colored matrix. The minimal value of this coefficient (given on the legend) gives a good idea of the dataset homogeneity: low coefficients indicate important differences between array intensities. We suggest to plot it before and after normalization: as the normalization makes the arrays more comparable, the correlation should be higher after this step.

Left : Raw data // Right : Normalized data.

Dataset #1: Signals are highly correlated, even before normalization. The lower
correlation coefficient is 0.955 before and 0.99 after normalizatrion:

Dataset #2: Array 16 has a correlation coefficient <50% with all other arrays, even after normalization.
Arrays 9 and 13 are correlated with a coefficient <90%.

[Technical documentation of the function]

PCA analysis

The PCA (Principal Component Analysis) gives an other view of the correlations of expression between arrays: the data are projected on several axes (or components), ordered by decreasing significativity; the first principal component (PC1) explains most of the variations of expression. (in the following example, PC1 explains almost 47% of the variance). Clusters of genes on a PCA plot present a strong correlation of expression signals. This analysis is proposed before and after normalization.

The PCA graph presents 4 plots: the array data are projected respectively on PC1 versus PC2, PC1 versus PC3 and PC2 versus PC3. The fourth plot is an histogram of the percentage of variance explained per each component, by decreasing order of significance: PC1, PC2, PC3...
We clearly see on this example that PC1 projects outlier array NoE2-2 very far from other arrays, because this variance is the most important in this dataset. By the way, PC2 and PC3 gives interesting results that are not spoiled by the outlier array: arrays from groups S1CT and S1TAM are all grouped together in the PC2 versus PC3 plot.

[Technical documentation of the function]

Hierarchical clustering

The Hierarchical Clustering plot is computed in two steps: first it computes an expression measure distance between all pairs of arrays and then it creates the tree from these distances. The distance absolute values are of interest as well as the groups of arrays that emerged from this analysis.

Left : Raw data // Right : Normalized data.

Dataset #1: The distance value ranges are very low (0.02 for raw data and less than 0.1
after normalization). The expression values are very homogeneous between arrays.

Dataset #2: The distance ranges are higher for this dataset (1.2 for raw data and around
0.8 after normalization): we clearly see here that array NoE2-2 is very different from all other
arrays even after normalization which confirms what most of the quality control graphs conluded.

[Technical documentation of the function]