Understanding TCGA mRNA Level3 analysis results files from FireBrowse

The corresponding gitlab repo: understanding-firebrowse-data-format.

A useful quote from http://seqanswers.com/forums/showthread.php?t=42911.

It took me a while to get my head around this, since the column names in the rsem.genes/isoforms.results files don’t match the default output of RSEM, neither the version they claim to have used nor the most current version.

The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It’s closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas “raw” counts are not!

The *.normalized_results files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments. The Perl code for this quantile normalisation can be found here.

In conclusion, I would strongly recommend using the TPM/scaled_estimate values for all intents and purposes. It seems to me to be the more robust and mathematically sound value.

I am trying to figure out which mRNA analysis results file from Firehose I should use for differential gene expression (e.g. with DESeq2). However, I find the documentation relatively poor. As I have spent quite some time figuring out all the details, I decided to share what I’ve found here so future users don’t need to.

I am currently looking at KICH. If you click KICH at http://firebrowse.org/, and then click mRNA, you see the following pop up menu.

I tried to understand all the difference, and found the following pages somewhat helpful.

confluence.broadinstitute.org
wiki.nci.nih.gov (This is about the RNASeqV2 pipeline in general)

Based on the description of the second link, I thought I would be interested in the illuminahiseq_rnaseqv2-RSEM_genes file as DESeq2 require expression values as close to raw count as possible.

However, when I download it and look inside, it contains headers as such with very long lines.

Hybridization REF TCGA-KL-8323-01A-21R-2315-07 TCGA-KL-8323-01A-21R-2315-07.1 TCGA-KL-8323-01A-21R-2315-07.2 TCGA-KL-8324-11A-01R-2315-07 TCGA-KL-8324-11A-01R-2315-07.1 TCGA-KL-8324-11A-01R-2315-07.2 ...
         gene_id                    raw_count                scaled_estimate                  transcript_id                    raw_count                scaled_estimate                  transcript_id ...
     ?|100130426                         0.00                              0                     uc011lsn.1                         0.00                              0                     uc011lsn.1 ...
     ?|100133144                        22.18           5.56262138762192e-07          uc010unu.1,uc010uoa.1                        20.92           5.66666965654224e-07          uc010unu.1,uc010uoa.1 ...
     ?|100134869                        13.82           2.50584617744958e-07          uc002bgz.2,uc002bic.2                        17.08           3.34550559455013e-07          uc002bgz.2,uc002bic.2 ...
         ?|10357                       247.31           1.44524322577865e-05                     uc010zzl.1                       328.97           2.06751089262803e-05                     uc010zzl.1 ...
         ?|10431                      1814.00           4.74864023757294e-05          uc001jiu.2,uc010qhg.1                      2691.00           7.60539144597261e-05          uc001jiu.2,uc010qhg.1 ...
...

So for each sample, there are three columns (barcode, barcode.1, barcode.2) corresponding to raw_count, scaled_estimates, and transcript_id(s). I find the file format is not very intuitive, and creates inconvenience for future parsing. Such nested structure would’ve been much better handled by XML/JSON/YAML, etc instead.

The normalization method is briefly described in the first link

RSEM expression estimates are normalized to set the upper quartile count at 1000 for gene level and 300 for isoform level estimates. These files have two columns, feature name and normalized count

I also looked into the mRNAseq_Preprocess link, which was not described in the above links. I found an overly short description here:

mRNAseq_preprocessor: Pick the “normalized_count”(quantile normalized RSEM) value from illumina hiseq/ga2 mRNAseq level_3 (v2) data set and make the mRNAseq matrix with log2 transformed for the downstream analysis. To maximize sample counts we include both HiSeq and GA2 aliquots in each cohort dataset, but if a given patient has both HiSeq and GA2 aliquots the HiSeq aliquot will take precedence (to avoid double-counting a patient during analysis). The pipeline also will create the matrix with RPKM and log2 transform from hiseq/ag2 mRNAseq level 3 (v1) data set.

Z score calculation of RSEM/RPKM data:

Z = (expression in single tumor sample) - (mean expression in all tumor samples ) / (standard deviation of expression in all tumor samples)

The Preprocess files after download look like this

KICH.uncv2.mRNAseq_RSEM_Z_Score.txt
KICH.uncv2.mRNAseq_RSEM_normalized_log2.txt
KICH.uncv2.mRNAseq_RSEM_normalized_log2_PARADIGM.txt
KICH.uncv2.mRNAseq_RSEM_all.txt
KICH.uncv2.mRNAseq_raw_counts.txt
KICH.uncv2.mRNAseq_scaled_estimate.txt
KICH.uncv2.mRNAseq_geneid_transcriptid_mapping.txt
MANIFEST.txt

File 1 and 2 are briefly explained in the above quote. I don’t quite understand why that is called a Z score, the formula is equal to standardization (i.e. removing mean, and scaled to unit variance). Also, it does not say if how normal/metastatic samples’ Z scores are calculated. File 3 is not described, but PARADIGM seems to be related to pathway analysis (keyword search within that linked page). File 4 and 5 are likely extracted from the raw_count and scaled_estimate columns of the RSEM_genes (unnormalized) file mentioned before. I checked a few genes and File 7 seems to the same as the RSEM_genes_normalized file (but slightly different file formats), consistent with the description of mRNAseq_preprocessor part under RNAseq Pipelines section.

I think the documentation could’ve been much improved so that no one needs to hack around file formats, which bioinformaticians have already spent too much time doing so in general.

Back to my own problem, I prefer to use RSEM_genes (unnormalized) file over the raw_count file for differential expression analysis with DESeq2, as the later uses short 12-letter barcode (e.g TCGA-KL-8323-01), which may be unspecific enough for later analysis.