Jiaqi Wu

Prepping TCGA data for machine learning projects

05 Jan 2019

Last semester, I wanted to explore the role of epigenetic markers in differentiating types of cancers for a course project. I turned to The Cancer Genome Atlas’s (TCGA) data portal, which contains 7 types of genomic data for 33 types of cancer.

Processing and organizing the data from TCGA to a form that I could use for Tensorflow turned out to be the most difficult part of this project, so here are the steps I followed, and resources I found helpful along the way.

1: MicroRNA (miRNA) expression profiles

I started by reading discussions on Biostars. I was surprised by the lack of detailed documentation for handling TCGA data for beginners.

1.1: Bulk downloading raw data with GDC

I downloaded 17 datasets from TCGA. TCGA has two tiers of data: free and controlled. The controlled set requires special access (you need to apply for it). So instead, I chose only the TXT files (consisting of transcriptome profiles). This sped up my processing a bit, since the controlled tier contains BAM sequencing reads of ~60-200MB.

Once I had all my TXT files for one tissue origin site in the cart, I downloaded the GDC Manifest. The manifest is just a file containing a list of the files you have requested. Since I was downloading to a cluster, I used the GDC-Client:

./gdc-client download -m $MANIFEST

Note: The expression profiles only give you the miRNA_ID and sequencing read information; if you want more information about the sample (i.e., metastatic, primary, or normal tissue), you have to download the Sample Sheet from TCGA as well.

1.2: Preparing the data using R

Each sample downloaded creates a new folder in your download directory, and inside the folder are two files: *.mirbase21.mirnas.quantification.txt and *.mirbase21.isoforms.quantification.txt. I will be working with miRNA isoform expression profiles.

More on miRNA isoforms

Each row in the isoforms file contains the calculated expression for an individual miRNA isoform observed, per sample. A common way to process this data is by taking the max or sum of all isoform counts associated with a specific miRNA. In this project, when there are multiple isoforms for a given miRNA in each sample, then the expression is summed.

Now, I move all the files out of their folders and into one directory. We can use rptashkin’s TCGA_miRNAseq_matrix R script to obtain a complete matrix of summed miRNA isoforms from the individual samples. The rows represent miRNAs (our features), and the columns represent samples.

I add a row called “label” and manually label it with the corresponding cancer type (i.e., BRCA, ESCA, SKCM). Then, I repeat this process with all the cancer types (I had 17). Once I have 17 labeled matrices, I merge them in R, and transpose. Now, the final matrix has samples from all 17 cancer types, where the columns represent miRNAs, and the rows are samples. I randomize, split the label column from the real data, and import into Python for machine learning analysis.

Importing R dataframe to Python

There are several ways to do this. I wrote my R matrix to disc (in a .txt file), and then imported it into Python using Pandas.

After splitting my data into a train and test set, and isolating the labels columns, I converted the data frame to numpy array to use for further analysis.

This could have been done entirely in Python, but I wanted to utilize the pre-existing scripts, which were in R. Choose your battles!

2: DNA methylation expression profiles

DNA methylation was a little more complicated than miRNAs, and I could no longer use the same script to turn individual samples into one big matrix. So, I had to do some more digging.

2.1: Downloading TCGA data using TCGA2STAT

I found an R package TCGA2STAT that enables users to directly download TCGA data into an analysis-ready format, which is exactly what I was looking for. (Turns out, we could’ve used this for miRNAs as well…)

Because I only needed it for methylation, I extracted the code relevant to DNA methylation. The altered version is available in my repository.

TCGA2STAT allows you to specify the type of data you wish to download, but it is 27K by default. This type stands for the number of probes, or the number of features; 27K means 27,000 features. As this was the most commonly available type for TCGA (and conveniently smaller), I went ahead and downloaded the 27K types.

# 27K is selected by default
methyl <- getTCGA(disease="LUAD", data.type="Methylation")

# 450K example—450K isn't available for every dataset. Check in TCGA before downloading 
# methyl <- getTCGA(disease="LUAD", data.type="Methylation", type="450K")

# Table of samples vs. CpG site beta-values is found in $dat 
data = methyl$dat

# Script to see distribution of primary/metastatic/normal tissue. 

samples <- colnames(data)
parsed <- strsplit(samples,"-")
parsed <- lapply(parsed, `[[`, 4)
parsed <- unlist(parsed, recursive=FALSE)
print(table(parsed))

# Replace sample names, transform, and add label.  
colnames(data) <- gsub(pattern='-', replacement='_', x=colnames(data))
t_data <- t(data)
t_data = as.data.frame(t_data)
t_data$type = c(rep(CODE,nrow(t_data)))

# Add row names to make sure table prints out correctly in .txt format 
t_data <- cbind(Row.names = rownames(t_data), t_data)

write.table(t_data,file=paste0("LUAD-methyl-labeled",".txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

The $dat attribute of the returned data structure gives us a matrix, where, again, the rows represent a CpG site and columns represent a sample. I transform the matrix, add a label for the origin site/disease type (i.e., LUAD), and store as a .txt file. This is repeated for each origin site I wanted to look at, and then all the .txt files are merged, resulting in a matrix with their labeled origin sites.

2.2: Feature selection in R

Since DNA methylation yields a much larger dataset, I did some feature selection before transforming the data matrix. I removed all features (CpG sites) with more than 30% “NA” values. If you want to go even further with the feature selection at this stage, you can consider filtering the matrix based on differential expression using variance cut-offs or limma. You can filter for differential expression between different origin sites, between tumor vs. normal tissue samples, etc.

3: Conclusion

Now, we have two datasets ready to be analyzed in Python. Using just an out-of-the-box random forest classifier from SKLearn already yields decent results for most cancer types. However, I wanted to take it a step further to reduce the false negative rates for cancer types that have been traditionally difficult to classify (i.e., esophageal, pancreatic, and lung).

If you are interested in the Tensorflow model I built to predict cancer type based on miRNAs and methylation, the code and results are available at GitHub.

The main effort and largest learning curve was processing TCGA data, a step that I could only find very scattered documentation for. I hope this post helps others who are trying to work with microRNA and DNA methylation data for the first time. I would love to hear about what you did with this data!

Here’s a list of resources I referenced for data processing:

Tweet me @thejiaqiwu if you like this post.