Celligner Data Release

--------------------------------------------------------
OUTPUTS
--------------------------------------------------------

Results of the Celligner method applied to the TCGA, TARGET, Treehouse, and DepMap RNA-Seq data. Files include:

1) Celligner_aligned_data: The primary output of the Celligner method. It is a matrix of samples (13,485) by genes (19,188), where the data have been aligned between cell lines and tumors using the Celligner method.
Genes are identified using Ensembl identifiers and samples are tumors (n=12,236) and cell lines (n=1,249).

2) Celligner_info: Table of meta data per sample. Columns include:
    - sampleID: ID for each tumor and cell line. It matches with the identifiers used in the other data files.
     Sample IDs for tumor samples begin with either 'TCGA' (ex. TCGA-29-1702-01), 'TARGET' (ex. TARGET-40-0A4I65-01A-01R), or 'TH' (ex. TH27_2189_S01),
     depending on whether they sample is from the TCGA, TARGET, or Treehouse data, respectively. Sample IDs for cell lines use DepMap IDs and begin with 'ACH' (ex. ACH-000493).
    - sampleID_CCLE_Name: ID for each tumor and cell line, where cell line IDs use the CCLE Name (ex. SNU423_LIVER)
    - UMAP_1, UMAP_2: UMAP coordinates calculated from the Celligner-aligned data
    - lineage: cancer lineage type annotation for each sample
    - subtype: Refined disease subcategory for each sample
    - type: either tumor or CL (for cell lines)
    - disease: the disease type annotation for each sample
    - Primary/Metastasis: annotation as a primary or metastatic sample, where available
    - cluster: clusters calculated using shared nearest neighbor based clustering in the 70-dimensional PCA space from the Celligner-aligned data
    - purity: tumor purity estimates based on a consensus measurement of tumor purity from Aran et al.
    - undifferentiated_cluster: whether the sample falls within the cluster of cell lines that show undifferentiated characteristics and up-regulation of the EMT pathway.
    - uncorrected_tumor_UMAP_1, uncorrected_tumor_UMAP_2: UMAP coordinates calculated on the uncorrected tumor expression data
    - uncorrected_CL_UMAP_1, uncorrected_CL_UMAP_2: UMAP coordinates calculated on the uncorrected cell line expression data
    - uncorrected_tumor_cluster: clusters calculated using shared nearest neighbor based clustering in the 70-dimensional PCA space from the uncorrected tumor expression data
    - uncorrected_CL_cluster: clusters calculated using shared nearest neighbor based clustering in the 70-dimensional PCA space from the uncorrected cell line expression data
    - uncorrected_tumor_cluster: clusters calculated using shared nearest neighbor based clustering in the 70-dimensional PCA space from the uncorrected tumor data
    - uncorrected_CL_cluster: clusters calculated using shared nearest neighbor based clustering in the 70-dimensional PCA space from the uncorrected cell line data
    - CL_tumor_class: inferred tumor class for each cell line, identified using nearest neighbors classification within the Celligner-aligned data (see Methods)
    - age: age of the patient from which the sample was derived
    - sex: sex of the patient from which the sample was derived
    - sampling year: approximate year in which the cell line was derived.
    - Yu_et_al_annotation: Annotation from 'Comprehensive transcriptomic analysis of cell lines as models of primary tumors across 22 tumor types'(Yu, et al.) of the cancer type of each cell line 
    - CancerCellNet_annotation: Annotation from 'Evaluating the transcriptional fidelity of cancer models' (Peng, et al.) of the cancer type of each cell line.


3) tumor_CL_cor: Matrix of tumor samples by cell lines, where each value is the correlation between that tumor sample and cell line in the Celligner-aligned data.
The correlation is calculated by taking the Pearson correlation in the complete Celligner-aligned data.

4) cPCs: Matrix of contrastive principal components (cPCs). The matrix is genes (19,188) by cPCs (5), where only the cPCs analyzed in the manuscript are included.
This includes cPCs 1-4, which are components that are higher variance in the tumor data and which were regressed out of the tumor and cell line data as part of the Celligner method,
and cPC 19188, which is a component that is higher variance in the cell line data. In order to avoid identifying signatures related to differences in the cancer type or subtype compositions of the datasets
we first clustered the tumor and cell line data separately and subtracted the average expression of each cluster from all samples in the cluster to estimate the average intra-cluster covariance for tumors and cell lines.
The cPCs were then calculated on this data using the method described by Abid et al., with the cell line data treated as the background data and the tumor data treated as the target data.

5) hgnc_complete_set_7.24.2018.txt : matrix of gene IDs downloaded from ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt. This is the version used to convert between HGNC symbols and select functional genes in the paper.

6) cPCA_values.csv : matrix of cPCA eigenvalues ordered by rank.

--------------------------------------------------------
References
--------------------------------------------------------
Abid, A., Zhang, M. J., Bagaria, V. K. & Zou, J. Exploring patterns enriched in a dataset with contrastive principal component analysis. Nat. Commun. 9, 2134 (2018).

Aran, D., Sirota, M. & Butte, A. J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).

Peng, D. et al. Evaluating the transcriptional fidelity of cancer models. BioRxiv (2020). doi:10.1101/2020.03.27.012757

Yu, K. et al. Comprehensive transcriptomic analysis of cell lines as models of primary tumors across 22 tumor types. Nat. Commun. 10, 3574 (2019).

--------------------------------------------------------
Input data source:
--------------------------------------------------------
Cell line gene expression data for 1,249 samples were taken from the DepMap Public 19Q4 file: CCLE_expression_full.csv downloaded from the Cancer Dependency Map portal (https://depmap.org/portal/download/).  

Tumor gene expression data for 12,236 samples were taken from Treehouse Public Expression Dataset v10 obtained from Xena browser (https://xenabrowser.net/datapages/?cohort=Treehouse%20Tumor%20Compendium%20v10%20Public%20PolyA).


Goldman, M., Craft, B., Brooks, A. N., Zhu, J. & Haussler, D. The UCSC Xena Platform for cancer genomics data visualization and interpretation. BioRxiv (2018). doi:10.1101/326470

DepMap, Broad. DepMap 19Q4 Public. Figshare (2020). doi:10.6084/m9.figshare.11384241.v2

HGNC. 7/24/2019. https://www.genenames.org/. 

-----------------------------------------------------------
Version history:
-----------------------------------------------------------
v1: Initial data release
v2: added the HGNC gene symbol file used to make the dataset
V3: added cell line and tumor expression files so that the Celligner methods (https://github.com/broadinstitute/Celligner_ms) can be run simply by downloading this directory. 
V4: added additional annotations to the Celligner_info file and updated the tumor_CL_distances matrix to tumor_CL_cor to reflect changes to the analysis run in the manuscript.