# DepMap Public 22Q2

##############
## Overview ##
##############

This DepMap release contains data from CRISPR knockout screens from project Achilles, as well as genomic characterization data from the CCLE project.

###############
## Pipelines ##
###############

### Achilles

This Achilles dataset contains the results of genome-scale CRISPR knockout screens for Achilles and Achilles combined with Project SCORE screens. The dataset was processed using the following steps:

- Sum raw readcounts by replicate and guide
- Remove the list of guides with suspected off-target activity
- Remove guides with pDNA counts less than one millionth of the pDNA pool
- Remove replicates that fail fingerprinting match to parent or derivative lines
- Remove replicates with total reads less than 15 million
- Calculate reads per million, add pseudo-count of 1, then log2-fold-change from pDNA counts for each replicate
- Calculate the NNMD for each replicate using genes targeting the Hart reference non-essentials and the intersection of the Hart and Blomen essentials, and remove those with values more positive than -1.0. See Hart et al., Mol. Syst. Biol, 2014 and Blomen et al., Science, 2015
- Remove replicates that do not have a Pearson coefficient > .61 with at least one other replicate for the line when looking at genes with the highest variance (top 3%) in gene effect across cell lines. Equivalent to excluding lines whose replicates have a p-value greater than 0.05 of having at least as high a correlation with a random line as with each other.
- Calculate the NNMD for each cell line after averaging remaining replicates, and remove those more positive than -1.0
- NA out readcounts for guides which correspond with a SNP in a given cell line (to correct ancestry bias)
- Run Chronos to generate gene-level scores
- Scale Chronos output so that median of common essentials over the whole dataset is -1.0
- Combine and batch correct the Achilles and Sanger SCORE gene scores into merged CRISPR datasets
- Identify pan-dependent genes as those for whom 90% of cell lines rank the gene above a given dependency cutoff. The cutoff is determined from the central minimum in a histogram of gene ranks in their 90th percentile least dependent line
- For each Chronos/CRISPR gene score, infer the probability that the score represents a true dependency or not. This is done using an EM step until convergence independently in each cell line. The dependent distribution is given by the list of essential genes. The null distribution is determined from unexpressed gene scores in those cell lines that have expression data available, and from the Hart non-essential gene list in the remainder
- Replace all genes on X chromosome for cell lines that only have Broad SNP copy number data with NA in gene_effect.csv, gene_effect_unscaled.csv, and gene_dependency.csv

The source for copy number data varies by cell line. Copy number data indicated as "Sanger WES" are based on the Sanger Institute whole exome sequencing data (COSMIC: http://cancer.sanger.ac.uk/cell_lines, EGA accession number: EGAD00001001039) reprocessed using CCLE pipelines. Copy number source was chosen according to the following logic:

- Broad WES for lines where available
- Broad SNP when Broad WES is not available and Sanger WES not available, or Sanger WES copy number has less correlation with logfold change than Broad SNP
- Sanger WES in all other cases

Details about data processing are published on bioRxiv here: https://www.biorxiv.org/content/10.1101/720243v1

### Expression

CCLE expression data is quantified from RNAseq files using the GTEx pipelines. A detailed description of the pipelines and tool versions can be found here: https://github.com/broadinstitute/ccle_processing#rnaseq. We provide a subset of the data files outputted from this pipeline available on FireCloud. These are aligned to hg38.

### Copy number

CCLE WES copy number data is generated by running the GATK copy number pipeline aligned to hg38. Tutorials and descriptions of this method can be found here https://software.broadinstitute.org/gatk/documentation/article?id=11682, https://software.broadinstitute.org/gatk/documentation/article?id=11683. WES samples have been realigned to hg38 and run through this pipeline.

### Mutations

CCLE mutation calls are aggregated from several different sources and sequencing technologies.

### Fusions

CCLE generates RNAseq based fusion calls using the STAR-Fusion pipeline. A comprehensive overview of how the STAR-Fusion pipeline works can be found here: https://github.com/STAR-Fusion/STAR-Fusion/wiki. We run STAR-Fusion version 1.6.0 using the plug-n-play resources available in the STAR-Fusion docs for gencode v29. We run the fusion calling with default parameters except we add the --no_annotation_filter and --min_FFPM 0 arguments to prevent filtering.

### Omics Updates

 [Copy Number] Previously ACH-002345 and ACH-002291’s copy number data were from WES, and ACH-000901’s was from SNP array. This quarter, we processed WGS data for these three lines, so the source of their copy number data is now WGS.
[Copy Number and Mutation] HS571T (ACH-001092) and TOV112D (ACH-000048) are not isogenic according to Cellosaurus, but they are identified as isogenic in DepMap. Therefore, we decided to remove data for ACH-001092, which includes mutation from hybrid capture data, and copy number from SNP array data.
[Expression] We updated STAR to v2.7.10a, RSEM to v1.3.3, and GENCODE to version 38 (used for generating index for STAR and reference for RSEM) in our expression pipeline. In order to QC, we computed Spearman correlation tests between data generated by the old version and the updated version of the expression pipeline. On gene level, all cell lines’ correlation values are above 0.98. On transcript level, all cell lines’ correlation values are above 0.91 (averaged around 0.94). These results are reasonable and expected. 

###########
## Files ##
###########

### README.txt

Description of all files contained in this release

### Achilles_gene_effect.csv

Pipeline: Achilles

_Post-Chronos_

Chronos data, copy number corrected. 

- Columns: genes in the format "HUGO (Entrez)"
- Rows: cell lines (Broad IDs)

### Achilles_gene_effect_uncorrected.csv

Pipeline: Achilles

_Post-Chronos_

Raw Chronos data, prior to copy number correction and scaling. 

- Columns: genes in the format "HUGO (Entrez)"
- Rows: cell lines (Broad IDs)

### Achilles_gene_dependency.csv

Pipeline: Achilles

_Post-Chronos_

Probability that knocking out the gene has a real depletion effect using gene_effect.

- Columns: genes in the format "HUGO (Entrez)"
- Rows: cell lines (Broad IDs)

### Achilles_common_essentials.csv

Pipeline: Achilles

_Post-Chronos_

List of genes identified as pan-essentials using Chronos

### Achilles_guide_efficacy.csv

Pipeline: Achilles

_Post-Chronos_

Columns:

- sgrna (nucleotides)
- efficacy - Chronos inferred efficacy for the guide

### Achilles_cell_line_efficacy.csv

Pipeline: Achilles

_Post-Chronos_

Columns:

- cell lines (Broad IDs)
- avana - Chronos inferred efficacy for the cell line in Avana (Achilles) screens.

### Achilles_cell_line_growth_rate.csv

Pipeline: Achilles

_Post-Chronos_

Columns:

- cell lines (Broad IDs)
- avana - Chronos inferred unperturbed growth rate for the cell line in Avana (Achilles) screens

### CRISPR_dataset_sources.csv

Pipeline: Achilles

_Post-Chronos_

Columns:

- cell lines (Broad IDs)
- dataset - source dataset for combined data (Achilles, Score, or Both)

### CRISPR_gene_effect.csv

Pipeline: Achilles

Gene Effect scores derived from CRISPR knockout screens published by Broad’s Achilles and Sanger’s SCORE projects. 

Negative scores imply cell growth inhibition and/or death following gene knockout. Scores are normalized such that nonessential genes have a median score of 0 and independently identified common essentials have a median score of -1.

Gene Effect scores were inferenced by Chronos ( <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02540-7> )

Integration of the Broad and Sanger datasets was performed as described in <https://doi.org/10.1038/s41467-021-21898-7>, except that quantile normalization was not performed.


### CRISPR_gene_dependency.csv

Pipeline: Achilles

Gene Dependency Probabilities represent the likelihood that knocking out the gene has a cell growth inhibition or death effect. These probabilities are derived from the scores in CRISPR_gene_effect.csv as described in <https://doi.org/10.1101/720243>.


### CRISPR_common_essentials.csv

Pipeline: Achilles

_Post-Chronos_

List of genes identified as dependencies in all lines, one per line.

### common_essentials.csv

Pipeline: Achilles

_Pre-Chronos file_

List of genes used as positive controls, intersection of Biomen (2014) and Hart (2015) essentials in the format "HUGO (Entrez)". Each entry is separated by a newline.
The scores of these genes are used as the dependent distribution for inferring dependency probability.

### nonessentials.csv

Pipeline: Achilles

_Pre-Chronos file_

List of genes used as negative controls (Hart (2014) nonessentials) in the format "HUGO (Entrez)". Each entry is separated by a newline.

### Achilles_raw_readcounts.csv

Pipeline: Achilles

_Pre-Chronos file_

Summed counts for each replicate/PDNA

- Columns: replicate/pDNA IDs
- Rows: Guides (nucleotides)

### Achilles_raw_readcounts_failures.csv

Pipeline: Achilles

_Pre-Chronos file_

Summed counts for each replicate failing quality control checks

- Columns: replicate IDs
- Rows: Guides (nucleotides)

### Achilles_logfold_change.csv

Pipeline: Achilles

_Pre-Chronos file_

Post-QC log2-fold change (not ZMADed)

- Columns: replicate IDs
- Rows: Guides (nucleotides)

### Achilles_logfold_change_failures.csv

Pipeline: Achilles

_Pre-Chronos file_

Post-QC log2-fold change (not ZMADed) for cell lines failing quality control checks

- Columns: replicate IDs
- Rows: Guides (nucleotides)

### Achilles_guide_map.csv

Pipeline: Achilles

_Pre-Chronos file_

Columns:

- sgrna (nucleotides) - appears more than once
- genome_alignment
- gene ("HUGO (Entrez)")
- n_alignments (integer number of perfect matches for that guide)

### Achilles_replicate_map.csv

Pipeline: Achilles

_Pre-Chronos file_

Columns:

- replicate_ID (str)
- Broad_ID
- pDNA_batch (int): indicates which processing batch the replicate belongs to and therefore which pDNA reference it should be compared with.
- passes_QC (str): indicates if the replicate was included in Chronos calculations

### Achilles_replicate_QC_report_failing.csv

Pipeline: Achilles

_Pre-Chronos file_ Rows: replicate IDs
Columns: - failure_mode (reason replicate failed or NA) - total_reads - Pearson_corr_with_rep_A/B/C/D (Pearson correlation with sibling replicates) - num_sibling_replicates_passing_QC (count of sibling replicates that passed) - replicate_level_NNMD_pass (boolean indicating whether replicate passed NNMD threshold) - replicate_level_NNMD (float) - excluded_from_processing (boolean indicating if replicate was excluded from further QC processing) - DepMap_ID - FP_unknown (boolean indicating if fingerprinting status was unknown) - can_include_in_dataset (boolean)

### Achilles_dropped_guides.csv

Pipeline: Achilles

_Pre-Chronos file_

Columns:

- sgrna (nucleotides) - appears more than once
- genome_alignment
- gene ("HUGO (Entrez)")
- n_alignments (integer number of perfect matches for that guide)
- fail_reason (why this guide is not used for gene effect/dependency calculation) Note: in_dropped_guides = guide dropped for suspected off-target activity 

### Achilles_high_variance_genes.csv

Pipeline: Achilles

_Pre-Chronos file_

List of genes with top 3% most variable scores across cell lines in 18Q4 gene_effect. Used for replicate correlation in quality control step.

### CCLE_RNAseq_reads.csv

Pipeline: Expression

RNAseq read count data from RSEM.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Ensembl ID)

### CCLE_expression_full.csv

Pipeline: Expression

RNAseq TPM gene expression data for all genes using RSEM. Log2 transformed, using a pseudo-count of 1.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Ensembl ID)

### CCLE_expression.csv

Pipeline: Expression

Gene expression TPM values of the protein coding genes for DepMap cell lines. Values are inferred from RNA-seq data using the RSEM tool and are reported after log2 transformation, using a pseudo-count of 1; log2(TPM+1).

Additional RNA-seq-based expression measurements are available for download as part of the full DepMap Data Release

More information on the DepMap Omics processing pipeline is available at <https://github.com/broadinstitute/depmap_omics>.


### CCLE_expression_transcripts_expected_count.csv

Pipeline: Expression

RNAseq read count data from RSEM.

- Rows: cell lines (Broad IDs)
- Columns: transcripts (HGNC symbol and ensembl transcript ID)

### CCLE_expression_proteincoding_genes_expected_count.csv

Pipeline: Expression

RNAseq read count data from RSEM for just protein coding genes.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Entrez ID)

### CCLE_RNAseq_transcripts.csv

Pipeline: Expression

RNAseq transcript tpm data using RSEM. Log2 transformed, using a pseudo-count of 1.
- Rows: cell lines (Broad IDs)
- Columns: transcripts (HGNC symbol and ensembl transcript ID)

### CCLE_segment_cn.csv

Pipeline: Copy number


Segment level copy number data
- DepMap_ID
- Chromosome
- Start (bp start of the segment)
- End (bp end of the segment)
- Num_Probes (the number of targeting probes that make up this segment)
- Segment_Mean (relative copy ratio for that segment)
- amplification status (+,-,0)

### CCLE_wes_segment_cn.csv

Pipeline: Copy number


Segment level copy number data from whole exome sequencing

- DepMap_ID
- Chromosome
- Start (bp start of the segment)
- End (bp end of the segment)
- Num_Probes (the number of targeting probes that make up this segment)
- Segment_Mean (relative copy ratio for that segment)
- amplification status (+,-,0)

### CCLE_gene_cn.csv

Pipeline: Copy number

Gene-level copy number data that is log2 transformed with a pseudo-count of 1; log2(CN ratio + 1) . Inferred from WGS, WES or SNP array depending on the availability of the data type. Values are calculated by mapping genes onto the segment level calls and computing a weighted average along the genomic coordinate.

Additional copy number datasets are available for download as part of the full DepMap Data Release.

More information on the DepMap Omics processing pipeline is available at <https://github.com/broadinstitute/depmap_omics>.


### CCLE_wes_gene_cn.csv

Pipeline: Copy number

Gene-level copy number data that is log2 transformed with a pseudo-count of 1. Inferred from only WES data by mapping genes onto the segment level calls.

Additional copy number datasets are available for download as part of the full DepMap Data Release.

More information on the DepMap Omics processing pipeline is available at <https://github.com/broadinstitute/depmap_omics>.


### CCLE_fusions.csv

Pipeline: Fusions

Gene fusion data derived from RNAseq data. Data is filtered using by performing the following:

- Removing fusion involving mitochondrial chromosomes or HLA genes
- Removed common false positive fusions (red herring annotations as described in the STAR-Fusion docs)
- Recurrent fusions observed in CCLE across cell lines (in 10% or more of the samples)
- Removed fusions where SpliceType='INCL_NON_REF_SPLICE' and LargeAnchorSupport='NO_LDAS' and FFPM < 0.1
- FFPM < 0.05
Column descriptions can be found in the STAR-Fusion wiki, except for CCLE_count, which indicates the number of CCLE samples that have this fusion.

### CCLE_fusions_unfiltered.csv

Pipeline: Fusions

Gene fusion data derived from RNAseq data. Data is unfiltered. Column descriptions can be found in the STAR-Fusion wiki

### CCLE_mutations.csv

Pipeline: Mutations

MAF file containing information on all the somatic point mutations and indels called in the DepMap cell lines. The calls are an ensemble of calls from MuTect1, MuTect2, and Strelka. A description of the various columns is in the DepMap Release README file.

Additional processed mutation datasets containing binary mutation calls are available for download as part of the full DepMap Data Release.

More information on the DepMap Omics processing pipeline is available at <https://github.com/broadinstitute/depmap_omics>.

Columns: 

For all columns with AC, the allelic ratio is presented as [ALTERNATE:REFERENCE].

- CGA_WES_AC: the allelic ratio for this variant in all our WES/WGS(exon only)
using a cell line adapted version of the 2019 CGA pipeline that includes germline
filtering.

- SangerWES_AC: in Sanger WES (called by sanger) (legacy)

- SangerRecalibWES_AC: in Sanger WES after realignment at Broad (legacy)

- RNAseq_AC: in Broad RNAseq data from the CCLE2 project (legacy)

- HC_AC: in Broad Hybrid capture data from the CCLE2 project (legacy)

- RD_AC: in Broad Raindance data from the CCLE2 project (legacy)

- legacy_wgs_exon_only: in Broad WGS data from the CCLE2 project (legacy)

Additional columns:

- isTCGAhotspot: is this mutation commonly found in TCGA

- TCGAhsCnt: number of times this mutation is observed in TCGA

- isCOSMIChotspot: is this mutation commonly found in COSMIC

- COSMIChsCnt: number of samples in COSMIC with this mutation

- ExAC_AF: the allelic frequency in the Exome Aggregation Consortium (ExAC)

Descriptions of the remaining columns in the MAF can be found here: <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>


### CCLE_mutations_bool_hotspot.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one hotspot mutation

### CCLE_mutations_bool_damaging.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one damaging mutation

### CCLE_mutations_bool_nonconserving.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one other nonconserving mutation

### CCLE_mutations_bool_otherconserving.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one other conserving mutation

### sample_info.csv

Metadata for all of DepMap’s cancer models/cell lines. A full description of each column is available in the DepMap Release README file.

Columns:

- DepMap_ID: Static primary key assigned by DepMap to each cell line

- cell_line_name: Original cell line name, including punctuation

- stripped_cell_line_name: Cell line name with alphanumeric characters only

- CCLE_Name: Previous naming system that used the stripped cell line name followed
by the lineage; no longer assigned to new cell lines

- alias: Additional cell line identifiers (not a comprehensive list)

- COSMICID: Cell line ID used in Cosmic cancer database

- sex: Sex of tissue donor if known

- source: Source of cell line vial used by DepMap

- RRID: Cellosaurus research resource identifier

- WTSI_Master_Cell_ID: ID of corresponding record in Sanger Drug dataset

- sample_collection_site: Tissue collection site

- primary_or_metastasis: Indicates whether tissue sample is from primary or metastatic
site

- primary_disease: General cancer lineage category

- Subtype: Subtype of disease; specific disease name

- age: If known, age of tissue donor at time of sample collection

- Sanger_Model_ID: Sanger Institute Cell Model Passport ID

- depmap_public_comments: Further information about the cell line   

- lineage, lineage_subtype, lineage_sub_subtype, lineage_molecular_subtype: Cancer
type classifications in a standardized form

- default_growth_pattern: Typical growth pattern of the cell line

- model_manipulation: Cell line modifications including drug resistance and gene knockout

- model_manipulation_details: Additional information about the model manipulation

- patient_id: Identifier indicating which cell lines come from the same patient

- parent_depmap_id: If known, DepMap ID of parental cell line

- Cellosaurus_NCIt_disease: From Cellosaurus, NCI thesaurus disease term

- Cellosaurus_NCIt_id:  From Cellosaurus, NCI thesaurus code

- Cellosaurus_issues:  From Cellosaurus, documented issues with cell line'


### Achilles_metadata.csv

Metadata for DepMap’s cancer models/cell lines, specific to Project Achilles’ CRISPR screens.

Columns: 

- Achilles_n_replicates: Number of replicates used in Achilles CRISPR screen passing
QC
- cell_line_NNMD: Difference in the means of positive and negative controls normalized
by the standard deviation of the negative control distribution
- culture_type: Growth pattern of cell line (Adherent, Suspension, Mixed adherent
and suspension, 3D, or Adherent (requires laminin coating))
- culture_medium: Medium used to grow cell line    
- cas9_activity: Percentage of cells remaining GFP negative on days 12-14 of cas9
activity assay as measured by FACs'