Software

Driver analysis

Overview

Driver analysis is a statistical approach that can be used to identify significantly altered cancer driver genes in Sleeping Beauty (SB) transposon datasets, specifically, SB mouse models of cancer. Driver analysis is used by the SBCDDB. It is detailed in a manuscript currently in preparation.

To run driver analysis you will need Python 2.7 or 3.6 with the numpy and scipy toolboxes. To generate custom annotations, you will also need SAMtools. You will need to download the driver analysis python scripts and make sure these are added to your system's path environment.

Driver analysis is mostly encapsulated in a self contained script (driverAnalysis.py). Multiple driver analysis reports can be merged (using mergeDriversets.py), a step which is done when we look for drivers with upstream activating insertions.

By default, driver analysis works with reads mapped to mouse genome build mm9. It also can work with reads mapped to mm10. Driver analysis contains embedded gene annotations derived from UCSC refGene.txt files (for mm9 and mm10) downloaded on 3 May 2015.

If you wish to use a different reference or annotations, you can use the uniqueTA.py script to identify unique TA sites in your genome of choice. Note that this is a very memory intensive step. For mm9/mm10, this requires in between 120-150 GB, and takes a few hours to run. Once the unique TA sites are extracted, you can run the chromInfo.py and featureInfo.py scrips with a custom annotation file (in genePred format) to generate tab-separated-files that tally unique TA sites in chromosomes and genes. You can then point driverAnalysis.py to these tsvs to run driver analysis with your custom annotations.

Code

Examples

The following examples can be run in Bash 3.2. Other shells may suffice, but these commands have only been tested in Bash.

Run progression driver analysis

Your insertions should be stored in six-column BED format (chrom, chromStart, chromEnd, name, score, strand). Note that the fourth column (name column) should be the name of a tumor. To perform trunk driver analysis or driver analysis on tumors with different donor chromosome sites, you will need to add a seventh column, which contains information about the donor transposon/donor chromosome and the trunk driver cutoff to be used for the associated record. See the manuscript for explanations of donor chromosomes and driver cutoffs, or refer to the toy bedfile for an example of what the extra column looks like. Once you have downloaded the toy bedfile, run driver analysis as follows:

$ mkdir -p results
$ driverAnalysis.py -x offset toy.mm9.bed > results/toy.progression.0
$ driverAnalysis.py -p 15000 -x offset toy.mm9.bed > results/toy.progression.15
$ mergeDriversets.py results/toy.progression.0 results/toy.progression.15 > results/toy.progression.txt

Note that above, you are first running driver analysis with no promoter approximation. Next you are running it with a 15kb upstream approximation. Finally, you are merging the results together into a single report.

Driver analysis can be run in a variety of ways. Just execute the script with no arguments to see the different ways in which it can be run. For example, say you did not add a seventh column to your bedfile because all of the samples have the same donor chromosome (example). You could then run driver analysis as follows:

$ driverAnalysis.py -d chr1 toy.mm9.6113.bed > results/toy.6113.txt

Run trunk driver analysis

You can run trunk driver analysis in a fashion similar to progression driver analysis:

$ driverAnalysis.py --trunk-mode -x offset toy.mm9.bed > results/toy.trunk.0
$ driverAnalysis.py --trunk-mode -p 15000 -x offset toy.mm9.bed > results/toy.trunk.15
$ mergeDriversets.py results/toy.trunk.0 results/toy.trunk.15 > results/toy.trunk.txt

(Optional) identify unique TA sites and generate custom annotations

Obtain a FASTA file against which your transposon insertions have been mapped. For this example, chr16 from mm9 is used. Then run the commands listed below. Note that the uniqueTA script takes two arguments: the first is the length of the sequences (including the TA dinucleotide) which will be used to determine uniqueness, the second is the FASTA file or STDIN stream.

$ mkdir -p cache
$ gzip -dc mm9.chr16.fa.gz | uniqueTA.py 20 - | gzip - > cache/ta20.txt.gz

With the list of unique TA sites, you can then tally the number of bases and TA sites in each chromosome and each gene. For this you will need a gene/transcript annotation file in genePred format as well as a FASTA index of your reference genome generated by samtools.

$ gzip -dc cache/ta20.txt.gz | cut -f1 | chromInfo.py refGene.mm9.chr16.txt mm9.chr16.fa.fai - > cache/chromInfo.tsv
$ gzip -dc cache/ta20.txt.gz | featureInfo.py refGene.mm9.chr16.txt - > cache/transcriptInfo.tsv 2> cache/geneInfo.tsv

(Optional) run driver analysis on custom annotations

Once the chromInfo.tsv and geneInfo.tsv files have been generated, you can perform driver analysis using your custom annotations. Just make sure that chromosome features in the bedfile matches the chromosomes used in the FASTA reference. (Example bedfile.)

$ driverAnalysis.py -cfilepath cache/chromInfo.tsv -gfilepath cache/geneInfo.tsv toy.mm9.chr16.bed > results/toy.custom.txt