Driver analysis is a statistical approach that can be used to identify significantly altered cancer driver genes in Sleeping Beauty (SB) transposon datasets, specifically, SB mouse models of cancer. Driver analysis is used by the SBCDDB. It is detailed in a manuscript currently in preparation.
To run driver analysis you will need Python 2.7 or 3.6 with the numpy and scipy toolboxes. To generate custom annotations, you will either need SAMtools or a custom script to generate fasta indexes (fai) from fasta files. You will need to download the SB Driver Analysis python scripts and make sure these are added to your system's path environment. Note that this package contains benchmark data and an example bash script (run.sh) that highlights how to run SB Driver Analysis on a dataset.
SB Driver Analysis is a self contained script (sbdriver.py). A few helper scripts are also included for tasks such as pre-processing SB insertion data (mainly formatting the tumor name and associating transposon information with tumors) or setting custom gene annotations.
By default, SB Driver Analysis works with reads mapped to mouse genome build mm9. It also can work with reads mapped to mm10. SB Driver Analysis contains embedded gene annotations derived from UCSC refGene.txt files (for mm9 and mm10) downloaded on 3 May 2015.
If you wish to use a different reference or annotations, you can check out the tutorial below, which utilize the uniqueTA.py script to identify unique TA sites in your genome of choice. A portion of mm9 is included here for demonstrative purposes. Note that running uniqueTA.py on a mouse- or human-sized genome can require 120-150 GB of memory and can take a few hours to run. Once the unique TA sites are extracted, you can run the chromInfo.py and featureInfo.py scrips with a custom annotation file (in genePred format) to generate tab-separated-files that tally unique TA sites in chromosomes and genes. You can then point sbdriver.py to these files to run SB Driver Analysis with your custom annotations.
The following examples can be run in Bash 3.2. Other shells may suffice, but these commands have only been tested in Bash.
Your insertions should be stored in six-column BED format (chrom, chromStart, chromEnd, name, score, strand). Note that the fourth column (name column) should be the name of a tumor. To perform Trunk Driver Analysis or Driver Analysis on tumors with different donor chromosome sites, you will need to add a seventh column or a separate annotation file (detailed in sbdriver.py documentation viewable by running the program without any input arguments), which contains information about the donor transposon/donor chromosome and the trunk driver cutoff to be used for the associated record. See the manuscript for explanations of donor chromosomes and driver cutoffs, or refer to the toy bedfile for an example of what the extra column looks like. Once you have downloaded the toy bedfile, run Driver Analysis as follows:
$ mkdir -p results
$ sbdriver.py toy.mm9.bed > results/toy.txt
sbdriver.py can be run in a variety of ways. Just execute the script with no arguments to see the different ways in which it can be run. For example, say you did not add a seventh column to your bedfile because all of the samples have the same donor chromosome (example). You could then run Driver Analysis as follows:
$ sbdriver.py -d chr1 toy.mm9.6113.bed > results/toy.6113.txt
You can run Trunk Driver Analysis in a fashion similar to Progression Driver Analysis:
$ sbdriver.py --trunk -f 0.015 toy.mm9.bed > results/toy.trunk.txt
Trunk Driver Analysis requires the --trunk flag so the routine applies a per-tumor read depth cutoff. The -f flag tells the routine to relax the recurrence criteria such that a gene needs only have insertions in at least 1.5% of tumors (as opposed to 5% by default).
Obtain a FASTA file against which your transposon insertions have been mapped. For this example, chr16 from mm9 is used. Then run the commands listed below. Note that the uniqueTA script takes two arguments: the first is the length of the sequences (including the TA dinucleotide) which will be used to determine uniqueness, the second is the FASTA file or STDIN stream.
$ mkdir -p cache
$ gzip -dc mm9.chr16.fa.gz | uniqueTA.py 20 - | gzip - > cache/ta20.txt.gz
With the list of unique TA sites, you can then tally the number of bases and TA sites in each chromosome and each gene. For this you will need a gene/transcript annotation file in genePred format as well as a FASTA index of your reference genome generated by samtools.
$ gzip -dc cache/ta20.txt.gz | cut -f1 | chromInfo.py refGene.mm9.chr16.txt mm9.chr16.fa.fai - > cache/chromInfo.tsv
$ gzip -dc cache/ta20.txt.gz | featureInfo.py refGene.mm9.chr16.txt - > cache/transcriptInfo.tsv 2> cache/geneInfo.tsv
Once the chromInfo.tsv and geneInfo.tsv files have been generated, you can perform Driver Analysis using your custom annotations. Just make sure that chromosome features in the bedfile matches the chromosomes used in the FASTA reference. (Example bedfile.)
$ sbdriver.py -cfilepath cache/chromInfo.tsv -gfilepath cache/geneInfo.tsv toy.mm9.chr16.bed > results/toy.custom.txt
If you use SB Driver Analysis in your work, please cite: