This is adapted/expanded from a talk given by Yan Kou of the Ma’ayan Lab given at NASB 2015.
You can download compiled STAR releases from Alexander Dobin's GitHub repo. You can find the binaries in the
Annoyingly, it seems like STAR only runs on Linux and MacOS, but it slurps up RAM so you probably want to run it on a compute cluster any how.
If you're a student/researcher at a Canadian institute with a Compute Canada account, you can just run
module load star to load STAR.
Build Reference Genome Index
First step is to build a reference genome index that can be used with STAR.
To do this, we need reference sequences and annotations. You can download the latest FASTA sequences and annotations from ENSEMBL. If you're interested in the Human or Mouse genome, you can grab these from GENCODE, which is a superset of automated ENSEMBL genomes1.
Once downloaded, the CLI command for creating the genome index is:
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir ./out --genomeFastaFiles ./GRCh38.primary_assembly.genome.fa --sjdbGTFfile ./gencode.v35.annotation.gtf --sjdbOverhang 100
Here are details about some of the arguments:
runThreadNis the number of threads STAR will use to compute the index
genomeDiris the path where the generated genome index is to be outputted
genomeFastaFilesis the path to where the FASTA file(s) containing the genome sequences is/are
sjdbGTFfileis the path to where the GTF annotations is
sjdbOverhangis described in the User Manual better than I ever can 2.
Important note about CPU memory
I ran into a problem where I needed increase the max amount of memory STAR allows itself to use through the
limitGenomeGenerateRAM flag. I had to append
--limitGenomeGenerateRAM=160488777770 to compute the genome index from the gencode files above, raising the memory limit to 149.5 GB. That also means you need access to at least a whopping 149.5 Gb of memory for this H. sapiens reference genome.
This took about an hour on the Compute Canada's Beluga cluster. You should see the output like the following:
Nov 21 15:48:06 ..... started STAR run Nov 21 15:48:06 ... starting to generate Genome files Nov 21 15:49:17 ... starting to sort Suffix Array. This may take a long time... Nov 21 15:49:43 ... sorting Suffix Array chunks and saving them to disk... Nov 21 16:34:38 ... loading chunks from disk, packing SA... Nov 21 16:36:40 ... finished generating suffix array Nov 21 16:36:40 ... generating Suffix Array index Nov 21 16:40:05 ... completed Suffix Array index Nov 21 16:40:05 ..... processing annotations GTF Nov 21 16:40:24 ..... inserting junctions into the genome indices Nov 21 16:44:18 ... writing Genome to disk ... Nov 21 16:44:23 ... writing Suffix Array to disk ... Nov 21 16:44:57 ... writing SAindex to disk Nov 21 16:45:03 ..... finished successfully
The resulting genome index weighs in at nearly 28Gb in my case, which is nothing to sneeze at.
Finally, we’re ready to run the alignment. To create a SAM file, simply run:
STAR --runThreadN 8 --runMode alignReads --genomeDir pathToGenomeIndex --readFilesIn pathToRNAseqFile/myRNAseqReads_mate1.fastq pathToRNAseqFile/myRNAseqReads_mate2.fastq
--genomeDir is the path to the genome index we generated in the previous step. Here we have paired-end reads labelled with the
To generate BAM files, just append the flags
--outSAMtype BAM SortedByCoordinate
And that should result in the following CLI output:
Nov 24 12:39:51 ..... started STAR run Nov 24 12:39:51 ..... loading genome Nov 24 12:40:56 ..... started mapping Nov 24 12:48:52 ..... finished mapping Nov 24 12:48:53 ..... started sorting BAM Nov 24 12:49:30 ..... finished successfully
“length of the donor/acceptor sequence on each side of the junctions, ideally = (matelength - 1)” ↩︎