Multiple Sequence Aligning with STAR

2020-11-19

This is adapted/expanded from a talk given by Yan Kou of the Ma’ayan Lab given at NASB 2015.

Requirements

You can download compiled STAR releases from Alexander Dobin's GitHub repo. You can find the binaries in the bin folder.

Annoyingly, it seems like STAR only runs on Linux and MacOS, but it slurps up RAM so you probably want to run it on a compute cluster any how.

If you're a student/researcher at a Canadian institute with a Compute Canada account, you can just run module load star to load STAR.

Build Reference Genome Index

First step is to build a reference genome index that can be used with STAR.

To do this, we need reference sequences and annotations. You can download the latest FASTA sequences and annotations from ENSEMBL. If you're interested in the Human or Mouse genome, you can grab these from GENCODE, which is a superset of automated ENSEMBL genomes1.

I'll be using this gencode annotations and sequences for H. sapiens. I'll be referring to these files as gencode.v35.annotation.gtf and GRCh38.primary_assembly.genome.fa respectively going forward.

Once downloaded, the CLI command for creating the genome index is:

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir ./out --genomeFastaFiles ./GRCh38.primary_assembly.genome.fa --sjdbGTFfile ./gencode.v35.annotation.gtf --sjdbOverhang 100 

Here are details about some of the arguments:

  • runThreadN is the number of threads STAR will use to compute the index
  • genomeDir is the path where the generated genome index is to be outputted
  • genomeFastaFiles is the path to where the FASTA file(s) containing the genome sequences is/are
  • sjdbGTFfile is the path to where the GTF annotations is
  • sjdbOverhang is described in the User Manual better than I ever can 2.

Important note about CPU memory

I ran into a problem where I needed increase the max amount of memory STAR allows itself to use through the limitGenomeGenerateRAM flag. I had to append --limitGenomeGenerateRAM=160488777770 to compute the genome index from the gencode files above, raising the memory limit to 149.5 GB. That also means you need access to at least a whopping 149.5 Gb of memory for this H. sapiens reference genome.

This took about an hour on the Compute Canada's Beluga cluster. You should see the output like the following:

Nov 21 15:48:06 ..... started STAR run
Nov 21 15:48:06 ... starting to generate Genome files
Nov 21 15:49:17 ... starting to sort Suffix Array. This may take a long time...
Nov 21 15:49:43 ... sorting Suffix Array chunks and saving them to disk...
Nov 21 16:34:38 ... loading chunks from disk, packing SA...
Nov 21 16:36:40 ... finished generating suffix array
Nov 21 16:36:40 ... generating Suffix Array index
Nov 21 16:40:05 ... completed Suffix Array index
Nov 21 16:40:05 ..... processing annotations GTF
Nov 21 16:40:24 ..... inserting junctions into the genome indices
Nov 21 16:44:18 ... writing Genome to disk ...
Nov 21 16:44:23 ... writing Suffix Array to disk ...
Nov 21 16:44:57 ... writing SAindex to disk
Nov 21 16:45:03 ..... finished successfully

The resulting genome index weighs in at nearly 28Gb in my case, which is nothing to sneeze at.

Aligning Sequences

Finally, we’re ready to run the alignment. To create a SAM file, simply run:

STAR --runThreadN 8 --runMode alignReads --genomeDir pathToGenomeIndex --readFilesIn pathToRNAseqFile/myRNAseqReads_mate1.fastq pathToRNAseqFile/myRNAseqReads_mate2.fastq

where --genomeDir is the path to the genome index we generated in the previous step. Here we have paired-end reads labelled with the _mate1 and _mate2 suffix.

To generate BAM files, just append the flags

--outSAMtype BAM SortedByCoordinate

And that should result in the following CLI output:

Nov 24 12:39:51 ..... started STAR run
Nov 24 12:39:51 ..... loading genome
Nov 24 12:40:56 ..... started mapping
Nov 24 12:48:52 ..... finished mapping
Nov 24 12:48:53 ..... started sorting BAM
Nov 24 12:49:30 ..... finished successfully

  1. “Gencode is made by merging the manual gene annotation produced by the Ensembl-Havana team and the Ensembl-genebuild automated gene annotation” ↩︎

  2. “length of the donor/acceptor sequence on each side of the junctions, ideally = (matelength - 1)” ↩︎

Bioinformaticsbioinformaticssequence alignment

Setting Up a Windows System to Protect Untechnical Users from Themselves

Semi-Supervised Classification with Graph Convolutional Networks