Spliced Transcripts Alignment to Reference (STAR)

What is STAR?

STAR is an alignment tool for RNA-seq, developed by Alexander Dobin et al. [1] STAR is implemented as a standalone C++ code. STAR is a free open source software distributed under GPLv3 license, and can be downloaded from http://code.google.com/p/rna-star/. The latest version is 2.3.0.

How does STAR work?

STAR first builds its own index for reference genome. By analyzing the standard genome (e.g., hg19 or mm10), STAR can produce a suffix array index for accelerating the alignment step in the next step. The indexing step only needs to be built once.

STAR can identify alternative splicing junctions in RNA-seq. The advantage of STAR is that time of alignment can be shorten into few hours or even in an hour with similar accuracy of alignment as TopHat.

Using STAR

The input files of STAR can be single-end or pair-end fastq files. If the annotation file (.gtf file) is provided, the accuracy of alignment can be increased. After alignment step, the output file of STAR will be in sam format. The sam file should be converted into bam file, a binary file of sam, for shortening computation time in further analyzing differential expressions of RNA-seq data. The description of selected output files are described as follows:

  • Alignment.sam: the alignment file in sam format
  • SJ.out.tab: the splice junctions in unique mapping
  • Log.progess.out: the log file of alignment progress
  • Log.final.result: summary result of alignment

Output of STAR

  • Number of input reads: it is associated with coverage and depth
  • Average input read length: the sequencing read length
Unique reads
  • Uniquely mapped reads number: Unique reads indicates the reads is only mapped on one spot.
  • Uniquely mapped reads %: The percentage of uniquely mapped reads. This value should not be lower than 80%.
  • Average mapped length: Some of the reads might be missing or bad quality, and these reads will be filtered out before alignment. Average mapped length should be as close as possible to the average input read length.
  • Number of splices (Total): The number of alternative splicing.
  • Number of splices: Annotated(sjdb): the splicing location in the .gtf file
  • Number of non-canonical splices: the number of unannotated splices
  • Mismatch rate per base: the average mismatch rate in one nucleotide. With good library, this value should be 0.5%-0.8%.
  • Deletion rate per base: the average deletion rate in one nucleotide
  • Deletion average length: the average of each deletion length
  • Insertion rate per base: the average insertion rate in one nucleotide
  • Insertion average length: the average of each insertion length
Multi-mapping reads
  • Number of reads mapped to multiple loci: the reads mapped on more than on spot, ex: poly-A tail
  • % of reads mapped to multiple loci: the percentage of multi-mapping read. Typically, this value should be 5%.
  • Number of reads mapped to too many loci: The default value is 10. If the reads mapped more than 10, it will be categorized in “too many loci”.
  • % of reads mapped to too many loci: the percentage of too many loci
Unmapped reads
  • % of reads unmapped: too many mismatches: the reads that contain too many mismatches. By default, if value is more than (0.3*read length) or 10 bases, it will be categorized in “too many mismatches.”
  • % of reads unmapped due to too short: the percentage of reads too short to mapped. The default value of too short is 2/3*read length. This output value is usually associated with sequencing quality.
  • % of reads unmapped due to other reasons: the percentage of unmapped due to other causes, such as “too many mismatches.” The reason of unmapped reads might be contaminated with other samples. BLAST the reads is advised for trouble shooting.

References

[1]Alexander Dobin et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics (2012).