Long-reads-based Alternative Splicing Estimation and Recognition: LASER

Running LASER

Input files

  • Genome Alignment bam files minimap2 using parameters minimap2 -ax splice -u f annotation/genome.fa long_read.fastq.gz | samtools sort -@ 4 -o output.bam - samtools index output.bam
  • Reference annotation in gtf format. Example file here
  • Short read sequencing SJ.out files STAR. Example file in here. We recommend to pull SJ.out into a single SJ.out from many experiments and filter by min counts.

Building junction reference.

LASER requires a reference junction set to identify couplings. It only analyzes junctions that occur outside the influence of 5’ selection (alternative transcription start sites) or 3’ selection (alternative polyadenylation). LASER can build the reference using both a reference annotation and short read sequence data junctions from STAR.

## GRanges object with 61994 ranges and 2 metadata columns:
##           seqnames            ranges strand |     gene_id               juncID
##              <Rle>         <IRanges>  <Rle> | <character>          <character>
##       [1]       2L       11345-11409      - | FBgn0002121       2L:11345-11409
##       [2]       2L       11519-11778      - | FBgn0002121       2L:11519-11778
##       [3]       2L       12222-12285      - | FBgn0002121       2L:12222-12285
##       [4]       2L       12929-13519      - | FBgn0002121       2L:12929-13519
##       [5]       2L       13626-13682      - | FBgn0002121       2L:13626-13682
##       ...      ...               ...    ... .         ...                  ...
##   [61990]       3R 31979620-31985510      - | FBgn0011224 3R:31979620-31985510
##   [61991]       3R 31979620-31993207      - | FBgn0011224 3R:31979620-31993207
##   [61992]       3R 31979620-32014359      - | FBgn0011224 3R:31979620-32014359
##   [61993]       3R 31981927-31985711      - | FBgn0011224 3R:31981927-31985711
##   [61994]       3R 31984186-31985711      - | FBgn0011224 3R:31984186-31985711
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

Read to junctions assignments

Full-length read filtering.

LASER generates a database with read-to-feature assignments. This process involves three steps occurring within read_to_junctions(). The steps include: 1) Identifying only 5’-3’ full length read isoforms. This filtering is done with get_LASER_full_lengths(), considering only reads spanning an annotated TSS and 3’ end. 2) Correcting the junctions found in long reads using reference annotation and short read data in read_refjunction_correction(). 3) Last, make_junction_database() takes each read and assigns the features it contains: TSS, TES, and the junctions it contains.

Calculate exon-junction couplings to 5’/3’

LASER calculates exon couplings by generating contingency matrices from exon junctions to promoters with compute_exon_tss_couplings() and exon junctions to 3’ ends with compute_exon_3end_couplings(). It uses multinomial testing with the chi-squared statistic to test the significance of the couplings. To account for possible biases due to low counts, the p-value is simulated using Monte Carlo methods.

Exploring results

Transcription start site couplings with exon usage

Looking closely at Stai gene [FBgn0266521] (https://flybase.org/reports/FBgn0266521). We can observed the couplings of the different junctions and their residuals values. Residual values represent the difference between observed and expected junction of a giving pair in the gene.

Looking into individual promoters we can notice how the first TSS of Stai FBgn0266521:P01 it’s strongly associated with junctions 6122411-6122697 and 6122854-6122977, while junctions 6121944-6122004 and 6122191-6122263 are depleted from the same promoter but highly enriched in the 2nd TSS FBgn0266521:P02.

Session information

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.6    dplyr_1.0.9      LASER_0.0.0.9000
## 
## loaded via a namespace (and not attached):
##   [1] bitops_1.0-7                matrixStats_0.62.0         
##   [3] bit64_4.0.5                 filelock_1.0.2             
##   [5] progress_1.2.2              httr_1.4.3                 
##   [7] GenomeInfoDb_1.30.1         tools_4.1.2                
##   [9] bslib_0.3.1                 utf8_1.2.2                 
##  [11] R6_2.5.1                    DBI_1.1.2                  
##  [13] BiocGenerics_0.40.0         colorspace_2.0-3           
##  [15] withr_2.5.0                 tidyselect_1.1.2           
##  [17] prettyunits_1.1.1           bit_4.0.4                  
##  [19] curl_4.3.2                  compiler_4.1.2             
##  [21] cli_3.6.0                   SGSeq_1.28.0               
##  [23] Biobase_2.54.0              xml2_1.3.3                 
##  [25] DelayedArray_0.20.0         labeling_0.4.2             
##  [27] rtracklayer_1.54.0          sass_0.4.1                 
##  [29] scales_1.2.0                readr_2.1.2                
##  [31] rappdirs_0.3.3              stringr_1.4.0              
##  [33] digest_0.6.29               Rsamtools_2.10.0           
##  [35] rmarkdown_2.14              XVector_0.34.0             
##  [37] pkgconfig_2.0.3             htmltools_0.5.2            
##  [39] MatrixGenerics_1.6.0        highr_0.9                  
##  [41] limma_3.50.3                dbplyr_2.1.1               
##  [43] fastmap_1.1.0               rlang_1.0.6                
##  [45] rstudioapi_0.13             RSQLite_2.2.13             
##  [47] prettydoc_0.4.1             farver_2.1.0               
##  [49] jquerylib_0.1.4             BiocIO_1.4.0               
##  [51] generics_0.1.2              jsonlite_1.8.0             
##  [53] BiocParallel_1.28.3         RCurl_1.98-1.6             
##  [55] magrittr_2.0.3              GenomeInfoDbData_1.2.7     
##  [57] Matrix_1.4-1                Rcpp_1.0.8.3               
##  [59] munsell_0.5.0               S4Vectors_0.32.4           
##  [61] fansi_1.0.3                 lifecycle_1.0.1            
##  [63] edgeR_3.36.0                stringi_1.7.6              
##  [65] yaml_2.3.5                  SummarizedExperiment_1.24.0
##  [67] zlibbioc_1.40.0             plyr_1.8.7                 
##  [69] BiocFileCache_2.2.1         grid_4.1.2                 
##  [71] blob_1.2.3                  parallel_4.1.2             
##  [73] crayon_1.5.1                lattice_0.20-45            
##  [75] Biostrings_2.62.0           GenomicFeatures_1.46.5     
##  [77] hms_1.1.1                   KEGGREST_1.34.0            
##  [79] locfit_1.5-9.5              knitr_1.39                 
##  [81] pillar_1.7.0                igraph_1.3.5               
##  [83] RUnit_0.4.32                GenomicRanges_1.46.1       
##  [85] rjson_0.2.21                reshape2_1.4.4             
##  [87] biomaRt_2.50.3              stats4_4.1.2               
##  [89] XML_3.99-0.9                glue_1.6.2                 
##  [91] evaluate_0.15               data.table_1.14.2          
##  [93] tzdb_0.3.0                  png_0.1-7                  
##  [95] vctrs_0.4.1                 tidyr_1.2.0                
##  [97] gtable_0.3.0                purrr_0.3.4                
##  [99] reshape_0.8.9               assertthat_0.2.1           
## [101] cachem_1.0.6                maditr_0.8.3               
## [103] xfun_0.30                   restfulr_0.0.13            
## [105] tibble_3.1.7                plyranges_1.14.0           
## [107] GenomicAlignments_1.30.0    AnnotationDbi_1.56.2       
## [109] memoise_2.0.1               IRanges_2.28.0             
## [111] ellipsis_0.3.2