Running LASER
Input files
- Genome Alignment bam files minimap2 using parameters
minimap2 -ax splice -u f annotation/genome.fa long_read.fastq.gz | samtools sort -@ 4 -o output.bam - samtools index output.bam
- Reference annotation in gtf format. Example file here
- Short read sequencing SJ.out files STAR. Example file in here. We recommend to pull SJ.out into a single SJ.out from many experiments and filter by min counts.
Building junction reference.
LASER requires a reference junction set to identify couplings. It only analyzes junctions that occur outside the influence of 5’ selection (alternative transcription start sites) or 3’ selection (alternative polyadenylation). LASER can build the reference using both a reference annotation and short read sequence data junctions from STAR.
# load reference annotation
annot_path <- system.file("exdata/dm6.annot.gtf.gz", package="LASER")
ref_annot <- rtracklayer::import.gff(annot_path)
# load junctions from STAR
junction_path <- system.file("exdata/short_read_junctions.SJ.out.tab", package = 'LASER')
# create txf file of reference junctions
referenece_junctions <- create_reference_junctions( junction_path, min.jcounts = 2 , ref_annot, type="short")
referenece_junctions
## GRanges object with 61994 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id juncID
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] 2L 11345-11409 - | FBgn0002121 2L:11345-11409
## [2] 2L 11519-11778 - | FBgn0002121 2L:11519-11778
## [3] 2L 12222-12285 - | FBgn0002121 2L:12222-12285
## [4] 2L 12929-13519 - | FBgn0002121 2L:12929-13519
## [5] 2L 13626-13682 - | FBgn0002121 2L:13626-13682
## ... ... ... ... . ... ...
## [61990] 3R 31979620-31985510 - | FBgn0011224 3R:31979620-31985510
## [61991] 3R 31979620-31993207 - | FBgn0011224 3R:31979620-31993207
## [61992] 3R 31979620-32014359 - | FBgn0011224 3R:31979620-32014359
## [61993] 3R 31981927-31985711 - | FBgn0011224 3R:31981927-31985711
## [61994] 3R 31984186-31985711 - | FBgn0011224 3R:31984186-31985711
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
Read to junctions assignments
Full-length read filtering.
LASER generates a database with read-to-feature assignments. This process involves three steps occurring within read_to_junctions()
. The steps include: 1) Identifying only 5’-3’ full length read isoforms. This filtering is done with get_LASER_full_lengths()
, considering only reads spanning an annotated TSS and 3’ end. 2) Correcting the junctions found in long reads using reference annotation and short read data in read_refjunction_correction()
. 3) Last, make_junction_database()
takes each read and assigns the features it contains: TSS, TES, and the junctions it contains.
Calculate exon-junction couplings to 5’/3’
LASER calculates exon couplings by generating contingency matrices from exon junctions to promoters with compute_exon_tss_couplings()
and exon junctions to 3’ ends with compute_exon_3end_couplings()
. It uses multinomial testing with the chi-squared statistic to test the significance of the couplings. To account for possible biases due to low counts, the p-value is simulated using Monte Carlo methods.
Exploring results
Transcription start site couplings with exon usage
Looking closely at Stai gene [FBgn0266521] (https://flybase.org/reports/FBgn0266521). We can observed the couplings of the different junctions and their residuals values. Residual values represent the difference between observed and expected junction of a giving pair in the gene.
Looking into individual promoters we can notice how the first TSS of Stai FBgn0266521:P01
it’s strongly associated with junctions 6122411-6122697
and 6122854-6122977
, while junctions 6121944-6122004
and 6122191-6122263
are depleted from the same promoter but highly enriched in the 2nd TSS FBgn0266521:P02
.
couplings$TSS.couplingsPerJunction %>%
filter(grepl("FBgn0266521", pairs_id)) %>% mutate(
promoter_id = stringr::str_split_fixed(.data$pairs_id, ":" , n = 3)[, 3],
junction = stringr::str_split_fixed(.data$pairs_id, ":" , n = 3)[, 2]
) %>%
ggplot(., aes(x = junction,
y = residuals,
color = promoter_id)) +
geom_point(size =3) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45,vjust = 1,hjust = 1)) +
facet_grid(cols = vars(promoter_id)) +
scale_color_manual(values =c("#FF355E", "#004F98", "#679267")) +
geom_hline(yintercept = 0)
3’end couplings with exon usage
In contrast tacc gene [FBgn0026620] (https://flybase.org/reports/FBgn0026620). We can observed the couplings related to the usage of specific exon-junction to the 3’end selection of the gene.
couplings$TES.couplingsPerJunction %>%
filter(grepl("FBgn0026620", pairs_id)) %>%
mutate(tes_id=stringr::str_split_fixed(.data$pairs_id,":" ,n = 3)[,3],
junction=stringr::str_split_fixed(.data$pairs_id,":" ,n = 3)[,2]) %>%
ggplot(.,
aes(x=junction,
y=residuals,
color=tes_id)) +
geom_point(size=3) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
facet_grid(cols=vars(tes_id)) +
scale_color_manual(values=c("#FF355E","#004F98", "#679267", "blue", "lavender")) +
geom_hline(yintercept=0)
Session information
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.6 dplyr_1.0.9 LASER_0.0.0.9000
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 matrixStats_0.62.0
## [3] bit64_4.0.5 filelock_1.0.2
## [5] progress_1.2.2 httr_1.4.3
## [7] GenomeInfoDb_1.30.1 tools_4.1.2
## [9] bslib_0.3.1 utf8_1.2.2
## [11] R6_2.5.1 DBI_1.1.2
## [13] BiocGenerics_0.40.0 colorspace_2.0-3
## [15] withr_2.5.0 tidyselect_1.1.2
## [17] prettyunits_1.1.1 bit_4.0.4
## [19] curl_4.3.2 compiler_4.1.2
## [21] cli_3.6.0 SGSeq_1.28.0
## [23] Biobase_2.54.0 xml2_1.3.3
## [25] DelayedArray_0.20.0 labeling_0.4.2
## [27] rtracklayer_1.54.0 sass_0.4.1
## [29] scales_1.2.0 readr_2.1.2
## [31] rappdirs_0.3.3 stringr_1.4.0
## [33] digest_0.6.29 Rsamtools_2.10.0
## [35] rmarkdown_2.14 XVector_0.34.0
## [37] pkgconfig_2.0.3 htmltools_0.5.2
## [39] MatrixGenerics_1.6.0 highr_0.9
## [41] limma_3.50.3 dbplyr_2.1.1
## [43] fastmap_1.1.0 rlang_1.0.6
## [45] rstudioapi_0.13 RSQLite_2.2.13
## [47] prettydoc_0.4.1 farver_2.1.0
## [49] jquerylib_0.1.4 BiocIO_1.4.0
## [51] generics_0.1.2 jsonlite_1.8.0
## [53] BiocParallel_1.28.3 RCurl_1.98-1.6
## [55] magrittr_2.0.3 GenomeInfoDbData_1.2.7
## [57] Matrix_1.4-1 Rcpp_1.0.8.3
## [59] munsell_0.5.0 S4Vectors_0.32.4
## [61] fansi_1.0.3 lifecycle_1.0.1
## [63] edgeR_3.36.0 stringi_1.7.6
## [65] yaml_2.3.5 SummarizedExperiment_1.24.0
## [67] zlibbioc_1.40.0 plyr_1.8.7
## [69] BiocFileCache_2.2.1 grid_4.1.2
## [71] blob_1.2.3 parallel_4.1.2
## [73] crayon_1.5.1 lattice_0.20-45
## [75] Biostrings_2.62.0 GenomicFeatures_1.46.5
## [77] hms_1.1.1 KEGGREST_1.34.0
## [79] locfit_1.5-9.5 knitr_1.39
## [81] pillar_1.7.0 igraph_1.3.5
## [83] RUnit_0.4.32 GenomicRanges_1.46.1
## [85] rjson_0.2.21 reshape2_1.4.4
## [87] biomaRt_2.50.3 stats4_4.1.2
## [89] XML_3.99-0.9 glue_1.6.2
## [91] evaluate_0.15 data.table_1.14.2
## [93] tzdb_0.3.0 png_0.1-7
## [95] vctrs_0.4.1 tidyr_1.2.0
## [97] gtable_0.3.0 purrr_0.3.4
## [99] reshape_0.8.9 assertthat_0.2.1
## [101] cachem_1.0.6 maditr_0.8.3
## [103] xfun_0.30 restfulr_0.0.13
## [105] tibble_3.1.7 plyranges_1.14.0
## [107] GenomicAlignments_1.30.0 AnnotationDbi_1.56.2
## [109] memoise_2.0.1 IRanges_2.28.0
## [111] ellipsis_0.3.2