ciaTaiLoR

3’end analysis and processing of CIA Transcriptome

About the Documentation

All functions required for 3’processing as well as additonal helper functions for Long-read data processing an exploration of CIA transcriptome are available are part of the ciaTaiLoR package. ciaTaiLoR can be easily installed by:

install.packages("devtools")
devtools::install_github("hilgers-lab/ciaTaiLoR")

poly(A) Database

polyA database was created using data from FLAM-seq and Oxford Nanopore RNA Direct sequencing methods. For each sequencing methods reads with poly(A) tails were identified using their already published pipelines. For FLAM-seq: repository available here and for Nanopore RNA Direct sequencing we used oxford nanopore pipeline for poly(A) tail calling: repository available here. Both pipelines return mapped bam files containing only reads with an identifiable 3’end.

For simplification in this vignette we’ll explore the processing of FLAM-seq data from the Drosophila Head tissue.

Loading data

library(ciaTaiLoR)
# load bam file of genomic alignments 
flamSeqData <- GenomicAlignments::readGAlignments(
  paste0(mainDir, 
         "flam_seq_pipeline/mapQuantGeneDir/w1118_heads_flam-seq_pooled.bam")) %>% 
  GenomicRanges::GRanges(.) 
# Invert strand 
strand(flamSeqData) <- ifelse(strand(flamSeqData) == '+', '-', '+') 

3’end extraction and filtering

Given the high accuracy of FLAM-seq and RNA Direct methods, the most 3’end of every read with a poly(A) tail represents a highly accurate 3’end of each transcript, therefore we trimmed the reads for at their 3’ends leaving a single nucleotide.

flamSeq_singlent <- trim_single_nucleotide(flamSeqData)

Clustering

Using this data we can now create clusters an annotate them with a reference annotation using create_clusters() function and retrieve the supported read counts associated with each cluster

refAnnot <- "/data/repository/organisms/dm6_ensembl/ensembl/release-96/genes.gtf"
geneAnnot <- rtracklayer::import.gff(refAnnot, feature.type = "gene")
flamSeq_clusters <- create_clusters(flamSeq_singlent,
                geneAnnot,
                window = 50
                )
flamSeq_clusters$Counts <- countOverlaps(flamSeq_clusters, 
                                 flamSeq_singlent)

Feature extraction

Features can be easily extracted from clusters using extract_features() function. This function retrieves several parameters that are useful to identify 3’ends from any time of data.

flamSeq_clusters.features <- extract_features(flamSeq_clusters, 
                 genomeSeq = BSgenome.Dmelanogaster.UCSC.dm6,
                 dwindow = 50,
                 upwindow = 50)

Distribution of poly(A) signals

When scanning for poly(A) signal hexamers in FLAM-seq data we can observe high proportion of cannonical poly(A) sites.

flamSeq_clusters.features %>% 
  group_by(polyAsignalClass) %>% tally() %>% 
  mutate(percentage = n/sum(n)) %>% 
  ggplot(., aes(fill=polyAsignalClass, 
                y=percentage, 
                x="")) + 
    geom_bar(position="stack", stat="identity") + 
  theme_classic() + 
  xlab("FLAM-seq clusters") + 
  ylab("Fraction of clusters")

Nucleotide composition

An additional read out is the nucleotide composition of the 3’ends, this can be easily plot using the cluster regions and the function plotNucleotides(). As expected due to the high quality definion of FLAM-seq the reads show the expected nucleotide composition without any possible internal priming bias in the regions flanking the polyA site.

ciaTaiLoR::plotNucleotides(flamSeq_clusters,
                           genome = BSgenome.Dmelanogaster.UCSC.dm6)

Correction of Long-read sequencing assemblies

Rationale

Long-read sequencing correction is based on the fact that Long-read assemblies, usually contain artifact isoforms, due to internal primming at the 3’ends. Using our method overcomes this problems by correcting the reads in the following steps:

  1. Creation of scaffold blocks. We used our reference database to create scaffold blocks representing the bins between the 3’ends found by FLAM-seq and RNA Direct data.

  2. Keep all 3’ends in the assembly that match the reference database

  3. For those that do not match find keep those that overlap within the scaffold block if they overlap with more than 25% of the scaffold block the 3’ends it then corrected and assigned to the most distal part of the 3’end of the scaffold block.

  4. 3’ends that fall downstream the 3’end of the poly(A) are kept only if they have a recognizable cannonical poly(A) signal at their 3’end.

Scaffoldbloks

Data

For this vignette we’ll 3 files from the Nanopore 1D cDNA from Drosophila heads : 1) The uncorrected assembly out from FLAIR-collapse. 2) The comprehensive poly(A) database of the Drosophila transcriptome available Here 3) the output of sqanti QC _classification.txt file. That contains a detailled annotation of ORF which we’ll use to define the stop coding taked as the 1st nucletide of the 3’UTR. 4) Reference annotation in gff format.

flairAssembly  <- rtracklayer::import.gff(
  paste0(flair_outputDir, 
         "cia_heads.all.seqs.gtf"))
polyaDatabase  <- rtracklayer::import.gff(
  paste0(polyADatabase_outputDir, 
         "combined.rds.clusters.new.gff"))
sqantiClassification  <- readr::read_tsv(
  paste0(SQANTI_classification_outputDir, 
         "cia_heads.all.seqs.sqanti_classification.txt"))
ensemble_Annotation  <- rtracklayer::import.gff(ensAnnotFile)

Create Scaffold blocks

Scaffold blocks are created using SQANTI reference stop codons that allows to make the 3’UTR bin based on the isoforms found in the data. Function createScaffoldBlocks() included in ciaTaiLoR which retrieves the regions that we’ll use to correct the 3’UTR of the assemblies.

blocks <- createScaffoldBlocks(sqantiClassification, polyaDatabase) 
blocks
## GRanges object with 16594 ranges and 0 metadata columns:
##               seqnames            ranges strand
##                  <Rle>         <IRanges>  <Rle>
##   FBgn0000014       3R 16807097-16807621      -
##   FBgn0000014       3R 16807622-16809687      -
##   FBgn0000015       3R 16927203-16927664      -
##   FBgn0000015       3R 16927665-16929389      -
##   FBgn0000017       3L 16615887-16616257      -
##           ...      ...               ...    ...
##   FBgn0286027       2L 15011358-15011662      +
##   FBgn0286027       2L 15011663-15011775      +
##   FBgn0286051       3L   6608418-6608792      +
##   FBgn0286213       3L 13024637-13024851      +
##   FBgn0286222        X   6681699-6681861      +
##   -------
##   seqinfo: 9 sequences from an unspecified genome; no seqlengths

Scaffold block guided correction

The function blockGuidedcorrection(blocks, uncor.annot, 300, 50). Windows can be defined for the assignment of scaffold blocks to determine if an isoforms is “longer” or shorter than the reference. In this case we’ll usee 300 and 50 nt respectively. The object contains the assembly of all the isoforms that were “corrected” and extended to their most distal 3’end in res.correction$extended.assembly. The all transcripts ends that were processed as their original coordinate are deposited in res.correction$$corrected.ends, the column end_quality describes the type of 3’end that was found in comparison with the scaffold blocks, the category shorter represents those transcripts that can be corrected as their in the threshold defined, transcripts classified as longer will be used later.

res.correction <- blockGuidedcorrection(blocks, 
                                        flairAssembly, 
                                        300, 
                                        50)
## [1] 32914
## Identify 3'ends for correction
## [1] 32914
## LongEnd read correction for: 5847 transcripts
## Sucessfully corrected transcript 3'ends: 5847  transcripts

Longer than the reference

Isoforms with a 3’end are identified with the function classifyLonger. Using the output from blockGuidedcorrection(). This outputs a data.frame containing the distance to the closest 3’end in the poly(A) database as well as other relevant information for filtering like poly(A) signals, the distance of the poly(A) signals, if the 3’end is found in ensembl etc.

longer_than_reference_ends  <- classifyLonger(res.correction$ends.data, 
                                    ensemble_Annotation) 

Filtering longer than reference

To our knowledge most of the longer than reference 3’ends come form “Ultra-long 3’UTR”, according to literature this isoforms usually harbor a strong poly(A) signal as they represent the hard stop for transcription termination, we decieded to keep only those transcripts as strict cut-offf.

transcriptsUtraLong <- subset(longer_than_reference_ends, 
                              utr_class=="ultra.distal" & 
                                signal_short %in% c("AATAAA") | 
                                utr_class=="ultra.distal" & 
                                is.Ens.End=="TRUE"  ) %>% 
  dplyr::select(transcript_id)

transcriptsSecondDistal<- subset(longer_than_reference_ends,
                                 utr_class=="second.distal" & 
                                   signal_short %in% c("AATAAA")) %>% 
  dplyr::select(transcript_id)
longToSelect <- c(transcriptsUtraLong$transcript_id, transcriptsSecondDistal$transcript_id )

Filtering and integration

To identify transcripts with a 3’end found in the database we use strandAwareFindOverlapEnd() function.

# retrieve transcripts 
flairAssembly_transcripts <- flairAssembly[flairAssembly$type=="transcript"]
# identify transcripts with a 3'end found in the reference 
flairAssembly_transcript_InRef <- strandAwareFindOverlapEnd(flairAssembly_transcripts, 
                                                          polyaDatabase , 
                                                          250) 
# Identify those transcripts found reference AND corrected in the previous steps 
flairAssembly_transcript_InRef <- c(flairAssembly_transcript_InRef$transcript_id, 
                                  sub('^([^.]+.[^.]+).*', '\\1', 
                                      res.correction$corrected.ends$neu.id))
transcriptsToKeep <- c(longToSelect , flairAssembly_transcript_InRef)
# then we subset the previous "extended assembly for only those transcripts matching the previously described parameters 
# Transcripts corrected OR found in reference 
corrected.filtered.assembly <- subset(res.correction$extended.assembly, 
                                      transcript_id %in% transcriptsToKeep ) 
# Transcripts that were not correct OR found in the reference 
corrected.removed.assembly <- subset(res.correction$extended.assembly, 
                                     !transcript_id %in% transcriptsToKeep ) 

We can now assess the quality of the 3’ends that were corrected with the assembly and compare it to the those discarded. Retriving the 3’UTRs of each transcript with getLastExonperTranscript() and using the plotNucleotides() function.

# keep exons only 
corrected.filtered.assembly <- corrected.filtered.assembly[corrected.filtered.assembly$type == "exon"]
corrected.removed.assembly <- corrected.removed.assembly[corrected.removed.assembly$type == "exon"]
# get3UTR per transcript 
utr3.corrected <- getLastExonperTranscript(corrected.filtered.assembly)
utr3.removed <- getLastExonperTranscript(corrected.removed.assembly)
# trim to single nucleotide end 
utr3.corrected <- trim_single_nucleotide(utr3.corrected)
utr3.removed <- trim_single_nucleotide(utr3.removed)

Quality of Corrected 3’ends

This group corresponds to those corrected 3’ends and found in poly(A) database 3’ends. As expected the nucleotide probabilites from the 3’end found in the reference correspond with those previously reported.

ciaTaiLoR::plotNucleotides( utr3.corrected,
                               genome = BSgenome.Dmelanogaster.UCSC.dm6) 

Quality of discarded 3’ends

While the non corrected not found show a huge enrichment of A nts upstream the detected cleavage site suggesting they most likely come form internal priming events.

ciaTaiLoR::plotNucleotides( utr3.removed,
                               genome = BSgenome.Dmelanogaster.UCSC.dm6)