🧬

AD RNAPII Speed Profiles

MCF-7 Β· E2 time course Β· GSE41324 Β· Danko et al. 2013
Loading…
Best Pearson r
0.557
AD spectral ψ
Best Spearman ρ
0.603
Method A / groHMM
Genes analyzed
81
hg19 Β· MCF-7 Β· E2
AD Processing Gain
≀ 50 dB
10Β·log₁₀(M) per gene
Active TODOs
Loading…
Strategy Log β€” Recent Sessions
Loading…
Current Benchmark Results β€” vs Danko et al. 2013
Method Pearson r Spearman ρ Notes
Loading…
Pipeline Status
Checking…

The Big Idea: It's Like Highway Traffic

Imagine a long highway. Helicopters hover above and count cars at every mile marker. If there are many cars per mile, that stretch is congested β€” cars are moving slowly. If there are few cars per mile, the road is clear and cars are moving fast.

SLOW ZONE β€” dense v = slow FAST ZONE β€” sparse v = fast density = HIGH density = LOW
Speed ∝ 1 / density

This simple relationship is the core of our analysis. More entities per unit length means each entity is moving slower (they spend more time there). This is true for highway cars and it's true for RNA Polymerases on DNA.

In Our Lab: RNA Polymerases on DNA

In our experiment, instead of cars on a highway, we have RNA Polymerases (RNAPs) moving along a gene (a stretch of DNA). GRO-Seq (Genomic Run-On Sequencing) is our "helicopter" β€” it counts how many polymerases are at each base-pair position at a snapshot in time.

TSS 3' end Pause peak / slow zone (high reads = slow) Gene body / fast zone (low reads = fast) GRO-Seq reads:
v(p) = C / ρ(p) where ρ(p) = read density at position p

More reads at position p means polymerases spend more time there, which means they are moving slower. By computing 1/density at every position along the gene, we get a 1 bp resolution speed profile. The constant C is calibrated so that the mean speed matches the expected ~2 kb/min.

Why 47 dB? The Algebraic Diversity Processing Gain

Each individual position p gives us one noisy measurement of speed. Noise from sequencing, mapping, and biological variability limits accuracy.

Key insight: A gene 50,000 bp long gives us 50,000 measurements simultaneously. Averaging N independent noisy measurements reduces noise variance by a factor of N (noise amplitude by √N). In decibels: 10Β·log₁₀(N).

SNR Improvement from Averaging Noise level (dB) 0 dB N = 1 βˆ’20 dB N = 100 βˆ’30 dB N = 1,000 βˆ’47 dB N = 50,000
AD gain = 10 Β· log₁₀(M) dB (M = gene length in bp)

For a 50 kb gene: 10 Β· log₁₀(50,000) β‰ˆ 47 dB. This is a 50,000Γ— improvement in signal quality compared to measuring at just one position. This is the Algebraic Diversity (AD) processing gain β€” the mathematical benefit of using all M positions simultaneously via the cyclic group Z_M.

The Z_M Group Average β€” What It Actually Computes

The cyclic group Z_M consists of M cyclic shifts of our coverage vector. Applying it gives a group-averaged covariance matrix whose eigenvalues are exactly the DFT power spectrum.

xβ‚€, x₁,…, x_{M-1} coverage vector P⁰x PΒΉx PΒ²x PΒ³x … P^{M-1}x RΜ‚ = (1/M)Β·Ξ£ (Pᡏx)(Pᡏx)α΅€
FFT(x)[0] = DC component = group average = mean coverage

Each shift P^k x represents "what would the coverage profile look like if we could observe the gene starting from position k?" Averaging all M shifts gives an unbiased estimate with variance reduced by M. The DFT computes all of this in O(M log M) time via the Fast Fourier Transform β€” effectively "for free." The spectral concentration ψ = max(|X[f]|Β²) / Ξ£|X[f]|Β² measures how coherent the coverage wave is, and correlates with elongation rate (Pearson r = 0.557).

Research Roadmap β€” Four Phases

1 Speed βœ“ 2 Rephase ⏳ 3 Epigenomics ⏳ 4 RNA Folding ⏳
1
Estimate v(p) at every base pair β€” DONE βœ“
Three methods: A (C/Οβ‚„β‚€β‚˜), B (C/(Οβ‚„β‚€β‚˜βˆ’Οβ‚€β‚˜)), groHMM wave-front. Best Pearson r = 0.557 (ψ), best Spearman ρ = 0.603 (Method A).
2
Temporal rephasing: genomic position β†’ transcription time
Convert t(p) = βˆ«β‚€α΅– dp'/v(p') β€” maps genomic coordinate axis to time coordinate axis. Enables direct comparison of dynamics across genes of different lengths/speeds. Run: scripts/11_temporal_rephase.py
3
Epigenomic overlay: methylation, H3K36me3, CTCF, …
Download ENCODE ChIP-seq/WGBS data for MCF-7. Correlate each mark with v(p) profile. Do slow regions have more H3K27me3? Do CTCF sites cause speed bumps? Run: scripts/10_download_epigenomics.py
4
RNA folding: does slow speed give more co-transcriptional folding time?
Slower elongation at a given position means the nascent RNA has more time to fold before the next nucleotide is added. Test whether known RNA structure elements co-locate with slow-speed regions.
⚑
Select a gene from the sidebar
Choose any gene to view its instantaneous RNAPII speed profile at 1 bp resolution, computed by three independent methods.
🧬 Epigenomic Marks β€” Awaiting Data
⏳
Epigenomic data not yet downloaded
Six histone and chromatin marks will be overlaid with speed profiles once downloaded. These marks can reveal whether slow RNAPII speed correlates with repressive marks or CTCF binding.
python3 scripts/10_download_epigenomics.py
DNA methylation H3K36me3 H3K27ac H3K4me3 H3K27me3 CTCF
πŸ“ˆ
Select a gene from the sidebar
View raw GRO-Seq coverage at 0m and 40m time points, plus per-timepoint speed curves from Method C.
Method Comparison β€” Pearson r & Spearman ρ vs Danko et al. 2013
Method Pearson r p-value Spearman ρ n
Loading…
All Genes β€” groHMM rate vs Danko et al. (colored by ψ-A)
Pearson r = 0.514 · Spearman ρ = 0.601 · n = 81 genes | Points colored by AD spectral concentration ψ (Method A)
All Genes β€” Method A vs Method B speed (colored by ψ-A)
Shows that A and B are nearly identical, confirming E2 uniformly amplifies coverage (Οβ‚€β‚˜ β‰ˆ const Β· Οβ‚„β‚€β‚˜ within wave). Both achieve Pearson r = 0.288 vs Danko.
🎬 RNAPII Traversal Simulator β€” elongation at estimated v(p) with co-transcriptional RNA folding
Select gene in sidebar β†’ click β–Ά Play
πŸ”¬ RNA FOLD VIEW
How to read this animation
πŸ”΅ Blue circle = RNA Pol II
Moves along the gene body at the instantaneous speed v(p) = C / Οβ‚„β‚€β‚˜(p) estimated from GRO-seq read coverage. Faster where coverage is low (polymerase zipped through), slower where coverage piled up (it dwelled there).
🟣 Wavy purple line = Nascent RNA transcript
The RNA chain being synthesised behind the polymerase. Hairpin stem-loops (double lines + circle) appear at positions where v(p) was lowest β€” the polymerase lingered there, giving the RNA time to fold before being pulled into the exit channel.
🟠 Gene body colour = speed heatmap
Blue = slow elongation, orange/warm = fast. Derived directly from v(p). Epigenomic peaks (H3K36me3, H3K27ac, CTCF…) float above the bar if ENCODE data is loaded for this gene.
πŸ“Š Speed chart (bottom panel)
Full v(p) curve across the gene. The amber cursor tracks the polymerase in real time. Use the speed slider to compress 40 min of transcription into seconds. Stem-loop count and position stats shown in the stats box.
⏱ Temporal Rephasing β€” RNAPII Speed in Transcription Time
Coordinate transform t(p) = βˆ«β‚€α΅– dpβ€²/v(pβ€²). All genes aligned to [0, 40 min].
Multi-Gene Speed Profiles β€” Transcription Time Axis
Each trace = one gene. Speed normalized within gene (0–1). Selected gene highlighted in amber. Colour gradient: blue = slow gene, red = fast gene (by Danko rate).
Loading temporal profiles…
About Temporal Rephasing
Why rephase? Genomic position is not time β€” a fast gene covers 100 kb in 40 min while a slow gene covers 60 kb. Aligning by transcription time t(p) = ∫dp/v(p) removes this compression artifact and enables direct comparison of molecular events (nucleosome encounters, co-transcriptional folding onset, pause sites) across genes of different speeds.

Method: v(p) from Method A (C/Οβ‚„β‚€β‚˜), sparse grid integrated via trapezoidal rule, normalised so t(wave_40m) = 40 min. Epigenomic marks interpolated onto uniform 500-point time grid [0, 40] min.