AD RNAPII Speed Profiles

Best Pearson r

0.557

AD spectral ψ

Best Spearman ρ

0.603

Method A / groHMM

Genes analyzed

hg19 · MCF-7 · E2

AD Processing Gain

≤ 50 dB

10·log₁₀(M) per gene

Active TODOs

Loading…

Strategy Log — Recent Sessions

Loading…

Current Benchmark Results — vs Danko et al. 2013

Method	Pearson r	Spearman ρ	Notes
Loading…

Pipeline Status

Checking…

The Big Idea: It's Like Highway Traffic

Imagine a long highway. Helicopters hover above and count cars at every mile marker. If there are many cars per mile, that stretch is congested — cars are moving slowly. If there are few cars per mile, the road is clear and cars are moving fast.

Speed ∝ 1 / density

This simple relationship is the core of our analysis. More entities per unit length means each entity is moving slower (they spend more time there). This is true for highway cars and it's true for RNA Polymerases on DNA.

In Our Lab: RNA Polymerases on DNA

In our experiment, instead of cars on a highway, we have RNA Polymerases (RNAPs) moving along a gene (a stretch of DNA). GRO-Seq (Genomic Run-On Sequencing) is our "helicopter" — it counts how many polymerases are at each base-pair position at a snapshot in time.

v(p) = C / ρ(p) where ρ(p) = read density at position p

More reads at position p means polymerases spend more time there, which means they are moving slower. By computing 1/density at every position along the gene, we get a 1 bp resolution speed profile. The constant C is calibrated so that the mean speed matches the expected ~2 kb/min.

Why 47 dB? The Algebraic Diversity Processing Gain

Each individual position p gives us one noisy measurement of speed. Noise from sequencing, mapping, and biological variability limits accuracy.

Key insight: A gene 50,000 bp long gives us 50,000 measurements simultaneously. Averaging N independent noisy measurements reduces noise variance by a factor of N (noise amplitude by √N). In decibels: 10·log₁₀(N).

AD gain = 10 · log₁₀(M) dB (M = gene length in bp)

For a 50 kb gene: 10 · log₁₀(50,000) ≈ 47 dB. This is a 50,000× improvement in signal quality compared to measuring at just one position. This is the Algebraic Diversity (AD) processing gain — the mathematical benefit of using all M positions simultaneously via the cyclic group Z_M.

The Z_M Group Average — What It Actually Computes

The cyclic group Z_M consists of M cyclic shifts of our coverage vector. Applying it gives a group-averaged covariance matrix whose eigenvalues are exactly the DFT power spectrum.

FFT(x)[0] = DC component = group average = mean coverage

Each shift P^k x represents "what would the coverage profile look like if we could observe the gene starting from position k?" Averaging all M shifts gives an unbiased estimate with variance reduced by M. The DFT computes all of this in O(M log M) time via the Fast Fourier Transform — effectively "for free." The spectral concentration ψ = max(|X[f]|²) / Σ|X[f]|² measures how coherent the coverage wave is, and correlates with elongation rate (Pearson r = 0.557).

Research Roadmap — Four Phases

Estimate v(p) at every base pair — DONE ✓

Three methods: A (C/ρ₄₀ₘ), B (C/(ρ₄₀ₘ−ρ₀ₘ)), groHMM wave-front. Best Pearson r = 0.557 (ψ), best Spearman ρ = 0.603 (Method A).

Temporal rephasing: genomic position → transcription time

Convert t(p) = ∫₀ᵖ dp'/v(p') — maps genomic coordinate axis to time coordinate axis. Enables direct comparison of dynamics across genes of different lengths/speeds. Run: scripts/11_temporal_rephase.py

Epigenomic overlay: methylation, H3K36me3, CTCF, …

Download ENCODE ChIP-seq/WGBS data for MCF-7. Correlate each mark with v(p) profile. Do slow regions have more H3K27me3? Do CTCF sites cause speed bumps? Run: scripts/10_download_epigenomics.py

RNA folding: does slow speed give more co-transcriptional folding time?

Slower elongation at a given position means the nascent RNA has more time to fold before the next nucleotide is added. Test whether known RNA structure elements co-locate with slow-speed regions.

⚡

Select a gene from the sidebar

Choose any gene to view its instantaneous RNAPII speed profile at 1 bp resolution, computed by three independent methods.

🧬 Epigenomic Marks — Awaiting Data

⏳

Epigenomic data not yet downloaded

Six histone and chromatin marks will be overlaid with speed profiles once downloaded. These marks can reveal whether slow RNAPII speed correlates with repressive marks or CTCF binding.

python3 scripts/10_download_epigenomics.py

DNA methylation H3K36me3 H3K27ac H3K4me3 H3K27me3 CTCF

📈

Select a gene from the sidebar

View raw GRO-Seq coverage at 0m and 40m time points, plus per-timepoint speed curves from Method C.

Method Comparison — Pearson r & Spearman ρ vs Danko et al. 2013

Method	Pearson r	p-value	Spearman ρ	n
Loading…

All Genes — groHMM rate vs Danko et al. (colored by ψ-A)

Pearson r = 0.514 · Spearman ρ = 0.601 · n = 81 genes | Points colored by AD spectral concentration ψ (Method A)

All Genes — Method A vs Method B speed (colored by ψ-A)

Shows that A and B are nearly identical, confirming E2 uniformly amplifies coverage (ρ₀ₘ ≈ const · ρ₄₀ₘ within wave). Both achieve Pearson r = 0.288 vs Danko.

🎬 RNAPII Traversal Simulator — elongation at estimated v(p) with co-transcriptional RNA folding

Speed: 50k× Smooth: raw Epi marks Stem-loops Fold panel Select gene in sidebar → click ▶ Play

🔬 RNA FOLD VIEW Zoom: 1.0×

How to read this animation

🔵 Blue circle = RNA Pol II
Moves along the gene body at the instantaneous speed v(p) = C / ρ₄₀ₘ(p) estimated from GRO-seq read coverage. Faster where coverage is low (polymerase zipped through), slower where coverage piled up (it dwelled there).

🟣 Wavy purple line = Nascent RNA transcript
The RNA chain being synthesised behind the polymerase. Hairpin stem-loops (double lines + circle) appear at positions where v(p) was lowest — the polymerase lingered there, giving the RNA time to fold before being pulled into the exit channel.

🟠 Gene body colour = speed heatmap
Blue = slow elongation, orange/warm = fast. Derived directly from v(p). Epigenomic peaks (H3K36me3, H3K27ac, CTCF…) float above the bar if ENCODE data is loaded for this gene.

📊 Speed chart (bottom panel)
Full v(p) curve across the gene. The amber cursor tracks the polymerase in real time. Use the speed slider to compress 40 min of transcription into seconds. Stem-loop count and position stats shown in the stats box.

⏱ Temporal Rephasing — RNAPII Speed in Transcription Time

Coordinate transform t(p) = ∫₀ᵖ dp′/v(p′). All genes aligned to [0, 40 min].

Multi-Gene Speed Profiles — Transcription Time Axis

Each trace = one gene. Speed normalized within gene (0–1). Selected gene highlighted in amber. Colour gradient: blue = slow gene, red = fast gene (by Danko rate).

Loading temporal profiles…

About Temporal Rephasing

Why rephase? Genomic position is not time — a fast gene covers 100 kb in 40 min while a slow gene covers 60 kb. Aligning by transcription time t(p) = ∫dp/v(p) removes this compression artifact and enables direct comparison of molecular events (nucleosome encounters, co-transcriptional folding onset, pause sites) across genes of different speeds.

Method: v(p) from Method A (C/ρ₄₀ₘ), sparse grid integrated via trapezoidal rule, normalised so t(wave_40m) = 40 min. Epigenomic marks interpolated onto uniform 500-point time grid [0, 40] min.