| Method | Pearson r | Spearman Ο | Notes |
|---|---|---|---|
| Loading⦠| |||
The Big Idea: It's Like Highway Traffic
Imagine a long highway. Helicopters hover above and count cars at every mile marker. If there are many cars per mile, that stretch is congested β cars are moving slowly. If there are few cars per mile, the road is clear and cars are moving fast.
This simple relationship is the core of our analysis. More entities per unit length means each entity is moving slower (they spend more time there). This is true for highway cars and it's true for RNA Polymerases on DNA.
In Our Lab: RNA Polymerases on DNA
In our experiment, instead of cars on a highway, we have RNA Polymerases (RNAPs) moving along a gene (a stretch of DNA). GRO-Seq (Genomic Run-On Sequencing) is our "helicopter" β it counts how many polymerases are at each base-pair position at a snapshot in time.
More reads at position p means polymerases spend more time there, which means they are moving slower. By computing 1/density at every position along the gene, we get a 1 bp resolution speed profile. The constant C is calibrated so that the mean speed matches the expected ~2 kb/min.
Why 47 dB? The Algebraic Diversity Processing Gain
Each individual position p gives us one noisy measurement of speed.
Noise from sequencing, mapping, and biological variability limits accuracy.
Key insight: A gene 50,000 bp long gives us
50,000 measurements simultaneously.
Averaging N independent noisy measurements reduces noise variance by a factor of N
(noise amplitude by βN). In decibels: 10Β·logββ(N).
For a 50 kb gene: 10 Β· logββ(50,000) β 47 dB. This is a 50,000Γ improvement in signal quality compared to measuring at just one position. This is the Algebraic Diversity (AD) processing gain β the mathematical benefit of using all M positions simultaneously via the cyclic group Z_M.
The Z_M Group Average β What It Actually Computes
The cyclic group Z_M consists of M cyclic shifts of our coverage vector. Applying it gives a group-averaged covariance matrix whose eigenvalues are exactly the DFT power spectrum.
Each shift P^k x represents "what would the coverage profile look like if we could observe the gene starting from position k?" Averaging all M shifts gives an unbiased estimate with variance reduced by M. The DFT computes all of this in O(M log M) time via the Fast Fourier Transform β effectively "for free." The spectral concentration Ο = max(|X[f]|Β²) / Ξ£|X[f]|Β² measures how coherent the coverage wave is, and correlates with elongation rate (Pearson r = 0.557).
Research Roadmap β Four Phases
scripts/11_temporal_rephase.py
scripts/10_download_epigenomics.py
| Method | Pearson r | p-value | Spearman Ο | n |
|---|---|---|---|---|
| Loading⦠| ||||
Moves along the gene body at the instantaneous speed v(p) = C / Οβββ(p) estimated from GRO-seq read coverage. Faster where coverage is low (polymerase zipped through), slower where coverage piled up (it dwelled there).
The RNA chain being synthesised behind the polymerase. Hairpin stem-loops (double lines + circle) appear at positions where v(p) was lowest β the polymerase lingered there, giving the RNA time to fold before being pulled into the exit channel.
Blue = slow elongation, orange/warm = fast. Derived directly from v(p). Epigenomic peaks (H3K36me3, H3K27ac, CTCFβ¦) float above the bar if ENCODE data is loaded for this gene.
Full v(p) curve across the gene. The amber cursor tracks the polymerase in real time. Use the speed slider to compress 40 min of transcription into seconds. Stem-loop count and position stats shown in the stats box.
Method: v(p) from Method A (C/Οβββ), sparse grid integrated via trapezoidal rule, normalised so t(wave_40m) = 40 min. Epigenomic marks interpolated onto uniform 500-point time grid [0, 40] min.