Genome Interval Arithmetic in R

Read and manipulate genome intervals and signals. Provides functionality similar to command-line tool suites within R, enabling interactive analysis and visualization of genome-scale data.


Build Status AppVeyor Build Status Coverage Status DOI

valr provides tools to read and manipulate genome intervals and signals, similar to the BEDtools suite. valr enables analysis in the R/RStudio environment, leveraging modern R tools in the tidyverse for a terse, expressive syntax. Compute-intensive algorithms are implemented in Rcpp/C++, and many methods take advantage of the speed and grouping capability provided by dplyr.

Installation

The latest stable version can be installed from CRAN:

install.packages('valr')

The latest development version can be installed from github:

devtools::install_github('rnabioco/valr')

Why valr?

Why another tool set for interval manipulations? Based on our experience teaching genome analysis, we were motivated to develop interval arithmetic software that faciliates genome analysis in a single environment (RStudio), eliminating the need to master both command-line and exploratory analysis tools.

Note: valr can currently be used for analysis of pre-processed data in BED and related formats. We plan to support BAM and VCF files soon via tabix indexes.

Familiar tools, natively in R

The functions in valr have similar names to their BEDtools counterparts, and so will be familiar to users coming from the BEDtools suite. Unlike other tools that wrap BEDtools and write temporary files to disk, valr tools run natively in memory. Similar to pybedtools, valr has a terse syntax:

library(valr)
library(dplyr)
 
snps <- read_bed(valr_example('hg19.snps147.chr22.bed.gz'), n_fields = 6)
genes <- read_bed(valr_example('genes.hg19.chr22.bed.gz'), n_fields = 6)
 
# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# find distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)
 
nearby %>%
  select(starts_with('name'), .overlap, .dist) %>%
  filter(abs(.dist) < 5000)
#> # A tibble: 1,047 x 4
#>    name.x      name.y   .overlap .dist
#>    <chr>       <chr>       <int> <int>
#>  1 rs530458610 P704P           0  2579
#>  2 rs2261631   P704P           0 - 268
#>  3 rs570770556 POTEH           0 - 913
#>  4 rs538163832 POTEH           0 - 953
#>  5 rs190224195 POTEH           0 -1399
#>  6 rs2379966   DQ571479        0  4750
#>  7 rs142687051 DQ571479        0  3558
#>  8 rs528403095 DQ571479        0  3309
#>  9 rs555126291 DQ571479        0  2745
#> 10 rs5747567   DQ571479        0 -1778
#> # ... with 1,037 more rows

Visual documentation

valr includes helpful glyphs to illustrate the results of specific operations, similar to those found in the BEDtools documentation. For example, bed_glyph() illustrates the result of intersecting x and y intervals with bed_intersect():

library(valr)
 
x <- trbl_interval(
  ~chrom, ~start, ~end,
  'chr1', 25,     50,
  'chr1', 100,    125
)
 
y <- trbl_interval(
  ~chrom, ~start, ~end,
  'chr1', 30,     75
)
 
bed_glyph(bed_intersect(x, y))

Reproducible reports

valr can be used in RMarkdown documents to generate reproducible work-flows for data processing. Because computations in valr are fast, it can be for exploratory analysis with RMarkdown, and for interactive analysis using shiny.

Remote databases

Remote databases can be accessed with db_ucsc() (to access the UCSC Browser) and db_ensembl() (to access Ensembl databases).

# access the `refGene` tbl on the `hg38` assembly
ucsc <- db_ucsc('hg38')
tbl(ucsc, 'refGene')

API

Function names are similar to their their BEDtools counterparts, with some additions.

Data types

  • Create new interval sets with tbl_interval() and tbl_genome(). Coerce existing GenomicRanges::GRanges objects with as.tbl_interval().

Reading data

  • Read BED and related files with read_bed(), read_bed12(), read_bedgraph(), read_narrowpeak() and read_broadpeak().

  • Read genome files containing chromosome name and size information with read_genome().

  • Load VCF files with read_vcf().

  • Access remote databases with db_ucsc() and db_ensembl().

Transforming single interval sets

  • Adjust interval coordinates with bed_slop() and bed_shift(), and create new flanking intervals with bed_flank().

  • Combine nearby intervals with bed_merge() and identify nearby intervals with bed_cluster().

  • Generate intervals not covered by a query with bed_complement().

  • Order intervals with bed_sort().

Comparing multiple interval sets

  • Find overlaps between sets of intervals with bed_intersect().

  • Apply functions to overlapping sets of intervals with bed_map().

  • Remove intervals based on overlaps with bed_subtract().

  • Find overlapping intervals within a window with bed_window().

  • Find closest intervals independent of overlaps with bed_closest().

Randomizing intervals

  • Generate random intervals with bed_random().

  • Shuffle the coordinates of intervals with bed_shuffle().

  • Sample input intervals with dplyr::sample_n() and dplyr::sample_frac().

Interval statistics

  • Calculate significance of overlaps between sets of intervals with bed_fisher() and bed_projection().

  • Quantify relative and absolute distances between sets of intervals with bed_reldist() and bed_absdist().

  • Quantify extent of overlap between sets of intervals with bed_jaccard().

Utilities

  • Create features from BED12 files with create_introns(), create_tss(), create_utrs5(), and create_utrs3().

  • Visualize the actions of valr functions with bed_glyph().

  • Constrain intervals to a genome reference with bound_intervals().

  • Subdivide intervals with bed_makewindows().

  • Convert BED12 to BED6 format with bed12_to_exons().

  • Calculate spacing between intervals with interval_spacing().

Related work

News

valr 0.4.0

Minor changes

  • All relevant tests from bedtool2 were ported into valr. Bugs identified in corner cases by new tests were fixed (#328 @raysinesis)

  • bed_jaccard() now works with grouped inputs (#216)

  • Update dplyr header files to v0.7

  • bed_intersect() and internal intersect_impl were refactored to enable return of non-intersecting intervals.

  • The genome argument to bed_makewindows() was deprecated and will produce a warning if used. Also error handling was added to check and warn if there are intervals smaller than the requested window size in makewindows_impl() (#312 @kriemo)

Bug fixes

  • Fixed off by one error in reported distances from bed_closest(). Distances reported now are the same as bedtools closest behavior (#311).

  • bed_glyph() accepts trbl_intervals named other than x and y (#318).

  • bed_makewindows() now returns the number of windows specified by num_win when the input intervals are not evenly divisble into num_win, consistent with bedtools behavior.

  • The output of findOverlaps() is now sorted in subtract_impl() to prevent reporting intervals that should have been dropped when calling bed_subtract() (#316 @kriemo)

valr 0.3.1

Enhancements

  • A manuscript describing valr has been published in F1000Research.

  • New S3 generic as.tbl_interval() converts GenomicRanges::GRanges objects to tbl_interval.

  • New create_tss() for creating transcription start sites.

  • Improve documentation of interval statistics with more complex examples.

Minor changes

  • bed_sort() has been de-deprecated to reduce arrange calls in library code.

Bug fixes

  • bed_merge() now reports start/end columns if spec is provided (#288)

valr 0.3.0

Enhancements

  • New create_introns(), create_utrs5() and create_utrs3() functions for generating features from BED12 files.

  • Speed-ups in bed_makewindows() (~50x), bed_merge() (~4x), and bed_flank() (~4x) (thanks to @kriemo and @sheridar). Thanks to the sponsors of the Biofrontiers Hackathon for the caffeine underlying these improvements.

Bug fixes

  • intervals from bed_random() are now sorted properly.

valr 0.2.0

Major changes

  • Package dplyr v0.5.0 headers with valr to remove dplyr LinkingTo dependency.

  • bed_intersect() now accepts multiple tbls for intersection (#220 @kriemo).

  • new tbl_interval() and tbl_genome() that wrap tibbles and enforce strict column naming. trbl_interval() and trbl_genome() are constructors that take tibble::tribble() formatting and is.tbl_interval() and is.tbl_genome() are used to check for valid classes.

Minor changes

  • intervals returned from bed_random() are sorted by chrom and start by default.

Bug fixes

  • Merge intervals in bed_jaccard() and use numeric values for calculation (fixes #204).

valr 0.1.2

Major changes

  • Deprecate bed_sort() in favor of using dplyr::arrange() explicitly (fixes #134).

Minor changes

  • add src/init.c that calls R_registerRoutines and R_useDynamicSymbols to address NOTE in r-devel

  • Deprecate dist parameter in bed_closest() in favor of using user supplied functions (#182 @kriemo)

  • Make .id values sequential across chroms in bed_cluster() output (#171)

  • Transfer repository to http://github.com/rnabioco/valr, update links and docs.

  • Move shiny app to new repo (http://github.com/rnabioco/valrdata).

  • Add Kent Riemondy to LICENSE file.

Bug fixes

  • bed_merge() now merges contained intervals (#177)

valr 0.1.1

Minor changes

  • test / vignette guards for Suggested RMySQL

  • fixed memory leak in absdist.cpp

  • fixed vignette entry names

valr 0.1.0

Major changes

  • initial release on CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("valr")

0.4.1 by Jay Hesselberth, a month ago


http://github.com/rnabioco/valr, http://rnabioco.github.io/valr


Report a bug at https://github.com/rnabioco/valr/issues


Browse source code at https://github.com/cran/valr


Authors: Jay Hesselberth [aut, cre] (<https://orcid.org/0000-0002-6299-179X>), Kent Riemondy [aut] (<https://orcid.org/0000-0003-0750-1273>)


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports dplyr, rlang, readr, stringr, tibble, broom, ggplot2

Suggests knitr, rmarkdown, testthat, microbenchmark, covr, curl, RMySQL, purrr, tidyr, devtools, DT, cowplot, dbplyr, GenomicRanges, IRanges, S4Vectors

Linking to Rcpp, BH, plogr, bindrcpp

System requirements: C++11


See at CRAN