Read and manipulate genome intervals and signals. Provides functionality similar to command-line tool suites within R, enabling interactive analysis and visualization of genome-scale data.
valr provides tools to read and manipulate genome intervals and signals, similar to the
valr enables analysis in the R/RStudio environment, leveraging modern R tools in the
tidyverse for a terse, expressive syntax. Compute-intensive algorithms are implemented in
Rcpp/C++, and many methods take advantage of the speed and grouping capability provided by
The latest stable version can be installed from CRAN:
The latest development version can be installed from github:
Why another tool set for interval manipulations? Based on our experience teaching genome analysis, we were motivated to develop interval arithmetic software that faciliates genome analysis in a single environment (RStudio), eliminating the need to master both command-line and exploratory analysis tools.
valr can currently be used for analysis of pre-processed data in BED and related formats. We plan to support BAM and VCF files soon via tabix indexes.
The functions in
valr have similar names to their
BEDtools counterparts, and so will be familiar to users coming from the
BEDtools suite. Unlike other tools that wrap
BEDtools and write temporary files to disk,
valr tools run natively in memory. Similar to
valr has a terse syntax:
library(valr)library(dplyr)snps <- read_bed(valr_example('hg19.snps147.chr22.bed.gz'), n_fields = 6)genes <- read_bed(valr_example('genes.hg19.chr22.bed.gz'), n_fields = 6)# find snps in intergenic regionsintergenic <- bed_subtract(snps, genes)# find distance from intergenic snps to nearest genenearby <- bed_closest(intergenic, genes)nearby %>%select(starts_with('name'), .overlap, .dist) %>%filter(abs(.dist) < 5000)#> # A tibble: 1,047 x 4#> name.x name.y .overlap .dist#> <chr> <chr> <int> <int>#> 1 rs530458610 P704P 0 2579#> 2 rs2261631 P704P 0 - 268#> 3 rs570770556 POTEH 0 - 913#> 4 rs538163832 POTEH 0 - 953#> 5 rs190224195 POTEH 0 -1399#> 6 rs2379966 DQ571479 0 4750#> 7 rs142687051 DQ571479 0 3558#> 8 rs528403095 DQ571479 0 3309#> 9 rs555126291 DQ571479 0 2745#> 10 rs5747567 DQ571479 0 -1778#> # ... with 1,037 more rows
valr includes helpful glyphs to illustrate the results of specific operations, similar to those found in the
BEDtools documentation. For example,
bed_glyph() illustrates the result of intersecting
y intervals with
library(valr)x <- trbl_interval(~chrom, ~start, ~end,'chr1', 25, 50,'chr1', 100, 125)y <- trbl_interval(~chrom, ~start, ~end,'chr1', 30, 75)bed_glyph(bed_intersect(x, y))
valr can be used in RMarkdown documents to generate reproducible work-flows for data processing. Because computations in
valr are fast, it can be for exploratory analysis with
RMarkdown, and for interactive analysis using
Remote databases can be accessed with
db_ucsc() (to access the UCSC Browser) and
db_ensembl() (to access Ensembl databases).
# access the `refGene` tbl on the `hg38` assemblyucsc <- db_ucsc('hg38')tbl(ucsc, 'refGene')
Function names are similar to their their BEDtools counterparts, with some additions.
tbl_genome(). Coerce existing
Read BED and related files with
Read genome files containing chromosome name and size information with
Load VCF files with
Access remote databases with
Adjust interval coordinates with
bed_shift(), and create new flanking intervals with
Combine nearby intervals with
bed_merge() and identify nearby intervals with
Generate intervals not covered by a query with
Order intervals with
Find overlaps between sets of intervals with
Apply functions to overlapping sets of intervals with
Remove intervals based on overlaps with
Find overlapping intervals within a window with
Find closest intervals independent of overlaps with
Generate random intervals with
Shuffle the coordinates of intervals with
Sample input intervals with
Calculate significance of overlaps between sets of intervals with
Quantify relative and absolute distances between sets of intervals with
Quantify extent of overlap between sets of intervals with
Create features from BED12 files with
Visualize the actions of valr functions with
Constrain intervals to a genome reference with
Subdivide intervals with
Convert BED12 to BED6 format with
Calculate spacing between intervals with
All relevant tests from bedtool2 were ported into valr. Bugs identified in corner cases by new tests were fixed (#328 @raysinesis)
bed_jaccard() now works with grouped inputs (#216)
Update dplyr header files to v0.7
bed_intersect() and internal
intersect_impl were refactored to enable return of non-intersecting intervals.
The genome argument to
bed_makewindows() was deprecated and will produce a warning if used. Also error handling was added to check and warn if there are intervals smaller than the requested window size in
makewindows_impl() (#312 @kriemo)
Fixed off by one error in reported distances from
bed_closest(). Distances reported now are the same as
bedtools closest behavior (#311).
trbl_intervals named other than
bed_makewindows() now returns the number of windows specified by
num_win when the input intervals are not evenly divisble into
num_win, consistent with
The output of
findOverlaps() is now sorted in
subtract_impl() to prevent reporting intervals that should have been dropped when calling
bed_subtract() (#316 @kriemo)
A manuscript describing valr has been published in F1000Research.
New S3 generic
GenomicRanges::GRanges objects to
create_tss() for creating transcription start sites.
Improve documentation of interval statistics with more complex examples.
bed_sort()has been de-deprecated to reduce
arrangecalls in library code.
bed_merge()now reports start/end columns if spec is provided (#288)
create_utrs3() functions for generating features from BED12 files.
bed_merge() (~4x), and
bed_flank() (~4x) (thanks to @kriemo and @sheridar). Thanks to the sponsors of the Biofrontiers Hackathon for the caffeine underlying these improvements.
bed_random()are now sorted properly.
Package dplyr v0.5.0 headers with valr to remove dplyr LinkingTo dependency.
bed_intersect() now accepts multiple tbls for intersection (#220 @kriemo).
tbl_genome() that wrap tibbles and enforce strict column naming.
trbl_genome() are constructors that take
tibble::tribble() formatting and
is.tbl_genome() are used to check for valid classes.
bed_random()are sorted by
bed_jaccard()and use numeric values for calculation (fixes #204).
bed_sort()in favor of using
dplyr::arrange()explicitly (fixes #134).
src/init.c that calls
R_useDynamicSymbols to address NOTE in r-devel
dist parameter in
bed_closest() in favor of using user supplied functions (#182 @kriemo)
.id values sequential across chroms in
bed_cluster() output (#171)
Transfer repository to http://github.com/rnabioco/valr, update links and docs.
Move shiny app to new repo (http://github.com/rnabioco/valrdata).
Add Kent Riemondy to LICENSE file.
bed_merge()now merges contained intervals (#177)
test / vignette guards for Suggested RMySQL
fixed memory leak in absdist.cpp
fixed vignette entry names