Provides a computational framework for Bayesian estimation of antigen-driven selection in immunoglobulin (Ig) sequences, providing an intuitive means of analyzing selection by quantifying the degree of selective pressure. Also provides tools to profile mutations in Ig sequences, build models of somatic hypermutation (SHM) in Ig sequences, and make model-dependent distance comparisons of Ig repertoires.

SHazaM is part of the Immcantation
analysis framework for Adaptive Immune Receptor Repertoire sequencing
(AIRR-seq) and provides tools for advanced analysis of somatic hypermutation
(SHM) in immunoglobulin (Ig) sequences. Shazam focuses on the following

analysis topics:

**Quantification of mutational load**

SHazaM includes methods for determine the rate of observed and expected mutations under various criteria. Mutational profiling criteria include rates under SHM targeting models, mutations specific to CDR and FWR regions, and physicochemical property dependent substitution rates.**Statistical models of SHM targeting patterns**

Models of SHM may be divided into two independent components: (a) a mutability model that defines where mutations occur and (b) a nucleotide substitution model that defines the resulting mutation. Collectively these two components define an SHM targeting model. SHazaM provides empirically derived SHM 5-mer context mutation models for both humans and mice, as well tools to build SHM targeting models from data.**Analysis of selection pressure using BASELINe**

The Bayesian Estimation of Antigen-driven Selection in Ig Sequences (BASELINe) method is a novel method for quantifying antigen-driven selection in high-throughput Ig sequence data. BASELINe uses SHM targeting models can be used to estimate the null distribution of expected mutation frequencies, and provide measures of selection pressure informed by known AID targeting biases.**Model-dependent distance calculations**

SHazaM provides methods to compute evolutionary distances between sequences or set of sequences based on SHM targeting models. This information is particularly useful in understanding and defining clonal relationships.

General

- Corrected several functions so that they accept both tibbles and data.frames.

Distance Calculation:

- Adding new fitting procedures to the
`"gmm"`

method of`findThreshold()`

that allows users to choose a mixture of two univariate density distribution functions among four available combinations:`"norm-norm"`

,`"norm-gamma"`

,

`"gamma-norm"`

, or`"gamma-gamma"`

. - Added the ability to choose the threshold selection criteria in the
`"gmm"`

method of`findThreshold()`

from the best average sensitivity and specificity, the curve intersection or user defined sensitivity or specificity. - Renamed the
`cutEdge`

argument of`findThreshold()`

to`edge`

.

Mutation Profiling:

- Redesigned
`collapseClones()`

, adding various deterministic and stochastic methods to obtain effective clonal sequences, support for including ambiguous IUPAC characters in output, as well as extensive documentation. Removed`calcClonalConsensus()`

from exported functions. - Added support for including ambiguous IUPAC characters in input for
`observedMutations()`

and`calcObservedMutations()`

. - Fixed a minor bug in calculating the denominator for mutation frequency in
`calcObservedMutations()`

for sequences with non-triplet overhang at the tail. - Renamed column names of observed mutations (previously
`OBSERVED`

) and expected mutations (previously`EXPECTED`

) returned by`observedMutations()`

and`expectedMutations()`

to`MU_COUNT`

and`MU_EXPECTED`

respectively.

Selection Analysis:

`calcBaseline()`

no longer calls`collapseClones()`

automatically if a`CLONE`

column is present. As indicated by the documentation for`calcBaseline()`

users are advised to obtain effective clonal sequences (for example, calling`collapseClones()`

) before running`calcBaseline()`

.- Updated vignette to reflect changes in
`calcBaseline()`

.

Mutation Profiling:

- Fixed a bug in
`collapseClones()`

that prevented it from running when`nproc`

is greater than 1.

General:

- Internal changes for compatibility with dplyr v0.6.0.
- Removed data.table dependency.

Mutation Profiling:

- Fixed a bug in
`collapseClones()`

that resulted in erroneous`CLONAL_SEQUENCE`

and`CLONAL_GERMLINE`

being returned. - Added a vignette describing basic mutational analysis.
- Remove console notification that
`observedMutations`

was running.

General:

- License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

Selection Analysis:

- Fixed a bug in p-value calculation in
`summarizeBaseline()`

. The returned p-value can now be either positive or negative. Its magnitude (without the sign) should be interpreted as per normal. Its sign indicates the direction of the seLicense chalection detected. A positive p-value indicates positive selection, whereas a negative p-value indicates negative selection. - Added
`editBaseline()`

to exported functions, and a corresponding section in the vignette. - Fixed a bug in counting the total number of observed mutations when performing
a local test for codon-by-codon selection analysis in
`calcBaseline()`

.

Targeting Models:

- Added
`numMutationsOnly`

argument to`createSubstitutionMatrix()`

, enabling parameter tuning for`minNumMutations`

. - Added functions
`minNumMutationsTune()`

and`minNumSeqMutationsTune()`

to tune for parameters`minNumMutations`

and`minNumSeqMutations`

in functions`createSubstitutionMatrix()`

and`createMutabilityMatrix()`

respectively. Also added function`plotTune()`

which helps visualize parameter tuning using the abovementioned two new functions. - Added human kappa and lambda light chain, silent, 5-mer, functional targeting
model (
`HKL_S5F`

). - Renamed
`HS5FModel`

as`HH_S5F`

,`MRS5NFModel`

as`MK_RS5NF`

, and`U5NModel`

as`U5N`

. - Added human heavy chain, silent, 1-mer, functional substitution model (
`HH_S1F`

), human kappa and lambda light chain, silent, 1-mer, functional substitution model (`HKL_S1F`

), and mouse kappa light chain, replacement and silent, 1-mer, non-functional substitution model (`MK_RS1NF`

). - Added
`makeDegenerate5merSub`

and`makeDegenerate5merMut`

which make degenerate 5-mer substitution and mutability models respectively based on the 1-mer models. Also added`makeAverage1merSub`

and`makeAverage1merMut`

which make 1-mer substitution and mutability models respectively by averaging over the 5-mer models.

Mutation Profiling:

- Added
`returnRaw`

argument to`calcObservedMutations()`

, which if true returns the positions of point mutations and their corresponding mutation types, as opposed to counts of mutations (hence "raw"). - Added new functions
`slideWindowSeq()`

and`slideWindowDb()`

which implement a sliding window approach towards filtering a single sequence or sequences in a data.frame which contain(s) equal to or more than a given number of mutations in a given number of consecutive nucleotides. - Added new function
`slideWindowTune()`

which allows for parameter tuning for using`slideWindowSeq()`

and`slideWindowDb()`

. - Added new function
`slideWindowTunePlot()`

which visualizes parameter tuning by`slideWindowTune()`

.

Distance Calculation:

- Fixed a bug in
`distToNearest`

wherein`normalize="length"`

for 5-mer models was resulting in distances normalized by junction length squared instead of raw junction length. - Fixed a bug in
`distToNearest`

wherein`symmetry="min"`

was calculating the minimum of the total distance between two sequences instead of the minimum distance at each mutated position. - Added
`findThreshold`

function to infer clonal distance threshold from nearest neighbor distances returned by`distToNearest`

. - Renamed the
`length`

option for the`normalize`

argument of`distToNearest`

to`len`

so it matches Change-O. - Deprecated the
`HS1FDistance`

and`M1NDistance`

distance models, which have been renamed to`hs1f_compat`

and`m1n_compat`

in the`model`

argument of`distToNearest`

. These deprecated models should be used for compatibility with DefineClones in Change-O v0.3.3. These models have been replaced by replaced by`hh_s1f`

and`mk_rs1nf`

, which are supported by Change-O v0.3.4. - Renamed the
`hs5f`

model in`distToNearest`

to`hh_s5f`

. - Added support for
`MK_RS5NF`

models to`distToNearest`

. - Updated
`calcTargetingDistance()`

to enable calculation of a symmetric distance matrix given a 1-mer substitution matrix normalized by row, such as`HH_S1F`

. - Added a Gaussian mixture model (GMM) approach for threshold determination to
`findThreshold`

. The previous smoothed density method is available via the`method="density"`

argument and the new GMM method is available via`method="gmm"`

. - Added the functions
`plotGmmThreshold`

and`plotDensityThreshold`

to plot the threshold detection results from`findThreshold`

for the`"gmm"`

and`"density"`

methods, respectively.

Region Definition:

- Deleted
`IMGT_V_NO_CDR3`

and`IMGT_V_BY_REGIONS_NO_CDR3`

. Updated`IMGT_V`

and`IMGT_V_BY_REGIONS`

so that neither includes CDR3 now.

Selection Analysis:

- Fixed a bug in calcBaseline wherein the germline column was incorrected hardcoded, leading to erroneous mutation counts for some clonal consensus sequences.

Targeting Models:

- Added
`numSeqMutationsOnly`

argument to`createMutabilityMatrix()`

, enabling parameter tuning for`minNumSeqMutations`

.

General:

- Added ape and igraph dependency
- Removed the
`InfluenzaDb`

data object, in favor of the updated`ExampleDb`

provided in alakazam 0.2.4. - Added conversion of sequence to uppercase for several functions to support data that was not generated via Change-O.

Distance Calculation:

- Added the
`cross`

argument to`distToNearest()`

which allows restriction of distances to only distances across samples (ie, excludes within-sample distances). - Added
`mst`

flag to`distToNearest()`

, which will return all distances to neighboring nodes in a minimum spanning tree. - Updated single nucleotide distance models to use the new C++ distance methods in alakazam 0.2.4 for better performance.
- Fixed a bug leading to failed distance calculations for the
`aa`

model of`distToNearest()`

. - Fixed a bug wherein gap characters where being translated into Ns (Asn)
rather than Xs within the
`aa`

model of`distToNearest()`

.

Mutation Profiling:

- Added the
`MutationDefinition`

`VOLUME_MUTATIONS`

. - Added the functions
`shmulateSeq()`

and`shmulateTree()`

to simulate mutations on sequences and lineage trees, respectively, using a 5-mer targeting model. - Renamed
`collapseByClone`

,`calcDbExpectedMutations`

and`calcDbObservedMutations`

to`collapseClones`

,`expectedMutations`

, and`observedMutations`

, respectively.

Selection Analysis:

- Fixed a bug wherein passing a
`Baseline`

object through`groupBaseline()`

multiple times resulted in incorrect normalization. - Added
`title`

options to`plotBaselineSummary()`

and`plotBaselineDensity()`

. - Added more control over colors and group ordering to
`plotBaselineSummary()`

and`plotBaselineDensity()`

. - Added the
`testBaseline()`

function to test the significance of differences between two selection distributions. - Improved selection analysis vignette.

General:

- Renamed package from shm to shazam.
- Internal changes to conform to CRAN policies.
- Compressed and moved example database to the data object
`InfluenzaDb`

. - Fixed several bugs where functions would not work properly when passed
a
`dplyr::tbl_df`

object instead of a`data.frame`

. - Changed R dependency to R >= 3.1.2.
- Added stringi dependency.

Distance Calculation:

- Fixed a bug wherein
`distToNearest()`

did not return the nearest neighbor with a non-zero distance.

Targeting Models:

- Performance improvements to
`createSubstitutionMatrix()`

,

`createMutabilityMatrix()`

, and`plotMutability()`

. - Modified color scheme in
`plotMutability()`

. - Fixed errors in the targeting models vignette.

Mutation Profiling:

- Added the
`MutationDefinition`

objects`MUTATIONS_CHARGE`

,`MUTATIONS_HYDROPATHY`

,`MUTATIONS_POLARITY`

providing alternate approaches to defining replacement and silent annotations to mutations when calling`calcDBObservedMutations()`

and`calcDBExpectedMutations()`

. - Fixed a few bugs where column names, region definitions or mutation models were not being recognized properly when non-default values were used.
- Made the behavior of
`regionDefinition=NULL`

consistent for all mutation profiling functions. Now the entire sequence is used as the region and calculations are made accordingly. `calcDBObservedMutations()`

returns R and S mutations also when`regionDefinition=NULL`

. Older versions reported the sum of R and S mutations. The function will add the columns`OBSERVED_SEQ_R`

and`OBSERVED_SEQ_S`

when`frequency=FALSE`

, and`MU_FREQ_SEQ_R`

and`MU_FREQ_SEQ_R`

when`frequency=TRUE`

.

General:

- Swapped dependency on doSNOW for doParallel.
- Swapped dependency on plyr for dplyr.
- Swapped dependency on reshape2 for tidyr.
- Documentation clean up.

Distance Calculation:

- Changed underlying method of calcTargetingDistance to be negative log10 of the probability that is then centered at one by dividing by the mean distance.
- Added
`symmetry`

parameter to distToNearest to change behavior of how asymmetric distances (A->B != B->A) are combined to get distance between A and B. - Updated error handling in distToNearest to issue warning when unrecognized character is in the sequence and return an NA.
- Fixed bug in 'aa' model in distToNearest that was calculating distance incorrectly when normalizing by length.
- Changed behavior to return nearest nonzero distance neighbor.

Mutation Profiling:

- Renamed calcDBClonalConsensus to collapseByClone Also, renamed argument collapseByClone to expandedDb.
- Fixed a (major) bug in calcExpectedMutations. Previously, the targeting calculation was incorrect and resulted in incorrect expected mutation frequencies. Note, that this also resulted in incorrect BASELINe Selection (Sigma) values.
- Changed denominator in calcObservedMutations to be based on informative (unambiguous) positions only.
- Added nonTerminalOnly parameter to calcDBClonalConsensus indicating whether to consider mutations at leaves or not (defaults to false).

Selection Analysis:

- Updated groupBaseline. Now when regrouping a Baseline object (i.e. grouping previously grouped PDFs) weighted convolution is performed.
- Added "imbalance" test statistic to the Baseline selection calculation.
- Extended the Baseline Object to include binomK, binomN and binomP Similar to numbOfSeqs, each of these are a matrix. They contain binomial inputs for each sequence and region.

Targeting Models:

- Added
`minNumMutations`

parameter to createSubstitutionMatrix. This is the minimum number of observed 5-mers required for the substituion model. The substitution rate of 5-mers with fewer number of observed mutations will be inferred from other 5-mers. - Added
`minNumSeqMutations`

parameter to createMutabilityMatrix. This is the minimum number of mutations required in sequences containing the 5-mers of interest. The mutability of 5-mers with fewer number of observed mutations in the sequences will be inferred. - Added
`returnModel`

parameter to createSubstitutionMatrix. This gives user the option to return 1-mer or 5-mer model. - Added
`returnSource`

parameter to createMutabilityMatrix. If TRUE, the code will return a data frame indicating whether each 5-mer mutability is observed or inferred. - In createSubstitutionMatrix and createMutabilityMatrix, fixed a bug when multipleMutation is set to "ignore".
- Changed inference procedure for the 5-mer substitution model.
- Added inference procedure for 5-mers without enough observed mutations in the mutability model.
- Fixed a bug in background 5-mer count for the RS model.
- Fixed a bug in IMGT gap handling in createMutabilityMatrix.
- Fixed a bug that occurs when sequences are in lower cases.

Initial public release.

General:

- Restructured the S4 class documentation.
- Fixed bug wherein example
`Influenza.tab`

file did not load on Mac OS X. - Added citations for
`citation("shazam")`

command. - Added dependency on data.table >= 1.9.4 to fix bug that occured with earlier versions of data.table.

Distance Calculation:

- Added a human 1-mer substitution matrix,
`HS1FDistance`

, based on the Yaari et al, 2013 data. - Set the
`hs1f`

as the default distance model for`distToNearest()`

. - Added conversion of sequences to uppercase in
`distToNearest()`

. - Fixed a bug wherein unrecongized (including lowercase) characters would lead to silenting returning a distance of 0 to the neared neighbor. Unrecognized characters will now raise an error.

Mutation Profiling:

- Fixed bug in
`calcDBClonalConsensus()`

so that the function now works correctly when called with the argument`collapseByClone=FALSE`

. - Added the
`frequency`

argument to`calcObservedMutations()`

and`calcDBObservedMutations()`

, which enables return of mutation frequencies rather the default of mutation counts.

Targeting Models:

- Removed
`M3NModel`

and all options for using said model. - Fixed bug in
`createSubstitutionMatrix()`

and`createMutabilityMatrix()`

where IMGT gaps were not being handled.

General:

- Added more error checking.

Targeting Models:

- Updated the targeting model workflow to include a clonal consensus step.

Targeting Models:

- Added the
`U5NModel`

, which is a uniform 5-mer model. - Improvements to
`plotMutability()`

output.

Prerelease for review.