Module Documentation¶

Looking for documentation of our functions? You are at the right place.

flintypy.v_stat module¶

flintypy.v_stat._block_cov(X, blocks, p)¶

Covariance Computations between Pairs of Distances (Block Dependencies Case)

Computes covariance matrix entries and associated \(\alpha\), \(\beta\) and \(\gamma\) quantities, for partitionable features that are grouped into blocks. Computes the unique entries of the asymptotic covariance matrix of the pairwise distances in \(O(N^2)\) time.

This is used in the large \(B\) asymptotics of the permutation test.

Depends on: _hamming_distances, scipy.spatial.distance.pdist

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The three distinct entries of covariance matrix \((\alpha,\beta,\gamma)\), all floats.

Return type

float 1D array

flintypy.v_stat._block_large_p(X, blocks, p)¶

Approximate p-value for Exchangeability Test (Assuming Large P with Block Dependencies)

Computes the large \(P\) asymptotic p-value for \(\mathbf{X}\), assuming its \(P\) features are independent within specified blocks.

Depends on: _chi2_weights, _block_cov, _calculate_bin_v_stat, _calculate_real_v_stat, _convolution_of_chi2

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The approximate p-value for \(\mathbf{X}\).

Return type

float

flintypy.v_stat._block_large_p_large_n(X, blocks, p)¶

Approximate p-value for Exchangeability Test (Large P, Large N, Block Dependency)

Computes the large \(P\), large \(N\) asymptotic p-value for \(\mathbf{X}\), assuming its \(P\) features are are independent within specified blocks.

Depends on: _chi2_weights, _block_cov, _calculate_bin_v_stat, _calculate_real_v_stat, scipy.stats.norm

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The approximate p-value for \(\mathbf{X}\).

Return type

float

flintypy.v_stat._block_permute(X, blocks, nruns, p)¶

p-value Computation for Test of Exchangeability with Block Dependencies

Generates a block permutation p-value. Uses a heuristic to decide whether to use distance caching or simple block permutations.

Depends on: _calculate_bin_v_stat, _calculate_real_v_stat, _naive_block_permute, _cache_block_permute

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
nruns (int) – The number of permutations to perform / resampling number.
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The block permutation p-value

Return type

float

flintypy.v_stat._build_forward(n)¶

Map from Indices to Label Pairs

Builds a map from indexes to pairs of labels. This is for caching distances, to avoid recomputing distances especially when dealing with high-dimensional (large \(P\)) arrays.

Depends on: numba.njit

Parameters: n (int) – Sample size (i.e., \(\mathbf{X}\).shape[0])
Returns: forward – An \(N\times N\) array, whose entries record the index corresponding to the pair of labels (indexed by the matrix dimensions)
Return type: int64 2D array

flintypy.v_stat._build_reverse(n)¶

Map from Label Pairs to Indices

Builds a map from pairs of labels to indexes. This is for caching distances, to avoid recomputing distances especially when dealing with high-dimensional (large \(P\)) arrays.

Depends on: numba.njit

Parameters: n (int) – Sample size (i.e., \(\mathbf{X}\).shape[0])
Returns: reverse – An \({N \choose 2} \times 2\) array, whose entries at row \(k\), \((k,0)\) and \((k,1)\), are the indices that make up the \(k\) th pair in the list \(((1,1), (1,2), ..., (1,N), (2,3),\ldots)\)
Return type: int64 2D array

flintypy.v_stat._cache_block_permute(X, blocks, nruns, p)¶

Resampling Many V Statistics

Generates a block permutation distribution of \(V\). Precomputes distances and some indexing arrays to quickly generate samples from the block permutation distribution.

Depends on: _hamming_distances, scipy.spatial.distance.pdist, _build_forward, _build_reverse, _numba_permute_dists

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
nruns (int) – The number of permutations to perform / resampling number.
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

Array of floats storing the permutation distribution of the \(V\) statistic

Return type

array

flintypy.v_stat._calculate_bin_v_stat(X)¶

V Statistic for Binary Arrays

Computes \(V(\mathbf{X})\) for a binary matrix \(\mathbf{X}\).

Depends on: _hamming_distances

Parameters: X (int64 numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
Returns: The \(V\) statistic, a scalar computing the variance of the pairwise Hamming distance between individuals.
Return type: float

flintypy.v_stat._calculate_real_v_stat(X, p)¶

V Statistic for Real Arrays

Computes \(V(\mathbf{X})\) for a real matrix \(\mathbf{X}\), where \(V(\mathbf{X})\) is the scaled variance of \(l_p^p\) distances between the rows of \(\mathbf{X}\).

Depends on: scipy.spatial.distance.pdist

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The \(V\) statistic, a scalar computing the variance of the pairwise \(l_p^p\) distance between individuals.

Return type

float

flintypy.v_stat._chi2_weights(alpha, beta, gamma, n)¶

Get Chi Square Weights

Computes convolution weights for the asymptotic random variable from the covariance entries \(\alpha\), \(\beta\) and \(\gamma\) obtained from \(\mathbf{X}\).

Parameters

alpha (float) – The variance of \(d(X_1, X_2)\)
beta (float) – The covariance of \(d(X_1,X_2)\) and \(d(X_1,X_3)\)
gamma (float) – The covariance of \(d(X_1,X_2)\) and \(d(X_3,X_4)\)
n (int) – Sample size (i.e., \(\mathbf{X}\).shape[0])

Returns

Two floats, \(w_1\) and \(w_2\), where \(w_1\) is the weight for the chi square distribution with \(n-1\) degrees of freedom and \(w_2\) is the weight for the chi square distribution with \({n-1 \choose 2} - 1\) degrees of freedom.

Return type

1D floats

flintypy.v_stat._convolution_of_chi2(val, w1, w2, d1, d2)¶

Tail Probability for Chi Square Convolution Random Variable

Computes \(P(X > val)\) where \(X = w_1 Y + w_2 Z\), where \(Y\) is chi square distributed with \(d_1\) degrees of freedom, and \(Z\) is chi square distributed with \(d_2\) degrees of freedom. The probabiity is computed using numerical integration of the densities of the two chi square distributions. (Method: trapezoidal rule)

Depends on: scipy.stats.chi2

Parameters

val (float) – The point at which to evaluate the anti-CDF (aka, observed statistic).
w1 (float) – The weight of the first chi square rv
w2 (float) – The weight of the second chi square rv
d1 (int) – The degrees of freedom of first chi square rv
d2 (int) – The degrees of freedom of second chi square rv

Returns

\(1 - CDF = P(X > val)\), the probability that the rv is at least val

Return type

float

flintypy.v_stat._dist_data_large_p(dist_list)¶

Asymptotic p-value Computation for Test of Exchangeability Using Distance Data

Generates an asymptotic distribution of \(V\), by storing the provided list of distance data as a \(B\times {N \choose 2}\) array, and then using large-\(P\) theory to generate the asymptotic null distribution. The observed \(V\) statistic is also computed from the distance data.

Each element of dist_list should be the same type. They can either all be distance matrices (shape \((N,N)\)), or all be distance vectors (shape \(({N \choose 2},)\)).

Depends on: scipy.spatial.distance.squareform, _build_forward, _build_reverse, _numba_permute_dists

Parameters: dist_list (List of numpy arrays (matrix / vector)) – The list of pairwise distance
Returns: The block permutation p-value.
Return type: float

flintypy.v_stat._dist_data_permute(dist_list, nruns)¶

p-value Computation for Test of Exchangeability using Distance Data

Generates a permutation null distribution of \(V\) by storing the provided list of distance data as a \(B \times {N \choose 2}\) array, and then permuting the underlying indices of each individual to generate resampled arrays. The observed \(V\) statistic is also computed from the distance data. The p-value is computed from the null distribution and the observed \(V\) statistic.

Each element of dist_list should be the same type. They can either all be distance matrices (shape \((N,N)\)), or all be distance vectors (shape \(({N \choose 2},)\)).

Depends on: scipy.spatial.distance.squareform, _build_forward, _build_reverse, _numba_permute_dists

Parameters

dist_list (List of numpy arrays (matrix / vector)) – The list of pairwise distances
nruns (int) – The number of permutations to perform / resampling number.

Returns

The block permutation p-value.

Return type

float

flintypy.v_stat._hamming_distance_gmpy(X)¶

Bit-Computation of Pairwise Hamming Distances

Uses bit operations to quickly compute pairwise Hamming distances for a two dimensional array \(\mathbf{X}\). Incurs some overhead in packing the bits, so performance gains are only for sufficiently large arrays.

Depends on: gmpy2.pack, gmpy2.hamdist

Parameters: X (int64 numpy array) – An \(N\times P\) array recording \(P\) features in \(N\) individuals
Returns: A length \({N \choose 2}\) array containing the pairwise Hamming distances in order \(((1, 1), (1, 2), ..., (1, N), (2, 3),\ldots)\).
Return type: float64 numpy array

flintypy.v_stat._hamming_distances(X)¶

A Hamming Distance Vector Calculator

Uses a heuristic to decide whether to use bit operations or operate directly on a two-dimensional array \(\mathbf{X}\) to compute the pairwise Hamming distances.

Depends on: _hamming_distance_gmpy, scipy.spatial.distance.pdist

Parameters: X (int64 numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
Returns: A length \({N \choose 2}\) array containing the pairwise Hamming distances in order \(((1, 1), (1, 2), ..., (1, N), (2, 3),\ldots)\).
Return type: float64 numpy array

flintypy.v_stat._ind_cov(X, p)¶

Covariance Computation Between Pairs of Distances (Independent Case)

Computes covariance matrix entries and associated \(\alpha\), \(\beta\) and \(\gamma\) quantities defined in Aw, Spence and Song (2021), assuming the \(P\) features of dataset \(\mathbf{X}\) are independent.

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The three distinct entries of covariance matrix \((\alpha,\beta,\gamma)\), all floats.

Return type

float 1D array

flintypy.v_stat._ind_large_p(X, p)¶

Approximate p-value for Test of Exchangeability (Assuming Large P)

Computes the large \(P\) asymptotic p-value for dataset \(\mathbf{X}\), assuming its \(P\) features are independent.

Depends on: scipy.stats.chi2, _calculate_bin_v_stat, _convolution_of_chi2

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The approximate p-value for \(\mathbf{X}\), a float.

Return type

float

flintypy.v_stat._ind_large_p_large_n(X, p)¶

Approximate p-value for Test of Exchangeability (Assuming Large N and P)

Computes the large \(N\), large \(P\) asymptotic p-value for \(\mathbf{X}\) assuming its \(P\) features are independent.

Depends on: _calculate_bin_v_stat, _calculate_real_v_stat, _ind_cov, _chi2_weights, scipy.stats.norm

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

The approximate p-value for \(\mathbf{X}\), a float.

Return type

float

flintypy.v_stat._naive_block_permute(X, block_labels, p)¶

Resampling V Statistic

Generates a new array \(\mathbf{X}'\) under the permutation null, and then returns the \(V\) statistic computed for \(\mathbf{X}'\).

Depends on: _calculate_bin_v_stat, _calculate_real_v_stat, _numba_permute

Parameters

X (numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
block_labels (int64 numpy array) – A length \(P\) array, with entry \(i\) containing the label of the block that contains feature \(i\). Blocks are assumed to be labeled 0 to number of blocks - 1.
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)

Returns

\(V(\mathbf{X}')\), where \(\mathbf{X}'\) is a resampled version of \(\mathbf{X}\).

Return type

float

flintypy.v_stat._numba_permute(X, block_labels)¶

Resampling Arrays

Generates a new array \(\mathbf{X}'\) under the permutation null.

Depends on: numba.njit

Parameters

X (numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
block_labels (int64 numpy array) – A length \(P\) array, with entry \(i\) containing the label of the block that contains feature \(i\). Blocks are assumed to be labeled 0 to number of blocks - 1.

Returns

An \(N \times P\) matrix \(\mathbf{X}'\), which is a permuted version of \(\mathbf{X}\).

Return type

numpy array

flintypy.v_stat._numba_permute_dists(dists, forward, reverse)¶

Permutation by Caching Distances

What do you do when you have to compute pairwise distances many times, and those damn distances take a long time to compute? Answer: You cache the distances and permute the underlying sample labels!

Permutes pairwise distances (Hamming, \(l_p^p\), etc.) within blocks. Permutations respect the fact that we are actually permuting the underlying labels. Arguments forward and reverse should be precomputed using _build_forward and _build_reverse.

Depends on: numba.njit

Parameters

dists (float64 numpy array) – A \(B \times {N \choose 2}\) array, where each column records the pairwise distances across all blocks, ordered as \(((1, 1), (1, 2), ..., (1, N), (2, 3),\ldots)\)
forward (numpy 2D array) – A \(N \times N\) mapping from labels \(i, j\) to their corresponding index in dists. Should be precomputed using _build_forward
reverse (numpy 2D array) – A \({N \choose 2}\times 2\) mapping from an index in dists to the corresponding labels i and j. Should be precomputed using _build_reverse.

Returns

A \(B \times {N \choose 2}\) array containing the block-permuted pairwise distances

Return type

float64 2D array

flintypy.v_stat.dist_data_p_value(dist_list, large_p=False, num_perms=1000)¶

A Non-parametric Test for Exchangeability and Homogeneity (Distance List Version)

Computes the p-value of a multivariate dataset, which informs the user if the sample is exchangeable at a given significance level, while simultaneously accounting for feature dependencies.

This version takes in a list of distances recording pairwise distances between individuals across either \(P\) independent features or \(B\) independent blocks of features.

Each element of dist_list should be the same type. They can either all be distance matrices (shape \((N,N)\)), or all be distance vectors (shape \(({N \choose 2},)\)).

Depends on: _dist_data_large_p, _dist_data_permute

Parameters

dist_list (List of numpy arrays (matrix / vector)) – The list of pairwise distances
large_p (boolean, optional) – Indicates whether large P asymptotics are used to determine the approximate null distribution. The default is False.
num_perms (int) – The number of permutations to perform / resampling number. The default is 1000.

Returns

The p-value.

Return type

float

flintypy.v_stat.get_p_value(X, blocks=None, large_p=False, large_n=False, num_perms=1000, p=2)¶

A Non-parametric Test of Exchangeability and Homogeneity

Computes the p-value of a multivariate dataset \(\mathbf{X}\), which informs the user if the sample is exchangeable at a given significance level, while simultaneously accounting for feature dependencies. See Aw, Spence and Song (2021) for details.

Automatically detects if dataset is binary, and runs the Hamming distance version of test if so. Otherwise, computes the squared Euclidean distance between individuals and evaluates whether the variance of Euclidean distances, \(V\), is atypically large under the null hypothesis of exchangeability.

Note the user may tweak the choice of power \(p\) if they prefer an \(l_p^p\) distance other than Euclidean (\(p=2\)).

Under the hood, the variance statistic \(V\) is computed efficiently. Moreover, the user can specify their choice of block permutations, large \(P\) asymptotics, or large \(P\) and large \(N\) asymptotics. The latter two return reasonbly accurate p-values for moderately large dimensionalities.

User recommendations: When the number of independent blocks \(B\) or number of independent features \(P\) is at least \(50\), it is safe to use large \(P\) asymptotics. If \(P\) or \(B\) is small, stick with permutations.

Depends on: _ind_large_p, _ind_large_p_large_n, _block_large_p, _block_large_p_large_n, _block_permute

Parameters

X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays, optional) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\). The default is None.
large_p (boolean, optional) – Indicates whether large \(P\) asymptotics are used to determine the approximate null distribution. The default is False.
large_n (boolean, optional) – Indicates whether large \(P\), large \(N\) asymptotics are used to determine the approximate null distribution. The default is False.
num_perms (int, optional) – Determines the number of permutations to perform if not using an asymptotic test. If either large_p or large_n is true, num_perms is ignored. The default is 1000.
p (float, optional) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\). The default is 2.

Raises

IOError – These are errors associated with assertions and unit tests.
NotImplementedError – This error corresponds to the large \(N\), small \(P\) asymptotics, which is currently not implemented.

Returns

The p-value computed for the \(V\) statistic of \(\mathbf{X}\).

Return type

float