Module Documentation¶
Looking for documentation of our functions? You are at the right place.
flintypy.v_stat module¶
- flintypy.v_stat._block_cov(X, blocks, p)¶
Covariance Computations between Pairs of Distances (Block Dependencies Case)
Computes covariance matrix entries and associated \(\alpha\), \(\beta\) and \(\gamma\) quantities, for partitionable features that are grouped into blocks. Computes the unique entries of the asymptotic covariance matrix of the pairwise distances in \(O(N^2)\) time.
This is used in the large \(B\) asymptotics of the permutation test.
Depends on:
_hamming_distances
,scipy.spatial.distance.pdist
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The three distinct entries of covariance matrix \((\alpha,\beta,\gamma)\), all floats.
- Return type
float 1D array
- flintypy.v_stat._block_large_p(X, blocks, p)¶
Approximate p-value for Exchangeability Test (Assuming Large P with Block Dependencies)
Computes the large \(P\) asymptotic p-value for \(\mathbf{X}\), assuming its \(P\) features are independent within specified blocks.
Depends on: _chi2_weights, _block_cov, _calculate_bin_v_stat, _calculate_real_v_stat, _convolution_of_chi2
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The approximate p-value for \(\mathbf{X}\).
- Return type
float
- flintypy.v_stat._block_large_p_large_n(X, blocks, p)¶
Approximate p-value for Exchangeability Test (Large P, Large N, Block Dependency)
Computes the large \(P\), large \(N\) asymptotic p-value for \(\mathbf{X}\), assuming its \(P\) features are are independent within specified blocks.
Depends on:
_chi2_weights
,_block_cov
,_calculate_bin_v_stat
,_calculate_real_v_stat
,scipy.stats.norm
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The approximate p-value for \(\mathbf{X}\).
- Return type
float
- flintypy.v_stat._block_permute(X, blocks, nruns, p)¶
p-value Computation for Test of Exchangeability with Block Dependencies
Generates a block permutation p-value. Uses a heuristic to decide whether to use distance caching or simple block permutations.
Depends on:
_calculate_bin_v_stat
,_calculate_real_v_stat
,_naive_block_permute
,_cache_block_permute
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
nruns (int) – The number of permutations to perform / resampling number.
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The block permutation p-value
- Return type
float
- flintypy.v_stat._build_forward(n)¶
Map from Indices to Label Pairs
Builds a map from indexes to pairs of labels. This is for caching distances, to avoid recomputing distances especially when dealing with high-dimensional (large \(P\)) arrays.
Depends on:
numba.njit
- Parameters
n (int) – Sample size (i.e., \(\mathbf{X}\).shape[0])
- Returns
forward – An \(N\times N\) array, whose entries record the index corresponding to the pair of labels (indexed by the matrix dimensions)
- Return type
int64 2D array
- flintypy.v_stat._build_reverse(n)¶
Map from Label Pairs to Indices
Builds a map from pairs of labels to indexes. This is for caching distances, to avoid recomputing distances especially when dealing with high-dimensional (large \(P\)) arrays.
Depends on:
numba.njit
- Parameters
n (int) – Sample size (i.e., \(\mathbf{X}\).shape[0])
- Returns
reverse – An \({N \choose 2} \times 2\) array, whose entries at row \(k\), \((k,0)\) and \((k,1)\), are the indices that make up the \(k\) th pair in the list \(((1,1), (1,2), ..., (1,N), (2,3),\ldots)\)
- Return type
int64 2D array
- flintypy.v_stat._cache_block_permute(X, blocks, nruns, p)¶
Resampling Many V Statistics
Generates a block permutation distribution of \(V\). Precomputes distances and some indexing arrays to quickly generate samples from the block permutation distribution.
Depends on:
_hamming_distances
,scipy.spatial.distance.pdist
,_build_forward
,_build_reverse
,_numba_permute_dists
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\).
nruns (int) – The number of permutations to perform / resampling number.
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
Array of floats storing the permutation distribution of the \(V\) statistic
- Return type
array
- flintypy.v_stat._calculate_bin_v_stat(X)¶
V Statistic for Binary Arrays
Computes \(V(\mathbf{X})\) for a binary matrix \(\mathbf{X}\).
Depends on:
_hamming_distances
- Parameters
X (int64 numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
- Returns
The \(V\) statistic, a scalar computing the variance of the pairwise Hamming distance between individuals.
- Return type
float
- flintypy.v_stat._calculate_real_v_stat(X, p)¶
V Statistic for Real Arrays
Computes \(V(\mathbf{X})\) for a real matrix \(\mathbf{X}\), where \(V(\mathbf{X})\) is the scaled variance of \(l_p^p\) distances between the rows of \(\mathbf{X}\).
Depends on:
scipy.spatial.distance.pdist
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The \(V\) statistic, a scalar computing the variance of the pairwise \(l_p^p\) distance between individuals.
- Return type
float
- flintypy.v_stat._chi2_weights(alpha, beta, gamma, n)¶
Get Chi Square Weights
Computes convolution weights for the asymptotic random variable from the covariance entries \(\alpha\), \(\beta\) and \(\gamma\) obtained from \(\mathbf{X}\).
- Parameters
alpha (float) – The variance of \(d(X_1, X_2)\)
beta (float) – The covariance of \(d(X_1,X_2)\) and \(d(X_1,X_3)\)
gamma (float) – The covariance of \(d(X_1,X_2)\) and \(d(X_3,X_4)\)
n (int) – Sample size (i.e., \(\mathbf{X}\).shape[0])
- Returns
Two floats, \(w_1\) and \(w_2\), where \(w_1\) is the weight for the chi square distribution with \(n-1\) degrees of freedom and \(w_2\) is the weight for the chi square distribution with \({n-1 \choose 2} - 1\) degrees of freedom.
- Return type
1D floats
- flintypy.v_stat._convolution_of_chi2(val, w1, w2, d1, d2)¶
Tail Probability for Chi Square Convolution Random Variable
Computes \(P(X > val)\) where \(X = w_1 Y + w_2 Z\), where \(Y\) is chi square distributed with \(d_1\) degrees of freedom, and \(Z\) is chi square distributed with \(d_2\) degrees of freedom. The probabiity is computed using numerical integration of the densities of the two chi square distributions. (Method: trapezoidal rule)
Depends on:
scipy.stats.chi2
- Parameters
val (float) – The point at which to evaluate the anti-CDF (aka, observed statistic).
w1 (float) – The weight of the first chi square rv
w2 (float) – The weight of the second chi square rv
d1 (int) – The degrees of freedom of first chi square rv
d2 (int) – The degrees of freedom of second chi square rv
- Returns
\(1 - CDF = P(X > val)\), the probability that the rv is at least val
- Return type
float
- flintypy.v_stat._dist_data_large_p(dist_list)¶
Asymptotic p-value Computation for Test of Exchangeability Using Distance Data
Generates an asymptotic distribution of \(V\), by storing the provided list of distance data as a \(B\times {N \choose 2}\) array, and then using large-\(P\) theory to generate the asymptotic null distribution. The observed \(V\) statistic is also computed from the distance data.
Each element of dist_list should be the same type. They can either all be distance matrices (shape \((N,N)\)), or all be distance vectors (shape \(({N \choose 2},)\)).
Depends on:
scipy.spatial.distance.squareform
,_build_forward
,_build_reverse
,_numba_permute_dists
- Parameters
dist_list (List of numpy arrays (matrix / vector)) – The list of pairwise distance
- Returns
The block permutation p-value.
- Return type
float
- flintypy.v_stat._dist_data_permute(dist_list, nruns)¶
p-value Computation for Test of Exchangeability using Distance Data
Generates a permutation null distribution of \(V\) by storing the provided list of distance data as a \(B \times {N \choose 2}\) array, and then permuting the underlying indices of each individual to generate resampled arrays. The observed \(V\) statistic is also computed from the distance data. The p-value is computed from the null distribution and the observed \(V\) statistic.
Each element of dist_list should be the same type. They can either all be distance matrices (shape \((N,N)\)), or all be distance vectors (shape \(({N \choose 2},)\)).
Depends on:
scipy.spatial.distance.squareform
,_build_forward
,_build_reverse
,_numba_permute_dists
- Parameters
dist_list (List of numpy arrays (matrix / vector)) – The list of pairwise distances
nruns (int) – The number of permutations to perform / resampling number.
- Returns
The block permutation p-value.
- Return type
float
- flintypy.v_stat._hamming_distance_gmpy(X)¶
Bit-Computation of Pairwise Hamming Distances
Uses bit operations to quickly compute pairwise Hamming distances for a two dimensional array \(\mathbf{X}\). Incurs some overhead in packing the bits, so performance gains are only for sufficiently large arrays.
Depends on:
gmpy2.pack
,gmpy2.hamdist
- Parameters
X (int64 numpy array) – An \(N\times P\) array recording \(P\) features in \(N\) individuals
- Returns
A length \({N \choose 2}\) array containing the pairwise Hamming distances in order \(((1, 1), (1, 2), ..., (1, N), (2, 3),\ldots)\).
- Return type
float64 numpy array
- flintypy.v_stat._hamming_distances(X)¶
A Hamming Distance Vector Calculator
Uses a heuristic to decide whether to use bit operations or operate directly on a two-dimensional array \(\mathbf{X}\) to compute the pairwise Hamming distances.
Depends on:
_hamming_distance_gmpy
,scipy.spatial.distance.pdist
- Parameters
X (int64 numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
- Returns
A length \({N \choose 2}\) array containing the pairwise Hamming distances in order \(((1, 1), (1, 2), ..., (1, N), (2, 3),\ldots)\).
- Return type
float64 numpy array
- flintypy.v_stat._ind_cov(X, p)¶
Covariance Computation Between Pairs of Distances (Independent Case)
Computes covariance matrix entries and associated \(\alpha\), \(\beta\) and \(\gamma\) quantities defined in Aw, Spence and Song (2021), assuming the \(P\) features of dataset \(\mathbf{X}\) are independent.
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The three distinct entries of covariance matrix \((\alpha,\beta,\gamma)\), all floats.
- Return type
float 1D array
- flintypy.v_stat._ind_large_p(X, p)¶
Approximate p-value for Test of Exchangeability (Assuming Large P)
Computes the large \(P\) asymptotic p-value for dataset \(\mathbf{X}\), assuming its \(P\) features are independent.
Depends on:
scipy.stats.chi2
,_calculate_bin_v_stat
,_convolution_of_chi2
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The approximate p-value for \(\mathbf{X}\), a float.
- Return type
float
- flintypy.v_stat._ind_large_p_large_n(X, p)¶
Approximate p-value for Test of Exchangeability (Assuming Large N and P)
Computes the large \(N\), large \(P\) asymptotic p-value for \(\mathbf{X}\) assuming its \(P\) features are independent.
Depends on:
_calculate_bin_v_stat
,_calculate_real_v_stat
,_ind_cov
,_chi2_weights, scipy.stats.norm
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
The approximate p-value for \(\mathbf{X}\), a float.
- Return type
float
- flintypy.v_stat._naive_block_permute(X, block_labels, p)¶
Resampling V Statistic
Generates a new array \(\mathbf{X}'\) under the permutation null, and then returns the \(V\) statistic computed for \(\mathbf{X}'\).
Depends on:
_calculate_bin_v_stat
,_calculate_real_v_stat
,_numba_permute
- Parameters
X (numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
block_labels (int64 numpy array) – A length \(P\) array, with entry \(i\) containing the label of the block that contains feature \(i\). Blocks are assumed to be labeled 0 to number of blocks - 1.
p (float) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\)
- Returns
\(V(\mathbf{X}')\), where \(\mathbf{X}'\) is a resampled version of \(\mathbf{X}\).
- Return type
float
- flintypy.v_stat._numba_permute(X, block_labels)¶
Resampling Arrays
Generates a new array \(\mathbf{X}'\) under the permutation null.
Depends on:
numba.njit
- Parameters
X (numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
block_labels (int64 numpy array) – A length \(P\) array, with entry \(i\) containing the label of the block that contains feature \(i\). Blocks are assumed to be labeled 0 to number of blocks - 1.
- Returns
An \(N \times P\) matrix \(\mathbf{X}'\), which is a permuted version of \(\mathbf{X}\).
- Return type
numpy array
- flintypy.v_stat._numba_permute_dists(dists, forward, reverse)¶
Permutation by Caching Distances
What do you do when you have to compute pairwise distances many times, and those damn distances take a long time to compute? Answer: You cache the distances and permute the underlying sample labels!
Permutes pairwise distances (Hamming, \(l_p^p\), etc.) within blocks. Permutations respect the fact that we are actually permuting the underlying labels. Arguments forward and reverse should be precomputed using
_build_forward
and_build_reverse
.Depends on:
numba.njit
- Parameters
dists (float64 numpy array) – A \(B \times {N \choose 2}\) array, where each column records the pairwise distances across all blocks, ordered as \(((1, 1), (1, 2), ..., (1, N), (2, 3),\ldots)\)
forward (numpy 2D array) – A \(N \times N\) mapping from labels \(i, j\) to their corresponding index in dists. Should be precomputed using
_build_forward
reverse (numpy 2D array) – A \({N \choose 2}\times 2\) mapping from an index in dists to the corresponding labels i and j. Should be precomputed using
_build_reverse
.
- Returns
A \(B \times {N \choose 2}\) array containing the block-permuted pairwise distances
- Return type
float64 2D array
- flintypy.v_stat.dist_data_p_value(dist_list, large_p=False, num_perms=1000)¶
A Non-parametric Test for Exchangeability and Homogeneity (Distance List Version)
Computes the p-value of a multivariate dataset, which informs the user if the sample is exchangeable at a given significance level, while simultaneously accounting for feature dependencies.
This version takes in a list of distances recording pairwise distances between individuals across either \(P\) independent features or \(B\) independent blocks of features.
Each element of dist_list should be the same type. They can either all be distance matrices (shape \((N,N)\)), or all be distance vectors (shape \(({N \choose 2},)\)).
Depends on:
_dist_data_large_p
,_dist_data_permute
- Parameters
dist_list (List of numpy arrays (matrix / vector)) – The list of pairwise distances
large_p (boolean, optional) – Indicates whether large P asymptotics are used to determine the approximate null distribution. The default is False.
num_perms (int) – The number of permutations to perform / resampling number. The default is 1000.
- Returns
The p-value.
- Return type
float
- flintypy.v_stat.get_p_value(X, blocks=None, large_p=False, large_n=False, num_perms=1000, p=2)¶
A Non-parametric Test of Exchangeability and Homogeneity
Computes the p-value of a multivariate dataset \(\mathbf{X}\), which informs the user if the sample is exchangeable at a given significance level, while simultaneously accounting for feature dependencies. See Aw, Spence and Song (2021) for details.
Automatically detects if dataset is binary, and runs the Hamming distance version of test if so. Otherwise, computes the squared Euclidean distance between individuals and evaluates whether the variance of Euclidean distances, \(V\), is atypically large under the null hypothesis of exchangeability.
Note the user may tweak the choice of power \(p\) if they prefer an \(l_p^p\) distance other than Euclidean (\(p=2\)).
Under the hood, the variance statistic \(V\) is computed efficiently. Moreover, the user can specify their choice of block permutations, large \(P\) asymptotics, or large \(P\) and large \(N\) asymptotics. The latter two return reasonbly accurate p-values for moderately large dimensionalities.
User recommendations: When the number of independent blocks \(B\) or number of independent features \(P\) is at least \(50\), it is safe to use large \(P\) asymptotics. If \(P\) or \(B\) is small, stick with permutations.
Depends on:
_ind_large_p
,_ind_large_p_large_n
,_block_large_p
,_block_large_p_large_n
,_block_permute
- Parameters
X (float numpy array) – An \(N \times P\) array recording \(P\) features in \(N\) individuals
blocks (List of int64 numpy arrays, optional) – List of arrays, with the \(k\) th array containing the indices of features in block \(k\). The default is None.
large_p (boolean, optional) – Indicates whether large \(P\) asymptotics are used to determine the approximate null distribution. The default is False.
large_n (boolean, optional) – Indicates whether large \(P\), large \(N\) asymptotics are used to determine the approximate null distribution. The default is False.
num_perms (int, optional) – Determines the number of permutations to perform if not using an asymptotic test. If either large_p or large_n is true, num_perms is ignored. The default is 1000.
p (float, optional) – The Minkowski power, \(l_p^p = (x_1^p+\ldots+x_n^p)\). The default is 2.
- Raises
IOError – These are errors associated with assertions and unit tests.
NotImplementedError – This error corresponds to the large \(N\), small \(P\) asymptotics, which is currently not implemented.
- Returns
The p-value computed for the \(V\) statistic of \(\mathbf{X}\).
- Return type
float