,

Fast weighted Hamming distance in R

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here). The fast row-weighted binary Hamming distance: The fast row-weighted “multi-level” Hamming distance: Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist…

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here).

The fast row-weighted binary Hamming distance:

hamming_binary <- function(X, w = rep(1, nrow(X))) {
	w <- sqrt(w)
	D <- t(w * (1 - X)) %*% (w * X)
	D + t(D)
}

The fast row-weighted “multi-level” Hamming distance:

hamming <- function(X, w = rep(1, nrow(X))) {
	uniqs <- unique(as.vector(X))
	H <- hamming_binary(X == uniqs[1], w)
	for ( uniq in uniqs[-1] ) {
		H <- H + hamming_binary(X == uniq, w)
	}
	H / 2
}

Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist of any number of different characters, integers, and any other type of element you can store in an R matrix. Also, because it avoids R-level looping, it is extremely fast.

Tags:

Responses to “Fast weighted Hamming distance in R”

  1. Fast Hamming distance in R using covariance – Johann de Jong

    […] in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed by adding multiple binary […]

    Like

  2. William

    I’ve come to this post rather late, but I’m looking to do something similar to what you’ve discussed here. This is an elegant solution, but I have one question – would this benefit from being parallelized at all? Could the for loop be turned into a foreach or a future_map or would parallelization not play niceley here?

    As of right now when I implement your function I’m only using one of my computer’s cores and it seems like a lot of processing power to leave on the table for larger matrices.

    Like

  3. johanndejong

    Yes, parallelization can make the function faster. However, you might consider parallelizing at the matrix multiplication level (instead of at the for loop level). For example, note that parallelization at the for loop level would no difference for binary matrices. Maybe have a look at this package: https://cran.r-project.org/web/packages/Rfast/index.html

    Like

    1. William

      Thank you so much for the advice!

      Like

  4. Guangzhi

    I have a dataset of 40 rows (features) and 3500 columns (observations). I used your Faster Hamming distance in R (2) and got the pairwise hamming distances matrix (3500×3500) for the 3500 observation. I would like also to get a pairwise distance matrix with weighted rows, I run the functions in this post, it did return a matrix. I am just wondering if I used your function correctly?

    Thanks

    Like

    1. Anonymous

      it did not retrun a matrix

      Like

      1. johanndejong

        Hi! Can you give me some more details? Some code to reproduce would be helpful.

        Like

      2. Guangzhi

        Thanks for your reply. it was my bad that did not run the hamming_binary function before I ran the hamming function. I made it work by running hamming_binary, and I got the output with row modified distance matrix. Your codes worked beautifully for a few smaller dataset (40rows and 3000 columns), the running time is within several minutes.

        However, I did run into a few other problems:
        1. I got empty weighted distance matrix with dataset (40 row, 1000 column) when I used a weight vector containing negative numbers, does that mean that I can not use negative value for weight?

        2. I would like to run much large dataset to get weighted and un-weighted hamming distance , for example, a dataset with over 1000 rows (features, categorical data) and over 10,000 columns (observations). I tried the hamming distance with a dataset with 1000 rows and 4000 columns , I did not get the output after an hour so I stopped the calculation. My question is that is it possible to run a large dataset (1000rows X 10000 columns), and how long it would take?

        Thanks for your time and any help is highly appreciated.

        Like

      3. johanndejong

        A few minutes for a 40×1000 matrix is very long, so that probably means you have a large number of unique values in your matrix. In that case, you could consider parallelizing the for loop in the hamming function. Alternatively, you can parallelize the matrix multiplication step itself using e.g. the Rfast package: https://cran.r-project.org/web/packages/Rfast/index.html

        Hope that helps.

        Like

  5. Anonymous

    Thank s

    Like

Leave a comment