Johann de Jong

johanndejong

September 28, 2019

Fast weighted Hamming distance in R

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here). The fast row-weighted binary Hamming distance: The fast row-weighted “multi-level” Hamming distance: Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist…

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here).

The fast row-weighted binary Hamming distance:

hamming_binary <- function(X, w = rep(1, nrow(X))) {
	w <- sqrt(w)
	D <- t(w * (1 - X)) %*% (w * X)
	D + t(D)
}

The fast row-weighted “multi-level” Hamming distance:

hamming <- function(X, w = rep(1, nrow(X))) {
	uniqs <- unique(as.vector(X))
	H <- hamming_binary(X == uniqs[1], w)
	for ( uniq in uniqs[-1] ) {
		H <- H + hamming_binary(X == uniq, w)
	}
	H / 2
}

Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist of any number of different characters, integers, and any other type of element you can store in an R matrix. Also, because it avoids R-level looping, it is extremely fast.

Tags:

Responses to “Fast weighted Hamming distance in R”

Fast Hamming distance in R using covariance – Johann de Jong

September 30, 2019

[…] in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed by adding multiple binary […]

LikeLike

Reply
William

June 2, 2023

I’ve come to this post rather late, but I’m looking to do something similar to what you’ve discussed here. This is an elegant solution, but I have one question – would this benefit from being parallelized at all? Could the for loop be turned into a foreach or a future_map or would parallelization not play niceley here?

As of right now when I implement your function I’m only using one of my computer’s cores and it seems like a lot of processing power to leave on the table for larger matrices.

LikeLike

Reply
johanndejong

June 3, 2023

Yes, parallelization can make the function faster. However, you might consider parallelizing at the matrix multiplication level (instead of at the for loop level). For example, note that parallelization at the for loop level would no difference for binary matrices. Maybe have a look at this package: https://cran.r-project.org/web/packages/Rfast/index.html

LikeLike

Reply
1. William
  
  June 5, 2023
  
  Thank you so much for the advice!
  
  LikeLike
  
  Reply
Guangzhi

June 27, 2023

I have a dataset of 40 rows (features) and 3500 columns (observations). I used your Faster Hamming distance in R (2) and got the pairwise hamming distances matrix (3500×3500) for the 3500 observation. I would like also to get a pairwise distance matrix with weighted rows, I run the functions in this post, it did return a matrix. I am just wondering if I used your function correctly?

Thanks

LikeLike

Reply
1. Anonymous
  
  June 27, 2023
  
  it did not retrun a matrix
  
  LikeLike
  
  Reply
  1. johanndejong
    
    June 28, 2023
    
    Hi! Can you give me some more details? Some code to reproduce would be helpful.
    
    LikeLike
  2. Guangzhi
    
    June 29, 2023
    
    Thanks for your reply. it was my bad that did not run the hamming_binary function before I ran the hamming function. I made it work by running hamming_binary, and I got the output with row modified distance matrix. Your codes worked beautifully for a few smaller dataset (40rows and 3000 columns), the running time is within several minutes.
    
    However, I did run into a few other problems:
    1. I got empty weighted distance matrix with dataset (40 row, 1000 column) when I used a weight vector containing negative numbers, does that mean that I can not use negative value for weight?
    
    2. I would like to run much large dataset to get weighted and un-weighted hamming distance , for example, a dataset with over 1000 rows (features, categorical data) and over 10,000 columns (observations). I tried the hamming distance with a dataset with 1000 rows and 4000 columns , I did not get the output after an hour so I stopped the calculation. My question is that is it possible to run a large dataset (1000rows X 10000 columns), and how long it would take?
    
    Thanks for your time and any help is highly appreciated.
    
    LikeLike
  3. johanndejong
    
    June 29, 2023
    
    A few minutes for a 40×1000 matrix is very long, so that probably means you have a large number of unique values in your matrix. In that case, you could consider parallelizing the for loop in the hamming function. Alternatively, you can parallelize the matrix multiplication step itself using e.g. the Rfast package: https://cran.r-project.org/web/packages/Rfast/index.html
    
    Hope that helps.
    
    LikeLike
Anonymous

June 29, 2023

Thank s

LikeLike

Reply

Fast weighted Hamming distance in R

Share this:

Responses to “Fast weighted Hamming distance in R”

Leave a comment Cancel reply