Column rescaling for a very large sparse matrix in R -


i have large (~500,000 x ~500,000) sparse matrix in r, , trying divide each column sum:

sm = t(t(sm) / colsums(sm)) 

however, when following error:

# error in evaluating argument 'x' in selecting method function 't': # error: cannot allocate vector of size 721.1 gb 

is there better way in r? able store colsums fine, compute , store transpose of sparse matrix, problem seems arrive when trying perform "/". looks sparse matrix converted full dense matrix here.

any appreciated. thank you!

this can do, assuming a dgcmatrix:

a@x <- a@x / rep.int(colsums(a), diff(a@p)) 

this requires understanding of dgcmatrix class.

  1. @x stores none-zero matrix values, in packed 1d array;
  2. @p stores cumulative number of non-zero elements column, hence diff(a@p) gives number of non-zero elements each column.

we repeat each element of colsums(a) number of none-zero elements in column, divide a@x vector. in end, update a@x rescaled values. in way, column rescaling done in sparse manner.


example:

library(matrix) set.seed(2); <- matrix(rbinom(100,10,0.05), nrow = 10)  #10 x 10 sparse matrix of class "dgcmatrix"  # [1,] . . 1 . 2 . 1 . . 2 # [2,] 1 . . . . . 1 . 1 . # [3,] . 1 1 1 . 1 1 . . . # [4,] . . . 1 . 2 . . . . # [5,] 2 . . . 2 . 1 . . . # [6,] 2 1 . 1 1 1 . 1 1 . # [7,] . 2 . 1 2 1 . . 2 . # [8,] 1 . . . . 3 . 1 . . # [9,] . . 2 1 . 1 . . 1 . #[10,] . . . . 1 1 . . . .  diff(a@p)    ## number of non-zeros per column # [1] 4 3 3 5 5 7 4 2 4 1  colsums(a)   ## column sums # [1]  6  4  4  5  8 10  4  2  5  2  a@x <- a@x / rep.int(colsums(a), diff(a@p))    ## sparse column rescaling  #10 x 10 sparse matrix of class "dgcmatrix"  # [1,] .         .    0.25 .   0.250 .   0.25 .   .   1 # [2,] 0.1666667 .    .    .   .     .   0.25 .   0.2 . # [3,] .         0.25 0.25 0.2 .     0.1 0.25 .   .   . # [4,] .         .    .    0.2 .     0.2 .    .   .   . # [5,] 0.3333333 .    .    .   0.250 .   0.25 .   .   . # [6,] 0.3333333 0.25 .    0.2 0.125 0.1 .    0.5 0.2 . # [7,] .         0.50 .    0.2 0.250 0.1 .    .   0.4 . # [8,] 0.1666667 .    .    .   .     0.3 .    0.5 .   . # [9,] .         .    0.50 0.2 .     0.1 .    .   0.2 . #[10,] .         .    .    .   0.125 0.1 .    .   .   . 

@thelatemail mentioned method, first converting dgcmatrix dgtmatrix:

aa <- as(a, "dgtmatrix") a@x <- a@x / colsumns(a)[aa@j + 1l] 

for dgtmatrix class there no @p @j, giving column index (0 based) none 0 matrix elements.


Comments