Improving computational performance with algorithm engineering

July 1, 2015

Activity-based microsimulation models

Input: register data, demographic and mobility surveys
Survey calibration: Compute weights
Sampling without replacement: Transform to integer weights
Statistical matching: Combine datasets

Overview

Weighted random sampling without replacement
Similarity-based statistical matching
Survey calibration

Weighted random sampling without replacement

Random sampling

sample.int

function (n, size = n, replace = FALSE, prob = NULL) 
{
    if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) 
        .Internal(sample2(n, size))
    else .Internal(sample(n, size, replace, prob))
}

n: Number of items to choose from
size: Number of items to choose
- optimize for size ≫ 1
replace: With or without replacement
prob: Uniform or non-uniform probabilities

Implementation of random sampling

Common framework:

Subdivide an interval according to probabilities
Repeat
- Uniformly sample point from interval
- Choose sub-interval covered by point
- If sampling without replacement, remove sub-interval

Implementation of random sampling

Common framework:

Subdivide an interval according to probabilities
Repeat
- Uniformly sample point from interval
- Choose sub-interval covered by point
- If sampling without replacement, remove sub-interval

Random sampling w/o replacement

Non-uniform probabilities

R uses trivial algorithm with update in \(O(n)\)
Heap-like data structure (Partial Sum Trees) with update in \(O(\log n)\) (Wong and Easton, 1980)
Alternative approaches

Alternative approaches

Rejection sampling
- Sample with replacement, throw away duplicates
One-pass sampling (Efraimidis and Spirakis, 2006)

wrswoR::sample_int_rank

function (n, size, prob) 
{
    .check_args(n, size, prob)
    head(order(rexp(n)/prob), size)
}

n %>% rexp %>% divide_by(prob) %>% order %>% head(size)

Run time

`size = n`, `prob = 1`

Run time

`size = n`, `prob = 1`

Run time

`size = n / 10`, `prob = 1`

Run time

`size = n / 100`, `prob = 1:n`

Run time

`size = n / 100`, `prob = 1:n`

Run time

`size = n / 10`, `prob = 1:n`

Run time

`size = n`, `prob = 1:n`

Modelling run time

lm(log10(time) ~ a. + a.:log10(n) + a.:log10(size) - 1, weights = n)

                    Estimate Std. Error
a.crank               1.8789     0.0186
a.crank:log10(n)      0.7809     0.0043
a.crank:log10(size)   0.3106     0.0019
a.rej                 3.1159     0.0186
a.rej:log10(n)        0.4327     0.0043
a.rej:log10(size)     0.4296     0.0019
a.R                   0.2789     0.0186
a.R:log10(n)          1.0722     0.0043
a.R:log10(size)       0.8159     0.0019

10 ^ 0.02

[1] 1.047129

Correctness

devtools::install_github("krlmlr/wrswoR")

Similarity-based statistical matching

Statistical matching (data fusion)

Input: two joint distributions \((X, Y)\) and \((X, Z)\)
Output: a joint distribution \((X, Y, Z)\)

Hot deck: Combine observations from \((X, Y)\) with "matching" observations from \((X, Z)\) to create a realization of \((X, Y, Z)\).

\[(x, y) ⋈ (x', z) = (x, y, z)\]

Precondition: \(Y\) and \(Z\) independent given \(X\)
Variants:
- Exact: \(x' = x\)
- Approximate: \(x'\) similar to \(x\)

Gower's distance

Distance metric for multivariate distributions (Gower, 1971)

Weighted sum of distances for each variable, in \([0, 1]\)

Interval-scaled: Relative to interval width
Ordinal variables: Relative to number of levels
Nominal variables: 0 if equal, 1 otherwise

(data <- cbind(data_int, data_ord, data_nm2, data_nm3))

  int ord nm2 nm3
1   1   a   A   C
2   3   b   B   D
3   6   c   B   E

From Gower's to Manhattan distance

data_int

daisy(data_int, "gower")

    1   2
2 0.4    
3 1.0 0.6

(6 - 3) / 5

[1] 0.6

(mdata_int <- mangow(data_int))

     int
[1,] 0.0
[2,] 0.4
[3,] 1.0

daisy(mdata_int, "manhattan")

    1   2
2 0.4    
3 1.0 0.6

Distance between second and third observation

From Gower's to Manhattan distance

data_ord

  ord
1   a
2   b
3   c

daisy(data_ord, "gower")

    1   2
2 0.5    
3 1.0 0.5

(3 - 2) / 2

[1] 0.5

(mdata_ord <- mangow(data_ord))

     ord
[1,] 0.0
[2,] 0.5
[3,] 1.0

daisy(mdata_ord, "manhattan")

    1   2
2 0.5    
3 1.0 0.5

Distance between second and third observation

From Gower's to Manhattan distance

data_nm2

  nm2
1   A
2   B
3   B

daisy(data_nm2, "gower")

  1 2
2 1  
3 1 0

ifelse("B" == "B", 0, 1)

[1] 0

(mdata_nm2 <- mangow(data_nm2))

     nm2.A.B
[1,]     0.5
[2,]    -0.5
[3,]    -0.5

daisy(mdata_nm2, "manhattan")

  1 2
2 1  
3 1 0

Distance between second and third observation

From Gower's to Manhattan distance

data_nm3

  nm3
1   C
2   D
3   E

daisy(data_nm3, "gower")

  1 2
2 1  
3 1 1

ifelse("D" == "E", 0, 1)

[1] 1

(mdata_nm3 <- mangow(data_nm3))

     nm3.C.D nm3.E
[1,]     0.5   0.0
[2,]    -0.5   0.0
[3,]     0.0   0.5

daisy(mdata_nm3, "manhattan")

  1 2
2 1  
3 1 1

Distance between second and third observation

From Gower's to Manhattan distance

data

  int ord nm2 nm3
1   1   a   A   C
2   3   b   B   D
3   6   c   B   E

daisy(data, "gower")

      1     2
2 0.725      
3 1.000 0.525

(0.6 + 0.5 + 0 + 1) / 4

[1] 0.525

(mdata <- mangow(data))

      int   ord nm2.A.B nm3.C.D nm3.E
[1,] 0.00 0.000   0.125   0.125 0.000
[2,] 0.10 0.125  -0.125  -0.125 0.000
[3,] 0.25 0.250  -0.125   0.000 0.125

daisy(mdata, "manhattan")

      1     2
2 0.725      
3 1.000 0.525

Distance between second and third observation

Run time improvements

Statistical matching with \(8 \times 10^6\) recipients vs. \(5 \times 10^4\) donors
- Naïve approach: 20 hours
  - Compute all-pairs distances
- Transform the problem: 20 minutes
  - Gower's -> Manhattan
  - 4 -> 6 columns
  - Use RANN.L1 for finding the \(k\) nearest donors for each recipient

devtools::install_github("krlmlr/mangow")

Survey calibration

http://krlmlr-user15.github.io/

Activity-based microsimulation models

Overview

Weighted random sampling without replacement

Random sampling

Implementation of random sampling

Implementation of random sampling

Random sampling w/o replacement

Non-uniform probabilities

Alternative approaches

Run time

size = n, prob = 1

Run time

size = n, prob = 1

Run time

size = n / 10, prob = 1

Run time

size = n / 100, prob = 1:n

Run time

size = n / 100, prob = 1:n

Run time

size = n / 10, prob = 1:n

Run time

size = n, prob = 1:n

Modelling run time

Correctness

Correctness

Correctness

Similarity-based statistical matching

Statistical matching (data fusion)

Gower's distance

From Gower's to Manhattan distance

From Gower's to Manhattan distance

From Gower's to Manhattan distance

From Gower's to Manhattan distance

From Gower's to Manhattan distance

Run time improvements

Survey calibration

`size = n`, `prob = 1`

`size = n`, `prob = 1`

`size = n / 10`, `prob = 1`

`size = n / 100`, `prob = 1:n`

`size = n / 100`, `prob = 1:n`

`size = n / 10`, `prob = 1:n`

`size = n`, `prob = 1:n`