July 1, 2015

Activity-based microsimulation models

  • Input: register data, demographic and mobility surveys
  • Survey calibration: Compute weights
  • Sampling without replacement: Transform to integer weights
  • Statistical matching: Combine datasets

Overview

  1. Weighted random sampling without replacement
  2. Similarity-based statistical matching
  3. Survey calibration

Weighted random sampling without replacement

Random sampling

sample.int
function (n, size = n, replace = FALSE, prob = NULL) 
{
    if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) 
        .Internal(sample2(n, size))
    else .Internal(sample(n, size, replace, prob))
}
  • n: Number of items to choose from
  • size: Number of items to choose
    • optimize for size ≫ 1
  • replace: With or without replacement
  • prob: Uniform or non-uniform probabilities

Implementation of random sampling

Common framework:

  • Subdivide an interval according to probabilities
  • Repeat
    • Uniformly sample point from interval
    • Choose sub-interval covered by point
    • If sampling without replacement, remove sub-interval

Implementation of random sampling

Common framework:

  • Subdivide an interval according to probabilities
  • Repeat
    • Uniformly sample point from interval
    • Choose sub-interval covered by point
    • If sampling without replacement, remove sub-interval

Random sampling w/o replacement

Non-uniform probabilities

  • R uses trivial algorithm with update in \(O(n)\)
  • Heap-like data structure (Partial Sum Trees) with update in \(O(\log n)\) (Wong and Easton, 1980)
  • Alternative approaches

Alternative approaches

  • Rejection sampling
    • Sample with replacement, throw away duplicates
  • One-pass sampling (Efraimidis and Spirakis, 2006)
wrswoR::sample_int_rank
function (n, size, prob) 
{
    .check_args(n, size, prob)
    head(order(rexp(n)/prob), size)
}
n %>% rexp %>% divide_by(prob) %>% order %>% head(size)

Run time

size = n, prob = 1

Run time

size = n, prob = 1

Run time

size = n / 10, prob = 1

Run time

size = n / 100, prob = 1:n

Run time

size = n / 100, prob = 1:n

Run time

size = n / 10, prob = 1:n

Run time

size = n, prob = 1:n

Modelling run time

lm(log10(time) ~ a. + a.:log10(n) + a.:log10(size) - 1, weights = n)
                    Estimate Std. Error
a.crank               1.8789     0.0186
a.crank:log10(n)      0.7809     0.0043
a.crank:log10(size)   0.3106     0.0019
a.rej                 3.1159     0.0186
a.rej:log10(n)        0.4327     0.0043
a.rej:log10(size)     0.4296     0.0019
a.R                   0.2789     0.0186
a.R:log10(n)          1.0722     0.0043
a.R:log10(size)       0.8159     0.0019
10 ^ 0.02
[1] 1.047129

Correctness

Correctness

Correctness

devtools::install_github("krlmlr/wrswoR")

Similarity-based statistical matching

Statistical matching (data fusion)

  • Input: two joint distributions \((X, Y)\) and \((X, Z)\)
  • Output: a joint distribution \((X, Y, Z)\)

Hot deck: Combine observations from \((X, Y)\) with "matching" observations from \((X, Z)\) to create a realization of \((X, Y, Z)\).

\[(x, y) ⋈ (x', z) = (x, y, z)\]

  • Precondition: \(Y\) and \(Z\) independent given \(X\)

  • Variants:
    • Exact: \(x' = x\)
    • Approximate: \(x'\) similar to \(x\)

Gower's distance

Distance metric for multivariate distributions (Gower, 1971)

Weighted sum of distances for each variable, in \([0, 1]\)

  • Interval-scaled: Relative to interval width
  • Ordinal variables: Relative to number of levels
  • Nominal variables: 0 if equal, 1 otherwise
(data <- cbind(data_int, data_ord, data_nm2, data_nm3))
  int ord nm2 nm3
1   1   a   A   C
2   3   b   B   D
3   6   c   B   E

From Gower's to Manhattan distance

data_int
  int
1   1
2   3
3   6
daisy(data_int, "gower")
    1   2
2 0.4    
3 1.0 0.6
(6 - 3) / 5
[1] 0.6
(mdata_int <- mangow(data_int))
     int
[1,] 0.0
[2,] 0.4
[3,] 1.0
daisy(mdata_int, "manhattan")
    1   2
2 0.4    
3 1.0 0.6
Distance between second and third observation

From Gower's to Manhattan distance

data_ord
  ord
1   a
2   b
3   c
daisy(data_ord, "gower")
    1   2
2 0.5    
3 1.0 0.5
(3 - 2) / 2
[1] 0.5
(mdata_ord <- mangow(data_ord))
     ord
[1,] 0.0
[2,] 0.5
[3,] 1.0
daisy(mdata_ord, "manhattan")
    1   2
2 0.5    
3 1.0 0.5
Distance between second and third observation

From Gower's to Manhattan distance

data_nm2
  nm2
1   A
2   B
3   B
daisy(data_nm2, "gower")
  1 2
2 1  
3 1 0
ifelse("B" == "B", 0, 1)
[1] 0
(mdata_nm2 <- mangow(data_nm2))
     nm2.A.B
[1,]     0.5
[2,]    -0.5
[3,]    -0.5
daisy(mdata_nm2, "manhattan")
  1 2
2 1  
3 1 0
Distance between second and third observation

From Gower's to Manhattan distance

data_nm3
  nm3
1   C
2   D
3   E
daisy(data_nm3, "gower")
  1 2
2 1  
3 1 1
ifelse("D" == "E", 0, 1)
[1] 1
(mdata_nm3 <- mangow(data_nm3))
     nm3.C.D nm3.E
[1,]     0.5   0.0
[2,]    -0.5   0.0
[3,]     0.0   0.5
daisy(mdata_nm3, "manhattan")
  1 2
2 1  
3 1 1
Distance between second and third observation

From Gower's to Manhattan distance

data
  int ord nm2 nm3
1   1   a   A   C
2   3   b   B   D
3   6   c   B   E
daisy(data, "gower")
      1     2
2 0.725      
3 1.000 0.525
(0.6 + 0.5 + 0 + 1) / 4
[1] 0.525
(mdata <- mangow(data))
      int   ord nm2.A.B nm3.C.D nm3.E
[1,] 0.00 0.000   0.125   0.125 0.000
[2,] 0.10 0.125  -0.125  -0.125 0.000
[3,] 0.25 0.250  -0.125   0.000 0.125
daisy(mdata, "manhattan")
      1     2
2 0.725      
3 1.000 0.525
Distance between second and third observation

Run time improvements

  • Statistical matching with \(8 \times 10^6\) recipients vs. \(5 \times 10^4\) donors
    • Naïve approach: 20 hours
      • Compute all-pairs distances
    • Transform the problem: 20 minutes
      • Gower's -> Manhattan
      • 4 -> 6 columns
      • Use RANN.L1 for finding the \(k\) nearest donors for each recipient

devtools::install_github("krlmlr/mangow")

Survey calibration

http://krlmlr-user15.github.io/