Some basic Caching lessons

Caching is the ability to save some sort of output from an operation, and then retrieve these outputs when the operation is repeated in the same way - meaning the inputs of this operation and the actual tasks it performs are unchanged.

Caching becomes fundamental when we can expect to re-run operations several times, particularly if they they a while to compute each time. Some examples of these operations are: - downloading data - (spatial) data processing/munging - fitting statistical models to large datasets, or that are complex in nature - running simulations with no stochasticity

SpaDES (via the reproducible package) offers a number of functions that make caching these operations a lot easier for non-programmers. Two fundamental ones are Cache and prepInputs.

library(reproducible)
library(sp)
library(raster)

## from ?prepInputs
dPath <- file.path("modules/Biomass_borealDataPrep/data")
url <- file.path("ftp://ftp.ccrs.nrcan.gc.ca/ad/NLCCLandCover",
                 "LandcoverCanada2005_250m/LandCoverOfCanada2005_V1_4.zip")
landCover <- prepInputs(url = url,
                        destinationPath = asPath(dPath))

Now do it again. Notice any difference?

landCover <- prepInputs(url = url,
                        destinationPath = asPath(dPath))

Now try wrapping the previous operation in Cache call, and run it twice. Notice differences in speed.

landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath))


landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath))

The previous code is great but we don’t have as much control as we’d like on where Cache is storing cached objects. To do that, we can explicitly provide a cache folder and add tags to the object so that we can find it more easily if we ever need to “clean it”.

cPath <- file.path(tempdir(), "cache")
## run this twice
landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath),
                   cacheRepo = cPath,
                   userTags = "landCover")

showCache(x = cPath, userTags = "landCover")
reproducible::clearCache(x = cPath, userTags = "landCover")

## notice how Cache needs to re-do things
landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath),
                   userTags = "landCover")

We can also force Cache to redo operations and re-cache, or simply to ignore caching altogether. See more options for Cache(useCache) in ?Cache

landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath),
                   cacheRepo = cPath,
                   userTags = "landCover",
                   useCache = "overwrite")
options("reproducible.useCache" = FALSE)
landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath),
                   cacheRepo = cPath,
                   userTags = "landCover",
                   useCache = TRUE)

Try to provide a study area now. Hints:

check out ?reproducible::prepInputs and ?reproducible::postProcess
check out ?SpaDES.tools::randomStudyArea
try with an area 1 ha
what happens when you run Cache(prepInputs(...)) with the new study area(s)?

library(SpaDES.tools)
StudyArea <- randomStudyArea(size = 100000^2)

## cheating to visualise beforehand
if (identical(crs(StudyArea), crs(landCover)))
  StudyArea <- spTransform(StudyArea, crs = crs(landCover))
plot(landCover)
plot(StudyArea, add = TRUE, col = "red")

options(reproducible.useCache = TRUE)
landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath),
                   useSAcrs = TRUE,
                   overwrite = TRUE,
                   studyArea = StudyArea,
                   cacheRepo = cPath,
                   userTags = "landCover")
plot(landCover)

What if my study area is a raster? Assuming you don’t have raster at hand, try using a raster from the SpaDESInAction example (“inputs/rasterToMatch.rds”).

## assuming you're in the SpaDESInAction2 project folder
templateRaster <- readRDS(file.path(getwd(), "inputs/rasterToMatch.rds"))

landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath),
                   # useSAcrs = TRUE,  ## we don't need this anymore
                   rasterToMatch = templateRaster, ## a reproducible::postProcess argument
                   maskWithRTM = TRUE,  ## a reproducible::maskInputs argument 
                   overwrite = TRUE, 
                   cacheRepo = cPath,
                   userTags = "landCover",
                   useCache = TRUE)
plot(landCover)

What if I have both? In some cases your study area may be defined by a polygon, but you may have a raster that will dictate the e.g. projection and resolution of the output (remember, polygons have no resolution).

landCover <- Cache(prepInputs,
                   url = url,
                   destinationPath = asPath(dPath),
                   studyArea = StudyArea,
                   useSAcrs = FALSE,  ## use the template raster projection
                   rasterToMatch = templateRaster,
                   maskWithRTM = FALSE,  ## mask using the study area
                   overwrite = TRUE, 
                   cacheRepo = cPath,
                   userTags = "landCover",
                   useCache = TRUE)

Now imagine someone told you there is a more up to date Land Cover map for Canada. And they told you where to look to get to the .zip file - the Canadian Gov. open data portal

Try right-click on “Access” for the TIF file and replace the “old” URL
Had an error, the messages are helpful!

url <- "http://ftp.maps.canada.ca/pub/nrcan_rncan/Land-cover_Couverture-du-sol/canada-landcover_canada-couverture-du-sol/CanadaLandcover2010.zip"
landCover <- Cache(prepInputs,
                   targetFile = "CAN_LC_2010_CAL.tif",
                   url = url,
                   destinationPath = asPath(dPath),
                   studyArea = StudyArea,
                   overwrite = TRUE, 
                   cacheRepo = cPath,
                   userTags = "landCover",
                   useCache = TRUE)
plot(landCover)

Learn more about caching

Some basic Debugging lessons

Now that you have a flavour of caching, we’re going to explore debugging a bit and put our new “caching skills” in practice in a SpaDES modelling context.

We’re going to run the caribouRSF module by itself.

## Restart your R session if not running from a "clean" environment
library("reproducible")
library(SpaDES)
library(LandR)
library(raster)
library(data.table)

options(
  "spades.recoveryMode" = 2,
  "spades.lowMemory" = TRUE,
  "LandR.assertions" = FALSE,
  "LandR.verbose" = 1,
  "reproducible.useMemoise" = TRUE, # Brings cached stuff to memory during the second run
  "reproducible.useNewDigestAlgorithm" = TRUE,  # use the new less strict hashing algo
  "reproducible.useCache" = TRUE,
  "pemisc.useParallel" = FALSE
)

## assuming you're in the SpaDESInAction2 project folder
inputDirectory <- checkPath(file.path(getwd(), "inputs"), create = TRUE)
outputDirectory <- checkPath(file.path(getwd(), "outputs"), create = TRUE)
modulesDirectory <- checkPath(file.path(getwd(), "modules"), create = TRUE)
cacheDirectory <- checkPath(file.path(getwd(), "cache"), create = TRUE)

setPaths(cachePath = cacheDirectory,
         modulePath = c(modulesDirectory, 
                        file.path(modulesDirectory, "scfm/modules")),
         inputPath = inputDirectory,
         outputPath = outputDirectory)

times <- list(start = 0, end = 10)

successionTimestep <- 1L
parameters <- list(
  caribouRSF = list(
    "decidousSp" = c("Betu_Pap", "Popu_Tre", "Popu_Bal"),
    "predictionInterval" = 20
  )
)
# load studyArea
studyArea <- readRDS(file.path(getPaths()$inputPath, "studyArea.rds"))

objects <- list(
  "studyArea" = studyArea
)

caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = 1)

Oops, something doesn’t seem to be right! We start by looking carefully at the printed output, then we use traceback to help us locate the problem. In this case, it seems to be a particular line of caribouRSF.R

traceback()
# 11: stop("This module does not work without data. Please provide the necessary layers") at caribouRSF.R#154
# 10: get(moduleCall, envir = fnEnv)(sim, cur[["eventTime"]], cur[["eventType"]])
# 9: eval(fnCallAsExpr)
# 8: eval(fnCallAsExpr)
# (...)

file.edit("modules/caribouRSF/caribouRSF.R")  ## got to caribouRSF.R#154

Insert a browser() before the line with the stop(). Save and re-rerun. OR
Use the debug option in simInitAndSpaDES or spades to go in into “browser mode”

Check ?browser, while you’re at it ;)

What is P(sim)$.useDummyData? Where does its value come from?
Which data objects are missing? Why?

# Browse[1]> P(sim)$.useDummyData
# [1] TRUE

# Browse[1]> mod$pixelGroupMap
# NULL

# Browse[1]> mod$cohortData
# NULL

You can also try to enter the “browser mode” in specific events or all events of a module via the simInitAndSpades or spades functions.

## browse at the init event(s) - if you had more than one module it would stop at each
caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = "init")

## browse at the event that triggered the error
caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = "lookingForCaribou")

## "browse" at each event of the caribouRSF module
caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = "caribouRSF")

We are going to supply these objects - note that the dynamic part will not be simulated.

Check the .inputObjects function and the metadata for inputs.

Can you see a pattern in how prepInputs gets data from online sources? Try to do the same for pixelGroupMapand cohortData
How are sources for objects given? Try adding sources the following sources:
for pixelGroupMap: “https://drive.google.com/open?id=1IUEuH55su8X7JCWt8LXy_hTAQz0cfCmU”
for cohortData: “https://drive.google.com/open?id=1R_wGGvzUI0gGZ5NOs2KmT2KrmXaTm4NS”

## add sourceURL to pixelGroupMap and cohortData
expectsInput(objectName = "pixelGroupMap", objectClass = "RasterLayer",
             desc = paste0("Map of groups of pixels that share the same info from cohortData (sp, age, biomass, etc).",
                           "Here is mainly used to determine old and recent burns based on tree age,",
                           " and if deciduous by species"),
             sourceURL = "https://drive.google.com/open?id=1IUEuH55su8X7JCWt8LXy_hTAQz0cfCmU")
expectsInput(objectName = "cohortData", objectClass = "data.table",
             desc = paste0("data.table with information by pixel group of sp, age, biomass, etc"),
             sourceURL = "https://drive.google.com/open?id=1R_wGGvzUI0gGZ5NOs2KmT2KrmXaTm4NS")

## add defaults for these objects in .inputObjects, so that the module can get them if they are not supplied
if (!suppliedElsewhere("pixelGroupMap", sim = sim, where = "sim")) {
  sim$pixelGroupMap <- Cache(prepInputs, targetFile = "pixelGroupMapCaribouEg.rds",
                             fun = "readRDS",
                             url = extractURL("pixelGroupMap"), studyArea = sim$studyArea,
                             destinationPath = dataPath(sim), filename2 = NULL,
                             rasterToMatch = sim$rasterToMatch)
}

if (!suppliedElsewhere("cohortData", sim = sim, where = "sim")) {
  sim$cohortData <- Cache(prepInputs, targetFile = "cohortDataCaribouEg.rds",
                          fun = "readRDS",
                          url = extractURL("cohortData"),
                          destinationPath = dataPath(sim))
}

You can now use restartSpades or simply re-run simInitAndSpades

caribou <- restartSpaDES()

caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = 1)

There seem to be no more issues. Maybe we don’t need to print all those debug messages anymore - less verbose - next time we run.

caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = FALSE)

Or maybe, we don’t want to see them printed, but want to keep a log of all messages to check later.

check the “debug” section of ?spades

caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = list(console = list(level = 40), 
                                         file = list(append = FALSE,
                                                     file = "logCaribou.txt",
                                                     level = 0), 
                                         debug = TRUE)
)

## notivce how different the file is from what was printed on the console.
file.edit("logCaribou.txt")

What about all that purple text? That’s the module code checking. It’s helpful, but not always accurate - meant to be informative rather than enforced.

options("spades.moduleCodeChecks" = FALSE)
caribou <- simInitAndSpades(times = times,
                            objects = objects,
                            params = parameters,
                            modules = as.list("caribouRSF"),
                            paths = getPaths(),
                            debug = list(console = list(level = 40), 
                                         file = list(append = FALSE,
                                                     file = "logCaribou.txt",
                                                     level = 0), 
                                         debug = TRUE)
)

## notice how different the file is from what was printed on the console.
file.edit("logCaribou.txt")

Learn about debugging in SpaDES and with RStudio

Caching and debugging

Ceres Barros

January 2020

Some basic Caching lessons

Some basic Debugging lessons