CBII FAQs | Working with Large Files

 

Given the large size of the Administrative Data, users who are linking or appending files are advised to carefully consider which variables and observations to retain. Large files can slow data processing and even result in the incomplete execution of analytical tasks. When using large data files, particularly in the Administrative Data, we recommend: 

  • Only read in and keep variables you need.
  • Convert string variables to numeric.
  • Avoid merges that retain all variables from multiple files.
  • Regularly delete old or unused versions of large intermediate data files and instead rely on archived code that creates such data files.

When working with data at the scale of the Administrative Data’s Term and Course files, typical workflows one might use with smaller datasets can be inefficient or impossible due to time or memory constraints. We advise that users consider how to optimize memory usage and efficient application of operations. Recognizing that optimizing code may be unfamiliar to many users, here we provide some guiding principles. 

To optimize code:

  1. Experiment with code on subsets of the data and estimate total run times before attempting to run code on full datasets.
  2. Instead of beginning with full CBII files for analysis, consider writing-out minimal working versions of the data to disk that are optimized to work within the scope of the research questions.

To optimize memory:

  1. Keep only variables that you plan on using in your analysis. This is especially important when merging files across levels of granularity. Attempting to merge all variables from the Student file with all variables from the Course file will consume an unreasonable amount of memory and may cause the VDE machines to crash.
  2. Convert strings to numeric or categorical values if possible
  3. Store data in appropriate data formats. For instance, when working with text data derived from course descriptions or networks derived from student transcripts, consider storing data in sparse matrices rather than memory-intensive dense matrices.
  4. If necessary, break tasks into intermediate parts and save the results. For instance, if performing a memory-intensive computation on the Student file, one might work on each term of the file, saving files for each term. Note that this will sacrifice computational efficiency for the sake of memory usage by requiring many IO (input/output) operations.
  5. Make use of built-in functionality. For instance, the “gc()” command in R clears unused memory, and the memory usage widget in R Studio can help to understand memory limits.

To optimize computational efficiency:

  1. When possible, use vectorized operations and avoid using loops. If loops are necessary, include only the minimum number of functions within the loop and examine how much time each function call within the loop requires. Functions that achieve similar outputs may vary in terms of computational efficiency and you need to find efficient versions of functions. For instance, base R’s merge function may be less computationally efficient than dplyr’s join function to achieve the same end.
  2. If merging files across multiple levels of aggregation, perform as many operations as possible on individual files before merging. For instance, if working with variables from the Student and Course files, derive all student-level variables before merging the files. Similarly, if intending to work with aggregated measures from the Term or Course files merged with the Student file, first perform these aggregations before merging the files.