Spreadsheets

Misc

  • Resources

  • Some Excel files are binaries and in order to use download.file, you must set mode = “wb”

    download.file(url, 
                  destfile = glue("{rprojroot::find_rstudio_root_file()}/data/cases-age.xlsx"), 
                  mode = "wb")
  • Industry studies show that 90 percent of spreadsheets containing more than 150 rows have at least one major mistake.

  • Packages

    • {readxl}

      • Read: read_excel("your_file.xlsx", range = "C1:E4", sheet = "sheet_name", n_max = 3)
      • Doesn’t used Java, like {xlsx}, which can be a pain if you don’t have the correct Java runtime installed. Although there’s {rJavaEnv} now to help that. (see Misc >> R)
    • {pandas}

      • Read: pd.read_excel('your_file.xlsx, skiprows = 7, usecols = 'C:D')
  • Create workbook from labelled columns (source)

    # devtools::install_github("pcctc/croquet")
    library(croquet)
    library(openxlsx)
    
    wb <- createWorkbook() |> 
      add_labelled_sheet(penguins_labelled)
    
    saveWorkbook(wb, "penguins_labelled.xlsx")
  • Save and view spreadsheet (Thread)

    kview = function(df) {
      # fn <- paste0(tempfile(), ".tsv")
      # write_tsv(df,
      #           fn,
      #           na = "")
      # system(sprintf("libreoffice --calc %s", fn))
    
      fn <- paste0(tempfile(), ".xlsx")
      writexl::write_xlsx(df, fn)
      system(sprintf("wslview %s", fn))
    }

Catastrophes

  • Releasing confidential information
    • Irish police accidently handed out officers private information when sharing sheets with statistics due to a freedom of information request. (link)
  • Errors when combining sheets
    • Wales dismissed anaesthesiologists after mistakenly deeming them “unappointable.” Spreadsheets from different areas lacked standardization in formatting, naming conventions, and overall structure. To make matters worse, data was manually copied and pasted between various spreadsheets, a time-consuming and error-prone process. (link)
    • When consolidating assets from different spreadsheets, the spreadsheet data was not “cleaned” and formatted properly. The Icelandic bank’s shares were subsequently undervalued by as much as £16 million. (link)
  • Data entry errors
    • Cryto.com accidentally transferred $10.5 million instead of $100 into the account of an Australian customer due to an incorrect number being entered on a spreadsheet. (link)
    • Norway’s $1.5tn sovereign wealth fund lost $92M, on an error relating to how it calculated its mandated benchmark. A person used the wrong date, December 1st instead of November 1st. (link)

Best Practices

  • Better Spreadsheets - Workshop materials covering Woo and Broman paper
  • Data organization in spreadsheets (Woo and Broman
    • Be consistent
    • Write dates like YYYY-MM-DD
    • Don’t leave any cells empty
    • Put just one thing in a cell
    • Organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row)
    • Create a data dictionary
    • Don’t include calculations in the raw data files
    • Don’t use font color or highlighting as data
    • Choose good names for things
    • Make backups
    • Use data validation to avoid data entry errors
    • Save the data in plain text files.
  • Tips for data entry in Excel
    • Adding data validation to improve data quality
      • Excel has data validation?!
    • Using forms to improve data entry security
    • Linking information across sheets to reduce redundancy
      • Uses XLOOKUP to emulate primary/foreign key functionality in relational databases
    • Double data entry to reduce errors
      • A designated team member creates two identical entry forms. One person enters forms in the first entry screen, a different person enters forms in the second entry screen.

Transitioning from Spreadsheet to DB

  • Misc
  • DB advantages over spreadsheets:
    • Efficient analysis: Relational databases allow information to be retrieved quicker to then be analyzed with SQL (Structured Query Language), to then run queries.
      • Once spreadsheets get large, they can lag or freeze when opening, editing, or performing simple analyses in them.
    • Centralized data management: Since relational databases often require a certain type or format of data to be input into each column of a table, it’s less likely that you’ll end up with duplicate or inconsistent data.
    • Scalability: If your business is experiencing high growth, this means that the database will expand, and a relational database can accommodate an increased volume of data.
  • Start documenting the spreadsheets
    • File Names, File Paths
    • Understand where values are coming from
      • Source (e.g. department, store, sensor), Owner
    • How rows of data are being generated
      • Who/What is inputting the data
    • How does each spreadsheet/notebooks/set of spreadsheets fit in the company’s business model
      • How are they being used and by whom
    • Map the spreadsheets relationships to one another