General

Misc

  • Resources
  • Tools
    • direnv - Augments existing shells with a new feature that can load and unload environment variables depending on the current directory
  • ctrl-rshell command history search
    • McFly - intelligent command history search engine that takes into account your working directory and the context of recently executed commands. McFly’s suggestions are prioritized in real time with a small neural network
  • Path to a folder that’s above root folder:
    • 1 level up: ../desired-folder
    • 2 levels up: ../../desired-folder

R

  • Misc

    • The “shebang” line starting #! allows a script to be run directly from the command line without explicitly passing it through Rscript or r. It’s not required but is a helpful convenience on Unix-like systems.

      #!/usr/bin/env -S Rscript --vanilla
      • The shebang attempts to use /usr/bin/env to locate the Rscript executable and then passes –vanilla as an argument to Rscript
    • Alter line endings when writing an R script on Windows but executing it on Linux

      • Windows uses \r\n (carriage return + newline) as line endings.

      • Linux/Unix uses \n (newline only) as line endings.

      • Command that makes the script compatible with Linux systems

        sed -i 's/\r//' my-script.R
        • sed: Stream editor for filtering and transforming text.
        • -i: Edits the file “in place.”
        • ‘s/\\r//’: Removes (s///) all occurrences of \r (carriage return).
        • filename: The file to process.
  • Resources

  • Packages

    • {ps} - List, Query, Manipulate System Processes
    • {fs} - A cross-platform, uniform interface to file system operations
    • {littler} - A scripting and command-line front-end for GNU R
  • Rscript need to be on PATH

  • Run R (default version) in the shell:

    RS
    # or 
    rig run
    • RS might require {rig} to be installed
    • To run a specific R version that’s already installed: R-4.2
  • Run an R script:

    Rscript "path\to\my-script.R"
    # or
    rig run -f <script-file>
    # or
    chmod +x my-script.R
    ./my-script.R
  • Evaluate an R expression:

    Rscript -e <expression> 
    # or 
    rig run -e <expression>
  • Run an R app: rig run <path-to-app>

    • Plumber APIs
    • Shiny apps
    • Quarto documents (also with embedded Shiny apps)
    • Rmd documents (also with embedded Shiny apps)
    • Static web sites
  • Make an R script pipeable (From link)

    parallel "echo 'zipping bin {}'; cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R '$S3_DEST'/chr_'$DESIRED_CHR'_bin_{}.rds"
    #!/usr/bin/env Rscript
    library(readr)
    library(aws.s3)
    
    # Read first command line argument
    data_destination <- commandArgs(trailingOnly = TRUE)[1]
    
    data_cols <- list(SNP_Name = 'c', ...)
    
    s3saveRDS(
      read_csv(
            file("stdin"), 
            col_names = names(data_cols),
            col_types = data_cols 
        ),
      object = data_destination
    )
    • By passing readr::read_csv the function, file("stdin"), it loads the data piped to the R script into a dataframe, which then gets written as an .rds file directly to s3 using {aws.s3}.
  • Killing a process

    system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
  • Starting a process in the background

    # start MLflow server
    sys::exec_background("mlflow server")
  • Check file sizes in a directory

    file.info(Sys.glob("*.csv"))["size"]
    #>                                size
    #> Data8277.csv              857672667
    #> DimenLookupAge8277.csv         2720
    #> DimenLookupArea8277.csv       65400
    #> DimenLookupEthnic8277.csv       272
    #> DimenLookupSex8277.csv           74
    #> DimenLookupYear8277.csv          67
    • First one is about 800MB
  • Read first ten lines of a file

    cat(paste(readLines("Data8277.csv", n=10), collapse="\n"))
    #> Year,Age,Ethnic,Sex,Area,count
    #> 2018,000,1,1,01,795
    #> 2018,000,1,1,02,5067
    #> 2018,000,1,1,03,2229
    #> 2018,000,1,1,04,1356
    #> 2018,000,1,1,05,180
    #> 2018,000,1,1,06,738
    #> 2018,000,1,1,07,630
    #> 2018,000,1,1,08,1188
    #> 2018,000,1,1,09,2157
  • Delete an opened file in the same R session

    • You **MUST** unlink it before any kind of manipulation of object

      • I think this works because readr loads files lazily by default
    • Example:

      wisc_csv_filename <- "COVID-19_Historical_Data_by_County.csv"
      download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
      wisc_file_path <- file.path(download_location, wisc_csv_filename)
      wisc_tests_new <- readr::read_csv(wisc_file_path)
      # key part, must unlink before any kind of code interaction
      # supposedly need recursive = TRUE for Windows, but I didn't need it
      # Throws an error (hence safely) but still works
      safe_unlink <- purrr::safely(unlink)
      safe_unlink(wisc_tests_new)
      
      # manipulate obj
      wisc_tests_clean <- wisc_tests_new %>%
            janitor::clean_names() %>%
            select(date, geo, county = name, negative, positive) %>%
            filter(geo == "County") %>%
            mutate(date = lubridate::as_date(date)) %>%
            select(-geo)
      # clean-up
      fs::file_delete(wisc_file_path)
  • Find out which process is locking or using a file

    • Open Resource Monitor, which can be found
      • By searching for Resource Monitor or resmon.exe in the start menu, or
      • As a button on the Performance tab in your Task Manager
    • Go to the CPU tab
    • Use the search field in the Associated Handles section
      • type the name of file in the search field and it’ll search automatically
      • 35548

Python

  • Notes from Python’s many command-line utilities

    • Lists and describes all CLI utilities that are available through Python’s standard library
  • Linux utilities through Python in CLI

    Command Purpose More
    python3.12 -m uuid Like uuidgen CLI utility Docs
    python3.12 -m sqlite3 Like sqlite3 CLI utility Docs
    python -m zipfile Like zip & unzip CLI utilities Docs
    python -m gzip Like gzip & gunzip CLI utilities Docs
    python -m tarfile Like the tar CLI utility Docs
    python -m base64 Like the base64 CLI utility
    python -m ftplib Like the ftp utility
    python -m smtplib Like the sendmail utility
    python -m poplib Like using curl to read email
    python -m imaplib Like using curl to read email
    python -m telnetlib Like the telnetutility
    • uuid and sqlite3 require version 3.12 or above.
  • Code Utilities

    Command Purpose More
    python -m pip Install third-party Python packages Docs
    python -m venv Create a virtual environment Docs
    python -m pdb Run the Python Debugger Docs
    python -m unittest Run unittest tests in a directory Docs
    python -m pydoc Show documentation for given string Docs
    python -m doctest Run doctests for a given Python file Docs
    python -m ensurepip Install pip if it’s not installed Docs
    python -m idlelib Launch Python’s IDLE graphical REPL Docs
    python -m zipapp Turn Python module into runnable ZIP Docs
    python -m compileall Pre-compile Python files to bytecode Docs

AWK

  • Misc

  • Print first few rows of columns 1 and 2

    awk -F, '{print $1,$2}' adult_t.csv|head
  • Filter lines where no of hours/ week (13th column) > 98

    awk -F,$13 > 98’ adult_t.csv|head
  • Filter lines with “Doctorate” and print first 3 columns

    awk '/Doctorate/{print $1, $2, $3}' adult_t.csv
  • Random sample 8% of the total lines from a .csv (keeps header)

    'BEGIN {srand()} !/^$/ {if(rand()<=0.08||FNR==1) print > "rand.samp.csv"}' big_fn.csv
  • Decompresses, chunks, sorts, and writes back to S3 (From link)

    # Let S3 use as many threads as it wants
    aws configure set default.s3.max_concurrent_requests 50
    
    for chunk_file in $(aws s3 ls $DATA_LOC | awk '{print $4}' | grep 'chr'$DESIRED_CHR'.csv') ; do
    
            aws s3 cp s3://$batch_loc$chunk_file - |
            pigz -dc |
            parallel --block 100M --pipe  \
            "awk -F '\t' '{print \$1\",...\"$30\">\"chunked/{#}_chr\"\$15\".csv\"}'"
    
            # Combine all the parallel process chunks to single files
            ls chunked/ |
            cut -d '_' -f 2 |
            sort -u |
            parallel 'cat chunked/*_{} | sort -k5 -n -S 80% -t, | aws s3 cp - '$s3_dest'/batch_'$batch_num'_{}'
    
            # Clean up intermediate data
            rm chunked/*
    done
    • Uses pigz to parallelize decompression
    • Uses GNU Parallel (site, docs, tutorial1, tutorial2) to parallelize chunking (100MB chunks in 1st section)
    • Chunks data into smaller files and sorts them into directories based on a chromosome column (I think)
    • Avoids writing to disk

Vim

  • Command-line based text editor
  • Common Usage
    • Edit text files while in CLI
    • Logging into a remote machine and need to make a code change there. vim is a standard program and therefore usually available on any machine you work on.
    • When running git commit, by default git opens vim for writing a commit message. So at the very least you’ll want to know how to write, save, and close a file.
  • Resources
  • 2 modes: Navigation Mode; Edit Mode
    • When Vim is launched you’re in Navigation mode
    • Press i to start edit mode, in which you can make changes to the file.
    • Press Esc key to leave edit mode and go back to navigation mode.
  • Commands
    • x deletes a character
    • dd deletes an entire row
    • b (back) goes to the previous word
    • n (next) goes to the next word
    • :wq saves your changes and closes the file
    • :q! ignores your changes and closes the file
    • h is \(\leftarrow\)
    • j is \(\downarrow\)
    • k is \(\uparrow\)
    • l (i.e. lower L) is \(\rightarrow\)