General

Misc

Resources
- Data Science at the Command Line
Tools
- direnv - Augments existing shells with a new feature that can load and unload environment variables depending on the current directory
ctrl-rshell command history search
- McFly - intelligent command history search engine that takes into account your working directory and the context of recently executed commands. McFly’s suggestions are prioritized in real time with a small neural network
Path to a folder that’s above root folder:
- 1 level up: ../desired-folder
- 2 levels up: ../../desired-folder
Glob Patterns
- Notes from A Beginner’s Guide: Glob Patterns
- Patterns that can expand a wildcard pattern into a list of pathnames that match the given pattern.
- Wildcard Characters: *, ?, and [ ]
- Negation of a Pattern: !
  - e.g. [!CBR]at
- Escape a character: /
- Asterisk
  - * - On Linux, will match everything except slashes. On Windows, it will avoid matching backslashes as well as slashes.
  - ** - Recursively matches zero or more directories that fall under the current directory.
  - *(pattern_list) - Only matches if zero or one occurrence of any pattern is included in the pattern-list above
  - Example: To find all Markdown files recursively that end with .md, the pattern would be **/*.md
    - *.md would only return the file paths in the current directory
- Question Mark
  - Used to match any single character
  - Case insensitive
  - Useful for finding files with dates in the file name
  - Example: To match files with the pattern “at” (e.g. Cat.png), use ?at
    - The ? is for the “C” or whatever the charater is preceding “at”
- Square Brackets
  - Character Classes
    - Used to denote a pattern that should match a single character that is enclosed inside of the brackets. (“Character Classes” seems like a grand name for somethikng so simple.)
    - The inside of the bracket can’t be empty (i.e. attempting to match a space)
    - Case sensitive
    - Example: To match Cat.png, Bat.png, and Rat.png, use [CBR]at
  - Ranges
    - Two characters that are separated by a dash.
    - Example: Match all alpha numerical strings: [A-Za-z0-9]

R

Misc
- The “shebang” line starting #! allows a script to be run directly from the command line without explicitly passing it through Rscript or r. It’s not required but is a helpful convenience on Unix-like systems.
```
#!/usr/bin/env -S Rscript --vanilla
```
  - The shebang attempts to use /usr/bin/env to locate the Rscript executable and then passes –vanilla as an argument to Rscript
- Alter line endings when writing an R script on Windows but executing it on Linux
  - Windows uses \r\n (carriage return + newline) as line endings.
  - Linux/Unix uses \n (newline only) as line endings.
  - Command that makes the script compatible with Linux systems
```
sed -i 's/\r//' my-script.R
```
    - sed: Stream editor for filtering and transforming text.
    - -i: Edits the file “in place.”
    - ‘s/\\r//’: Removes (s///) all occurrences of \r (carriage return).
    - filename: The file to process.
- Exiting R in bash without annoying “Save Workspace” prompt (source)
```
alias R="$(/usr/bin/which R) --no-save"
```
  - Add to ~/.bashrc
  - For windows (source, code):
    1. Find your Rstats executable
    2. Create a new folder
    3. Inside folder add a batch script to launch R with –no-save arg
    4. Add folder to path
Resources
- Invoking R from the command line for using R and R CMD
Packages
- {ps} - List, Query, Manipulate System Processes
- {fs} - A cross-platform, uniform interface to file system operations
  - Navarro deep dive, For fs
- {littler} - A scripting and command-line front-end for GNU R
- {sys} - Offers drop-in replacements for the system2 with fine control and consistent behavior across platforms
  - Supports clean interruption, timeout, background tasks, and streaming STDIN / STDOUT / STDERR over binary or text connections.
  - Arguments on Windows automatically get encoded and quoted to work on different locales.
- {seekr} - Recursively list files from a directory, filter them using a regular expression, read their contents, and extract lines that match a user-defined pattern
  - Designed for quick code base exploration, log inspection, or any use case involving pattern-based file and line filtering.

File Paths

Various functions to make file paths easier to work with (source)
```
(path <- here::here())
#> [1] "C:/Users/ercbk/Documents"

base::normalizePath(path)
#> [1] "C:\\Users\\ercbk\\Documents"

user <- "ercbk"
base::file.path("C:Users", user, "Documents")
#> [1] "C:Users/ercbk/Documents"
```
- here - A drop-in replacement for file.path that will always locate the files relative to your project root.
- normalizePath - Convert file paths to canonical form for the platform
- file.path - (faster than paste) Construct the path to a file from components in a platform-independent way.

basename, dirname, tools::file_path_sans_ext

basename("~/me/path/to/file.R")
#> [1] "file.R"
basename("C:\\Users\\me\\Documents\\R\\Projects\\file.R")
#> [1] "file.R"
dirname("~/me/path/to/file.R")
#> [1] "~me/path/to"
tools::file_path_sans_ext("~/me/path/to/file.R")
#> [1] "~/me/path/to/file"

Rscript need to be on PATH
Run R (default version) in the shell:
```
RS
# or 
rig run
```
- RS might require {rig} to be installed
- To run a specific R version that’s already installed: R-4.2

Run an R script:

Rscript "path\to\my-script.R"
# or
rig run -f <script-file>
# or
chmod +x my-script.R
./my-script.R

Evaluate an R expression:

Rscript -e <expression> 
# or 
rig run -e <expression>

Run an R app: rig run <path-to-app>
- Plumber APIs
- Shiny apps
- Quarto documents (also with embedded Shiny apps)
- Rmd documents (also with embedded Shiny apps)
- Static web sites

Make an R script pipeable (From link)

parallel "echo 'zipping bin {}'; cat chunked/*_bin_{}_*.csv | ./upload_as_rds.R '$S3_DEST'/chr_'$DESIRED_CHR'_bin_{}.rds"

#!/usr/bin/env Rscript
library(readr)
library(aws.s3)

# Read first command line argument
data_destination <- commandArgs(trailingOnly = TRUE)[1]

data_cols <- list(SNP_Name = 'c', ...)

s3saveRDS(
  read_csv(
        file("stdin"), 
        col_names = names(data_cols),
        col_types = data_cols 
    ),
  object = data_destination
)

By passing readr::read_csv the function, file("stdin"), it loads the data piped to the R script into a dataframe, which then gets written as an .rds file directly to s3 using {aws.s3}.

Killing a process

system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

Kill all processes that are using a particular port (source)
```
pids <- 
  system("lsof -i tcp:3000", intern = TRUE) |> 
  readr::read_table() |> 
  subset(select = "PID", drop = TRUE) |> 
  unique()

lapply(paste0("kill -9", pids), system)
```
- lsof is “List Open Files” by processes, users, and process IDs (source)
- -i lists open network connections and ports
- tcp:3000 specifies the type (TCP) and the port number (3000)
- intern indicates whether to capture the output of the command as a character vector.
- fuser -k 3000/tcp also accomplishes this

Starting a process in the background

# start MLflow server
sys::exec_background("mlflow server")

Check file sizes in a directory

file.info(Sys.glob("*.csv"))["size"]
#>                                size
#> Data8277.csv              857672667
#> DimenLookupAge8277.csv         2720
#> DimenLookupArea8277.csv       65400
#> DimenLookupEthnic8277.csv       272
#> DimenLookupSex8277.csv           74
#> DimenLookupYear8277.csv          67

First one is about 800MB

Read first ten lines of a file

cat(paste(readLines("Data8277.csv", n=10), collapse="\n"))
#> Year,Age,Ethnic,Sex,Area,count
#> 2018,000,1,1,01,795
#> 2018,000,1,1,02,5067
#> 2018,000,1,1,03,2229
#> 2018,000,1,1,04,1356
#> 2018,000,1,1,05,180
#> 2018,000,1,1,06,738
#> 2018,000,1,1,07,630
#> 2018,000,1,1,08,1188
#> 2018,000,1,1,09,2157

Delete an opened file in the same R session

You **MUST** unlink it before any kind of manipulation of object
- I think this works because readr loads files lazily by default

Example:

wisc_csv_filename <- "COVID-19_Historical_Data_by_County.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
wisc_file_path <- file.path(download_location, wisc_csv_filename)
wisc_tests_new <- readr::read_csv(wisc_file_path)
# key part, must unlink before any kind of code interaction
# supposedly need recursive = TRUE for Windows, but I didn't need it
# Throws an error (hence safely) but still works
safe_unlink <- purrr::safely(unlink)
safe_unlink(wisc_tests_new)

# manipulate obj
wisc_tests_clean <- wisc_tests_new %>%
      janitor::clean_names() %>%
      select(date, geo, county = name, negative, positive) %>%
      filter(geo == "County") %>%
      mutate(date = lubridate::as_date(date)) %>%
      select(-geo)
# clean-up
fs::file_delete(wisc_file_path)

Find out which process is locking or using a file
- Open Resource Monitor, which can be found
  - By searching for Resource Monitor or resmon.exe in the start menu, or
  - As a button on the Performance tab in your Task Manager
- Go to the CPU tab
- Use the search field in the Associated Handles section
  - type the name of file in the search field and it’ll search automatically
  - 35548

List all file based on file names, directories, or glob patterns. (source)

list_all_files <- function(include = "*") {
  # list all files in the current directory recursing
  all_files <- fs::dir_ls(recurse = TRUE, all = TRUE, type = "any")

  # Identify directory patterns explicitly mentioned
  matched_dirs <- include[fs::dir_exists(include)]

  # Get *all* files inside matched directories
  extra_files <- unlist(
    lapply(
      matched_dirs,
      function(d) fs::dir_ls(d, recurse = TRUE, all = TRUE)
    )
  )

  # Convert glob patterns to regex
  regex_patterns <- sapply(include, function(p) {
    if (grepl("\\*", p)) {
      utils::glob2rx(p) # Convert glob to regex
    } else {
      paste0("^", p, "$") # Match exact filenames
    }
  })

  # Match files against regex patterns
  matched_files <- all_files[sapply(all_files, function(file) {
    any(stringr::str_detect(file, regex_patterns))
  })]

  # Combine matched files and all directory contents
  unique(c(matched_files, extra_files))
}

include <- c("*.Rmd", "*.tex", "_book", "_output.yml", "\\.R$", "thumbnail.jpg", 'README.*')
list_all_files(include = include)

Python

Notes from Python’s many command-line utilities
- Lists and describes all CLI utilities that are available through Python’s standard library
Packages
- {{paramiko}} - Low level SSH interface
- {{fabric}} - High level SSH interface

Linux utilities through Python in CLI

Command	Purpose	More
`python3.12 -m uuid`	Like `uuidgen` CLI utility	Docs
`python3.12 -m sqlite3`	Like `sqlite3` CLI utility	Docs
`python -m zipfile`	Like `zip` & `unzip` CLI utilities	Docs
`python -m gzip`	Like `gzip` & `gunzip` CLI utilities	Docs
`python -m tarfile`	Like the `tar` CLI utility	Docs
`python -m base64`	Like the `base64` CLI utility
`python -m ftplib`	Like the `ftp` utility
`python -m smtplib`	Like the `sendmail` utility
`python -m poplib`	Like using `curl` to read email
`python -m imaplib`	Like using `curl` to read email
`python -m telnetlib`	Like the `telnet`utility

uuid and sqlite3 require version 3.12 or above.

Code Utilities

Command	Purpose	More
`python -m pip`	Install third-party Python packages	Docs
`python -m venv`	Create a virtual environment	Docs
`python -m pdb`	Run the Python Debugger	Docs
`python -m unittest`	Run `unittest` tests in a directory	Docs
`python -m pydoc`	Show documentation for given string	Docs
`python -m doctest`	Run doctests for a given Python file	Docs
`python -m ensurepip`	Install `pip` if it’s not installed	Docs
`python -m idlelib`	Launch Python’s IDLE graphical REPL	Docs
`python -m zipapp`	Turn Python module into runnable ZIP	Docs
`python -m compileall`	Pre-compile Python files to bytecode	Docs

pathlib

/= Operator for joining paths

from pathlib import Path
directory = Path.home()
directory /= "Documents"
directory
#> PosixPath('/home/trey/Documents')

AWK

Misc
- Resources
  - Docs
  - Awk - A Tutorial and Introduction

Print first few rows of columns 1 and 2

awk -F, '{print $1,$2}' adult_t.csv|head

Extract every 4th line starting from the line 1 (i.e. 1, 5, 9, 13, …)
```
awk '(NR%4==1)' file.txt
```
Filter lines where no of hours/ week (13th column) > 98
```
awk -F, ‘$13 > 98’ adult_t.csv|head
```
Filter lines with “Doctorate” and print first 3 columns
```
awk '/Doctorate/{print $1, $2, $3}' adult_t.csv
```

Random sample 8% of the total lines from a .csv (keeps header)

'BEGIN {srand()} !/^$/ {if(rand()<=0.08||FNR==1) print > "rand.samp.csv"}' big_fn.csv

Decompresses, chunks, sorts, and writes back to S3 (From link)

# Let S3 use as many threads as it wants
aws configure set default.s3.max_concurrent_requests 50

for chunk_file in $(aws s3 ls $DATA_LOC | awk '{print $4}' | grep 'chr'$DESIRED_CHR'.csv') ; do

        aws s3 cp s3://$batch_loc$chunk_file - |
        pigz -dc |
        parallel --block 100M --pipe  \
        "awk -F '\t' '{print \$1\",...\"$30\">\"chunked/{#}_chr\"\$15\".csv\"}'"

        # Combine all the parallel process chunks to single files
        ls chunked/ |
        cut -d '_' -f 2 |
        sort -u |
        parallel 'cat chunked/*_{} | sort -k5 -n -S 80% -t, | aws s3 cp - '$s3_dest'/batch_'$batch_num'_{}'

        # Clean up intermediate data
        rm chunked/*
done

Uses pigz to parallelize decompression
Uses GNU Parallel (site, docs, tutorial1, tutorial2) to parallelize chunking (100MB chunks in 1st section)
Chunks data into smaller files and sorts them into directories based on a chromosome column (I think)
Avoids writing to disk

Vim

Command-line based text editor
Common Usage
- Edit text files while in CLI
- Logging into a remote machine and need to make a code change there. vim is a standard program and therefore usually available on any machine you work on.
- When running git commit, by default git opens vim for writing a commit message. So at the very least you’ll want to know how to write, save, and close a file.
Resources
2 modes: Navigation Mode; Edit Mode
- When Vim is launched you’re in Navigation mode
- Press i to start edit mode, in which you can make changes to the file.
- Press Esc key to leave edit mode and go back to navigation mode.
Commands
- x deletes a character
- dd deletes an entire row
- b (back) goes to the previous word
- n (next) goes to the next word
- :wq saves your changes and closes the file
- :q! ignores your changes and closes the file
- h is $\leftarrow$
- j is $\downarrow$
- k is $\uparrow$
- l (i.e. lower L) is $\rightarrow$