General

Misc

  • Resource

  • Tools (see article, article for installation and usage)

    • ruff - rust-based, linter and sorts imports
      • Fast, sensible default settings, focuses on more important things out of the box, and has less legacy burden
    • pydocstring - tool for checking compliance with Python docstring conventions
    • black - code formatter
    • isort - sorts your imports
    • uv - Package Manager
    • pytest, pytest-watch - unit tests
    • commitizen - guides you through a series of steps to create a commit message that conforms to the structure of a Conventional Commit
    • nbQA - linting in jupyter notebooks
    • mypy - type checker; good support and docs
    • pylance - checks type hinting in VSCode (see Functions >> Documentation >> Type Hinting)
    • doit - task runner; {targets}-like tool; tutorial
    • pre-commit (Video)- Specify which checks you want to run against your code before committing changes to your git repository
    • README templates - link
  • Project contents

    • Structured code directories
    • .gitignore, README, LICENSE, .github
      • Evidently pull request and issue templates go into the .github folder as well
    • tests directory
    • pyproject.toml
      • Put as much config as possible into it. A lot of configurations tools will happily read from it, and it will give you one source of truth.
    • .pre-commit-config.yaml
      • ruff, isort, etc.
    • requirements.txt
      • Dependencies also in pyproject.toml but it’s also good to have this file
      • requirements.in if using pip tools
    • contributing.md
      • Instructs on how to contribute to your project — format to submit pull requets, code of conduct, coding standards, etc.
    • docs directory
  • An underscore _ at the beginning is used to denote private variables in Python.

    def set_temperature(self, value):
            if value < -273.15:
                raise ValueError("Temperature below -273.15 is not possible.")
            self._temperature = value
    • You can still access “_temperature” but it’s just meant for internal use by the class and the underscore indicates this
  • {{warnings::warnings.filterwarnings(‘ignore’)}}

  • sys.getsizeof(obj) to get the size of an object in memory.

Terms

  • classes - code template for creating objects, we can think of it as a blueprint. It describes the possible states and behaviors that every object of a certain type could have.
  • object - data structure storing information about the state and behavior of a certain entity and is an instance of a class
  • stub file - a file containing a skeleton of the public interface of that Python module, including classes, variables, functions – and most importantly, their types. (Source)

Base

  • Info method

    X.info()
  • Remove an object: del

  • Check object type

    • type() : outputs the type of an object
    • isinstance() : outputs type and inheritance of an object
    • See article for details on differences
  • Import Libraries

    import logging
    import bentoml
    from transformers import (
        SpeechT5Processor,
        SpeechT5ForTextToSpeech,
        SpeechT5HifiGan,
        WhisperForConditionalGeneration,
        WhisperProcessor,
    )

Fundamentals

Slicing

  • Format my_list[start:stop:step]

    • ** start value is inclusive and the end value is exclusive
  • 1 element and more than 1 element

    "Python"[0] # P
    "Python"[0:1] # P
    "Python"[0:5] # Pytho
  • illustrates how when using a range, the last element is exclusive

  • Negative indexing my_list[0:-1]

    • Everything but the last object
  • Skip every second element

    my_list = list("Python")
    my_list[0:len(my_list):2]
    >> ['P', 't', 'o']
  • start at 0, end at len(my_list), step = 2

  • Shortcuts

    my_list[0:-1] == my_list[:-1]
    my_list[0:len(my_list):2] == my_list[::2]
    "Python"[::-1] == "Python"[-1:-7:-1]
    • Defaults
      • 0 for the start value
      • len(list) for the stop value
      • 1 for the step value
    • Defaults for negative step value
      • -1 for the start value
      • -len(list) - 1 for the stop value
  • Alias vs new object

    b = a # alias
    b = a[:] # new object
    • With the alias, changes to a will happen to b as well
  • Common use cases

    Every element but the first and the last one [1:-1]
    Every element in reverse order [::-1]
    Every element but the first and the last one in reverse order [-2:0:-1]
    Every second element but the first and the last one in reverse order [-2:0:-2]
  • Using slice function

    sequence = list("Python")
    my_slice = slice(None, None, 2) # equivalent to [::2]
    indices = my_slice.indices(len(sequence))
    >> (0, 6, 2)
    • Shows start = 0, stop = 6, step = 2

F-Strings

  • Cheatsheet

  • Parameterize with {}

    >> x = 5
    >> f"One icecream is worth [{x}]{style='color: #990000'} dollars"
    'One icecream is worth 5 dollars'
  • ! - functions

    • !r — Shows the string delimiter, calls the repr() method.
      • repr’s goal is to be unambiguous and str’s is to be readable. For example, if we suspect a float has a small rounding error, repr will show us while str may not
    • !a — Shows the Ascii for the characters.
    • !s — Converts the value to a string.
      • Guessing this the str() method (see !r for details)
    food2brand = "Mcdonalds"
    food2 = "French fries"
    f"I like eating {food2brand} {food2!r}"
    "I like eating Mcdonalds 'French fries'"
  • Change format with “:”

    >> import datetime
    >> date = datetime.datetime.utcnow()
    >> f"The date is {date:%m-%Y %d}"
    'The date is 02-2022 15'
  • Formatting with “>” and “<”

    • <6 says width is 6 characters and text starts at the left edge
    • >10.2f says width is 10 characters, text starts the right hand edge, and number is rounded to 2 decimal places

Operators

  • (docs)

  • Exponential: 5**3

  • Integer division: 5//3

  • Modulo: 5%3

  • Identity: is

    x = 5
    y = 3
    print("The result for x is y is", x is y)
    The result for x is y is false
    • Think you can also use == here too
  • Logical: and and or

    print("The result for 5 > 3 and 6 > 8 is", 5 > 3 and 6 > 8)
    print("The result for 5 > 3 or 6 > 8 is", 5 > 3 or 6 > 8)
    The result for 5 > 3 and 6 > 8 is False
    The result for 5 > 3 or 6 > 8 is True
  • Subset: in and not in

    print("Is the number 3 in the list [1,2,3]?", 3 in [1,2,3])
    Is the number 3 in the list [1,2,3]? True
    
    print("Is the number 3 not in the list [1,2,3]?", 3 not in [1,2,3])
    Is the number 3 not in the list [1,2,3]? False
  • Assignment

Types

  • Scalars
    • Create scalars by subsetting a list

      inputs = [1, 0.04, 0.9]
      # 1 numeric 
      rmse = inputs[0] # rmse = 1 and is type 'float'
      # multiple numerics
      rmse, mape, rsq = inputs
  • Tuples
    • Lists are mutable and tuples are not
      • i.e. we can add or remove elements to a list after we create it but we cannot do such thing to a tuple
    • Syntax: name_of_tuple = (a, b)
  • Lists
    • Create list of objects (e.g. floats)

      acc_values = [rmse, mape, rsq]
      • alt method: asterisk-comma notation

        *acc_names, = "RMSE", "MAPE", "R-SQ"
        • asterisk is “unzipping operator”
    • Make a copy

      old_list = [2, 3, 4]
      new_list = list(old_list)
  • Dictionaries
    • ** if creating a simple dict, more performant to use curly braces **

      • Avoid d = dict(1=1, x='x')
    • Merge 2 dicts (source)

      # old way
      merged_dict = {**dict1, **dict2}
      
      # Since 3.9 (less bytecode, so faster in some cases)
      merged_dict = dict1 | dict2
      
      # In-Place Updating
      dict1 |= dict2
    • Join 2 dicts - d.update(d2)

      • If d and d2 share keys, d2’s values for those keys will be used
    • Access a value from a key: sample_dict['key_name']

    • Make a copy

      old_dict = {stuff: 2, more_stuff: 3}
      new_dict = dict(old_dict)
    • Convert list of tuples to a dict

      acc_dict = dict(acc_list)
    • Add key, value pair to a dict

      transaction_data['user_address'] = '221b Baker Street, London - UK'
      # or
      transaction_data.update(user_address='221b Baker Street, London - UK')
    • Unpack dict into separate tuples for key:value pairs

      rmse, mape, rsq = acc_dict.items()
      rmse
      ('RMSE', 1)
      • ** fastest way to iterate over both keys and values in a dict **
      • can also use zip to unpack pairs into a list (see loops >> zip)
    • Unpack dict into separate lists for keys and values

      acc_keys = list(acc_dict.keys()) 
      acc_values = list(acc_dict.values())
      • ** fastest way to iterate over a dict’s keys or values **
    • Unpack values from dicts into separate scalars

      rmse, mape, rsq = acc_dict.values()
      rmse
      1
    • Pull the value for a key (e.g. k) or return the default value - d.get(k, default)

      • Default is “None”. I think this can be set with d.setdefault(k, default)
    • Check for specific key (logical)

      ‘send_currency’ in transaction_data
      ‘send_currency’ in transaction_data.keys()
      ‘send_currency’ not in transaction_data.keys()
      • Like %in% in R
    • Check for specific value (logical)

      ‘GBP’ in transaction_data.values()
    • Check for key, value pair

      (‘send_currency’, ‘GBP’) in transaction_data.items()
    • Pretty printing of dictionaries

          _ = [print(k, ":", f'{v:.1f}') for k,v in acc_dict.items()]
          RMSE : 1.00
          MAPE : 0.04
          R-sq : 0.90
      • for-in loop format (see Loops >> Comprehension)
      • print returns “none” for each key:value at the bottom of the output for some reason. Assigning the print statement to a variable fixes it.
    • defaultdict

      • Creates a key from a list element and groups the properties into a list of values where the value may also be a dict.
      • From {{collections}}
      • Also see
  • Sets
    • If performing set logic, always more performant to use sets instead of dicts or lists
      • If using numpy/pandas, using the .unique() syntax is more efficient for arrays/series’ with numeric values
      • If using strings, it’s more efficient to use list(set(my_array))
  • Strings
    • Operators

      Operator Description
      %d Signed decimal integer
      %u unsigned decimal integer
      %c Character
      %s String
      %f Floating-point real number
    • Example

      name = "india"
      age = 19
      marks = 20.56
      string1 = 'Hey %s' % (name)
      print(string1)
      string2 = 'my age is %d' % (age)
      print(string2)
      string3= 'Hey %s, my age is %d' % (name, age)
      print(string3)
      string3= 'Hey %s, my subject mark is %f' % (name, marks)
      print(string3)

uv

  • Github, Docs

  • Package Manager

    • Written in Rust so it’s fast
    • By writing it in Rust, it removes by design half of the bootstrapping problems cursing Python tooling. To install uv, you don’t need to worry about which Python you use, where is your venv, and how you are going to deal with the PATH.
    • Drop-In replacement for common pip workflows
  • Environment Manager

    • Finds direct dependencies from a pyproject.toml file, create a cross-platform lock file, add & remove deps from the conf and sync the whole project instantly. This turns it into a serious alternative to pip-tools, Poetry, or Pipenv.
    • Can install Python itself, on Windows, Mac and Linux. This makes it a replacement for Pyflow, Pyenv or Rye.
    • Can run scripts with inline dependencies, like hatch. This means stand-alone scripts can use Pypi deps without worries.
    • Provides uvx, a pipx equivalent.
  • Example: Basic

    $ cd /tmp
    $ curl -LsSf https://astral.sh/uv/install.sh | sh
    $ uv init -p "<3.8,>=3.7" # forcing 3.7
    Initialized project `tmp`
    $ uv run hello.py
    Using Python 3.7.9
    Creating virtualenv at: .venv
    Hello from tmp!
    • Install uv globally.
    • Creates a project with a hello_world.py and pyproject.toml files.
    • Install Python with the required version.
    • Create a venv with this Python.
    • Activate the venv.
    • Run the script.

pip

  • Looks for packages on https://pypi.org, downloads, and installs it

  • See Dependencies for pip extensions

  • Misc

    • If you installed python using the app-store, replace python with python3.

    • Don’t use sudo to install libraries, since it will install things outside of the virtual environment.

    • Nor should you use “–user”, since it’s made to install things outside of the virtual environment.

    • Don’t mix pip, venv and Anaconda. Avoid Anaconda if you can. If you have to use Anaconda, don’t use pip and venv. Limit yourself to Anaconda tools.

    • If you get SSL errors (common if you are in a hotel or a company network) use the –trusted-host pypi.org –trusted-host files.pythonhosted.org options with pip to work around the problem.

      • e.g. python -m pip install pendulum --trusted-host pypi.org --trusted-host files.pythonhosted.org
    • If you are behind a corporate proxy that requires authentication (common if you are in a company network), you can use the –proxy option with pip to give the proxy address and your credentials.

      • e.g. python -m pip install pendulum --proxy http://your_username:yourpassword@proxy_address
      • It also works with the https_proxy environment variables
    • pip config set global.require-virtualenv True will only allow pip to install packages while a virtual environment is activated.

      • Probably needs to be set back to False when updating pip.
      • May become the default in Python 3.13
  • Install library

    $ python -m pip install <library_name>
    
    # inside ipython or a colab notebook, "!" signifies a shell command
    !pip install <library_name>
  • Install library from github

    python -m pip install git+https://github.com/bbalasub1/glmnet_python.git@1.0
    • @1.0” is the version number
  • Uninstall library

    $ python -m pip uninstall <library_name>
    • Won’t uninstall the dependencies of this library.
    • If you wish to also uninstall the unused dependencies as well, take a look at pip-autoremove
  • Remove all packages in environment

    $ python -m pip uninstall -y -r <(pip freeze)
  • Remove all packages in environment but write the names of the packages to a requirements.txt file first

    $ python -m pip freeze > requirements.txt && python3 -m pip uninstall -r         requirements.txt -y
  • Install requirements.txt

    $ python -m pip install -r requirements.txt
  • Write names of all the packages in your environment to a requirement.txt file

    $ python -m pip freeze > requirements.txt
    • Writes the specific version of the packages that you have installed in your environment (e.g. pandas==1.0.0)
      • This may not be what you always want, so you’ll need to manually change to just the library name in that case (e.g. pandas)
    • Only aware of the packages installed using the pip install command
      • i.e. any packages installed using a different approach such as peotry, setuptools, condaetc. won’t be included in the final requirements.txt file.
    • Does not account for dependency versioning conflicts
    • Saves all packages in the environment including those that are not relevent to the project
    • If you are not using a virtual environment, pip freeze generates a requirement file containing all the libraries in including those beyond the scope of your project.
  • List your installed libraries

    $ python -m pip list
  • See if you have a particular library installed

    $ python -m pip list | grep <library_name>
  • Get library info (name, version, summary, license, dependencies and other)

    $ python -m pip show <library_name>
  • Check that all installed packages are compatible

    $ python -m pip check
  • Update package

    $ python -m pip install package_name --upgrade
  • Search for PyPI libraries (pip source for libraries)

    $ python -m pip search <search_term>
    • returns all libraries matching search term
  • Download a package without installing it

    python -m pip download <library name>
    • It will download the package and all its dependencies in the current directory (the files, called, wheels, have a .whl extension).

    • You can then install them offline by doing python -m pip install on the wheels.

  • Build Wheel archives for the libraries and dependencies in your environment

    $ python -m pip wheel
    • I think these are binaries, so they don’t need compiled if installed in a future environment
    • Real Python Tutorial
  • Manage configuration

    $ python -m pip config <action name>
    • Actions: edit, get, list, set or unset

    • Example

      $ python -m pip config set global.index-url https://your.global.index
  • Display debug information specific to pip

    $ python -m pip debug

Anaconda

  • Check configuration of current environment

    conda list
    • Shows python version used, package names installed and their versions
  • Install packages

    conda install <package1> <package2>
  • Install a package from a specific channel

    conda install <package_name> -c <channel_name> -y # Short form
    conda install <package_name> --channel <channel_name> -y # Long form
  • Package installation channels (some packages not available in default channel)

    • Check current channels

      conda config --show channels
      • The order in which these channels are displayed shows the channel priority.
        • When a package is installed, anaconda will the check the channel at the top of list first then work it’s way down
    • Add a channel

      conda config --add channels conda-forge
      • Adds “conda-forge” to list of available channels
    • Remove a channel

      conda config --remove channels conda-forge
      • Removes the “conda-forge” channel

Environments

  • Misc
    • When you’re in a virtual environment
      • Anytime you use the “python” command while your virtual environment is activated, it will be only the one from this env.

      • If you start a Python shell now, it will see only the things in the current directory, and the things installed in the virtual environment.

      • If you run a script, it will see only the things in the current directory, and the things installed in the virtual environment.

      • If you run a command, the command will be taken from the virtual environment.

        And they will only use exactly the version of Python of the virtual environment.

    • Store environment files with project
    • Breakage
      • You cannot move a virtual environment, it will stop working. Create a “requirements.txt” file, delete the virtual environment and create a new one.
      • Don’t rename a directory containing a virtual environment. Or if you do, prepare yourself to create a “requirements.txt” file, delete the virtual environment and create a new one.
      • If you change the Python used in the virtual environment, such as when uninstalling it, it will stop working.
    • Create one big virtual environment for all small scripts.
      • If you make a lot of venv, you may be tempted to install everything at the system level for convenience. After all, it’s a bore to create and activate a virtual environment each time you want to write a five liner. A good balance is one single virtual environment you use for all things quick and dirty.
    • Create several virtual environments per versions of python if your project needs to support several versions. You may need several requirements.txt files as well, one for each env.
    • Recommendations for a stable dependency environment for your project (article)
      • Don’t install the latest major version of Python
        • Maximum: 1 version under the latest version
        • Minimum: 4 versions under the latest version (e.g. latest = 3.11, min = 3.7)
      • Use only the python.org installer on Windows and Mac, or official repositories on Linux.
      • Never install or run anything outside of a virtual environment
      • Limit yourself to the basics: “pip” and “venv”
      • If you run a command, use “-m”
        • It lets you run any importable Python module, no matter where you are. Because most commands are Python modules, we can use this to say, “run the module X of this particular python”.

        • There is currently no way for you to run any python command reliably without “-m”.

        • Examples:

          # Don't do :
          pip install
          # Do:
          python -m pip install
          
          # Don't do :
          black
          # Do:
          python -m black
          
          # Don't do :
          jupyter notebook
          # Do:
          python -m jupyter notebook
      • When creating a virtual environment, be explicit about which Python you use
        • Get current python versions installed: py --list-paths (windows)

pyenv

  • Just a simple Python version manager — think {rig}
  • {{pyenv}}, {{pyenv-win}}
  • Set-up in RStudio (article)
  • Compiles Python under the hood when you install it. But compiling can fail in a thousand ways
  • pyenv install --list: To see what python versions are available to install
  • pyenv install <version number>: To install a specific version
  • pyenv versions: To see what python versions are installed on your system
  • pyenv global <version number>: The set one python version as a global default
  • pyenv local <version number>: The set a python version to be used within a specific directory/project\

pdm

  • Docs
  • Package and dependency manager similar to npm. Doesn’t require virtual environments.
  • Features: auto-updating pyproject.toml, isolating dependencies from dependencies-of-dependencies, active development and error handling

pvenv

  • Easy python virtual environment management
    • Basically just sets up nice aliases for venv commands
  • Docs
  • venv activate (aliased to avenv) to activate a virtual environment
  • venv deactivate (aliased to dvenv) to deactivate a virtual environment
  • venv list (aliased to lsvenv) to list all the virtual environments
  • venv make (aliased to mkvenv) to create a new virtual environment
  • venv rm (aliased to rmvenv) to remove a virtual environment

venv

  • Misc
    • Shipped with Python
    • Don’t mix pip, venv and Anaconda. Avoid Anaconda if you can. If you have to use Anaconda, don’t use pip and venv. Limit yourself to Anaconda tools.
  • Create
    • Windows: py -<py version> -m venv <env name>
    • Mac/Linux: python3.8 -m venv .venv
      • Where the python version is 3.8 and the environment name is “.venv”
      • Mac and Linux hide folders with names that have preceding “.” by default, so make sure you have “display hiddent folders” activated or you won’t see it.
    • Naming Environments
      • Name your environment directory “.venv”, because:
        • Some editors check for this name and automatically load it.

        • It contains “venv”, so it’s explicit.

        • It has a dot, so it’s hidden on Linux.

      • If you have more than one environment directory, use a suffix to distinguish them.
        • e.g. A project that must work with two different versions of Python (3.9 and 3.10), I will have a “.venv39” and a “.venv310”
      • Naming enviroments for misc uses
        • “.venv_test”: located in a personal directory to install new tools you want to play with. It’s disposable, and often broken, so you can delete it and recreate it regularly.
        • “.venv_scripts”: Used for all the small scripts. You don’t want to create one virtual environment for each script, so centralize everything. It’s better than installing outside of a virtual environment, but is not a big constraint.
  • Activate
    • Windows: .venv\Scripts\activate
      • Where .venv is the name of the virtual environment
      • May need .bat as extension to activate
    • Mac/Linux: source .venv/bin/activate
  • After that, you can use python -m pip install to install packages.
  • Deactivate: deactivate

virtualenv

  • Docs

  • Create a virtual environment

     python3 -m venv <env_name>
    • -m venv tells python to run the virtual environment module, venv
    • Make sure you’re in your projects directory
    • Remember to add “<env_name>/” to .gitignore
  • Activate environment

    source venv/bin/activate # Mac or Linux
    venv\Scripts\activate # Windows
    
    >> (<env_name>) $
    • Prompt should change if the environment is activated
    • All pip  installed packages will now be installed into the “<env_name>/lib/python3.9/site-packages” directory
  • Use the python contained within your virtual environment

    python main.py
    • Not sure why you wouldn’t just activate the environment.
  • Deactivate environment

    deactivate
    • no python  or env_name needed?
  • Reproducing environment

    • Done using requirements.txt (see pip section for details on writing and installing)
      • I don’t think the python version is included, so that will need to communicated manually

Anaconda

  • List environments

    conda env list # method 1
    conda info --envs # method 2
    • default environment is called “base”
    • Active environment will be in parentheses
    • Active environment will be the one in the list with an asterix
  • Create a new conda environment

    conda create -n <env name>
    conda activate <env name>
  • Create a new conda environment with a specific python version

    conda create -n py310 python=3.10
    conda activate py310
    conda install jupyter jupyterlab
    jupyter lab
    • Also install and launch jupyter lab
  • Create an environment from a yaml file

    conda env create -f environment.yml # Short form
    conda env create --file environment.yml # Long form
  • Remove an environment

    conda deactivate <env_name> # Need to deactivate the environment first
    conda env remove -n <env_name>
    • Should also delete environment folders (conda env list shows path to folders)
  • Clone an existing environment

    conda create -n testclone --clone test # Short form
    conda create --name testclone --clone test # Long form
    • “testclone” is a copy of “test”
  • Activate an environment

    conda activate <env_name>
  • Activate environment with reticulate in R

    reticulate::use_python("/usr/local/bin/python")   
    reticulate::use_condaenv("<env name>", "/home/jtimm/anaconda3/bin/conda")
  • Deactivate an environment

    conda activate # Option 1: activates base
    conda deactivate test # Option 2
  • Export the specifications of the current environment into a YAML file into the current directory

    conda env export > environment.yml # Option 1
    conda env export -f environment.yml # Option 2
  • Example: Conda workflow

    • Create an environment that uses a specific python version

      • Without a specified python version, the environment will use the same version as “base”
      conda create -n anothertest python=3.9.7 -y
      • -n is the name flag and “anothertest” is the name of the environment
      • Uses Python 3.9.7
      • Without the -y flag, there’d be a prompt you’d have to answer “yes” to
    • Activate the environment

      conda activate anothertest
    • Install packages

      • Installing packages one at time can lead to dependency conflicts.

      • Conda’s official documentation recommends to install all packages at the same time so that the dependency conflicts are resolved

        conda install "numpy>=1.11" nltk==3.6.2 jupyter -y # install specific versions
        conda install numpy nltk jupyter -y # install all latest versions
    • Do work and deactivate environment

      conda deactivate anothertest
  • Example Raschka workflow

    # create & activate
    conda create  --prefix ~/code/myproj python=3.8
    conda activate ~/code/myproj
    # export env
    conda env export > myproj.yml
    # create new env from yaml
    conda env create --file myproj.yml --prefix ~/code/myproj2

Poetry

  • Docs (like renv)

  • Apparently buggy (article)

  • pip’s dependency resolver is more flexible and won’t die on you if the package specifies bad metadata, while poetry’s strictness may mean you can’t install some packages at all.

  • Create project

    poetry new <project-dir-name>
    • automatically creates a directory for your project with a skeleton
    • “pyproject.toml” maintains dependencies for the project with the following sections:
      • tool.poetry provides an area to capture information about your project such as the name, version and author(s).
      • tool.poetry.dependencies lists all dependencies for your project.
      • tool.poetry.dev-dependencies lists dependencies your project needs for development that should not be present in any version deployed to a production environment.
      • build-system references the fact that Poetry has been used to manage the project.
  • Add library and create lock file: poetry add <library name>

    • When the first library is added, a “poetry.lock” file wil be generated
  • Activate environment: poetry shell

    • Deactivate environment: exit
  • Run script: poetry run python my_script.py

  • Package the project: poetry build

    • Creates tar.gz and wheel files (.whl) in “dist” dir
  • Example: poetry workflow (+pyenv, virtualenv)

    # Create a virtual environment called "my-new-project"
    # using Python 3.8.8
    pyenv virtualenv 3.8.8 my-new-project
    # Activate the virtual environment
    pyenv activate my-new-project
    • {{pyenv}} - For managing the exact version of Python and activating the environment

    • Name your package the same name as the directory which is the same name as the virtual environment.

      • Dashes for the latter two and underscores for the package
    • Intitialize the project and add packages (similar to renv) bash poetry init poetry add numpy

    • Reinstall dependencies

      # navigate to my project directory and run
      poetry install
    • Turn off virtualenv management

      # right after installing poetry, run:
      poetry config virtualenvs.create false
      • Default poetry behavior is that it will manage your virtual environments for you. This may not be desirable because:
        • Can’t just run a script from the command line. Instead, have to run poetry run my-script
          • Awkward when you want to dockerize your code
        • Enforces a virtual environment management framework on everybody in a shared codebase
        • Your Makefile now needs to know about poetry

Dependencies

  • Misc

    • After mastering pip, {{pip-tools}} is recommended.
  • Get a complete list of dependencies (e.g. dependencies of dependencies) with {{deptree}}

    deptree
    # output
    Flask==2.2.2  # flask
      Werkzeug==2.2.2  # Werkzeug>=2.2.2
        MarkupSafe==2.1.1  # MarkupSafe>=2.1.1
      Jinja2==3.1.2  # Jinja2>=3.0
        MarkupSafe==2.1.1  # MarkupSafe>=2.0
      itsdangerous==2.1.2  # itsdangerous>=2.0
      click==8.1.3  # click>=8.0
    # deptree and pip trees
    • Flask depends on Werkzeug which depends on MarkupSafe
      • Werkzeug and MarkupSafe qualify as transitive dependencies for this project
    • Commented part on the right is the compatible range
  • requirements.txt format

    # comment
    pandas==1.0.0
    pyspark
  • pip: write names of all the packages in your environment to a requirement.txt file

    $ python -m pip freeze > requirements.txt
    • See pip section for issues with this method
  • {{pip-tools}} (Video)

    • After learning pip, {{pip-tools}} is recommended.

    • Creates a requirements.txt with “pinned” dependencies (i.e. with versions) and shows how those packages depend on each other by naming the package in a comment.

    • Less extensive dependency management than something like poetry

    • Workflow

      • Create and activation virtual environment
      • Install pip-tools in venv
      • Open requirements file in vim: vi requirements.in
      • List 1 package per row without any pinning (i.e. versions), save, and exit vim
      • Compile requirements file: pip-compile
      • requirements.txt file is generated with package pinning and dependencies in the comments
      • Install requirements: pip install -r requirements.txt
  • {{pipx}} (Video)

    • For dependencies that you use everywhere

    • A tool for installing Python CLI utilities that gives them their own hidden virtual environment for their dependencies

    • Adds the tool itself to your PATH - so you can install stuff without worrying about it breaking anything else

    • Install

      pipx install datasette
    • Update package and dependencies

      pipx upgrade datasette
    • List packages

      pipx list
  • {{pipreqs}}

    • Scans all the python files (.py) in your project, then generates the requirements.txt file based on the import statements in each python file of the project

    • Set-up: pip install pipreqs

    • Generate requirements.txt file: pipreqs /<your_project_root_path>/

    • Update requirements.txt:  pipreqs --force /<your_project_root_path>/ 

    • Ignore the libraries of some python files from a specific subfolder

      pipreqs /<your_project_root_path>/ --ignore  /<your_project_root_path>/folder_to_ignore/
  • {{pip-compile-multi}}

    • Notes from:
    • Creates and nests multiple requirement files
      • e.g. Able to keep dev environment from production environment separate
    • Autoresolution of cross-requirement file conflicts
      • Dependency DAG (how all requirement files are connected) must have exactly one “sink” node
    • Organize your most ubiquitous dependencies into a single “core” set of dependencies that all other nodes require (a source node), and all of your development dependencies in a node that requires all others (directly or indirectly) require (a sink).
      • Simplifies and allows use of autoresolution functionality
    • Example: DAG (directionality of the arrows is opposite compared to library docs)

Loading/Saving

File paths

  • Misc
    • {{pathlib}} is recommended
  • {{os}}
    • Get current working directory: os.getcwd()

    • List all files and directories in working directory: os.listdir()

    • List all files and directories from a subdirectory: os.listdir(os.getcwd()+'\\01_Main_Directory')

    • Using os.walk(): gathers paths, folders, and files

      • Paths
      path = os.getcwd()+'\\01_Main_Directory'
      for folder_path, folders, files in os.walk(path):
          print(folder_path)
    • Folders

      • Similar code, just replace print(folder_path) with print(folders)
    • Files

  • {{glob}}
    • Get a file path string

      import glob
      path = os.getcwd()+'\\01_Main_Directory'
      for filepath in glob.glob(path):
          print(filepath)
      # C:\Users\Suraj\Challenges\01_Main_Directory
    • List all files and subdirectories from a path

      path = os.getcwd()+'\\01_Main_Directory\\*'
      for filepath in glob.glob(path):
          print(filepath)
      • Note the * wildcard
    • List all files and subdirectories with a “1” in the name

      path = os.getcwd()+'\\01_Main_Directory\\*1.*'
      for filepath in glob.glob(path):
          print(filepath)
    • Get a list of csv file paths from a directory: all_files = glob.glob("C:/Users/path/to/dir/*.csv")

      • Note that you don’t need a loop to save to an object
    • List all files and subdirectories and files in those subdirectories

      path = os.getcwd()+'\\01_Main_Directory\\**\\*.txt'
      for filepath in glob.glob(path, recursive=True):
          print(filepath)
      #Output
      C:\Users\Suraj\Challenges\01_Main_Directory\ABCD_1.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\ABCD_2.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\ABCD_3.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\ABCD_4.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\ABCD_5.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\Sub_Dictionary_1\File_1_in_SubDict_1.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\Sub_Dictionary_1\File_2_in_SubDict_1.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\Sub_Dictionary_2\File_1_in_SubDict_2.txt
      C:\Users\Suraj\Challenges\01_Main_Directory\Sub_Dictionary_2\File_2_in_SubDict_2.txt
      • Components: “**” and “recursive=True”
  • {{pathlib}}
    • Provides a single Path class with a range of methods (instead of separate functions) that can be used to perform various operations on a path.

    • Create a path object for a directory

      from pathlib import Path
      path = Path('origin/data/for_arli')
    • Check if a folder or a file is available in a given path

      if path.exists():
          print(f"[{path}]{style='color: #990000'} exists.")
          if path.is_file():
              print(f"[{path}]{style='color: #990000'} is a file.")
          elif path.is_dir():
              print(f"[{path}]{style='color: #990000'} is a directory.")
      else:
          raise ValueError(f"[{path}]{style='color: #990000'} does not exists")
      • Checks if the path ‘origin/data/for_arli’ exists
        • if it does, it will check whether it is a file or a directory.
        • If the path does not exist, it will print a raise an Error indicating that the path does not exist.
    • List all files/folders in a path

      for f in path.iterdir():
          print(f)
      • Use it in combination with the previous is_dir() and is_file()  methods to list either files or directories.
    • Delete files/folders in a path

      for f in path.iterdir():
          f.unlink()
      
      path.rmdir()
      • unlink deletes each file in the path
      • rmdir deletes the directory.
        • directory must be empty
    • Create a sequence of directories

      # existing directory: D:\scripts\myfolder
      p = Path("D:\scripts\myfolder\logs\newfolder")
      p.mkdir(parents=True, exist_ok=True)
      • Create path object with desired sequence of directories (e.g. logs\newfolder)
      • mkdir with parents=True creates the sequence of directories
        • W/exist_ok=True no error with occur if the directory already exists
    • Rename directory: path.rename('origin/data/new_name')

    • Concatenate a path with string

      path = Path("/origin/data/for_arli")
      # Join another path to the original path
      new_path = path.joinpath("la")
      print(new_path) # prints 'origin/data/for_arli/bla'
      • It also handles the join between two Path objects
    • Directory stats

      print(path.stat()) # print statistics 
      print(path.owner()) # print owner
      • e.g. creation time, modification time, etc.
    • Write to a file

      # Open a file for writing
      path = Path('origin/data/for_arli/example.txt')
      with path.open(mode='w') as f:
          # Write to the file
          f.write('Hello, World!')
      • You do not need to create manually example.txt.
    • Read a file

      path = Path('example.txt')
      with path.open(mode='r') as f:
          # Read from the file
          contents = f.read()
          print(contents) # Output: Hello World!

Models

  • Saving and Loading an estimator as a binary using {{joblib}} (aside: pipelines are estimators)

    import joblib
    #saving the pipeline into a binary file
    joblib.dump(pipe, 'wine_pipeline.bin')
    #loading the saved pipeline from a binary file
    pipe = joblib.load('wine_pipeline.bin')
  • Saving and loading a trained model as a pickle file

    import pickle
    # open file connection
    pickle_file = open('model.pkl', 'ab')
    # save the model
    pickle.dump(model_obj, pickle_file)
    # close file connection                     
    pickle_file.close()
    
    # Open conn and save
    test_dict = {"Hello": "World!"}
    with open("test.pickle", "wb") as outfile:
    # "wb" argument opens the file in binary mode
    pickle.dump(test_dict, outfile)
    
    # open file connection
    pickle_file = open('model.pkl', 'rb')
    # load saved model
    model = pickle.load(pickle_file)
    
    # open conn and load
    # Deserialization
    with open("test.pickle", "rb") as infile:
        test_dict_reconstructed = pickle.load(infile)
    • Can serialize almost everything including classes and functions

Environment Variables

  • {{os}}
    • Check existence

      env_var_exists = 'ENV' in os.environ
      # or
      env_var_exists = os.environ.has_key('ENV')
    • List environment variables: print(os.environ)

    • Loading

      import os
      # Errors when not present
      env_var = os.environ['ENV'] # where ENV is the name of the environment variable
      # Returns None when not present
      env_var = os.environ.get('ENV', 'DEFAULT_VALUE') # using default value is optional
    • Set/Export or overwrite

      os.environ['ENV'] = 'dev'
    • Load or create if not present

      try:
          env_var = os.environ['ENV']
      except KeyError:
          os.environ['ENV'] = 'dev'
    • Delete

      if 'ENV' in os.environ:
          del os.environ['ENV']
  • {{python-decouple}}
    • Access environment variables from whatever environment it is running in.

    • Create a .env file in the project root directory: touch .env

    • Open .env in nano text editor: nano .env

      • Nano text editor is pre-installed on macOS and most Linux distros
      • Check if installed/version: nano --version
      • Basic usage tutorial
    • Add environment variables to file

      USER=alex
      KEY=hfy92kadHgkk29fahjsu3j922v9sjwaucahf
      • Save: Ctrl+o
      • Exit: Ctrl+x
    • * Add .env to your .gitignore file *

    • Access

      from decouple import config
      API_USERNAME = config('USER')
      API_KEY = config('KEY')
  • {{python-dotenv}}
    • Reads .env files
    • Probably more popular than {{python-decouple}}
    • Has a companion R package, {dotenv}, so .env files can be used in projects that use both R and Python.

Functions

  • Misc
    • Benchmarking a function

      • Using IPython function

        %time dat['col1001'] = some_function(dat['col1'], dat['col2'], dat['col3'])
        • %%time prints the wall time for the entire cell whereas %time gives you the time for first line only
      • Using a decorator

    • Assigning functions based on arg type

      def process_data(data):
          if isinstance(data, dict):
              process_dict(data) 
          else:
              process_list(data) 
      def process_dict(data: dict):
          print("Dict is processed")
      def process_list(data: list):
          print("List is processed")
      • Assigns data to a particular function depending on whether it’s a dict or a list
      • isinstance checks that the passed argument is of the proper type or a subclass
    • Wrapping functions

      from functools import partial
      get_count_df = partial(get_count, df=df)
      • Wraps function to make df the default value for df arg

Documentation

  • Functions should at least include docstrings and type hinting
  • Docstrings
    • Types: Google-style, Numpydoc, reStructured Text, EpyTex

    • Information to include

      • Function description, arg description, return value description, Description of errors, Optional extra notes or examples of usage.
    • Access functions docstring:

      • print(func_name.__doc__)

      • For large docstrings

        import inspect
        print(inspect.getdoc(func_name))
    • Example: Google-style

      def send_request(key: str, lat: float = 0, lon: float = 0):
          """Send a request to Climacell Weather API
          to get weather info based on lat/lon.
      
          Climacell API provides realtime weather
          information which can be accessed using
          their 'Realtime Endpoint'.
      
          Args:
            key (str): an API key with length of 32 chars.
            lat (float, optional): value for latitude.
              Default=0
            lon (float, optional): value for longitude.
              Default=0
      
          Returns:
            int: status code of the result 
            dict: Result of the call as a dict
      
          Notes:
            See https://www.climacell.co/weather-api/ 
            for more info on Weather API. You can get
            API key from there, too.
          """
      • First sentence should contain the purpose of the function
    • Example: Numpydoc

      def send_request(key: str, lat: float = 0, lon: float = 0):
          """
          Send a request to Climacell Weather API
          to get weather info based on lat/lon.
      
          Climacell API provides realtime weather
          information which can be accessed using
          their 'Realtime Endpoint'.
      
          Parameters
          ----------
            key (str): an API key with length of 32 chars.
            lat (float, optional): value for latitude.
              Default=0
            lon (float, optional): value for longitude.
              Default=0
      
          Returns
          -------
            int: status code of the result 
            dict: Result of the call as a dict
      
          Notes
          -----
            See https://www.climacell.co/weather-api/ 
            for more info on Weather API. You can get
            API key from there, too.
          """
  • Type Hinting
    • This doesn’t check the type; it’s just metadata

      • See isinstance (see below), NotImplementedError (see below), or {{typecheck}} and {{mypy}} (see bkmks) for type checking that will throw errors
      • Posit uses {{Pyright}} (article)
    • Using type hints enables you to perform type checking. If you use an IDE like PyCharm or Visual Studio Code, you’ll get visual feedback if you’re using unexpected types:

    • Variables: my_variable_name: tuple[int, ...]

      • variable should be a tuple that contains only integers. The ellipsis says the total quantity is unimportant.
    • Functions

      def get_count(threshold: str, column: str, df: pd.DataFrame) -> int:
          return (df[column] > threshold).sum()
      • “threshold”, “column” should be strings (str)
      • “df” should be a pandas dataframe (pd.DataFrame)
      • Output should be an integer (int)
    • Function as an arg: Callable[[Arg1Type, Arg2Type], ReturnType]

      • Example:

        from collections.abc import Callable
        def foo(bar: Callable[[int, int], int], a: int, b: int) -> int:
            return bar(a, b)
        • “bar” is a function arg for the function, “foo”
        • “bar” is supposed to take: 2 integer args ([int, int]) and return an integer (int)
      • Example:

        def calculate(i: int, action: Callable[..., int], *args: int) -> int:
            return action(i, *args)
        • “action” takes any number and type of arguments but must return an integer.
        • With *args: int, you also allow a variable number of optional arguments, as long as they’re integers.
      • Example: Lambda

        f: Callable[[int, int], int] = lambda x, y: 3*x + y
        • May not work

Args and Operators

  • Misc

    • ** Args are not reset to default values after each call **

      • Example:

        def func(list1=[]):      # here l1 is a default argument set to []
            list1.append("Temp")
            return list1
    • “None” + conditional must be used to get the arg to reset back to the default value

      def func(l1=None):     
          if l1 is None
              l1 = []
          l1.append("Temp"
          return l1
  • *

    • Unpacks Lists
    num_list = [1,2,3,4,5]
    num_list_2 = [6,7,8,9,10]
    
    print(*num_list)
    # 1 2 3 4 5
    new_list = [*num_list, *num_list_2] # merge multiple lists
    # [1,2,3,4,5,6,7,8,9,10]
  • *args

    • Functions that can accept a varying number of values
    def names_tuple(*args):
        return args
    
    names_tuple('Michael', 'John', 'Nancy')
    # ('Michael', 'John', 'Nancy')
    names_tuple('Jennifer', 'Nancy')
    # ('Jennifer', 'Nancy')
  • **

    • Unpacks Dictionaries
    num_dict = {‘a’: 1, ‘b’: 2, ‘c’: 3}
    num_dict_2 = {‘d’: 4, ‘e’: 5, ‘f’: 6}
    
    print(*num_dict) # only keys printed
    # a b c
    new_dict = {**num_dict, **num_dict_2} # merge dictionaries
    # {‘a’: 1, ‘b’: 2, ‘c’: 3, ‘d’: 4, ‘e’: 5, ‘f’: 6}
  • **kwargs

    • Functions that can accept a varying number of variable/value pairs (like a … in R)
    def names_dict(**kwargs):
        return kwargs
    
    names_dict(Jane = 'Doe')
    # {'Jane': 'Doe'}
    names_dict(Jane = 'Doe', John = 'Smith')
    # {'Jane': 'Doe', 'John': 'Smith'}
  • Function as an arg

    def classic_boot(df, estimator, seed=1):
        df_boot = df.sample(n=len(df), replace=True, random_state=seed)
        estimate = estimator(df_boot)
        return estimate
    • Bootstrap function with an “estimator” function (e.g. mean) as arg
    • Using a Callable
  • Class as an arg

    from dataclasses import dataclass
    
    @dataclass
    class Person:
        name: str
        age: int
        address: str
        phone: str
        email: str
    
    def process_data(person: Person):
        print(f"Processing data for {person.name}, {person.age}, living at {person.address}. Contact info: {person.phone}, {person.email}")
    
    person = Person("Alice", 30, "123 Main St", "555-1234", "alice@example.com")
    process_data(person)
    • Makes function more readable when the function requires a bunch of args
  • Class as an arg (safer alternative)

    from typing import NamedTuple
    
    class Person(NamedTuple):
        name: str
        age: int
        address: str
        phone: str
        email: str
    
    def process_data(person: Person):
        print(f"Processing data for {person.name}, {person.age}, living at {person.address}. Contact info: {person.phone}, {person.email}")
    
    person = Person("Alice", 30, "123 Main St", "555-1234", "alice@example.com")
    process_data(person)
    • Using NamedTuple means that the attributes cannot be overridden

      • e.g. Executing person.name = "Bob" will result in an error because tuples can’t be modified.
  • Make an arg optional

    lass Address:
        def __init__(self, street, city, state, zipcode, street2=''):
            self.street = street
            self.street2 = street2
            self.city = city
            self.state = state
            self.zipcode = zipcode
    • “street2” has default value of an empty string, so it’s optional

Best Practices

  • Don’t mutate dataframes inside functions. Create new objects instead.
    • Bad: Mutates df
      From Article

      • The value of df has been assigned to variable in_df[df] when it was passed to modify_df as an argument.
      • Both the original df and the in_df inside the function point to the same memory location, even if they go by different variable names. During the modification of its attributes, the location of the mutable object remains unchanged.
      • Since the original instance has been modified, it’s redundant to return the DataFrame and assign it to the variable.
      • Note: This situation is not the case for immutable objects like Strings.
    • Good: Utilizes a deep copy of the df before mutating

      def compute_length(word: str) -> int:
          return len(word)
      
      def prepare_data(df: pd.DataFrame) -> pd.DataFrame:
          return pd.concat([
              df.copy(deep=True),  # deep copy
              df.name.apply(compute_length).rename("name_len"),
              ...
          ], axis=1)

Lambda

  • Useful if you just have 1 expression that you need to execute.

  • Best Practices

    • lambda is an anonymous function, hence it is not a good idea to store it in a variable for future use
    • Don’t use lambdas for single functions (e.g. sqrt). Make sure it’s an expression.
      • Example

        # bad
        sqrt_list = list(map(lambda x: math.sqrt(x), mylist))
        # good
        sqrt_list = list(map(math.sqrt, mylist))
        • Affects performance
    • Don’t use for complex expressions that require more than 1 line (meh)
      • Per PEP8 guidelines, Limit all lines to a maximum of 79 characters

      • Example

        # bad (118 characters)
        df["FinalStatus"] = df["Status"].map(lambda x: 'Completed' if x ==
        'Delivered' or x == 'Shipped' else 'Not Completed')
        # instead
        df["FinalStatus"] = ''
        df.loc[(df["Status"] == 'Delivered') |
              (df["Status"] == 'Shipped'),
              'FinalStatus'] = 'Completed'
        df.loc[(df["Status"] == 'Not Delivered') |
              (df["Status"] == 'Not Shipped'),
              'FinalStatus'] = 'Not Completed'
  • Example: 1 arg

    # py
    lambda x: np.sin(x / period * 2 * np.pi)
    # r
    ~sin(.x / period * 2 * pi)
    # r
    \(x) {sin(x / period * 2 * pi)}
  • Example: 2 args

    Greater = lambda x, y : x if(x > y) else y
    Greater(0.002, 0.5897)
  • Lambda-Filter

    • Faster than a comprehension

    • Format: filter(function, data_object)

      • Returns a filter object, which needs to be converted into data structure such as list or set
    • Example: Basic

      yourlist = list(np.arange(2,50,3))
      list(filter(lambda x:x**2<100, yourlist))
      # Output 
      [2, 5, 8]
    • Example: Filter w/logical

      import pandas as pd
      import datetime as dt
      # create a list of 10,000 dates
      datlist = pd.date_range(dt.datetime.today(), periods=10000).tolist() 
      # convert the dates to strings via list comprehension
      datstrlist = [d.strftime("Day %d in %B of year %Y is a %A") for d in datlist]
      datstrlist[:4]
      ['Day 21 in October of year 2021 is a Thursday', 'Day 22 in October of year 2021 is a Friday', 'Day 23 in October of year 2021 is a Saturday', 'Day 24 in October of year 2021 is a Sunday']
      
      strLamb = filter(lambda d: ((d.endswith("urday") or d.endswith("unday")) and "Oc" in d), datstrlist)
      • Searches for Saturdays and Sundays in the month of October of all years in list of strings
    • Example: Nested Lists

      group1 = [1,2,3,43,23,42,8,3,7]
      group2 = [[3, 34, 23, 32, 42], [6, 11, 9], [1, 3,9,7,2,8]]
      [list(filter(lambda x: x in group1, sublist)) for sublist in group2]
      >> [[3, 23, 42], [], [1, 3, 7, 2, 8]]
      • Probably useful for json
      • for-loop attached to the end of the list-filter combo
      • Each sublist of group 2 is fed into the lambda-filter and compared to the group 1 list
  • Iterating over each element of a list

    • Example: map

      list(map(lambda x: x**2+x**3, yourlist))
      • map returns a map object that needs to be converted
    • Example: 2 Lists

      mylist = list(np.arange(4,52,3))
      yourlist = list(np.arange(2,50,3))
      list(map(lambda x,y: x**2+y**2, yourlist, mylist))
      • Like a pmap
  • Nested lambdas

    • Example: map

      arr = [1,2,3,4,5]
      list(map(lambda x: x*2, filter(lambda x: x%2 == 0, arr)))
      >> [4,8]
      • Work inside out (locate where the data object, arr, appears)
      • “arr” is filtered by the first lambda function for even numbers then iterated by map to be squared by the second lambda function
  • Iterate over rows of a column in a df

    • Example: Using formula over rows

      grade['evaluate']=grade['MathScore'].apply(lambda x: round((x**x)/2,2))
      • “grade” is the df; “MathScore” is a numeric column; “evaluate” is the new column in the df
      • Formula applied to each value of “MathScore” to generate each value of evaluate
    • Example: Conditional over rows

      grade['group']=grade['MathScore'].apply(lambda x: 'Excellent' if x>=3.0 else 'Average')
      • “grade” is the df; “MathScore” is a numeric column; “group” is the new column in the df
      • Conditional applied to each value of “MathScore” to generate each value of “group”
    • Using {{swifter}} for parallelization

      import swifter
      df['e'] = df.swifter.apply(lambda x: infer(x['a'], x['b'], x['c'], x['d']), axis = 1)
  • In a Pivot Table (like a crosstab)

    • Example

      • grades_df
        • 2 names (“name”)
        • 6 scores (“score”)
        • Only 2 letter grades associated with these scores (“letter grade”)
      • Task: drop lowest score for each letter grade, then calculate the average score for each letter grade
      grades_df.pivot_table(index='name',
                            columns='letter grade',
                            values='score',
                            aggfunc = lambda series : (sorted(list(series))[-1] + sorted(list(series))[-2]) / 2)
      
      letter grade    A    B
      name
      Arif          96.5  87.0
      Kayla        95.5  84.0
      • index: each row will be a “name”
      • columns: each column will be a “letter grade”
      • values: value in the cells will be from the “score” column according to each combination columns in the index and columns args
      • aggfunc: uses a lambda to compute the aggregated values
        • “series” is used a the variable  in the lambda function
        • sorts series (ascending), takes the top two values (using negative list indexing), and averages them

Scope

  • Populated objects within functions persist if you instantiate the object in the argument
    • “all_numbers” retained it’s previous value when the 2nd call to the function was made

Closures

  • Inner functions that can access values in the outer function, even after the outer function has finished its execution

  • Example

    
    # closure way
    def balanceOwed(roomN,rate,nights):
        def increaseByMeals(extra):
            amountOwned=rate*nights+extra
            print(f"Dear Guest of Room [{roomN}]{style='color: #990000'}, you have"
            "a due balance:", "${:.2f}".format(amountOwned))
            return amountOwned
        return increaseByMeals
    
    ba = balanceOwned(201,400,3)
    ba(200)
    ba(150)
    ba(180)
    ba(190)
    Dear Guest of Room 201, you have a due balance: $1400.00
    Dear Guest of Room 201, you have a due balance: $1350.00
    Dear Guest of Room 201, you have a due balance: $1380.00
    Dear Guest of Room 201, you have a due balance: $1390.00
    • Tedious way: For each value of “extra” (e.g. meals), the function needs to be called even if the other values of the arguments don’t change.
    • Closure way:
      • increaseByMeals() is a closure function, because it remembers the values of the outer function balanceOwed(), even after the execution of the latter
      • balanceOwed() is called with its three arguments only once and then after its execution, we call it four times with the meal expenses (“extra”).

Modules

  • .py files are called “modules.”

  • A directory with .py files in which one of the files is an “__init__.py” is called a package.

  • Misc

    • Resource: Make your Python life easier by learning how imports find things
    • sys.path contains the list of paths where Python is looking for things to import. Your virtual environment and the directory containing your entry point are automatically added to sys.path.
      • sys.path is a list. Which means you can .append(). Any directory you add there will have its content importable. It’s a useful hack, but use it as a last resort.
    • When using -m flag to run a script, if you pass a package instead of a module, the package must contain a “__main__.py” file for it to work. This __main__.py module will run.
    • If you have scripts in your projects, don’t run them directly. Run them using “-m”, and you can assume everything starts from the root.
      • Example:

        top_dir
        ├── foo
        │   ├── bar.py
        │   ├── __init__.py
        │   └── blabla.py
        └── blabla.py
        • Running python foo/bar.py, “top_dir” is the current working directory, but “foo” is added to the sys.path.
        • Running python -m foo.bar, “top_dir” is the current working directory and added to sys.path.
          • Imports can all start from the root of the project and opened file paths as well.
  • Usage

  • Project Structure

    ├── main.py
    ├── packages
    │  └── __init__.py
    │  └── module_1.py
    │  └── module_2.py
    │  └── module_3.py
    └── └── module_4.py
    • “__init__.py” contains only 1 line which declares all the functions (or classes?) that are in the modules

      __all__ = ["func1", "func2"]
      • If the module files contained classes with multiple functions, I think you’d just declare the classes and not every function in that class.
        • If using classes, each module should only have 1 class.
    • Scripts need to include “_main_” in order to used in other scripts

      # test_function.py
      def function1(): 
          print("Hello world"
      function1()
      
      # Define the __main__ script
      if __name__ == '__main__':   
          # execute only if run as a script
          function1()
      • Says if this file is being run non-interactively (i.e. as a script), run this chunk
      • Add else: chunk, then that chunk will be run only if the file is imported as a module
      • Allows you to allow or prevent parts of code from being run when the modules are imported
      • Importing a module without _main_ in a jupyter notebook results in this
  • Loading

    • DO NOT USE from <library> import *

      • This will import anything and everything from that library and causes several problems:

        • You don’t know what is in that package, so you have no idea what you just imported, or even if what you want is in there.

        • You just filled your local namespace with an unknown quantity of mysterious names, and you don’t know what they will shadow.

        • Your editor will have a hard time helping you since it doesn’t know what you imported.

        • Your colleague will hate you because they have no idea what variables come from where.

      • Exception: In the shell, it’s handy. Sometimes, you want to import all things in __init__.py and you have “__all__” defined (see above)

    • From the working directory, it’s like importing from a library: from file1 import function1

    • From a subdirectory, from subdirectory.file1 import function1

    • From a directory outside the project, add the module to sys.path before importing it

      import sys
      sys.path.append('/User/moduleDirectory')
      • When a module is imported, it first searches for built-in modules, then the paths listed in sys.path

      • This appends the new path to the end of the sys.path

        import sys
        sys.path.insert(1, '/User/moduleDirectory')
      • Puts this path at the front of the sys.path directory list.

        import sys
        sys.path.remove('/User/NewDirectory')
        • *delete path from sys.path after you finish*
        • Python will also search this path for future projects unless they are removed

Conditionals

  • If-Else

    • Syntax

      if <expression>:
          do something
      else:
          do something else
    • Example

      regenerate = False
      if regenerate:
          concepts_list = df2Graph(df, model='zephyr:latest')
          dfg1 = graph2Df(concepts_list)
          if not os.path.exists(outputdirectory):
              os.makedirs(outputdirectory)
      
          dfg1.to_csv(outputdirectory/"graph.csv", sep="|", index=False)
          df.to_csv(outputdirectory/"chunks.csv", sep="|", index=False)
      else:
          dfg1 = pd.read_csv(outputdirectory/"graph.csv", sep="|")
  • Try-Except

    • Example

      import os
      try:
          env_var = os.environ['ENV']
      except KeyError:
          # Do something
      • If “ENV” is not a present a KeyError is thrown. Then, except section executed.
  • Match (> Python 3.10) (switch function)

    match object:
        case <pattern_1>:
            <action_1>
        case <pattern_2>:
            <action_2>
        case <pattern_3>:
            <action_3>
        case _:
            <action_wildcard>
    • “object” is just a variable name; could be anything

    • “case_” is the value used when none of the other cases are a match

    • Example: function input inside user function

      def http_error(status):
          match status:
              case 200:
                  return 'OK'
              case 400:
                  return 'Bad request'
              case 401 | 403 | 404:
                  return 'Not allowed'
              case _:
                  return 'Something is wrong'
    • Example: dict input inside a function

      def get_service_level(user_data: dict):
          match user_data:
              case {'subscription': _, 'msg_type': 'info'}:
                  print('Service level = 0')
              case {'subscription': 'free', 'msg_type': 'error'}:
                  print('Service level = 1')
              case {'subscription': 'premium', 'msg_type': 'error'}:
                  print('Service level = 2')
    • Example: inside a class

      class ServiceLevel:
          def __init__(self, subscription, msg_type):
              self.subscription = subscription
              self.msg_type = msg_type
      
          def get_service_level(user_data):
              match user_data:
                  case ServiceLevel(subscription=_, msg_type='info'):
                      print('Level = 0')
                  case ServiceLevel(subscription='free', msg_type='error'):
                      print('Level = 1')
                  case ServiceLevel(subscription='premium', msg_type='error'):
                      print('Level = 2')
                  case _:
                      print('Provide valid parameters')
      • Note that inside the function, the change from “:” to “=”  and “()” following the class name in the “case” portion of the match
  • Assert

    • Used to confirm a condition
    Incorrect: assert condition, message 
    
    Correct method: 
    if not condition: 
        raise AssertionError
    • assert is useful for debugging code because it lets you test if a condition in your code returns True, if not, the program will raise an AssertionError.
    • ** Do not use in production, because when code is executed with the -O (optimize) flag, the assert statements are removed from the bytecode. **

Loops

  • Misc
    • List Comprehensions vs Generators in terms of memory usage

    • {{tqdm}} - progress bar for loops

      from tqdm import tqdm
      for i in tqdm(range(10000))
          ...
    • break terminates the loop containing it

      • If in a nested loop, it will terminate the inner-most loop containing it
    • continue is used to skip the remaining code inside a loop for the current iteration only; forces the start of the next iteration of the loop

    • pass does nothing

      • used when a statement or a condition is required to be present in the program but we do not want any command or code to execute

Iterators

  • Remembers values

  • Example

    D = {"123":"Y","111":"PT","313":"Y","112":"Y","201":"PT"}
    ff = filter(lambda e:e[1]=="Y", D.items())
    
    print(next(ff))
    >> ('123', 'Y')
    print(next(ff))
    >> ('313', 'Y')
  • apply

    • axis

      • 0 or ‘index’: apply function to each column.

      • 1 or ‘columns’: apply function to each row.

    • Example: Function applied to rows of a column of a dataframe (i.e. cells)

      def df2Graph(dataframe: pd.DataFrame, model=None) -> list:
        # dataframe.reset_index(inplace=True)
        results = dataframe.apply(
          lambda row: graphPrompt(row.text, {"chunk_id": row.chunk_id}, model), axis=1
        )
      • text and chunk_id are column names of the dataframe
      • row is the row of the dataframe since axis=1, and from that row, the columns text and chunk_id are subsetted in the arguments of user-defined function.
    • Example: Formula applied to rows of a column of a dataframe (i.e. cells)

      grade['evaluate']=grade['MathScore'].apply(lambda x: round((x**x)/2,2))
      • grade is the df; MathScore is a numeric column; evaluate is the new column in the df

Generators

  • Generators are iterators, a kind of iterable you can only iterate over once. (normal iterators like lists, strings, etc. can be repeatedly iterated over)

  • Generators do not store all the values in memory, they generate the values on the fly

    • yield - Pauses the function saving all its states and later continues from there on successive calls.
      • Allows you to consume one element at a time and work with it without requiring you to have every element in memory.
      • Produces a generator
  • Misc

  • Example: Using a comprehension 

    mygenerator = (x*x for x in range(3))
    for i in mygenerator:
    ...    print(i)
    • Produce a list and ( ) produce a generator 
  • Example: Using a function

    def create_generator():
        mylist = range(3)
        for i in mylist:
            yield i*i
    
    for i in mygenerator:
        print(i)
    0
    1
    4
    • The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it’ll return the first value of the loop.
    • Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value.
    • This will continue until the generator is considered empty, which happens when the function runs without hitting yield.
      • That can be because the loop has come to an end, or because you no longer satisfy an “if/else”
  • Example: Sending values to (yield)

    def grep(pattern):
    print "Looking for %s" % pattern
    while True:
        line = (yield)
        if pattern in line:
            print line,
    g = grep("python"# instantiate with "python" pattern to search for
    
    g.next() # Prime it
    >> Looking for python
    
    g.send("A series of tubes") # "python" not present so returns nothing
    g.send("python generators rock!") # "python" present so returns line
    >> python generators rock!
    g.close() # closes coroutine
    • (yield) receives the input of the .send method and creates a generator object which is assigned to “line”.
    • All coroutines must be “primed” by first calling .next() (or send(None))
      • This advances execution to the location of the first yield expression
  • Example: Sending values to (yield)

    def writer():
        """A coroutine that writes data *sent* to it to fd, socket, etc."""
        while True:
            w = (yield)
            print('>> ', w)
    def writer_wrapper(coro):
        # TBD
        pass
    w = writer()
    wrap = writer_wrapper(w)
    wrap.send(None# "prime" the coroutine
    for i in range(4):
        wrap.send(i)
    >>  0
    >>  1
    >>  2
    >>  3
    • A more complex framework if you want to break the workflow into multiple functions

Using yield from

  • Allows for two-way usage (reading/sending) of generators

  • Example (reading from a generator)

    def reader():
        """A generator that fakes a read from a file, socket, etc."""
        for i in range(4):
            yield '<< %s' % i
    
    # with yield
    def reader_wrapper(g):
        # Manually iterate over data produced by reader
        for v in g:
            yield v
    # OR with yield from
    def reader_wrapper(g):
        yield from g
    wrap = reader_wrapper(reader())
    for i in wrap:
        print(i)
    • Basic; only eliminates 1 line of code
  • Example (sending to a generator)

    # with (yield)
    def writer_wrapper(coro):
        coro.send(None# prime the coro
        while True:
            try:
                x = (yield# Capture the value that's sent
                coro.send(x)  # and pass it to the writer
            except StopIteration:
                pass
    # OR with yield from
    def writer_wrapper(coro):
        yield from coro
    • Need to see example 4 for the writer() code and the use case
    • Shows the other advantage of using “yield from”: it automatically includes the code to stop prime and stop the loop.
  • Reusable generator

    • reading example using “yield from”
  • Slicing a generator

    from itertools import islice
    def gen():
        yield from range(1,11)
    g = gen()
    myslice = islice(g, 2)
    >> list(myslice)
    [1, 2]
    >> [i for i in g]
    [3,4,5,6,7,8,9,10]

For

  • Syntax - for <sequence>: <loop body>

  • Numeric Range

    for i = 1 to 10
        <loop body>
    
    # from 0 to 519
    for i in range(520)
        <loop body>
    
    res = 0
    for idx in np.arange(0, 100000):
      res += df.loc[idx, 'int']
    • np.arange() ran 8000 times faster than the same chunk using range()
  • List

    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    even_numbers = []
    for item in numbers:
        if item % 2 == 0:
            even_numbers.append(item)
    print(even_numbers)
    
    # results: [2, 4, 6, 8]
  • List: index and value

    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    for index, element in enumerate(numbers):
        if element % 2 != 0:
            numbers[index] = element * 2
        else:
            continue
    print(numbers)
    # results: [2, 2, 6, 4, 10, 6, 14, 8, 18]
    • enumerate also gets the index of the respective element at the same time
    • start = 3 says start loop at index 3
  • With three expressions

    for (for i = 1; i <= 10; i+=1)
        <loop body>
  • Collection-Based

    • If the collection is a dict, then this just iterates over the keys
    for i in <collection>:
        <loop body>
  • Iterate over a sliding window

  • Over dictionary keys and values of a dict

    for a,b in transaction_data.items():
        print(a,’~’,b)
    • The .items method includes both key and value, so it iterates over the pairs.
  • Over nested dictionaries

    for k, v in transaction_data_n.items():
        if type(v) is dict:
            for nk, nv in v.items(): 
                print(nk,’ →’, nv)
    • If the item of the dict is itself a dict then another loop iterates through its items.
    • nk and nv stand for nested key and nested value
  • Selecting a specific item in a nested dictionary

    for k, v in transaction_data_n.items():
        if type(v) is dict and k == 'transaction_2':
            for sk, sv in v.items():
                print(sk,'-->', sv)
    • Only transaction_2’ s items are printed
  • Rows of a data.frame

    res = 0
    for row in df.itertuples():
      res += getattr(row, 'int')
    • itertuples()  is 30x faster than iterrows()

zip

  • Combine lists into 1 list of tuples

    acc_values = [1, 0.04, 0.9]
    acc_names = ["RMSE", "MAPE", "R-sq"]
    acc_list = list(zip(acc_names, acc_values))
    acc_list
    [('RMSE', 1), ('MAPE', 0.04), ('R-sq', 0.9)]
    • zip does take lists of different lengths but will create shortest length list with corresponding elements

    • Combine lists of unequal lengths but keep the non-paired elements

      from itertools import zip_longest
      acc_names3 = ["RMSE", "MAPE", "R-sq", "MSE"]
      acc_values3 = [rmse, mape, rsq] 
      acc_list3 = list(zip_longest(acc_names3, acc_values3))
  • Unzip list of tuples into separate lists

    names, values = zip(*acc_list)
    • Asterisk is the “unzipping operator”
  • Unpack dict into a list of separate tuples for key:value pairs

    acc_tuples = list(zip(acc_dict.keys(), acc_dict.values()))
    acc_tuples
    [('RMSE', 1), ('MAPE', 0.04), ('R-sq', 0.9)]

Comprehensions

  • Misc
    • ‘for — in’ construct within comprehensions is faster than the traditional for-loops
      • not faster than (all?) lambda-filters (see functions >> lambda)
    • Returns lists or dicts (just change the bracket types)
  • Dicts
    • Syntax: mydict = {key:val for key, val in zip(keys_list, vals_list)}

    • Combine key:value lists into a dictionary

      acc_dict = {k:v for k,v in zip(acc_names, acc_values)}
    • Return value and output of expression

      mydict = {v: v**2 for v in numberslist}
    • If numberslist =[1,2,3], then mydict = {1:1, 2:4, 3:9}

  • Lists
    • Syntax: newlist = [expression for item in iterable if condition == True]

    • With expression

      mylist = [x**2 for x in numberslist]
      • if numberslist =[1,2,3], then mylist = [1,4,9]
    • Set values in a list to uppercase

      newlist = [x.upper() for x in fruits]
    • With conditional expression (if — else)

      • Append to the comprehension to filter the dictionary or list
      • Syntax: mylist = [expressionA if (condition2==True) else expressionB for item in list if (condition1==True)]
      • Example: newlist = [x if x != "banana" else "orange" for x in fruits]
        • Return “orange” instead of “banana”
      • Example: new_list = [(x**2) if (x>90) else (x**3) for x in old_list if (x%2==0)]
        • Says
          • Square an argument if it exceeds 90, else cube it 
          • Return all the exponentiated results only if the argument was an even number
      • Example: c = [d for d in datstrlist if ((d.endswith("urday") or d.endswith("unday")) and "Oc" in d)]
        • String filter than looks for strings with saturdays and sundays in october
        • *Slower than a lamda-filter* (See Functions >> lambda)
  • Nested
    • Syntax: myset = {{expression(itemA, itemB) for itemA in setA} for itemB in setB}

    • Example: {j for i in range(2, int(N**0.5)+1) for j in range(i**2, N, i)}

      • N = 100000
      • Creates a set of all the integers from 2 to 100,000.
      • Paces through all the integers i up to the square root of N
      • Discards from the set of 100,000 those numbers j which are equal or larger than the square of i
    • Example: From link

      # Function to get set labels
      def get_prediction_set_labels(prediction_set, class_labels):
          # Get set of class labels for each instance in prediction sets
          prediction_set_labels = [
              set([class_labels[i] for i, x in enumerate(prediction_set) if x]) for prediction_set in 
              prediction_sets]
          return prediction_set_labels
      • Returns a list where each object in the list is a set object (e.g. {green}, {green, orange})

Debugging

  • Misc
  • Terms
    • Exception Errors - Raised when the syntax is correct but the program results in an error.
    • Syntax Errors - Occur when the interpreter detects invalid syntax (relatively easier to fix)
      • e.g. unmatched parenthesis
    • Traceback - A report that helps us understand the reason for an exception.
      • Contains function calls made in the code along with their line numbers

Error Handling

  • try + except

    • Says try the main code snippet, but if an exception (error) occurs, run the secondary code snippet, the workaround.
    def pct_difference_error_handling(n1, n2):
      '''Function that takes two numbers and return the percentual difference
      between n1 and n2, being n1 the reference number'''
    
      # Try the main code
      try:
        pct_diff = (n1-n2)/n1
        return f'The difference between {n1} and {n2} is {n1-n2}, which is {pct_diff*100}% of {n1}'
    
      # If you find an error, use this code instead
      except:
        pct_diff = (int(n1)-int(n2))/int(n1)
        return f'The difference between {n1} and {n2} is {int(n1)-int(n2){style='color: #990000'}[}]{style='color: #990000'}, which is {pct_diff*100}% of {n1}'
    
      # Optional
      finally:
        print("Code ended")
    • Assumes the error will be the user enters a string instead of a numeric. If errors, converts string to numeric and calcs.
    • finally: - This argument will always run, regardless if the try block raises an error or not. So it could be a completion message or a summary, for example.