Numpy

Misc

  • Linear algebra resources

  • Optimization

    • {{numba}} - JIT compiler that translates a subset of Python and NumPy code into fast machine code.
  • Terms

    • Broadcasting is a mechanism that allows Numpy to handle (nd)arrays of different shapes during arithmetic operations.
      • See article for details on how this works and when it fails (ValueErrors)
      • A smaller (nd)array being “broad-casted” into the same shape as the larger (nd)array, before doing certain operations.
      • The smaller (nd)array will be copied multiple times, until it reaches the same shape as the larger (nd)array.
      • Fast, since it vectorizes array operations so that looping occurs in optimized C code
    • Memory Views: Working with views can be highly desirable since it avoids making unnecessary copies of arrays to save memory resources
      • np.may_share_memory(new_array, old_array) - if the result is TRUE, then new_array is a memory view
    • ndarrays - multi-dimensional arrays of fixed-size items.
    • Pandas will typically outperform numpy ndarrays in cases that involve significantly larger volume of data (say >500K rows) (not sure if this is true)
  • Info (no parentheses after method)

    • Number of dimensions: ary.ndim
    • Shape: ary.shape
    • Number of elements: ary.size
    • Number of rows (i.e. 1st dim): len(ary)
  • Random Number Generator

    rng2 = np.random.default_rng(seed=123)
    rng2.random(3)
    
    array([0.68235186, 0.05382102, 0.22035987])
  • Sample w/replacement

    np.random.seed(3)
    # a parameter: generate a list of unique random numbers (from 0 to 11)
    # size parameter: how many samples we want (12)
    # replace = True: sample with replacement
    np.random.choice(a=12, size=12, replace=True)
  • Create a grid of values

    grid_q_low = np.linspace(number1,number2,num_vals).reshape(-1,1)
    grid_q_high = np.linspace(number3,number4,num_vals).reshape(-1,1)
    grid_q = np.concatenate((grid_q_low,grid_q_high),1)
    • linspace returns evenly spaced numbers over a specified interval.
      • “number1,2,3,4” are numeric values for args: start and stop
      • reshape coerces the results into m x 1 column arrays (-1 is a placeholder)
    • concantenate axis = 1 says stack column-wise, so this results in a m x 2 array

Create or Coerce

  • Comparison with R DataFrame

    >>> X = np.arange(6).reshape(3, 2)
    >>> X
    array([[0, 1],
          [2, 3],
          [4, 5]])
    # r
    X <- data.frame(x1 = c(0,2,4), x2 = c(1,3,5))
    • Variables are column in the array
  • Create column-wise array

    # example 1
    a = np.array((1,2,3))
    b = np.array((2,3,4))
    np.column_stack((a,b))
    array([[1, 2],
          [2, 3],
          [3, 4]])
    
    # example 2
    np.column_stack([
        model.predict(X_cal, quantile=(alpha/2)*100), 
        model.predict(X_cal, quantile=(1-alpha/2)*100)])
  • Create a constant array

    constant_arr = np.full((other_array.shape), 5)
    # ** Don't really need this, since other_array + 5 works through broadcasting **
    • “other_array” the array we want the constant array to do arithmetic with
    • .shape method outputs other_array’s dimensions
  • Coerce from list

    a = [1, 2, 3]
    np.array(a)
    
    a = [[1,2,3], [4,5,6]]
    np.array(a, dtype = np.float32)
    • dtype is optional
  • Convert pandas df to ndarray

    • new_array = pandas_df.values
    • pandas_df.to_numpy()
    • np.array(df)

Manipulation

  • Subsetting a row

    ary = np.array([[1, 2, 3],
                    [4, 5, 6]])
    
    first_row = ary[0]
    first_row = ary[1:3]
    • Any changes to “first_row” also change “ary”
    • Produces a “memory view” which conserves memory and increases speed
    • Can only subset contiguous indices
  • Subsetting columns using Fancy Indexing

    ary_copy = ary[:, [0, 2]] # first and and last column
    • Uses tuple or list objects of non-contiguous integer indices to return desired array elements
    • ** produces a copy of the array. So takes-up more memory**
  • Boolean masking

    ary_bool1 = (ary > 3) & (ary % 2 == 0)
    ary_bool2 = ary > 3
    ary_bool2
    
    array([[False, False, False],
          [ TrueTrueTrue]])
  • Subsetting 1st elt of all dimensions using ellipsis

    # create an array with a random number of dimensions
    dimensions = np.random.randint(1,10)
    items_per_dimension = 2
    max_items = items_per_dimension**dimensions
    axes = np.repeat(items_per_dimension, dimensions)
    arr = np.arange(max_items).reshape(axes)
    
    arr[..., 0]
    array([[[[ 02],
            [ 46]],
    
            [[ 8, 10],
            [12, 14]]],
    
          [[[16, 18],
            [20, 22]],
    
            [[24, 26],
            [28, 30]]]])
    • ellipsis makes it so if you have a large (or unknown) number of dimensions, you don’t have to use a ton of colons to subset the array
    • Here, “arr” has five dimensions
  • Filter by boolean mask

    ary[ary_bool2]
    
    array([4, 5, 6])
  • Reshaping

    • 1 dim to 2 dim

      ary1d = np.array([1, 2, 3, 4, 5, 6])
      ary2d_view = ary1d.reshape(2, 3)
      ary2d_view
      
      array([[1, 2, 3],
            [4, 5, 6]])
      • 2 x 3 array
    • Need 2 columns

      ary1d.reshape(-1, 2)
      • -1 is a placeholder
      • Useful if we don’t know the number of rows, but we know we want 2 columns
    • Flatten array

      ary = np.array([[[1, 2, 3],
                      [4, 5, 6]]])
      ary.reshape(-1)
      ary.ravel()
      ary.flatten()
      
      array([1, 2, 3, 4, 5, 6])
      • reshape and ravel produce memory views; flatten produces a copy in memory
      • -1 is a placeholder
  • Combine arrays

    ary = np.array([[1, 2, 3]])
    # stack along the first axis (here: rows)
    np.concatenate((ary, ary), axis=0)
    • axis=1 would be stack column-wise (i.e. side-by-side)
    • Computationally ineffiicient, so should avoid if possible.
  • Sort vector (arrange)

    # asc
    >>> boris = np.maximum(moose, squirrel) # see above
    >>> np.sort(boris)
    array([-2, -114])
    
    # desc
    >>> np.sort(boris,0)[::-1]
    array([ 41, -1, -2])
  • Sort array (arrange)

    >>> squirrel = np.array([-2,-2,-2,-2])
    >>> moose = np.array([-3,-1,4,1])
    >>> natascha = np.vstack((moose, squirrel))
    array([[-3, -141],
          [-2, -2, -2, -2]])
    
    # column-wise (default)
    >>> np.sort(natascha)
    array([[-3, -114],
          [-2, -2, -2, -2]])
    # row-wise
    >>> np.sort(natascha, 0)
    array([[-3, -2, -2, -2],
          [-2, -141]])
    # row-wise desc
    >>> np.sort(natascha, 0)[::-1]
    array([[-2, -141],
          [-3, -2, -2, -2]])
  • Change values by condition

    ary = np.array([1, 2, 3, 4, 5])
    np.where(ary > 2, 1, 0)
    • Any values > 2 get changed to a 1 and the rest are changed to 0

Mathematics

  • Incrementing the values

    ary_copy += 99
    array([[100, 102], 
          [103, 105]])
  • Matrix multiplication

    matrix = np.array([[1, 2, 3], 
                      [4, 5, 6]])
    column_vector = np.array([1, 2, 3]).reshape(-1, 1)
    np.matmul(matrix, column_vector)
  • Dot product

    row_vector = np.array([1, 2, 3])
    np.matmul(row_vector, row_vector)
    np.dot(row_vector, row_vector)
    • One or the other can be slightly faster on specific machines and versions of BLAS
  • Transpose a matrix

    matrix = np.array([[1, 2, 3], 
                      [4, 5, 6]])
    matrix.transpose()
    
    array([[1, 4],
          [2, 5],
          [3, 6]])
  • Find pairwise maximum (pmax)

    >>> squirrel = np.array([-2,-2,-2,-2])
    >>> moose = np.array([-3,-1,4,1])
    >>> np.maximum(moose, squirrel)
    array([-2, -141])