Stochastic Processes
Misc
- Notes from
Terms
- Brownian Motion - A continuous process such that its increments for any time scale are drawn from a normal distribution
- Communication Class - consists a set of states that all communicate with one another (e.g. A must be reachable from B and B reachable from A)
- Irreducible - A Markov chain is irreducible if it consists of a single communication class otherwise it’s Reducible
- Limiting Distribution (aka Invariant Distribution) - Describes the long-term probability of being in any state, as well as the proportion of time spent in that state.
\[ \Pi = \begin{matrix} \pi_0 & \pi_1 & \pi_2 \\ \pi_0 & \pi_1 & \pi_2 \\ \pi_0 & \pi_1 & \pi_2 \end{matrix} \]This is like a asymptotic State Probability Distribution or a \(\infty\)-step State Probability Distribution
Can be calc’d by raising a transition matrix to a very large power
matrixpower(printers, 100) #> [,0] [,1] [,2] #> [0,] 0.5882353 0.2941176 0.1176471 #> [1,] 0.5882353 0.2941176 0.1176471 #> [2,] 0.5882353 0.2941176 0.1176471
- Markov Property - A system or variable is said to be Markov when knowing the current state of the process is just as good as knowing the entire history.
- Markov processes are said to be ‘memoryless’ because the history of their states is irrelevant.
- Recurrent States - States that have an infinite number of expected visits
- Absorbing State (Boundary) - (special recurrent case) States that only communicate with themselves. A state in which once it’s entered, you can never leave
- State Probability Distribution - Distribution of the Markov variable \(X_t\) at time, \(t\), and it is denoted by \(\pi_t\)
Notation for \(n\)-states at time, \(t\) (or “step” \(t\) is the same thing)
\[ \pi_t = [P(X_t = 1), P(X_t = 2), P(X_t = 3), \ldots, P(X_t = n)] \]Each element of \(\pi_t\) can be referenced using the notation
\[ \begin{align} &\pi_t = [\pi_{1t}, \pi_{2t}, \pi_{3t}, \ldots, \pi_{jt}, \ldots, \pi_{nt}]\\ &\text{where} \;\; \pi_{jt} = P(X_t = j) \end{align} \]- The sum of \(\pi_{jt}=1\), so that says that the sum of the probabilities for all the states at a given time step must be 1.
- Note that each row in a transition matrix sums to 1.
- The sum of \(\pi_{jt}=1\), so that says that the sum of the probabilities for all the states at a given time step must be 1.
- Transient States - States that have a finite number of expected visits
- Examples:
- State 0 (above) - If state 0 is the starting point, then after the 1st transition, it is no longer a potential destination
- States 1,2,3 (Drunkard below) - Small values or zeros in the limiting distribution
- Examples:
- Transition Matrix
- Row index represents the initial or source state at time \(t\)
- Column index represents the destination state at time (\(t+n\))
- Each Cell is the probability that by starting in source state, \(i\), you end-up in destinations state, \(j\) after \(n\)-steps/time
Transitions
Steps and Time Steps are interchangable in terms of transition matrices, e.g. step n+2 and time, t+2, state probability distributions, etc.
1-Step Transition Matrix
- Example:
Y-Axis - Probabilities of starting with an initial state (e.g. # of printers broken at the beginning of the day)
X-Axis - Probabilities of transitioning to the final (for a 1-step) state (e.g. # of printers broken at the end of that day which will be the initial state of the following day)
\[ \begin{matrix} & 0 & 1 & 2 \\ 0 & 0.7 & 0.3 & 0 \\ 1 & 0.2 & 0.4 & 0.4 \\ 2 & 1 & 0 & 0 \end{matrix} \]- (0,0) says there’s a 70% probability if you start the day with 0 printers broken, then you’ll end the day with 0 printers broken.
- (1,2) says there’s a 40% probability if you start the day with 1 printer broken, then you’ll end the day with 2 printers broken.
- (2,1) says there’s a 100% probability if you start the day with 2 printers broken, then you’ll end the day with 0 printers broken.
- e.g. A repairman is contractually obligated to show up and fix or replace both printers by the end of the day if they break on the same day.
- Example:
2-Step Transition Matrix
- Example: Same scenario as above but there are two steps (2-day interval) in which the transition is from State 0 to State 1, (0,1).
To do that we would need to sum all possible two-step transitions that start at State 0 and end-up in State 1.
\[ \begin{align} \text{Transition}_{\text{2-step}} &= \begin{aligned}[t] &P(X_{n-1} = 0|X_{n-2} = 0) \cdot P(X_n = 1|X_{n-1} = 0) \\ &+ P(X_{n-1} = 1|X_{n-2} = 0) \cdot P(X_n = 1|X_{n-1} = 1) \\ &+ P(X_{n-1} = 2|X_{n-2} = 0) \cdot P(X_n = 1|X_{n-1} = 2) \end{aligned}\\ &= (0.7 \cdot 0.2) + (0.3 \cdot 0.4) + (0 \cdot 0) = 0.33 \end{align} \]First potential 2-step transition says:
- 1st step - The probability of moving to State 0 (\(n-1\)) given you were in State 0 (\(n-2\))
- 2nd step - the probability of moving to State 1 (\(n\)) given you were in State 0 (\(n-1\))
For this example, each 2 step transition will be of the form \(p(x|0) \cdot p(1|x)\)
In general, the probability that the Markov chain will be in State \(j\) at Step \(n\), given that it was in State \(i\) at (\(n-1\))
\[ P_{ij}(X_n = j | X_{n-1} = i) \]
- Example: Same scenario as above but there are two steps (2-day interval) in which the transition is from State 0 to State 1, (0,1).
N-step (in general)
\[ \pi_t = \pi_0 P^t \]
Where
- \(P\) is the trasition matrix and \(t\) is the step (\(t\) is for time step instead of \(N\))
- \(\pi_0\) is like a prior probability vector for each state at \(t = 0\)
- This prior gives us the probabilities for which state we’ll likely start out at.
A little different from other the calculations in the Transitions section. Here \(t\)-step transition matrix is multiplied times a prior
matrixpower
(user-defined) performs this calculation for all \(n\)-step transitions<- function(mat,k) { matrixpower if (k == 0) return (diag(dim(mat)[1])) # 3x3 identity matrix if (k == 1) return(mat) if (k > 1) return(mat %*% matrixpower(mat, k-1)) } <- printers matrix(c(.7, .2, 1, .3, .4, 0, 0, .4, 0), nrow = 3, ncol = 3) printers#> [,0] [,1] [,2] #> [0,] 0.7 0.3 0.0 #> [1,] 0.2 0.4 0.4 #> [2,] 1.0 0.0 0.0 # 2-step matrixpower(printers, 2) #> [,0] [,1] [,2] #> [0,] 0.55 0.33 0.12 #> [1,] 0.62 0.22 0.16 #> [2,] 0.70 0.30 0.00 # 3-step matrixpower(printers, 3) #> [,0] [,1] [,2] #> [0,] 0.571 0.297 0.132 #> [1,] 0.638 0.274 0.088 #> [2,] 0.550 0.330 0.120
- This is just matrix exponentiation and multiplying the square (based on row dim) identity matrix at the end for some reason (probably makes a diff for non-square matrices)
matrixpower
is from Introduction to Stochastic Processes with R by Robert Dobrow- In the 2-step transition output, 0.33 at (0,1) matches the manual calculation above
- There’s a manual change to start indexing from 0, as that’s the typical way of indexing matrices for stochastic processes. R indexing typically starts from 1.
Communication
Communication
- Definition
- Definition
Classes
- States in Class \(X\) can communicate (i.e. transition to one another) but Class \(X\) members do not communicate with members of Class \(Y\)
Reducibility
- Irreducible - A Markov chain is irreducible if it consists of a single communication class otherwise it’s reducible.
Example
<- reducible matrix(c(0, 0, 0, 0, 0, .4, .9, .7, 0, 0, 0, .1, .3, 0, 0, .6, 0, 0, .5, .5, 0, 0, 0, .5, .5), nrow = 5, ncol = 5) reducible#> [,0] [,1] [,2] [,3] [,4] #> [0,] 0 0.4 0.0 0.6 0.0 #> [1,] 0 0.9 0.1 0.0 0.0 #> [2,] 0 0.7 0.3 0.0 0.0 #> [3,] 0 0.0 0.0 0.5 0.5 #> [4,] 0 0.0 0.0 0.5 0.5 matrixpower(reducible, 100) #> [,0] [,1] [,2] [,3] [,4] #> [0,] 0 0.350 0.050 0.3 0.3 #> [1,] 0 0.875 0.125 0.0 0.0 #> [2,] 0 0.875 0.125 0.0 0.0 #> [3,] 0 0.000 0.000 0.5 0.5 #> [4,] 0 0.000 0.000 0.5 0.5
2 Communication Classes
- Loop with States 1 and 2 (see rows 1 and 2)
- Loop with States 3 and 4 (see rows 3 and 4)
Starting at state 0,
- You can transition to either of the 2 classes through State 1 or State 3, but never return to State 0 (i.e. State 0 isn’t part of a class).
Starting in States 1 or 2, you remain in that class indefinitely
Starting in States 3 or 4, you remain in that class indefinitely
- Irreducible - A Markov chain is irreducible if it consists of a single communication class otherwise it’s reducible.
Limiting Distributions
- For a Reducible Chain
- Some or all the values for a column will be different
- Some or all the rows will be different
- The values are different because the starting state matters
- For an Irreducible Chain:
- Each column will have the (essentially) the same value in each row
- Each row will be identical
- The values are the same because the starting state doesn’t matter
- Note that in the example above that each state can (eventually) be reached from State 0.
- For a Reducible Chain
Random Walk
- Transient, Recurrent, and Absorbing States
Example: Drunkard that either makes it home (state 4) or the ocean (state 0)
<- random_walk matrix(c(1, .5, 0, 0, 0, 0, 0, .5, 0, 0, 0, .5, 0, .5, 0, 0, 0, .5, 0, 0, 0, 0, 0, .5, 1), nrow = 5, ncol = 5) #> [,0] [,1] [,2] [,3] [,4] #> [0,] 1.0 0.0 0.0 0.0 0.0 #> [1,] 0.5 0.0 0.5 0.0 0.0 #> [2,] 0.0 0.5 0.0 0.5 0.0 #> [3,] 0.0 0.0 0.5 0.0 0.5 #> [4,] 0.0 0.0 0.0 0.0 1.0 matrixpower(random_walk, 100) #> [,0] [,1] [,2] [,3] [,4] #> [0,] 1.00 0.000000e+00 0.000000e+00 0.000000e+00 0.00 #> [1,] 0.75 4.440892e-16 0.000000e+00 4.440892e-16 0.25 #> [2,] 0.50 0.000000e+00 8.881784e-16 0.000000e+00 0.50 #> [3,] 0.25 4.440892e-16 0.000000e+00 4.440892e-16 0.75 #> [4,] 0.00 0.000000e+00 0.000000e+00 0.000000e+00 1.00
- States 0, 4 are absorbing states and states 1,2,3 are transient
- Eventually through probability, there must be a transition to states 0 or 4 where it’s impossible to leave. Therefore the number of visits to 1,2,3 will be finite and by definition, they are transient.
- Limiting distribution - notice the middle states have either very small numbers or zero values, this is a property of transient states.
Hidden Markov Models (HMM)
Misc
- Notes from A Math Lovers Guide to Hidden Markov Models
- Packages
- {MSTest} (Github, Vignette) - Hypothesis Testing for Markov Switching Models that can be used to identify regimes
- Includes model estimation procedures
- Includes datasets and functions that can be used to simulate: autoregressive, vector autoregressive, Markov switching autoregressive, and Markov switching vector autoregressive processes among others
- {MSwM} - Estimation (EM method), inference and diagnostics for Univariate Autoregressive Markov Switching Models for Linear and Generalized Models.
- Distributions for the series include gaussian, Poisson, binomial and gamma cases. The EM algorithm
- {LaMa} (Paper) - Fits latent Markov models, including hidden Markov models (HMMs), hidden semi-Markov models (HSMMs), state-space models (SSMs) and continuous-time variants
- {MSTest} (Github, Vignette) - Hypothesis Testing for Markov Switching Models that can be used to identify regimes
- Used to study phenomena in which only a portion of the phenomenon can be directly observed while the rest of it cannot be directly observed, although its effect can be felt on what is observed. The effect of the unobserved portion (“hidden”) can only be estimated.
- Observed portion is modeled using arima, glm, rf, etc.
- Unobserved portion is modeled using Markov process
- Sort of like a latent variable that’s values are states (aka regimes)
- The two models produce a final prediction by calculating an expected value at each step that uses predictions from observed model and probabilities from the unobserved model.
- Use Cases
- Studying human workflows using computer event logs. Part of that is taking a user’s raw event log (keystrokes, URLs visited, etc.) and determining what they were doing at each time (background research, document review, etc.). Sometimes you can guess from one event in isolation, but usually you also need to look at the events before and after to have confidence.
- You have a photograph of a page from a novel, and you want to digitize the text. The photograph is already segmented into images of words. Most of them can be decoded with OCR, but some words are ambiguous because of dirt on the original page (is that “quick” or “quack”?). You want to use the surrounding words to pick the best option.
- You have a sequence of brain scans, taken every second while a person was playing a video game. During the game their avatar moves in and out of buildings, and your goal is to guess whether it is inside or outside using only the brain scans. Any given scan is very ambiguous, but the avatar usually stays inside/outside for many seconds at a time. You expect large spans of time where the avatar is in(out)side, and where the brain scans look slightly more in(out)side on average.
- Components
- An underlying markov chain that describes how likely you are to transition between different states (or stay in the same state). Typically this underlying state is the thing that you’re really interested in. If there are \(k\) states in the HMM then the markov chain consists of:
- \(k\times k\) transition matrix saying how likely you are to transition from a State \(S_a\) to a State \(S_b\)
- \(k\)-length vector saying how likely you are to start off in each of the states.
- A probability model that lets you compute \(Pr[O|S]\) — the probability of seeing observation \(O\) if we assume that the underlying State is \(S\).
- Unlike the markov chain, which has a fixed format, the model for \(Pr[O|S]\) can be arbitrarily complex. In many HMMs though \(Pr[O|S]\) is pretty simple: each State \(S\) is a different loaded die, and the \(Pr[O|S]\) are its probabilities of landing on each side.
- An underlying markov chain that describes how likely you are to transition between different states (or stay in the same state). Typically this underlying state is the thing that you’re really interested in. If there are \(k\) states in the HMM then the markov chain consists of:
- Baum-Welch algorithm
- Specific case of the EM algorithm
- As with the EM algorithm, solutions are only locally optimal
- Meaning you’re results will vary for each model you run (i.e. a non-convex problem)
- As with the EM algorithm, solutions are only locally optimal
- Step summary
- Guess at what the state labels are and train an HMM using those guesses
- The guess is usually about the parameters of the HMM and not actually the states.
- Typically random values from a dirichlet distribution
- The guess is usually about the parameters of the HMM and not actually the states.
- Use the trained HMM to make better guesses at the states, and re-train the HMM on those better guesses
- Uses the Forward-Backward Algorithm
- Continue process until the trained HMM stabilizes.
- Guess at what the state labels are and train an HMM using those guesses
- Forward-Backward Algorithm
- Find the most likely state for each timestep
- Unlike the Viterbi algorithm which computes the mostly likely sequence of states
- In practice, both algorithms only disagree about 5% of the time
- Unlike the Viterbi algorithm which computes the mostly likely sequence of states
- Find the most likely state for each timestep
- Specific case of the EM algorithm
- Mitigating the local optimum issue
- You can simplify your HMM.
- The number of local optima can grow exponentially with the number of parameters in your HMM. If you reduce the parameter size you can reduce the number of local optima there are to fall into.
- Examples
- Using a 2-state rather than a 3-state HMM.
- If your target is multinomial, collapse categories so that the cardinality is smaller.
- Examples
- The number of local optima can grow exponentially with the number of parameters in your HMM. If you reduce the parameter size you can reduce the number of local optima there are to fall into.
- Use business logic or domain expertise to have an initial HMM that is similar to what you expect the ultimate outcome to be.
- Different local optima often have large qualitative differences between each other, so on a gross level the HMM you ultimately converge on is likely to resemble the one you started with.
- Do make sure to have your guess be partly random though — there could be several local optima that are consistent with your business intuitions, and if so you want to know about it.
- Me: Sounds like formulating a prior
- You can simplify your HMM.
Markov Switching Dynamic Regression (MSDR)
- HMM for times series which uses dynamic regression as the observed model.
- Notes from A Worm’s Eye-View of the Markov Switching Dynamic Regression Model
- Basic formula for each time step, \(t\)
\[ y_t = \hat \mu_{tj} + \epsilon_t \quad \text{when} \;s_t = j \]- \(\hat \mu_{tj}\): final prediction for time step, \(t\)
- The j subscript is just saying that the hidden state variable st is part of the final prediction
- \(s_t\): hidden random variable that’s made-up of states (aka regimes) and helps determine the final prediction, \(\hat \mu\)
- Theoretically, a change in value of \(s_t\) impacts the distributional parameters (e.g. mean, variance) of the observed variable, \(y_t\)
- \(s_t\) is “hidden” because we do not know exactly when it changes its regime.
- If we knew which regime was in effect at each time step, we could simply make st a predictor variable.
- Poisson
\[ \hat \mu_{tj} = e^{x_t \cdot \hat\beta_j} \]
- \(\hat \mu_{tj}\): final prediction for time step, \(t\)
- Process
- Fit and get predictions from a dynamic regression model
\[ \begin{bmatrix} \hat y_{t1} & \hat y_{t2} & \cdots &\hat y_{tk} \end{bmatrix} = \begin{bmatrix} 1 & x_{1t} & x_{2t} & \cdots & x_{mt} \end{bmatrix} \cdot \begin{bmatrix} \hat \beta_{01} & \cdots & \hat \beta_{0k} \\ \vdots & \ddots & \vdots \\ \hat \beta_{m1} & \cdots & \hat \beta_{mk} \end{bmatrix} \]- \(m\) predictor variables and \(k\) states (i.e. a complete set of parameters is estimated for each state at each time step)
- Get the final prediction for time step, \(t\), by calculating the expected value
\[ \hat \mu_t = \sum_{j=1}^k \hat y_{tj} \cdot P(s_t = j) \]- \(\hat y\) are the predictions from the observed model
- \(P(s_t = j)\) are the state probabilities
\[ \pi_t = [\pi_{t1} = P(s_t = 1),\; \pi_{t2} = P(s_t = 2)] \]- e.g. 2 -states
- Where each \(\pi_{tj}\) is a row of the transition matrix
- Uses Maximum Likelihood Estimation (MLE) or Expectation Maximization (EM) to maximize the joint probability density.
- Estimated parameters:
- Transition Matrix (for each time step)
- During estimation, cells are allowed to range from \(-\infty\) to \(\infty\), but after it’s completed, they’re normalized to \([0,1]\) (See link)
- Regression Coefficients (for each state)
- Unclear to me how you estimate multiple coefficients with the same data. Maybe the likelihood for a \(\beta\) is maximized for a given (estimated?) transition matrix.
- Residual Variance (for each state)
- Transition Matrix (for each time step)
- Example: Using MLE for a Gaussian distribution
\[ \begin{align} f(;y = y_t|X = x_t; \hat \beta_s) = &P(s_t = 1)\left(\frac{1}{\sigma\sqrt{2\pi}}e^{\frac{1}{2}(\frac{y_t - \hat \mu_{t1}}{\sigma})^2}\right) \\ &+ P(s_t = 2)\left(\frac{1}{\sigma\sqrt{2\pi}}e^{\frac{1}{2}(\frac{y_t - \hat \mu_{t2}}{\sigma})^2}\right) \\ &+ \;\; \cdots \\ &+ P(s_t = k)\left(\frac{1}{\sigma\sqrt{2\pi}}e^{\frac{1}{2}(\frac{y_t - \hat \mu_{tk}}{\sigma})^2}\right) \end{align} \] - Example: For a Poisson DGP
\[ P(y = y_t|\hat \mu_t) = \sum_{j=1}^k P(s_t = j) \frac{e^{\hat \mu_{tj}} \hat \mu_{tj}^{y_t}}{y_t !} \]
- Estimated parameters:
- Fit and get predictions from a dynamic regression model
- Example
- Description
- Dependent variable: Personal Consumption Expenditures % Change Month over Month (M-o-M)
- Independent variable: Consumer Sentiment index % Change Month over Month (M-o-M)
- Model Equation
\[ \begin{align} &\hat \mu_{tj} = \hat \beta_{0j} + \hat \beta_{1j} x_{1t} \\ &\begin{aligned} \text{where} \;\; &\hat \beta_s = \begin{bmatrix} \beta_{00} & \beta_{01} \\ \beta_{10} & \beta_{11} \end{bmatrix} \\ & \epsilon_{ti} \sim \mathcal{N}(0, \sigma_j^2) \end{aligned} \end{align} \]- 2-states
- Observed model is a linear regression
- Model Summary (Python)
- Transition Matrix probabilities have pvals and CIs
- \(p[0\rightarrow0]\) is cell \(p00\) and \(p[1\rightarrow0]\) is cell p10 in the transition matrix
- unknown_cell_values = (1 - known_cell_values) for each row
- Interpretation
- Variance of States vs Transition Probabilities
- When Personal Consumption Expenditures are in a low variance regime, they tend to switch to a high variance regime less than 2% of the time.
- When Personal Consumption Expenditures are in a high variance regime, they tend to switch to a low variance regime with roughly 20% probability.
- This variance is “sigma2” in the model summary in the Regime Parameters sections.
- Me: This would be the residual variance which is usually a GOF thing (i.e. variance of the points around the regression line). In finance, variance is a volatility measure and I guess this can be thought of in that way too. I guess it would the volatility around a trend line.
- Smoothed State Probabilities for each time step
- Bottom two charts’ y-axes are probabilities, so chart 2’s dips in proability are when the hmm is in stage 1 (confirmed by looking at chart 3)
- Accessed via
msdr_model_results.smoothed_marginal_probabilities[0]
([1] for state 1) - After 2000, we see that often the Markov state model is in the high variance state towards the end a recession. (bottom chart)
- Variance of States vs Transition Probabilities
- Description
Markov Switching Auto Regression (MSAR)
- Model Equation
\[ y_t = \hat \mu_{ts_t} + \hat \phi (y_{t-1} - \hat \mu_{(t-1)s_{t-1}}) + \epsilon_t \]- Adds a fraction of the residual from the previous step to the MSDR model