Initially begun as a small project to learn some typescript and work out implementing a few statistical analyses, it actually also functions as a nice log of the various derivations and theorems I have been reading and doing in a notebook.
Thus, I have decided to document this a little more thoroughly and use it as proof of concept for my excursions into numerical computing. While thin at the moment (and not the most robust in approaches) the idea is that this repository would continue to grow as I hand-implement more algorithms and also make existing ones more robust (for example other optimization approaches, non-closed-form LR, etc).
The original typescript univariate implementation still exists inside of the original src directory but, they have been sidelined in favor of a Python backend and java/typescript front end for visualization and data loading.
Outline below are the different algorithms I have implemented so far. Mostly this is gonna be regressions and GLM's, bread and butter, and work out from there. More than likely add in other things like small unsupervised clustering algorithms like K-means or PCA and perhaps some supervised methods like decision trees, before moving on to more complex methods.
The code itself for the current algos lives inside of
Learning/Learning/And the previous implementations of univariate regression as well as the front end that still need to be worked out live inside of
Learning/src/There are examples at the bottom of each python module inside of the "main" block that can be run to test out each implementation if anyone perusing this file is curious to see. There are of-course unlisted dependencies (numpy/scipy/scikit-learn) but this is not really meant to be an entirely public use at the moment so for now I leave it up to the user to pip/conda install their way to success.
I have implemented linear regression as a multiple regression which also functions as a simple (univariate) regression when 1 predictor is provided. In linear regression we model a dependent variable (our target/response variable) as a function of a (or multiple) predictor variable(s).
or
(2): (for a univariate regression)
This implementation uses the linear algebra form for ease of implementation, where X is our design matrix. A design matrix is a matrix where each row is one of our observations i and each column is one of our features k. Importantly the first column of the design matrix is all 1's so B0 our intercept is constant across observations (also the reason it's different from a feature matrix).
B is our coefficient vector which contains coefficients for each one of our predictors. Here the optimal predictors can be determined with the closed form solution
Binary Logistic regression models the probability of an event happening as a linear combination of its predictors. Here the "event" is the probability of the response/target variable having the label it does, i.e. also the reason why we use this method for classification.
The equation for a logistic regression is as follows:
Here y-hat is our predicted probability P( y=1 | x). Ideally we would want to maximize the probability of our particular outcomes occurring (and therefore have the most accurate model). Fortunately, since we are working with binary outcomes, we can assume a bernoulli distribution and therefore, model our loss function as a maximum likelihood estimation where L(B):
However, in practice, multiplying probabilities like that leads to very small numbers and numerical overflow so we typically take the negative log likelihood.
(6):
or
(this form is the one I decided to implement, but is equivalent just re-organized)
Because of the sigmoid function we do not automatically receive an easy to use closed-form solution so we must use numerical methods to compute our ideal solution.
Fortunately the negative log-likelihood is convex in B and therefore has a unique global minimum, so I implemented gradient descent (steepest descent) with T iterations, where our gradient is defined as:
and our update rule is:
Currently this was tested on the mpg dataset found in scipy and the iris dataset inside of sklearn. These are pretty standard classroom datasets I used throughout undergrad so I thought it fitting to use them as the basis for testing functionality before a couple real-use cases.
I plan on also adding in some kind of typescript front end GUI/chart displayer mostly to try to get some practice using javascript/typescript.
There is actually already a univariate regression implemented in typescript before I realized that there weren’t many good vectorized math packages (aside from like tensorflow but this came with its own suite of problems) in the nodejs version of typescript and it's really not meant for that anyways but, it gave me a solid foundation thus far.
Perhaps I will also add in some mysql stuff to pull in datasets although it's hard to really implement SQL without using a proper database connection via stuff like microsoft azure.
I plan on continuing implementing mostly GLMs for the purpose of growing my clinical research relevant skillset at the moment but, will also try out things like K-means and SVD-PCA as well as other generalizations.
In the near future:
- poisson regression
- k-means
- PCA