Statistical Formulas for Programmers (2013)

211 points by Tomte a day ago

armanboyaci 16 hours ago

>Being able to apply statistics is like having a secret superpower.

I totally with this sentence. BUT If you ask for my opinion, merely knowing a list of statistical formulas is not very helpful. Most of the time, people don’t remember the underlying assumptions, so there is a fair chance they will use them in inappropriate situations.

I recommend watching these two YouTube videos. The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.

Jake Vanderplas - Statistics for Hackers https://www.youtube.com/watch?v=Iq9DzN6mvYA

John Rauser - Statistics Without the Agonizing Pain https://www.youtube.com/watch?v=5Dnw46eC-0o

asdff 8 hours ago

It makes no sense to memorize the formulas when most any statistical formula you'd actually use has a package or three that can run it in a way that's already probably reasonably benchmarked and not prone to you fat fingering some error rolling your own.
- dapperdrake 7 hours ago
  
  Assumptions are the part that matters.
mont_tag 15 hours ago

IIRC, Jake's video inspired the example section in the Python random module docs. It takes about 15 minutes with those examples to learn how to put Jake's ideas into practice. https://docs.python.org/3/library/random.html#examples .
Terr_ 15 hours ago

> The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.
Yeah, I often find it much easier to make a little Python script to do 10,000 monte-carlo trial, as opposed to properly" working things out and then not even being confident-enough in my result anyway.
wodenokoto 16 hours ago

While I really liked the video by vanderplas, I did return to it after a year or two and paused every time he presented a problem and then tried to solve it using for loops and thinking hard.
I barely succeeded in any of it. So at that point just look up the formula instead of bootstrapping.
I’ll give the second one a shot too.

Terr_ 17 hours ago

I think I avoid imposter syndrome in some areas, but Not Enough Real Math is definitely a weak spot.

When people start talking about eigenvalues, I'm just a business-rule caveman with a little discrete-math unga bunga.

This kind of statistical stuff falls somewhere in-between.

mportela an hour ago

Then definitely what 3Blue1Brown's video on eigenvalues and eigenvectors. [1] That's when I clicked to me! His entire series on Linear Algebra is incredibly well produced.
[1] https://youtube.com/watch?v=PFDu9oVAE-g
MrLeap 17 hours ago

Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.
Linear Algebra was the most useful and fun math class I took in college. Highly recommended if you ever wanna do gamedev. It's more approachable than you probably think.
For me, when people start talking about differential equations, specifically the symbols you'll see in a wikipedia article about Navier Stokes equations, I'm just a business-rule caveman with a little linear algebra zug zug.
- vector_spaces 11 hours ago
  
  > Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.
  Multiplying a vector or a matrix by any nonunit scalar changes its magnitude (hence scalar!! i.e. something that scales). Not all scalars are eigenvalues. So this isn't quite right
  Think about it geometrically instead. A linear operator transforms a space. Geometrically the transformation can be one or more of stretching, compressing, or rotating (taking shearing to be a kind of stretching). The directions in the space which remain the same other than having been scaled by some factor are the eigenvectors of the transformation. The scaling factor of one of those such directions is its eigenvalue.
- Elucalidavah 7 hours ago
  
  > when people start talking about differential equations
  It's not like you are going to solve those analytically.
  Implement a couple numerical solvers for things like Navier–Stokes and you'll see that differential equations is just obscenely compressed code.
roenxi 11 hours ago

Studying more statistics is often clever. Although in this case Mr. Miller led the the most important part - if there are two numbers (like 7 and 5) in a statistical context they might be the same number. That throws a lot of people into such a tailspin that they never really recover after making the obvious mistake of thinking they are different.
The powerful heuristic for the less technically inclined is to say "well, this evidence isn't conclusive until someone who knows statistics has tried to shoot it down".

gpderetta 2 hours ago

Also: "Common statistical tests are linear models (or: how to teach stats)"[1]. Also also, bootstrapping is a superpower.

[1] https://lindeloev.github.io/tests-as-linear/

bob1029 20 hours ago

I'd add z-score (standard score) to your tool belt. The ability to identify or reject outliers is invaluable when trying to stabilize real-world business processes.

For example, if you are building heuristics that determine if a customer's bank account is "reasonably active", you may not want to consider very small transactions unless that is typical activity for a given customer.

mont_tag 14 hours ago

Another simple tool that gives you superpowers is a Q-Q plot. https://en.wikipedia.org/wiki/Q–Q_plot
- Toenex 14 hours ago
  
  Personally always loved me a Bland-Altman plot (https://en.m.wikipedia.org/wiki/Bland-Altman_plot)

TheHideout 14 hours ago

FYI, using this stuff without understanding Test Power is dangerous and can lead to making bad decisions with false confidence.

mcphage 21 hours ago

The article "How Not To Sort By Average Rating" by the same author (and also linked in this article) is really good, and definitely changed my thinking about any kind of "sort by best to worst" list: https://www.evanmiller.org/how-not-to-sort-by-average-rating...

vismit2000 3 hours ago

Covered by 3b1b some years ago: https://youtu.be/8idr1WZ1A7Q
Joker_vD 20 hours ago

Hm. I wonder how well would "Score = [Positive ratings] / ([Total ratings] + 1)" fare.
- mcphage 20 hours ago
  
  It'll help some, but I don't think enough—it's way, way easier to get a good score on a small number of ratings than a large number. And the span on the number of ratings is several orders of magnitude—for instance, on Amazon you can do a search and get back products with less than 10 ratings along side products with over 10,000 ratings.

snitzr 20 hours ago

Why isn't 7 greater than 5?

DeepSeaTortoise 20 hours ago

Statistics gave him the superpower of predicting the future:
https://knowyourmeme.com/memes/fight-club-57-movie
- glitchc 19 hours ago
  
  That's hilarious!
  - avg_dev 18 hours ago
    
    yes, and informative. i was looking at the article and i thought everything made sense but i could tell i was missing something about this line...
senkora 20 hours ago

Treat them as two draws from possibly different, independent distributions.
The question is whether the distribution that drew 7 “stochastically dominates” the distribution that drew 5. You may or may not be able to conclude that based on the available data and assumptions about the distributions.
https://en.m.wikipedia.org/wiki/Stochastic_dominance
For example, if you assume that the two distributions are approximately normal with very small variances, then you can probably conclude that the distribution that drew 7 stochastically dominates the distribution that drew 5. But if you assume that the variances are large, then you probably can’t conclude that.
dlivingston 20 hours ago

Sounds like you should read the article. :)
Kidding. The idea is that there may be some statistical uncertainty associated with the measurement of 7, and also of 5, and so the "real" value of 7 may actually be less than the "real" value of 5.

cmdrmac 18 hours ago

This is certainly a very useful resource - even for a seasoned data scientist!

curtisszmania 19 hours ago

[dead]

hmcamp 20 hours ago

[flagged]

extrememacaroni 21 hours ago

[flagged]

Jtsummers 20 hours ago

Converting the math in here to code isn't very hard.
- Hussell 20 hours ago
  
  The statisticians have a bunch of tricks to transform the formulas into more-easily computable forms, e.g. calculate both the average and the standard deviation in a single pass through the data instead of one pass to calculate the average and a second to calculate the standard deviation. Converting the math in here to efficient code isn't very easy.
  - glitchc 19 hours ago
    
    You mean Welford's algorithm. Since code was requested:
    https://jonisalonen.com/2013/deriving-welfords-method-for-co...
- danhau 19 hours ago
  
  Someone should write a „math notation for programmers“ article. Certainly would help me anyway.
  - Jtsummers 19 hours ago
    
    https://news.ycombinator.com/item?id=28493031
    There are others like this out there.
- theWreckluse 20 hours ago
  
  And also it's something programmers need to be skilled at.
  - dlivingston 20 hours ago
    
    Yes. But also, probably just about every language will have a module with these functions. Python's NumPy and SciPy should have all of these built-in.
- kqr 20 hours ago
  
  ...except it depends on knowledge of the t distribution but has no information on how to approximate it.
  It is a good frequentist's toolbox, but it is not immediately translatable to code, no.