>Being able to apply statistics is like having a secret superpower.
I totally with this sentence. BUT If you ask for my opinion, merely knowing a list of statistical formulas is not very helpful. Most of the time, people don’t remember the underlying assumptions, so there is a fair chance they will use them in inappropriate situations.
I recommend watching these two YouTube videos. The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.
It makes no sense to memorize the formulas when most any statistical formula you'd actually use has a package or three that can run it in a way that's already probably reasonably benchmarked and not prone to you fat fingering some error rolling your own.
IIRC, Jake's video inspired the example section in the Python random module docs. It takes about 15 minutes with those examples to learn how to put Jake's ideas into practice. https://docs.python.org/3/library/random.html#examples .
> The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.
Yeah, I often find it much easier to make a little Python script to do 10,000 monte-carlo trial, as opposed to properly" working things out and then not even being confident-enough in my result anyway.
While I really liked the video by vanderplas, I did return to it after a year or two and paused every time he presented a problem and then tried to solve it using for loops and thinking hard.
I barely succeeded in any of it. So at that point just look up the formula instead of bootstrapping.
Then definitely what 3Blue1Brown's video on eigenvalues and eigenvectors. [1] That's when I clicked to me! His entire series on Linear Algebra is incredibly well produced.
Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.
Linear Algebra was the most useful and fun math class I took in college. Highly recommended if you ever wanna do gamedev. It's more approachable than you probably think.
For me, when people start talking about differential equations, specifically the symbols you'll see in a wikipedia article about Navier Stokes equations, I'm just a business-rule caveman with a little linear algebra zug zug.
> Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.
Multiplying a vector or a matrix by any nonunit scalar changes its magnitude (hence scalar!! i.e. something that scales). Not all scalars are eigenvalues. So this isn't quite right
Think about it geometrically instead. A linear operator transforms a space. Geometrically the transformation can be one or more of stretching, compressing, or rotating (taking shearing to be a kind of stretching). The directions in the space which remain the same other than having been scaled by some factor are the eigenvectors of the transformation. The scaling factor of one of those such directions is its eigenvalue.
Studying more statistics is often clever. Although in this case Mr. Miller led the the most important part - if there are two numbers (like 7 and 5) in a statistical context they might be the same number. That throws a lot of people into such a tailspin that they never really recover after making the obvious mistake of thinking they are different.
The powerful heuristic for the less technically inclined is to say "well, this evidence isn't conclusive until someone who knows statistics has tried to shoot it down".
I'd add z-score (standard score) to your tool belt. The ability to identify or reject outliers is invaluable when trying to stabilize real-world business processes.
For example, if you are building heuristics that determine if a customer's bank account is "reasonably active", you may not want to consider very small transactions unless that is typical activity for a given customer.
The article "How Not To Sort By Average Rating" by the same author (and also linked in this article) is really good, and definitely changed my thinking about any kind of "sort by best to worst" list: https://www.evanmiller.org/how-not-to-sort-by-average-rating...
It'll help some, but I don't think enough—it's way, way easier to get a good score on a small number of ratings than a large number. And the span on the number of ratings is several orders of magnitude—for instance, on Amazon you can do a search and get back products with less than 10 ratings along side products with over 10,000 ratings.
Treat them as two draws from possibly different, independent distributions.
The question is whether the distribution that drew 7 “stochastically dominates” the distribution that drew 5. You may or may not be able to conclude that based on the available data and assumptions about the distributions.
For example, if you assume that the two distributions are approximately normal with very small variances, then you can probably conclude that the distribution that drew 7 stochastically dominates the distribution that drew 5. But if you assume that the variances are large, then you probably can’t conclude that.
Kidding. The idea is that there may be some statistical uncertainty associated with the measurement of 7, and also of 5, and so the "real" value of 7 may actually be less than the "real" value of 5.
The statisticians have a bunch of tricks to transform the formulas into more-easily computable forms, e.g. calculate both the average and the standard deviation in a single pass through the data instead of one pass to calculate the average and a second to calculate the standard deviation. Converting the math in here to efficient code isn't very easy.
>Being able to apply statistics is like having a secret superpower.
I totally with this sentence. BUT If you ask for my opinion, merely knowing a list of statistical formulas is not very helpful. Most of the time, people don’t remember the underlying assumptions, so there is a fair chance they will use them in inappropriate situations.
I recommend watching these two YouTube videos. The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.
Jake Vanderplas - Statistics for Hackers https://www.youtube.com/watch?v=Iq9DzN6mvYA
John Rauser - Statistics Without the Agonizing Pain https://www.youtube.com/watch?v=5Dnw46eC-0o
It makes no sense to memorize the formulas when most any statistical formula you'd actually use has a package or three that can run it in a way that's already probably reasonably benchmarked and not prone to you fat fingering some error rolling your own.
Assumptions are the part that matters.
IIRC, Jake's video inspired the example section in the Python random module docs. It takes about 15 minutes with those examples to learn how to put Jake's ideas into practice. https://docs.python.org/3/library/random.html#examples .
> The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.
Yeah, I often find it much easier to make a little Python script to do 10,000 monte-carlo trial, as opposed to properly" working things out and then not even being confident-enough in my result anyway.
While I really liked the video by vanderplas, I did return to it after a year or two and paused every time he presented a problem and then tried to solve it using for loops and thinking hard.
I barely succeeded in any of it. So at that point just look up the formula instead of bootstrapping.
I’ll give the second one a shot too.
I think I avoid imposter syndrome in some areas, but Not Enough Real Math is definitely a weak spot.
When people start talking about eigenvalues, I'm just a business-rule caveman with a little discrete-math unga bunga.
This kind of statistical stuff falls somewhere in-between.
Then definitely what 3Blue1Brown's video on eigenvalues and eigenvectors. [1] That's when I clicked to me! His entire series on Linear Algebra is incredibly well produced.
[1] https://youtube.com/watch?v=PFDu9oVAE-g
Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.
Linear Algebra was the most useful and fun math class I took in college. Highly recommended if you ever wanna do gamedev. It's more approachable than you probably think.
For me, when people start talking about differential equations, specifically the symbols you'll see in a wikipedia article about Navier Stokes equations, I'm just a business-rule caveman with a little linear algebra zug zug.
> Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.
Multiplying a vector or a matrix by any nonunit scalar changes its magnitude (hence scalar!! i.e. something that scales). Not all scalars are eigenvalues. So this isn't quite right
Think about it geometrically instead. A linear operator transforms a space. Geometrically the transformation can be one or more of stretching, compressing, or rotating (taking shearing to be a kind of stretching). The directions in the space which remain the same other than having been scaled by some factor are the eigenvectors of the transformation. The scaling factor of one of those such directions is its eigenvalue.
> when people start talking about differential equations
It's not like you are going to solve those analytically.
Implement a couple numerical solvers for things like Navier–Stokes and you'll see that differential equations is just obscenely compressed code.
Studying more statistics is often clever. Although in this case Mr. Miller led the the most important part - if there are two numbers (like 7 and 5) in a statistical context they might be the same number. That throws a lot of people into such a tailspin that they never really recover after making the obvious mistake of thinking they are different.
The powerful heuristic for the less technically inclined is to say "well, this evidence isn't conclusive until someone who knows statistics has tried to shoot it down".
Also: "Common statistical tests are linear models (or: how to teach stats)"[1]. Also also, bootstrapping is a superpower.
[1] https://lindeloev.github.io/tests-as-linear/
I'd add z-score (standard score) to your tool belt. The ability to identify or reject outliers is invaluable when trying to stabilize real-world business processes.
For example, if you are building heuristics that determine if a customer's bank account is "reasonably active", you may not want to consider very small transactions unless that is typical activity for a given customer.
Another simple tool that gives you superpowers is a Q-Q plot. https://en.wikipedia.org/wiki/Q–Q_plot
Personally always loved me a Bland-Altman plot (https://en.m.wikipedia.org/wiki/Bland-Altman_plot)
FYI, using this stuff without understanding Test Power is dangerous and can lead to making bad decisions with false confidence.
The article "How Not To Sort By Average Rating" by the same author (and also linked in this article) is really good, and definitely changed my thinking about any kind of "sort by best to worst" list: https://www.evanmiller.org/how-not-to-sort-by-average-rating...
Covered by 3b1b some years ago: https://youtu.be/8idr1WZ1A7Q
Hm. I wonder how well would "Score = [Positive ratings] / ([Total ratings] + 1)" fare.
It'll help some, but I don't think enough—it's way, way easier to get a good score on a small number of ratings than a large number. And the span on the number of ratings is several orders of magnitude—for instance, on Amazon you can do a search and get back products with less than 10 ratings along side products with over 10,000 ratings.
Why isn't 7 greater than 5?
Statistics gave him the superpower of predicting the future:
https://knowyourmeme.com/memes/fight-club-57-movie
That's hilarious!
yes, and informative. i was looking at the article and i thought everything made sense but i could tell i was missing something about this line...
Treat them as two draws from possibly different, independent distributions.
The question is whether the distribution that drew 7 “stochastically dominates” the distribution that drew 5. You may or may not be able to conclude that based on the available data and assumptions about the distributions.
https://en.m.wikipedia.org/wiki/Stochastic_dominance
For example, if you assume that the two distributions are approximately normal with very small variances, then you can probably conclude that the distribution that drew 7 stochastically dominates the distribution that drew 5. But if you assume that the variances are large, then you probably can’t conclude that.
Sounds like you should read the article. :)
Kidding. The idea is that there may be some statistical uncertainty associated with the measurement of 7, and also of 5, and so the "real" value of 7 may actually be less than the "real" value of 5.
This is certainly a very useful resource - even for a seasoned data scientist!
[dead]
[flagged]
[flagged]
Converting the math in here to code isn't very hard.
The statisticians have a bunch of tricks to transform the formulas into more-easily computable forms, e.g. calculate both the average and the standard deviation in a single pass through the data instead of one pass to calculate the average and a second to calculate the standard deviation. Converting the math in here to efficient code isn't very easy.
You mean Welford's algorithm. Since code was requested:
https://jonisalonen.com/2013/deriving-welfords-method-for-co...
Someone should write a „math notation for programmers“ article. Certainly would help me anyway.
https://news.ycombinator.com/item?id=28493031
There are others like this out there.
And also it's something programmers need to be skilled at.
Yes. But also, probably just about every language will have a module with these functions. Python's NumPy and SciPy should have all of these built-in.
...except it depends on knowledge of the t distribution but has no information on how to approximate it.
It is a good frequentist's toolbox, but it is not immediately translatable to code, no.