Artwork for podcast Voice over Work - An Audiobook Sampler
The Art Of Statistical Thinking Detect Misinformation, Understand The World Deeper, And Make Better Decisions. By Albert Rutherford And Jae H. Kim, PhD AudioChapter
9th April 2024 • Voice over Work - An Audiobook Sampler • Russell Newton
00:00:00 00:39:25

Share Episode

Transcripts

Speaker:

The art of statistical thinking, detect misinformation, understand the world

Speaker:

deeper, and make better decisions. Advanced Thinking Skills, book 3, written by

Speaker:

Albert Rutherford, J. H. Kim, Ph.D., narrated by Russell Newton.

Speaker:

We make decisions every day - some can change our lives and those of our loved ones.

Speaker:

But it is not only the individuals who make decisions.

Speaker:

Companies, courts of law, governments and international organizations also make decisions,

Speaker:

often on a large scale, that can affect our jobs, the justice system, and everyday life

Speaker:

in a positive or negative way.

Speaker:

Such decisions usually are made under incomplete information and uncertainty.

Speaker:

The decision-makers often make correct decisions that will benefit our society, but they make

Speaker:

incorrect decisions too.

Speaker:

The cost of the latter can sometimes be devastating, starting from personal tragedies to changing

Speaker:

the course of human history.

Speaker:

But let’s not run so far ahead.

Speaker:

Suppose you are making an investment decision for your retirement.

Speaker:

Investment funds report their average returns for the past 5 years; you read a media report

Speaker:

about the recent growth of the real estate market, and you hear about overnight millionaires

Speaker:

who have made big from investing in cryptocurrency.

Speaker:

You also hear about those who lost their life savings because of wrong investments or scams.

Speaker:

And there is always a catch in the fine print - “Past performance is not necessarily indicative

Speaker:

of future performance."

Speaker:

This means you are facing uncertainty in your investment decisions, and you should learn

Speaker:

how to make a well-informed decision under this circumstance.

Speaker:

If you make a decision after you sampled a range of different funds, compared them with

Speaker:

those of real estate markets, and studied the future prospect of the world economy,

Speaker:

learned from the investment gurus such as Warren Buffet and listened to your friends

Speaker:

and advisors, then it is most likely that you have made an informed decision that will

Speaker:

bring handsome payoff eventually.

Speaker:

This is, in a way, “statistical thinking”; you sample the population and learn from it

Speaker:

to make an informed decision.

Speaker:

The more diverse and informative your sample’s elements are, the more likely it is that you

Speaker:

have made the right decision.

Speaker:

This book will show you how to understand statistics as a layman and make informed decisions

Speaker:

with the help of statistical thinking.

Speaker:

The problem is that statistics can easily be manipulated and misinterpreted.

Speaker:

If statistical findings were always presented and utilized in an honest and correct way,

Speaker:

the results wouldn’t always be as rosy.

Speaker:

We often see distorted and misguided numbers and outcomes, even though that was not the

Speaker:

intention of those who report statistics.

Speaker:

This book is intended to help readers gain better understanding and decision-making skills

Speaker:

– the kind that professional statisticians possess.

Speaker:

In the first chapter, we will review the definitions and basic concepts of statistics.

Speaker:

As a book on statistics, it is inevitable to introduce mathematical details.

Speaker:

However, these details will only be presented when necessary, without providing the full

Speaker:

theoretical background.

Speaker:

Chapter 1 - Definition and Basic Concepts.

Speaker:

1.

Speaker:

Sample versus population.

Speaker:

An investor wishes to know the five-year average return from investing in the U. S. stock market.

Speaker:

There are nearly 2,400 stocks (as of August 2022) listed on the NYSE (New York Stock Exchange),

Speaker:

and they must select a manageable number of stocks to form a portfolio of stocks.

Speaker:

However, they don’t need to calculate the average return of all 2400 stocks.

Speaker:

There are stocks not worth investing in – too low return or too risky.

Speaker:

Our investor will need to select a set of stocks that suits their investment style.

Speaker:

In this example, the collection of all stocks in the NYSE is called the population in statistical

Speaker:

jargon, and a subset of all stocks is called a sample.

Speaker:

Collecting the information from all the members of the population is too costly and time-consuming

Speaker:

and even unnecessary.

Speaker:

We can obtain a good indicator of average return by looking at a sample.

Speaker:

The way we select the sample is critically important, and it depends largely on the purpose

Speaker:

of the study or the aim of the statistical task at hand.

Speaker:

Suppose the investor’s aim is to achieve a steady return with relatively low risk by

Speaker:

investing in big and stable companies.

Speaker:

Then a good sample is the Dow Jones index, which comprises the stocks of 30 prominent

Speaker:

companies, such as Boeing, Coca-Cola, Microsoft, and Proctor & Gamble.

Speaker:

If the investor’s goal is to achieve a higher return with higher growth, albeit taking a

Speaker:

higher risk, the NASDAQ-100 index is a good sample that mainly includes the top technology

Speaker:

and IT stocks, such as Amazon, Apple, eBay, and Google.

Speaker:

By looking at the average returns of these indices, the investor can get a clear indication

Speaker:

and impression of the performance of these stocks.

Speaker:

Seasoned investors can select their own sample based on their aim and risk-return preference.

Speaker:

The important point is that the sample should be a good representation of the target population.

Speaker:

If the investor wants safe and steady investment returns, but their sample represents high-risk

Speaker:

stocks, they may not effectively achieve the aim of their investment.

Speaker:

Hence, the target population should be determined in consideration of the aim of the statistical

Speaker:

study.

Speaker:

A sample that is a good representation of the population can be obtained by pure random

Speaker:

sampling.

Speaker:

The members of the population are selected randomly with an equal chance.

Speaker:

For example, in political polls, all eligible voters should be treated equally.

Speaker:

In this situation, the most effective way of selecting an unbiased and representative

Speaker:

sample is random sampling, where the members of the eligible voters are selected with equal

Speaker:

chance, with no pre-selection or exclusions.

Speaker:

In a later chapter, we will discuss an example of one of the most disastrous polling outcomes

Speaker:

in the history, which occurred due to a violation of this random sampling principle.

Speaker:

2.

Speaker:

Descriptive statistics.

Speaker:

Descriptive statistics is a branch of statistics where the sample features are presented with

Speaker:

a range of summary statistics and visualization methods.

Speaker:

The summary statistics include the mean and median, which describe the centre of the sample

Speaker:

values, and the variance and standard deviation are the measures of the variability of the

Speaker:

sample values.

Speaker:

Visualization methods include plots, charts, and graphs, which are used to make a visual

Speaker:

impression about the distribution of the sample values.

Speaker:

1.1.

Speaker:

Mean and median.

Speaker:

The mean refers to the average of a set of values.

Speaker:

It is computed by adding the numbers and dividing the total by the number of observations.

Speaker:

The mean is the average of the sample values of size n, with each individual point given

Speaker:

the weight of 1/n.

Speaker:

The formula for the mean can be written as, .

Speaker:

(1).

Speaker:

where (X1, X2,…, Xn) represent the data points and n is called the sample size.

Speaker:

That is, the sample mean is the sum of all sample points divided by the sample size.

Speaker:

Alternatively, it can be interpreted as a weighted sum of all data points with an equal

Speaker:

weight of 1/n.

Speaker:

The median is the middle number in a sequence of numbers.

Speaker:

To find the median, organize each number in order by size; the number in the middle is

Speaker:

the median.i In statistical terms, the median is defined as the middle value of (X1, X2,

Speaker:

…, Xn) when sorted in ascending or descending order.

Speaker:

Consider a simple example of (X1, …, Xn) = (1, 2, 3, 4, 5) and n = 5.

Speaker:

The sum of all X’s is 15 (1+2+3+4+5=15), and the sample mean is 3 (15/5=3).

Speaker:

The middle value of (1, 2, 3, 4, 5) is 3.

Speaker:

In this case, the sample’s mean and median are the same.

Speaker:

In general, the mean and median values are different, and the median is widely used where

Speaker:

there are possible extreme values in the sample points.

Speaker:

Consider the sample points with an extreme observation (X1, …, Xn) = (1, 2, 3, 4, 20),

Speaker:

then the sample mean is 6 (1+2+3+4+20 = 30; 30/5=6), and the median is still 3 as the

Speaker:

middle value of the distribution (1, 2, 3, 4, 20).

Speaker:

If this extreme value is unusual and does not represent the target population, then

Speaker:

the sample mean of 6 can be a misleading value because it was distorted by the presence of

Speaker:

20.

Speaker:

In this case, the median should be preferred to the mean.

Speaker:

A practical example of using the median over the mean is the case for house prices.

Speaker:

For example, the researcher is interested in the average house price in a middle-class

Speaker:

suburb.

Speaker:

In such a suburb, there is still a chance that a big mansion or two in a large block

Speaker:

of land may be included in the sale.

Speaker:

However, these houses do not represent the general characteristics of the suburb, and

Speaker:

it is reasonable to use the median in this case to find the average value free from the

Speaker:

effect of these extreme values1.

Speaker:

The mean vs. median is closely related with the “skewedness” of the distribution.

Speaker:

If the distribution of the numbers you have is (more or less) symmetric around the mean

Speaker:

as in (X1, …, Xn) = (1, 2, 3, 4, 5), the mean and median will be identical or practically

Speaker:

the same.

Speaker:

However, when the distribution of the numbers is asymmetric or skewed, then the mean and

Speaker:

median can be different.

Speaker:

For example, if the distribution is asymmetric, as in (X1, …, Xn) = (1, 2, 3, 4, 20), then

Speaker:

the two values can be different.

Speaker:

Photo source - Study.comii.

Speaker:

Graphical illustrations of the different shapes of the distribution and the positions of the

Speaker:

mean and median are given above.

Speaker:

Suppose the above is the distribution of the performance of all salespeople in a company.

Speaker:

A symmetric distribution means the higher performers and lower performers are in the

Speaker:

same or similar proportion; in which case the mean and median are almost identical.

Speaker:

A positive skewed distribution means the presence of a small number of extremely capable performers.

Speaker:

In this case, the mean of the sales is inflated by their performance.

Speaker:

If the sales manager wants an average value that represents the performance of the “average

Speaker:

salesperson”, then the use of median is appropriate.

Speaker:

If she wants to know the average sales, including the performance of all salespeople in the

Speaker:

company, then the use of the mean is appropriate.

Speaker:

A similar interpretation can also be made from a negatively skewed distribution illustrated

Speaker:

above.

Speaker:

1.2.

Speaker:

Variance and standard deviation.

Speaker:

When analyzing or presenting a set of numbers, it is important to know the centre of the

Speaker:

distribution.

Speaker:

But understanding their dispersion and variability is also important.

Speaker:

Consider two salespeople with the same or a similar number of mean sales in the past

Speaker:

year.

Speaker:

In evaluating who was a more consistent performer, the manager will compare the dispersions in

Speaker:

their sales throughout the year.

Speaker:

Measures of variability, variance, and standard deviation present how widespread the sample

Speaker:

points are around the mean.

Speaker:

The distance of the sample point from the mean is calculated as , and they are squared

Speaker:

to make them all positive.

Speaker:

The average of all the squared distances from the mean is called the variance, which can

Speaker:

be written as,.

Speaker:

(0).

Speaker:

How this formula works will be explained in the table below.

Speaker:

But it is, in a way, the average of the squared distance of the data points from the mean,

Speaker:

i.e., .

Speaker:

The standard deviation (s) is defined as the square root of the variance, namely, .

Speaker:

(0).

Speaker:

Since the variance is the distance of the sample points from the mean in squares, the

Speaker:

standard deviation converts the value into the same unit as the original value of the

Speaker:

sample points by taking the square root.

Speaker:

X.

Speaker:

1.

Speaker:

-2 (=1-3).

Speaker:

-22 = 4.

Speaker:

2.

Speaker:

-1(=2-3).

Speaker:

-22 = 1.

Speaker:

3.

Speaker:

0 (=3-3).

Speaker:

02 = 0.

Speaker:

4.

Speaker:

1 (=4-3).

Speaker:

12 = 1.

Speaker:

5.

Speaker:

2 (=1-3).

Speaker:

22 = 4.

Speaker:

Sum.

Speaker:

10.

Speaker:

=3.

Speaker:

Using the example we used above as an illustration, X= (1, 2, 3, 4, 5) and The variance is the

Speaker:

sum of the numbers in the last column on the chart above divided by 4, which is 10/4 = 2.5.

Speaker:

The standard deviation is .

Speaker:

The interpretation is that the sample points are, on average, 1.58 units away from the

Speaker:

mean value of 3.

Speaker:

Why the division (or weight) is by (n-1), not by n, is beyond the scope of this book,

Speaker:

but it is to make the calculation more accurate when the sample size is small.

Speaker:

When the sample size is large, the division by n or by (n-1) makes no practical difference.

Speaker:

There are other variability measures around the median (i.e., interquartile range), and

Speaker:

they will be introduced in this book later.

Speaker:

3.

Speaker:

Sample statistics and population parameters.

Speaker:

The sample mean () and standard deviation (s) are the statistics calculated from a sample.

Speaker:

The sample is a subset of the population, which also has the mean and standard deviation

Speaker:

(the median and variance as well).

Speaker:

When we use statistics, what we eventually want to know is the population values (also

Speaker:

called the population parameters), such as the mean and standard deviation.

Speaker:

The population mean and standard deviation are often written with Greek letters as  and

Speaker:

, values that are never known.

Speaker:

Suppose you want to know the mean household income of California.

Speaker:

If you visit all the households in California to find their mean income, as in a census,

Speaker:

you are looking for the value of .

Speaker:

However, such an exercise is often neither feasible nor necessary.

Speaker:

A good representative sample can tell us a lot about , as we shall see later.

Speaker:

We can gather a random sample of 1,000 households to find their income, and this will give the

Speaker:

value of the sample mean ().

Speaker:

If the sample was a good representation of the population, it is likely the sample mean

Speaker:

is a good indicator for the population mean.

Speaker:

The population and variance (and standard deviation) can be written formally as,.

Speaker:

(0).

Speaker:

(0).

Speaker:

(0).

Speaker:

where north is the population size and represent the population values.

Speaker:

The formulae above are similar to their sample counterparts in (1) to (3), hence their interpretations

Speaker:

are similar, but they are the values of the population.

Speaker:

In our example, north is the number of the total households in California, and are their

Speaker:

incomes.

Speaker:

If 1,000 households are selected randomly and their mean income is found to be $75,000,

Speaker:

then with n =1,000.

Speaker:

It is hoped that this value of the sample mean is in close neighbourhood of the true

Speaker:

value of the population mean.

Speaker:

Let us take another example.

Speaker:

Consider a fictitious country with 1 million (north) eligible voters who are voting for

Speaker:

their President.

Speaker:

A candidate should have the support rate of more than 0.5 to get elected.

Speaker:

The true value of the support rate () is unknown, and what matter is this value on

Speaker:

the election date.

Speaker:

A poll is conducted from a sample of 1000 (n) eligible voters, 10 days before the election

Speaker:

date.

Speaker:

This value is the sample mean ().

Speaker:

Suppose this sample value () is 50.1 per cent.

Speaker:

This value is called an estimate of the population parameter ().

Speaker:

If the sample is a good representation of the population, this estimate of sample mean

Speaker:

is an indicator for the value of , 10 days before the election date.

Speaker:

4.

Speaker:

Descriptive statistics for relative position.

Speaker:

Suppose your IQ score is 115.

Speaker:

A natural question is how smart are you (according to the IQ score only) relative to the other

Speaker:

people in the sample or population.

Speaker:

Suppose your annual income is $50,000.

Speaker:

You want to know how rich or how poor you are relative to the others in the sample or

Speaker:

population.

Speaker:

You ran a marathon, and you completed the race with a record of 3 hours.

Speaker:

You want to know your rank in the race and where your rank stands relative to all the

Speaker:

participants of the race.

Speaker:

These questions are asking for a relative position, another important question in statistics.

Speaker:

The popular measures of relative positions are percentiles (sometimes called quantiles)

Speaker:

and quartiles.

Speaker:

Percentiles (quantiles).

Speaker:

With percentiles, we divide the distribution of the numbers into 100 positions.

Speaker:

For example, the 90th percentile represents the value in the sample that has 10% of the

Speaker:

sample points higher and 90% of the values lower than it.

Speaker:

That is, if your IQ score of 115 is said to be the 90th percentile, this means you are

Speaker:

at the top 10% of the distribution of all IQ scores.

Speaker:

Suppose your income of $50,000 is the 40th percentile of the distribution, then it means

Speaker:

your income is at the bottom 40% of the distribution.

Speaker:

That is, if there were 1000 people in the sample, your income stands at the 400th position

Speaker:

when all incomes are sorted in ascending order.

Speaker:

Similarly, among the 100 runners who participated in the marathon event, suppose your record

Speaker:

of 3 hours is at the 75th percentile.

Speaker:

This means your record is at the top 25%, and there are 24 runners who finish the race

Speaker:

with a better record than yours, and 74 of them were behind you.

Speaker:

Quartiles.

Speaker:

Quartiles are similar to percentile, but instead of dividing the distribution of the numbers

Speaker:

into 100 positions, they are based on the division into 4, as the following table shows

Speaker:

- .

Speaker:

The first quartile is the value whose position is at the bottom 25%, and it is the same as

Speaker:

the 25th percentile.

Speaker:

The second quartile is the 50th percentile, which is also the median.

Speaker:

If we go back to your marathon record, your record of 3 hours is the third quartile of

Speaker:

the distribution.

Speaker:

Interquartile range.

Speaker:

An interquartile range is defined as the difference between the third and 1st quartile of the

Speaker:

distribution.

Speaker:

It is a measure of variability or dispersion of a distribution alternative to the standard

Speaker:

deviation.

Speaker:

As the difference between the 3rd and 1st quartiles, the length of the interval contains

Speaker:

the (middle) 50% of the data points around the median.

Speaker:

Similarly to the median, the interquartile range is not sensitive to a few extreme values

Speaker:

in the distribution, while standard deviation can be inflated by extreme values.

Speaker:

More examples will follow for the interquartile range.

Speaker:

As an example, consider two suburbs whose median house prices are similar at 1 million

Speaker:

dollars.

Speaker:

The researcher finds the first suburb has the 1st quartile at the $750,000 and the 3rd

Speaker:

quartile at $1.25 million, with the interquartile range of $500,000 ($1.25 million - $750,000).

Speaker:

The second suburb has the 1st quartile at the $500,000 and the 3rd quartile at $1.5

Speaker:

million, with the interquartile range of 1 million dollars ($1.5 million – $500,000).

Speaker:

The interval that contains the middle 50% of the house prices are much longer in the

Speaker:

second suburb, which indicates the variability of house prices is substantially larger in

Speaker:

the second suburb.

Speaker:

5.

Speaker:

Data Visualization.

Speaker:

Visualization is a powerful way of understanding the key features of a sample and making impressions.

Speaker:

It often makes a better and stronger impression about the data characteristics than a table

Speaker:

full of numbers.

Speaker:

Consider an investor who wishes to invest in U. S. stocks.

Speaker:

They gather the sample for NASDAQ-100 index and want to know how the index and its return

Speaker:

have performed in the last 5 years to December 2021.

Speaker:

Figure 1 presents the line charts (time plots) of and return (growth rate) in percentage,

Speaker:

monthly from 2017 to 2021.

Speaker:

The index has been growing with an upward trend for the last 5 years, and the trend

Speaker:

gets steeper from early 2020.

Speaker:

The monthly return fluctuates around 0, with most values between -10% and 10%.

Speaker:

These plots provide a clear impression of how the index has performed in the last five

Speaker:

years.

Speaker:

Figure 1 - Time plots of NASDAQ-100 index and return.

Speaker:

Data source - Yahoo Finance.

Speaker:

A histogram is another popular method of data visualization that presents the frequencies

Speaker:

of data points over the intervals of sample points.

Speaker:

It is a useful method of presenting the distributional shape of the sample points.

Speaker:

Figure 2 presents the histogram of the monthly returns, which shows the monthly returns are

Speaker:

centred between 0% and 5%, and most of the values are in the range of -10% and 10%.

Speaker:

The sample mean value of the monthly return is 2.02%, and their median is 2.68%, so the

Speaker:

index has been increasing at an average growth rate of just higher than 2%.

Speaker:

The standard deviation is 4.92%, which indicates the average deviation of the monthly returns

Speaker:

from the mean has been around 5%.

Speaker:

By combining the plots and summary statistics, the investor can learn about the performance

Speaker:

of the index in detail.

Speaker:

Figure 2 - Histogram of Returns from NASAQQ-100 index.

Speaker:

Data source - Yahoo Finance.

Speaker:

6.

Speaker:

Comparing alternative distributions.

Speaker:

Now suppose the investor wishes to compare the performance of the NASDAQ-100 with the

Speaker:

Apple stock (APPL) for the same period.

Speaker:

The following table compares the basic statistics discussed so far.

Speaker:

Monthly returns for two alternative investments.

Speaker:

NASDAQ-100.

Speaker:

APPL.

Speaker:

Mean.

Speaker:

2.01.

Speaker:

3.02.

Speaker:

Median.

Speaker:

2.68.

Speaker:

5.00.

Speaker:

Standard Deviation.

Speaker:

4.92.

Speaker:

8.34.

Speaker:

1st Quartile.

Speaker:

-0.18.

Speaker:

-1.66.

Speaker:

3rd Quartile.

Speaker:

5.13.

Speaker:

9.25.

Speaker:

10th percentile.

Speaker:

-5.89.

Speaker:

-7.35.

Speaker:

90th percentile.

Speaker:

7.37.

Speaker:

12.27.

Speaker:

Data source - Yahoo finance.

Speaker:

The figures in this table reveal many details of the two investment alternatives -

Speaker:

•The average return from NASDAQ-100 is substantially lower than APPL. The mean and median of the

Speaker:

former is 2.01% and 2.68% per month, but those of APPL 3.02% and 5.00%.

Speaker:

•For both cases, the median is larger than the mean, especially the APPL. This means

Speaker:

the distribution is skewed to the left, with the presence of extremely low returns.

Speaker:

This means, when they go down, they can go down deep!

Speaker:

(Especially APPL!).

Speaker:

•The variability is a lot higher for the returns from APPL. The standard deviation

Speaker:

of APPL (8.34) is nearly twice larger than that of NASDA-100 (4.92).

Speaker:

This means APPL has a lot larger variation around the mean.

Speaker:

•The interquartile range for APPL is 10.91 (9.25 + 1.66) and that of NASDAQ-100 is 5.31

Speaker:

(5.13+0.18).

Speaker:

The length of interval that contains the middle 50% of the returns around the median is again

Speaker:

nearly twice larger for the APPL. .

Speaker:

•The worst possible outcome with 10% chance for APPL has been -7.35%, and that for NASDAQ-100

Speaker:

has been -5.89%.

Speaker:

The best possible outcome with 10% chance for APPL has been 12.27% a month, and that

Speaker:

for NASDAQ-100 has been 7.37%.

Speaker:

The comparison of these descriptive statistics reveals that monthly returns are a lot higher

Speaker:

for APPL investment, but it shows substantially higher variability or risk.

Speaker:

This is a well-known principle in finance - a higher return is compensation for taking

Speaker:

a higher risk.

Speaker:

The above plots present the histograms for the two investments.

Speaker:

A larger variability of the APPL with a heavier skew to the left of the distribution than

Speaker:

NASDAQ-100 is clear.

Speaker:

While the summary statistics tell the difference with the numbers, these histograms can make

Speaker:

a visual comparison.

Speaker:

To make a further visual comparison, another method of visualisation called the “Box-Whisker”

Speaker:

plot is introduced.

Speaker:

It plots the mean, the median, the 1st quartile, the 3rd quartile, maximum and minimum, along

Speaker:

with outliers.

Speaker:

The box in the middle is based on the 3rd quartile and 1st quartile, and the height

Speaker:

of the box represents the interquartile range.

Speaker:

Outliers are determined by a certain criterion (i.e., the outliers are defined as those lying

Speaker:

three standard deviations away from the mean).

Speaker:

Again, the APPL investment gives a substantially higher median return per month, but its monthly

Speaker:

variability is much higher than NASDAQ-100.

Speaker:

Which investment to choose depends on how risk-averse or risk-tolerant the investor

Speaker:

is.

Speaker:

If you are a Braveheart and enjoy a roller coaster ride, investing in APPL is not a bad

Speaker:

choice; otherwise, stick to the NASDAQ-100 for a safer option.

Speaker:

7.

Speaker:

Normal distribution.

Speaker:

Figure 2 presents a distribution of the sample points using a histogram.

Speaker:

In statistics, distribution is an important feature for both the sample and the population.

Speaker:

While we can observe a distribution of the sample as in Figure 2, that of the population

Speaker:

is often unknown and not observable.

Speaker:

Understanding the features of a distribution is one of the fundamental questions of statistics.

Speaker:

For example, what is the chance that investing in the NASDAQ-100 index will provide a return

Speaker:

greater than 2%?

Speaker:

What proportion of the households in California has a lower annual income than $50,000?

Speaker:

We can only guess using the distribution of the sample we observe.

Speaker:

Again, if the sample is a fair representation of the population, the distribution of the

Speaker:

sample can well reflect the distribution of the population.

Speaker:

On the other hand, there are several known distributions in statistics where the probability

Speaker:

can be calculated using the given values of the parameters, such as the mean and standard

Speaker:

deviation.

Speaker:

Among them, the most fundamental and popular is the normal distribution.

Speaker:

It is also a key distribution in the inferential statistics to be discussed in the next chapter.

Speaker:

Normal distribution is a bell-shaped distribution, symmetric around its mean (or median), and

Speaker:

the probability at any point of the distribution is known.

Speaker:

A normal distribution with a mean  and standard deviation of  is written as north(,).

Speaker:

In the special case of the mean being zero and the standard deviation 1, it is called

Speaker:

standard normal distribution, and it is denoted as north(0,1).

Speaker:

Figure 3 is a screenshot from an online calculator.2 .

Speaker:

Figure 3 - Standard normal distribution.

Speaker:

Given the values of the mean and standard deviation, any probability between an interval

Speaker:

can be calculated.

Speaker:

Figure 3 shows a normal distribution with zero mean and a standard deviation of 1 (called

Speaker:

the standard normal distribution).

Speaker:

Suppose your return (in percentage) from an investment follows the standard normal distribution.

Speaker:

The probability that your return is between -1.96% and 1.96% is calculated to be 0.95

Speaker:

(dark area on the bell illustration).

Speaker:

This also means the probability of the tail areas is 5% (white area on the bell illustration).

Speaker:

Your investment return can be lower than -1.96% with the probability of 0.025 and can take

Speaker:

a value greater than 1.96% with the probability 0.025.

Speaker:

Let’s assume the household income in California follows a normal distribution of $75,000 with

Speaker:

the standard deviation of $30,000 (see Figure 4).

Speaker:

Then, the household income distribution of California is represented by the bell curve

Speaker:

in Figure 4.

Speaker:

The probability that a household income is less than $50,000 or the proportion of the

Speaker:

households with income less than $50,000 is represented by the dark area in the distribution,

Speaker:

which is 0.20 approximately.

Speaker:

In other words, if you pick a household at random, you have a 0.20 chance to bump into

Speaker:

one with an income less than $50,000.

Speaker:

This also means the chance of a randomly selected household having an income higher than $50,000

Speaker:

is around 0.80 (= 1- 0.2023) .

Speaker:

Figure 4 - Application of a normal distribution.

Speaker:

8.

Speaker:

Checking the normality of a distribution.

Speaker:

Normal distribution is the most fundamental and popular distribution in statistics, and

Speaker:

it is widely used as a “benchmark” distribution or as an “approximation” to the true distribution

Speaker:

when it is unknown.

Speaker:

Being a benchmark or approximation means it may be sometimes useful, but sometimes not,

Speaker:

depending on the context and situation.

Speaker:

Figure 5 is the histogram we have seen in Figure 2, the returns from NASDAQ-100 investment,

Speaker:

overlayed with the normal distribution with the same mean and standard deviation values

Speaker:

of the returns.

Speaker:

While the histogram shows a similar shape to the normal distribution, with near symmetry

Speaker:

and bell curve, the fine details are not impressively consistent with the normal distribution.

Speaker:

While an approximation by a normal distribution to a stock return distribution is sometimes

Speaker:

used, it is generally accepted that a stock return distribution shows a clear departure

Speaker:

from a normal distribution.

Speaker:

Figure 5 - Histogram of the NASDAQ-100 and APPL returns and a normal curve.

Speaker:

The Q-Q (quantile-quantile) plot provides a clearer way of checking the normality of

Speaker:

a sample distribution using a graphical method.

Speaker:

It connects the sample quantiles (or percentiles) with the (theoretical) quantiles from the

Speaker:

normal distribution.

Speaker:

If the sample follows the standard normal distribution, then its percentiles should

Speaker:

match the percentiles from the normal distribution with the same mean and standard deviation.

Speaker:

The 95th percentile from the sample distribution (which is normal) should match the 1.96, and

Speaker:

the 50th percentile from the sample distribution should be 0, which is the 50th percentile

Speaker:

from the normal distribution.

Speaker:

An example of the Q-Q plot is given here - .

Speaker:

The grid lines are at (-1.96, 0, 1.96) for both axes, which are the 2.5th, 50th, and

Speaker:

97.5th percentiles from the standard normal distribution.

Speaker:

The y-axis (vertical) represents the sample quantile, and the x-axis (horizontal) represents

Speaker:

the theoretical quantiles from the normal distribution.

Speaker:

The grid lines are at (-1.96, 0, 1.96) for both axes, which match exactly.

Speaker:

Hence, any sample that shows a Q-Q plot like the one above can be well approximated by

Speaker:

a normal distribution.

Speaker:

Figure 6.

Speaker:

Q-Q plots for NASDAQ-100 and APPL returns.

Speaker:

The grid lines are at (-1.96, 0, 1.96) for both axes.

Speaker:

Figure 6 presents the Q-Q plots for the NASDAQ-100 and APPL returns.

Speaker:

The return from the NASDAQ-100 return shows a reasonable match with the normal quantiles,

Speaker:

while the quantiles of the APPL return show substantial departures from the normal quantiles.

Speaker:

This indicates that, while the NASDAQ-100 returns may be approximated by a normal distribution

Speaker:

with reasonable accuracy, a normal distribution will be a poor approximation to the APPL return

Speaker:

distribution.

Speaker:

9.

Speaker:

Concluding remarks.

Speaker:

As an opening chapter, the basic concepts and descriptive measures of statistics were

Speaker:

discussed with the following keywords -

Speaker:

•Sample and population.

Speaker:

•Mean and Median.

Speaker:

•Standard deviation and Inter-quartile range.

Speaker:

•Percentile or quartiles.

Speaker:

•Histogram, Time plots, Q-Q plot, Box-Whisker plot.

Speaker:

•Normal distribution.

Speaker:

If you understand the listed concepts and methods, and you can apply them to real-world

Speaker:

situations, you already have made big steps into the world of statistical thinking!

Speaker:

You can produce these statistics using popular tools such as Excel.

Speaker:

This has been the art of statistical thinking, detect misinformation, understand the world

Speaker:

deeper

Speaker:

and make better decisions. Advanced Thinking Skills Book 3. Written by Albert Rutherford,

Speaker:

J. H. Kim, PhD. Narrated by Russell Newton. Copyright 2022 by Albert Rutherford. Production

Speaker:

copyright by Albert Rutherford.

Links

Chapters

Video

More from YouTube