The art of statistical thinking, detect misinformation, understand the world
Speaker:deeper, and make better decisions. Advanced Thinking Skills, book 3, written by
Speaker:Albert Rutherford, J. H. Kim, Ph.D., narrated by Russell Newton.
Speaker:We make decisions every day - some can change our lives and those of our loved ones.
Speaker:But it is not only the individuals who make decisions.
Speaker:Companies, courts of law, governments and international organizations also make decisions,
Speaker:often on a large scale, that can affect our jobs, the justice system, and everyday life
Speaker:in a positive or negative way.
Speaker:Such decisions usually are made under incomplete information and uncertainty.
Speaker:The decision-makers often make correct decisions that will benefit our society, but they make
Speaker:incorrect decisions too.
Speaker:The cost of the latter can sometimes be devastating, starting from personal tragedies to changing
Speaker:the course of human history.
Speaker:But let’s not run so far ahead.
Speaker:Suppose you are making an investment decision for your retirement.
Speaker:Investment funds report their average returns for the past 5 years; you read a media report
Speaker:about the recent growth of the real estate market, and you hear about overnight millionaires
Speaker:who have made big from investing in cryptocurrency.
Speaker:You also hear about those who lost their life savings because of wrong investments or scams.
Speaker:And there is always a catch in the fine print - “Past performance is not necessarily indicative
Speaker:of future performance."
Speaker:This means you are facing uncertainty in your investment decisions, and you should learn
Speaker:how to make a well-informed decision under this circumstance.
Speaker:If you make a decision after you sampled a range of different funds, compared them with
Speaker:those of real estate markets, and studied the future prospect of the world economy,
Speaker:learned from the investment gurus such as Warren Buffet and listened to your friends
Speaker:and advisors, then it is most likely that you have made an informed decision that will
Speaker:bring handsome payoff eventually.
Speaker:This is, in a way, “statistical thinking”; you sample the population and learn from it
Speaker:to make an informed decision.
Speaker:The more diverse and informative your sample’s elements are, the more likely it is that you
Speaker:have made the right decision.
Speaker:This book will show you how to understand statistics as a layman and make informed decisions
Speaker:with the help of statistical thinking.
Speaker:The problem is that statistics can easily be manipulated and misinterpreted.
Speaker:If statistical findings were always presented and utilized in an honest and correct way,
Speaker:the results wouldn’t always be as rosy.
Speaker:We often see distorted and misguided numbers and outcomes, even though that was not the
Speaker:intention of those who report statistics.
Speaker:This book is intended to help readers gain better understanding and decision-making skills
Speaker:– the kind that professional statisticians possess.
Speaker:In the first chapter, we will review the definitions and basic concepts of statistics.
Speaker:As a book on statistics, it is inevitable to introduce mathematical details.
Speaker:However, these details will only be presented when necessary, without providing the full
Speaker:theoretical background.
Speaker:Chapter 1 - Definition and Basic Concepts.
Speaker:1.
Speaker:Sample versus population.
Speaker:An investor wishes to know the five-year average return from investing in the U. S. stock market.
Speaker:There are nearly 2,400 stocks (as of August 2022) listed on the NYSE (New York Stock Exchange),
Speaker:and they must select a manageable number of stocks to form a portfolio of stocks.
Speaker:However, they don’t need to calculate the average return of all 2400 stocks.
Speaker:There are stocks not worth investing in – too low return or too risky.
Speaker:Our investor will need to select a set of stocks that suits their investment style.
Speaker:In this example, the collection of all stocks in the NYSE is called the population in statistical
Speaker:jargon, and a subset of all stocks is called a sample.
Speaker:Collecting the information from all the members of the population is too costly and time-consuming
Speaker:and even unnecessary.
Speaker:We can obtain a good indicator of average return by looking at a sample.
Speaker:The way we select the sample is critically important, and it depends largely on the purpose
Speaker:of the study or the aim of the statistical task at hand.
Speaker:Suppose the investor’s aim is to achieve a steady return with relatively low risk by
Speaker:investing in big and stable companies.
Speaker:Then a good sample is the Dow Jones index, which comprises the stocks of 30 prominent
Speaker:companies, such as Boeing, Coca-Cola, Microsoft, and Proctor & Gamble.
Speaker:If the investor’s goal is to achieve a higher return with higher growth, albeit taking a
Speaker:higher risk, the NASDAQ-100 index is a good sample that mainly includes the top technology
Speaker:and IT stocks, such as Amazon, Apple, eBay, and Google.
Speaker:By looking at the average returns of these indices, the investor can get a clear indication
Speaker:and impression of the performance of these stocks.
Speaker:Seasoned investors can select their own sample based on their aim and risk-return preference.
Speaker:The important point is that the sample should be a good representation of the target population.
Speaker:If the investor wants safe and steady investment returns, but their sample represents high-risk
Speaker:stocks, they may not effectively achieve the aim of their investment.
Speaker:Hence, the target population should be determined in consideration of the aim of the statistical
Speaker:study.
Speaker:A sample that is a good representation of the population can be obtained by pure random
Speaker:sampling.
Speaker:The members of the population are selected randomly with an equal chance.
Speaker:For example, in political polls, all eligible voters should be treated equally.
Speaker:In this situation, the most effective way of selecting an unbiased and representative
Speaker:sample is random sampling, where the members of the eligible voters are selected with equal
Speaker:chance, with no pre-selection or exclusions.
Speaker:In a later chapter, we will discuss an example of one of the most disastrous polling outcomes
Speaker:in the history, which occurred due to a violation of this random sampling principle.
Speaker:2.
Speaker:Descriptive statistics.
Speaker:Descriptive statistics is a branch of statistics where the sample features are presented with
Speaker:a range of summary statistics and visualization methods.
Speaker:The summary statistics include the mean and median, which describe the centre of the sample
Speaker:values, and the variance and standard deviation are the measures of the variability of the
Speaker:sample values.
Speaker:Visualization methods include plots, charts, and graphs, which are used to make a visual
Speaker:impression about the distribution of the sample values.
Speaker:1.1.
Speaker:Mean and median.
Speaker:The mean refers to the average of a set of values.
Speaker:It is computed by adding the numbers and dividing the total by the number of observations.
Speaker:The mean is the average of the sample values of size n, with each individual point given
Speaker:the weight of 1/n.
Speaker:The formula for the mean can be written as, .
Speaker:(1).
Speaker:where (X1, X2,…, Xn) represent the data points and n is called the sample size.
Speaker:That is, the sample mean is the sum of all sample points divided by the sample size.
Speaker:Alternatively, it can be interpreted as a weighted sum of all data points with an equal
Speaker:weight of 1/n.
Speaker:The median is the middle number in a sequence of numbers.
Speaker:To find the median, organize each number in order by size; the number in the middle is
Speaker:the median.i In statistical terms, the median is defined as the middle value of (X1, X2,
Speaker:…, Xn) when sorted in ascending or descending order.
Speaker:Consider a simple example of (X1, …, Xn) = (1, 2, 3, 4, 5) and n = 5.
Speaker:The sum of all X’s is 15 (1+2+3+4+5=15), and the sample mean is 3 (15/5=3).
Speaker:The middle value of (1, 2, 3, 4, 5) is 3.
Speaker:In this case, the sample’s mean and median are the same.
Speaker:In general, the mean and median values are different, and the median is widely used where
Speaker:there are possible extreme values in the sample points.
Speaker:Consider the sample points with an extreme observation (X1, …, Xn) = (1, 2, 3, 4, 20),
Speaker:then the sample mean is 6 (1+2+3+4+20 = 30; 30/5=6), and the median is still 3 as the
Speaker:middle value of the distribution (1, 2, 3, 4, 20).
Speaker:If this extreme value is unusual and does not represent the target population, then
Speaker:the sample mean of 6 can be a misleading value because it was distorted by the presence of
Speaker:20.
Speaker:In this case, the median should be preferred to the mean.
Speaker:A practical example of using the median over the mean is the case for house prices.
Speaker:For example, the researcher is interested in the average house price in a middle-class
Speaker:suburb.
Speaker:In such a suburb, there is still a chance that a big mansion or two in a large block
Speaker:of land may be included in the sale.
Speaker:However, these houses do not represent the general characteristics of the suburb, and
Speaker:it is reasonable to use the median in this case to find the average value free from the
Speaker:effect of these extreme values1.
Speaker:The mean vs. median is closely related with the “skewedness” of the distribution.
Speaker:If the distribution of the numbers you have is (more or less) symmetric around the mean
Speaker:as in (X1, …, Xn) = (1, 2, 3, 4, 5), the mean and median will be identical or practically
Speaker:the same.
Speaker:However, when the distribution of the numbers is asymmetric or skewed, then the mean and
Speaker:median can be different.
Speaker:For example, if the distribution is asymmetric, as in (X1, …, Xn) = (1, 2, 3, 4, 20), then
Speaker:the two values can be different.
Speaker:Photo source - Study.comii.
Speaker:Graphical illustrations of the different shapes of the distribution and the positions of the
Speaker:mean and median are given above.
Speaker:Suppose the above is the distribution of the performance of all salespeople in a company.
Speaker:A symmetric distribution means the higher performers and lower performers are in the
Speaker:same or similar proportion; in which case the mean and median are almost identical.
Speaker:A positive skewed distribution means the presence of a small number of extremely capable performers.
Speaker:In this case, the mean of the sales is inflated by their performance.
Speaker:If the sales manager wants an average value that represents the performance of the “average
Speaker:salesperson”, then the use of median is appropriate.
Speaker:If she wants to know the average sales, including the performance of all salespeople in the
Speaker:company, then the use of the mean is appropriate.
Speaker:A similar interpretation can also be made from a negatively skewed distribution illustrated
Speaker:above.
Speaker:1.2.
Speaker:Variance and standard deviation.
Speaker:When analyzing or presenting a set of numbers, it is important to know the centre of the
Speaker:distribution.
Speaker:But understanding their dispersion and variability is also important.
Speaker:Consider two salespeople with the same or a similar number of mean sales in the past
Speaker:year.
Speaker:In evaluating who was a more consistent performer, the manager will compare the dispersions in
Speaker:their sales throughout the year.
Speaker:Measures of variability, variance, and standard deviation present how widespread the sample
Speaker:points are around the mean.
Speaker:The distance of the sample point from the mean is calculated as , and they are squared
Speaker:to make them all positive.
Speaker:The average of all the squared distances from the mean is called the variance, which can
Speaker:be written as,.
Speaker:(0).
Speaker:How this formula works will be explained in the table below.
Speaker:But it is, in a way, the average of the squared distance of the data points from the mean,
Speaker:i.e., .
Speaker:The standard deviation (s) is defined as the square root of the variance, namely, .
Speaker:(0).
Speaker:Since the variance is the distance of the sample points from the mean in squares, the
Speaker:standard deviation converts the value into the same unit as the original value of the
Speaker:sample points by taking the square root.
Speaker:X.
Speaker:1.
Speaker:-2 (=1-3).
Speaker:-22 = 4.
Speaker:2.
Speaker:-1(=2-3).
Speaker:-22 = 1.
Speaker:3.
Speaker:0 (=3-3).
Speaker:02 = 0.
Speaker:4.
Speaker:1 (=4-3).
Speaker:12 = 1.
Speaker:5.
Speaker:2 (=1-3).
Speaker:22 = 4.
Speaker:Sum.
Speaker:10.
Speaker:=3.
Speaker:Using the example we used above as an illustration, X= (1, 2, 3, 4, 5) and The variance is the
Speaker:sum of the numbers in the last column on the chart above divided by 4, which is 10/4 = 2.5.
Speaker:The standard deviation is .
Speaker:The interpretation is that the sample points are, on average, 1.58 units away from the
Speaker:mean value of 3.
Speaker:Why the division (or weight) is by (n-1), not by n, is beyond the scope of this book,
Speaker:but it is to make the calculation more accurate when the sample size is small.
Speaker:When the sample size is large, the division by n or by (n-1) makes no practical difference.
Speaker:There are other variability measures around the median (i.e., interquartile range), and
Speaker:they will be introduced in this book later.
Speaker:3.
Speaker:Sample statistics and population parameters.
Speaker:The sample mean () and standard deviation (s) are the statistics calculated from a sample.
Speaker:The sample is a subset of the population, which also has the mean and standard deviation
Speaker:(the median and variance as well).
Speaker:When we use statistics, what we eventually want to know is the population values (also
Speaker:called the population parameters), such as the mean and standard deviation.
Speaker:The population mean and standard deviation are often written with Greek letters as and
Speaker:, values that are never known.
Speaker:Suppose you want to know the mean household income of California.
Speaker:If you visit all the households in California to find their mean income, as in a census,
Speaker:you are looking for the value of .
Speaker:However, such an exercise is often neither feasible nor necessary.
Speaker:A good representative sample can tell us a lot about , as we shall see later.
Speaker:We can gather a random sample of 1,000 households to find their income, and this will give the
Speaker:value of the sample mean ().
Speaker:If the sample was a good representation of the population, it is likely the sample mean
Speaker:is a good indicator for the population mean.
Speaker:The population and variance (and standard deviation) can be written formally as,.
Speaker:(0).
Speaker:(0).
Speaker:(0).
Speaker:where north is the population size and represent the population values.
Speaker:The formulae above are similar to their sample counterparts in (1) to (3), hence their interpretations
Speaker:are similar, but they are the values of the population.
Speaker:In our example, north is the number of the total households in California, and are their
Speaker:incomes.
Speaker:If 1,000 households are selected randomly and their mean income is found to be $75,000,
Speaker:then with n =1,000.
Speaker:It is hoped that this value of the sample mean is in close neighbourhood of the true
Speaker:value of the population mean.
Speaker:Let us take another example.
Speaker:Consider a fictitious country with 1 million (north) eligible voters who are voting for
Speaker:their President.
Speaker:A candidate should have the support rate of more than 0.5 to get elected.
Speaker:The true value of the support rate () is unknown, and what matter is this value on
Speaker:the election date.
Speaker:A poll is conducted from a sample of 1000 (n) eligible voters, 10 days before the election
Speaker:date.
Speaker:This value is the sample mean ().
Speaker:Suppose this sample value () is 50.1 per cent.
Speaker:This value is called an estimate of the population parameter ().
Speaker:If the sample is a good representation of the population, this estimate of sample mean
Speaker:is an indicator for the value of , 10 days before the election date.
Speaker:4.
Speaker:Descriptive statistics for relative position.
Speaker:Suppose your IQ score is 115.
Speaker:A natural question is how smart are you (according to the IQ score only) relative to the other
Speaker:people in the sample or population.
Speaker:Suppose your annual income is $50,000.
Speaker:You want to know how rich or how poor you are relative to the others in the sample or
Speaker:population.
Speaker:You ran a marathon, and you completed the race with a record of 3 hours.
Speaker:You want to know your rank in the race and where your rank stands relative to all the
Speaker:participants of the race.
Speaker:These questions are asking for a relative position, another important question in statistics.
Speaker:The popular measures of relative positions are percentiles (sometimes called quantiles)
Speaker:and quartiles.
Speaker:Percentiles (quantiles).
Speaker:With percentiles, we divide the distribution of the numbers into 100 positions.
Speaker:For example, the 90th percentile represents the value in the sample that has 10% of the
Speaker:sample points higher and 90% of the values lower than it.
Speaker:That is, if your IQ score of 115 is said to be the 90th percentile, this means you are
Speaker:at the top 10% of the distribution of all IQ scores.
Speaker:Suppose your income of $50,000 is the 40th percentile of the distribution, then it means
Speaker:your income is at the bottom 40% of the distribution.
Speaker:That is, if there were 1000 people in the sample, your income stands at the 400th position
Speaker:when all incomes are sorted in ascending order.
Speaker:Similarly, among the 100 runners who participated in the marathon event, suppose your record
Speaker:of 3 hours is at the 75th percentile.
Speaker:This means your record is at the top 25%, and there are 24 runners who finish the race
Speaker:with a better record than yours, and 74 of them were behind you.
Speaker:Quartiles.
Speaker:Quartiles are similar to percentile, but instead of dividing the distribution of the numbers
Speaker:into 100 positions, they are based on the division into 4, as the following table shows
Speaker:- .
Speaker:The first quartile is the value whose position is at the bottom 25%, and it is the same as
Speaker:the 25th percentile.
Speaker:The second quartile is the 50th percentile, which is also the median.
Speaker:If we go back to your marathon record, your record of 3 hours is the third quartile of
Speaker:the distribution.
Speaker:Interquartile range.
Speaker:An interquartile range is defined as the difference between the third and 1st quartile of the
Speaker:distribution.
Speaker:It is a measure of variability or dispersion of a distribution alternative to the standard
Speaker:deviation.
Speaker:As the difference between the 3rd and 1st quartiles, the length of the interval contains
Speaker:the (middle) 50% of the data points around the median.
Speaker:Similarly to the median, the interquartile range is not sensitive to a few extreme values
Speaker:in the distribution, while standard deviation can be inflated by extreme values.
Speaker:More examples will follow for the interquartile range.
Speaker:As an example, consider two suburbs whose median house prices are similar at 1 million
Speaker:dollars.
Speaker:The researcher finds the first suburb has the 1st quartile at the $750,000 and the 3rd
Speaker:quartile at $1.25 million, with the interquartile range of $500,000 ($1.25 million - $750,000).
Speaker:The second suburb has the 1st quartile at the $500,000 and the 3rd quartile at $1.5
Speaker:million, with the interquartile range of 1 million dollars ($1.5 million – $500,000).
Speaker:The interval that contains the middle 50% of the house prices are much longer in the
Speaker:second suburb, which indicates the variability of house prices is substantially larger in
Speaker:the second suburb.
Speaker:5.
Speaker:Data Visualization.
Speaker:Visualization is a powerful way of understanding the key features of a sample and making impressions.
Speaker:It often makes a better and stronger impression about the data characteristics than a table
Speaker:full of numbers.
Speaker:Consider an investor who wishes to invest in U. S. stocks.
Speaker:They gather the sample for NASDAQ-100 index and want to know how the index and its return
Speaker:have performed in the last 5 years to December 2021.
Speaker:Figure 1 presents the line charts (time plots) of and return (growth rate) in percentage,
Speaker:monthly from 2017 to 2021.
Speaker:The index has been growing with an upward trend for the last 5 years, and the trend
Speaker:gets steeper from early 2020.
Speaker:The monthly return fluctuates around 0, with most values between -10% and 10%.
Speaker:These plots provide a clear impression of how the index has performed in the last five
Speaker:years.
Speaker:Figure 1 - Time plots of NASDAQ-100 index and return.
Speaker:Data source - Yahoo Finance.
Speaker:A histogram is another popular method of data visualization that presents the frequencies
Speaker:of data points over the intervals of sample points.
Speaker:It is a useful method of presenting the distributional shape of the sample points.
Speaker:Figure 2 presents the histogram of the monthly returns, which shows the monthly returns are
Speaker:centred between 0% and 5%, and most of the values are in the range of -10% and 10%.
Speaker:The sample mean value of the monthly return is 2.02%, and their median is 2.68%, so the
Speaker:index has been increasing at an average growth rate of just higher than 2%.
Speaker:The standard deviation is 4.92%, which indicates the average deviation of the monthly returns
Speaker:from the mean has been around 5%.
Speaker:By combining the plots and summary statistics, the investor can learn about the performance
Speaker:of the index in detail.
Speaker:Figure 2 - Histogram of Returns from NASAQQ-100 index.
Speaker:Data source - Yahoo Finance.
Speaker:6.
Speaker:Comparing alternative distributions.
Speaker:Now suppose the investor wishes to compare the performance of the NASDAQ-100 with the
Speaker:Apple stock (APPL) for the same period.
Speaker:The following table compares the basic statistics discussed so far.
Speaker:Monthly returns for two alternative investments.
Speaker:NASDAQ-100.
Speaker:APPL.
Speaker:Mean.
Speaker:2.01.
Speaker:3.02.
Speaker:Median.
Speaker:2.68.
Speaker:5.00.
Speaker:Standard Deviation.
Speaker:4.92.
Speaker:8.34.
Speaker:1st Quartile.
Speaker:-0.18.
Speaker:-1.66.
Speaker:3rd Quartile.
Speaker:5.13.
Speaker:9.25.
Speaker:10th percentile.
Speaker:-5.89.
Speaker:-7.35.
Speaker:90th percentile.
Speaker:7.37.
Speaker:12.27.
Speaker:Data source - Yahoo finance.
Speaker:The figures in this table reveal many details of the two investment alternatives -
Speaker:•The average return from NASDAQ-100 is substantially lower than APPL. The mean and median of the
Speaker:former is 2.01% and 2.68% per month, but those of APPL 3.02% and 5.00%.
Speaker:•For both cases, the median is larger than the mean, especially the APPL. This means
Speaker:the distribution is skewed to the left, with the presence of extremely low returns.
Speaker:This means, when they go down, they can go down deep!
Speaker:(Especially APPL!).
Speaker:•The variability is a lot higher for the returns from APPL. The standard deviation
Speaker:of APPL (8.34) is nearly twice larger than that of NASDA-100 (4.92).
Speaker:This means APPL has a lot larger variation around the mean.
Speaker:•The interquartile range for APPL is 10.91 (9.25 + 1.66) and that of NASDAQ-100 is 5.31
Speaker:(5.13+0.18).
Speaker:The length of interval that contains the middle 50% of the returns around the median is again
Speaker:nearly twice larger for the APPL. .
Speaker:•The worst possible outcome with 10% chance for APPL has been -7.35%, and that for NASDAQ-100
Speaker:has been -5.89%.
Speaker:The best possible outcome with 10% chance for APPL has been 12.27% a month, and that
Speaker:for NASDAQ-100 has been 7.37%.
Speaker:The comparison of these descriptive statistics reveals that monthly returns are a lot higher
Speaker:for APPL investment, but it shows substantially higher variability or risk.
Speaker:This is a well-known principle in finance - a higher return is compensation for taking
Speaker:a higher risk.
Speaker:The above plots present the histograms for the two investments.
Speaker:A larger variability of the APPL with a heavier skew to the left of the distribution than
Speaker:NASDAQ-100 is clear.
Speaker:While the summary statistics tell the difference with the numbers, these histograms can make
Speaker:a visual comparison.
Speaker:To make a further visual comparison, another method of visualisation called the “Box-Whisker”
Speaker:plot is introduced.
Speaker:It plots the mean, the median, the 1st quartile, the 3rd quartile, maximum and minimum, along
Speaker:with outliers.
Speaker:The box in the middle is based on the 3rd quartile and 1st quartile, and the height
Speaker:of the box represents the interquartile range.
Speaker:Outliers are determined by a certain criterion (i.e., the outliers are defined as those lying
Speaker:three standard deviations away from the mean).
Speaker:Again, the APPL investment gives a substantially higher median return per month, but its monthly
Speaker:variability is much higher than NASDAQ-100.
Speaker:Which investment to choose depends on how risk-averse or risk-tolerant the investor
Speaker:is.
Speaker:If you are a Braveheart and enjoy a roller coaster ride, investing in APPL is not a bad
Speaker:choice; otherwise, stick to the NASDAQ-100 for a safer option.
Speaker:7.
Speaker:Normal distribution.
Speaker:Figure 2 presents a distribution of the sample points using a histogram.
Speaker:In statistics, distribution is an important feature for both the sample and the population.
Speaker:While we can observe a distribution of the sample as in Figure 2, that of the population
Speaker:is often unknown and not observable.
Speaker:Understanding the features of a distribution is one of the fundamental questions of statistics.
Speaker:For example, what is the chance that investing in the NASDAQ-100 index will provide a return
Speaker:greater than 2%?
Speaker:What proportion of the households in California has a lower annual income than $50,000?
Speaker:We can only guess using the distribution of the sample we observe.
Speaker:Again, if the sample is a fair representation of the population, the distribution of the
Speaker:sample can well reflect the distribution of the population.
Speaker:On the other hand, there are several known distributions in statistics where the probability
Speaker:can be calculated using the given values of the parameters, such as the mean and standard
Speaker:deviation.
Speaker:Among them, the most fundamental and popular is the normal distribution.
Speaker:It is also a key distribution in the inferential statistics to be discussed in the next chapter.
Speaker:Normal distribution is a bell-shaped distribution, symmetric around its mean (or median), and
Speaker:the probability at any point of the distribution is known.
Speaker:A normal distribution with a mean and standard deviation of is written as north(,).
Speaker:In the special case of the mean being zero and the standard deviation 1, it is called
Speaker:standard normal distribution, and it is denoted as north(0,1).
Speaker:Figure 3 is a screenshot from an online calculator.2 .
Speaker:Figure 3 - Standard normal distribution.
Speaker:Given the values of the mean and standard deviation, any probability between an interval
Speaker:can be calculated.
Speaker:Figure 3 shows a normal distribution with zero mean and a standard deviation of 1 (called
Speaker:the standard normal distribution).
Speaker:Suppose your return (in percentage) from an investment follows the standard normal distribution.
Speaker:The probability that your return is between -1.96% and 1.96% is calculated to be 0.95
Speaker:(dark area on the bell illustration).
Speaker:This also means the probability of the tail areas is 5% (white area on the bell illustration).
Speaker:Your investment return can be lower than -1.96% with the probability of 0.025 and can take
Speaker:a value greater than 1.96% with the probability 0.025.
Speaker:Let’s assume the household income in California follows a normal distribution of $75,000 with
Speaker:the standard deviation of $30,000 (see Figure 4).
Speaker:Then, the household income distribution of California is represented by the bell curve
Speaker:in Figure 4.
Speaker:The probability that a household income is less than $50,000 or the proportion of the
Speaker:households with income less than $50,000 is represented by the dark area in the distribution,
Speaker:which is 0.20 approximately.
Speaker:In other words, if you pick a household at random, you have a 0.20 chance to bump into
Speaker:one with an income less than $50,000.
Speaker:This also means the chance of a randomly selected household having an income higher than $50,000
Speaker:is around 0.80 (= 1- 0.2023) .
Speaker:Figure 4 - Application of a normal distribution.
Speaker:8.
Speaker:Checking the normality of a distribution.
Speaker:Normal distribution is the most fundamental and popular distribution in statistics, and
Speaker:it is widely used as a “benchmark” distribution or as an “approximation” to the true distribution
Speaker:when it is unknown.
Speaker:Being a benchmark or approximation means it may be sometimes useful, but sometimes not,
Speaker:depending on the context and situation.
Speaker:Figure 5 is the histogram we have seen in Figure 2, the returns from NASDAQ-100 investment,
Speaker:overlayed with the normal distribution with the same mean and standard deviation values
Speaker:of the returns.
Speaker:While the histogram shows a similar shape to the normal distribution, with near symmetry
Speaker:and bell curve, the fine details are not impressively consistent with the normal distribution.
Speaker:While an approximation by a normal distribution to a stock return distribution is sometimes
Speaker:used, it is generally accepted that a stock return distribution shows a clear departure
Speaker:from a normal distribution.
Speaker:Figure 5 - Histogram of the NASDAQ-100 and APPL returns and a normal curve.
Speaker:The Q-Q (quantile-quantile) plot provides a clearer way of checking the normality of
Speaker:a sample distribution using a graphical method.
Speaker:It connects the sample quantiles (or percentiles) with the (theoretical) quantiles from the
Speaker:normal distribution.
Speaker:If the sample follows the standard normal distribution, then its percentiles should
Speaker:match the percentiles from the normal distribution with the same mean and standard deviation.
Speaker:The 95th percentile from the sample distribution (which is normal) should match the 1.96, and
Speaker:the 50th percentile from the sample distribution should be 0, which is the 50th percentile
Speaker:from the normal distribution.
Speaker:An example of the Q-Q plot is given here - .
Speaker:The grid lines are at (-1.96, 0, 1.96) for both axes, which are the 2.5th, 50th, and
Speaker:97.5th percentiles from the standard normal distribution.
Speaker:The y-axis (vertical) represents the sample quantile, and the x-axis (horizontal) represents
Speaker:the theoretical quantiles from the normal distribution.
Speaker:The grid lines are at (-1.96, 0, 1.96) for both axes, which match exactly.
Speaker:Hence, any sample that shows a Q-Q plot like the one above can be well approximated by
Speaker:a normal distribution.
Speaker:Figure 6.
Speaker:Q-Q plots for NASDAQ-100 and APPL returns.
Speaker:The grid lines are at (-1.96, 0, 1.96) for both axes.
Speaker:Figure 6 presents the Q-Q plots for the NASDAQ-100 and APPL returns.
Speaker:The return from the NASDAQ-100 return shows a reasonable match with the normal quantiles,
Speaker:while the quantiles of the APPL return show substantial departures from the normal quantiles.
Speaker:This indicates that, while the NASDAQ-100 returns may be approximated by a normal distribution
Speaker:with reasonable accuracy, a normal distribution will be a poor approximation to the APPL return
Speaker:distribution.
Speaker:9.
Speaker:Concluding remarks.
Speaker:As an opening chapter, the basic concepts and descriptive measures of statistics were
Speaker:discussed with the following keywords -
Speaker:•Sample and population.
Speaker:•Mean and Median.
Speaker:•Standard deviation and Inter-quartile range.
Speaker:•Percentile or quartiles.
Speaker:•Histogram, Time plots, Q-Q plot, Box-Whisker plot.
Speaker:•Normal distribution.
Speaker:If you understand the listed concepts and methods, and you can apply them to real-world
Speaker:situations, you already have made big steps into the world of statistical thinking!
Speaker:You can produce these statistics using popular tools such as Excel.
Speaker:This has been the art of statistical thinking, detect misinformation, understand the world
Speaker:deeper
Speaker:and make better decisions. Advanced Thinking Skills Book 3. Written by Albert Rutherford,
Speaker:J. H. Kim, PhD. Narrated by Russell Newton. Copyright 2022 by Albert Rutherford. Production
Speaker:copyright by Albert Rutherford.