An introduction to statistics for dummies

13 minute read

Published:

A very short introduction to the field of statistics.

Introduction

Statistics is a field of mathematics focused on gathering, analyzing, interpreting, presenting, and organizing data. It plays a vital role in data science, enabling the exploration and comprehension of large datasets. Through statistical methods, data scientists can uncover patterns, identify trends, and make predictions. It also provides the tools to draw conclusions about an entire population based on a smaller sample and to test hypotheses about relationships between different variables.

In data science, statistics is applied across a variety of tasks, including data exploration, cleaning, feature selection, model evaluation, and hypothesis testing.

For instance, during data exploration, a data scientist might use descriptive statistics like mean, median, and standard deviation to understand the data’s distribution and key characteristics. While preparing and cleaning the data, statistical techniques such as detecting outliers, imputing missing values, and normalizing data are commonly employed.

In feature selection, statistical methods like correlation and regression analysis help identify the most relevant variables for solving the specific problem.

For model selection, techniques such as cross-validation and A/B testing are used to ensure the model performs reliably on new, unseen data and is capable of generalizing effectively.

In short, statistics is an essential tool for data scientists, as it provides the means to extract insights and make predictions from data.

Types of Statistics:

  • Theoretical Statistics
  • Applied Statistics

Inferential Statistics

Inferential statistics is a branch of statistics that focuses on drawing conclusions about a population based on data collected from a sample. It leverages probability and statistical models to make predictions, estimates, and generalizations about a larger group using a subset of data. Common techniques in inferential statistics include hypothesis testing, estimation, and correlation analysis, which are used to predict outcomes, estimate population parameters, and derive meaningful insights from sample data.

Example of Inferential Statistics in Action:

Suppose a researcher wants to determine the average income of all adults in a specific city. Since surveying every adult would be impractical, the researcher selects a sample of 1,000 adults and records their incomes. Using inferential statistics, the researcher can estimate the average income for the entire population based on the sample.

From the sample data, the researcher calculates:

  • The mean income of the sample: $50,000
  • The standard deviation of the sample: $10,000

Using these values and statistical methods like a z-test, the researcher can estimate the population’s average income. Assuming a normal distribution, the researcher might conclude, with 95% confidence, that the true mean income of all adults in the city falls between $49,500 and $50,500.

This approach highlights how inferential statistics enables researchers to draw meaningful conclusions about a population without needing to analyze every individual within it.

Descriptive Statistics

Descriptive statistics is a branch of statistics focused on summarizing and describing the main features of a dataset. It provides a way to organize and present sample data clearly but does not involve drawing conclusions or making inferences about a larger population.

Types of Descriptive Statistics:

  1. Measures of Central Tendency: These summarize the central point of a dataset and include metrics like the mean, median, and mode.
  2. Measures of Variability: These describe the spread or dispersion of data and include metrics like range, variance, and standard deviation.

Example of Descriptive Statistics in Action

Imagine a researcher wants to analyze the height of students in a particular school. To do this, they measure the heights of a sample of 100 students. Using descriptive statistics, the researcher summarizes the data by calculating key measures, such as the mean, median, and mode. They can also visualize the data using tools like a frequency distribution table and a histogram to better understand the overall distribution.

From their analysis, the researcher finds:

  • The mean height is 5.6 feet.
  • The median height is 5.5 feet.
  • The mode height is also 5.6 feet.

Additionally, they observe that most students’ heights fall within the range of 5.4 to 5.8 feet.

This example highlights how descriptive statistics help summarize and present data in a meaningful way, offering a clear understanding of the dataset without making inferences about a larger population.

Basic Statistics

  • Introduction to Probability
  • Addition Rule in Probability
  • Multiplication Rule in Probability
  • Understanding Population and Sample
  • Measures of Central Tendency (Mean, Median, Mode)
  • Measures of Dispersion (Variance, Standard Deviation)
  • Sampling Methods and Their Types
  • Population Mean vs. Sample Mean
  • Variables and Their Types
  • Frequency Distribution and Cumulative Frequency
  • Creating and Interpreting Histograms

Intermediate Statistics

  • Percentiles and Quantiles
  • Five-Number Summary
  • Interquartile Range (IQR)
  • Boxplots
  • Impact of Outliers and Methods for Their Removal
  • Probability Density Function (PDF)
  • Normal (Gaussian) Distribution and the Empirical Rule
  • Z-Score Calculation
  • Standardization vs. Normalization
  • Standard Normal Distribution
  • Central Limit Theorem
  • Chebyshev’s Inequality
  • Covariance
  • Pearson Correlation Coefficient

Advanced Statistics

  • QQ Plot (Quantile-Quantile Plot)
  • Bernoulli and Binomial Distributions
  • Log-Normal Distribution
  • Power-Law Distribution
  • Box-Cox Transformation
  • Overview of Data Transformation Techniques
  • Confidence Intervals in Statistics
  • Type I and Type II Errors
  • One-Tailed vs. Two-Tailed Tests
  • Hypothesis Testing Process
  • p-Value and Its Interpretation
  • Steps for Conducting Hypothesis Testing
  • T-Test
  • Z-Test
  • ANOVA (Analysis of Variance) Test
  • Chi-Square Test

In this article, we will only cover the basic of statistics.

Probability

Probability is a measure that quantifies the likelihood of an event happening. It is expressed as a value between 0 and 1, where 0 means the event is impossible, and 1 means the event is certain. For instance, when flipping a fair coin, the probability of it landing on heads is 0.5, or 50%. This is because there are two equally likely outcomes — heads and tails.

Addition Rule in Probability

The addition rule in probability explains how to calculate the probability of the union of two mutually exclusive events, which are events that cannot happen simultaneously. According to the rule, the probability of either event A or event B occurring is the sum of their individual probabilities:

Formula: P(A or B) = P(A) + P(B)

Example: When flipping a coin, the probability of landing on heads (H) is 0.5, and the probability of landing on tails (T) is also 0.5. These events are mutually exclusive because the coin cannot land on both heads and tails at the same time. Using the addition rule:

P(H or T) = P(H) + P(T) = 0.5 + 0.5 = 1

This shows that it is certain that the coin will land on either heads or tails.

Multiplication Rule in Probability

The multiplication rule, also known as the conditional probability formula, is used to determine the probability of two events occurring together (A and B). It states that the probability of both events happening is the probability of event A multiplied by the probability of event B occurring, given that A has already occurred.

Formula: P(A and B) = P(A) * P(BA)

Example: Consider a bag containing 6 red marbles and 4 blue marbles.

  • The probability of drawing a red marble is 6/10.
  • After removing a red marble, the probability of drawing a blue marble becomes 4/9​ (as one marble is already removed).

Using the multiplication rule, the probability of drawing a red marble first and then a blue marble is:

P(red and blue) = P(red) * P(bluered) P(red and blue) = 3/5∗4/9 = 12/45 = 4/15​.

Population and Sample

In statistics, population and sample are essential concepts. A population includes all individuals or objects that a researcher wants to study. For example, a population could consist of all adults in a city or all students in a school district.

A sample is a smaller subset of the population chosen to represent the whole. Researchers use samples to draw conclusions about the population without needing to study every individual.

Example: If a researcher is studying adults in a city, they might select a sample of 100 individuals from the population to gather insights.

Sampling Methods:

  1. Simple Random Sampling: Individuals are selected randomly, giving each member of the population an equal chance of being chosen.
  2. Stratified Sampling: The population is divided into subgroups (strata), and samples are drawn from each subgroup to ensure representation.

For reliable results, the sample must accurately represent the population.

Measures of Central Tendency

Mean: The mean, or arithmetic average, is calculated by summing all the values in a dataset and dividing by the total number of values. It is a widely used measure of central tendency but is sensitive to outliers or extreme values, which can significantly affect its value.

Median: The median is the middle value of a dataset when it is ordered from smallest to largest. Unlike the mean, the median is not influenced by outliers or extreme values, making it a robust measure of central tendency. If the dataset has an even number of observations, the median is calculated as the average of the two middle values.

Mode: The mode represents the value that occurs most frequently in a dataset. A dataset may have one mode (unimodal), multiple modes (multimodal), or no mode at all if no value repeats. The mode is particularly useful for categorical data, such as identifying the most common type of car in a parking lot.

Measures of Dispersion (Variance and Standard Deviation)

Variance: Variance measures how far individual data points are from the mean of the dataset. It is calculated as the average of the squared differences between each data point and the mean. A high variance indicates that the data points are widely spread out, while a low variance suggests that they are clustered closely around the mean.

Standard Deviation: The standard deviation is the square root of the variance and provides a measure of the dispersion of data in the same units as the dataset. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation signifies a wider spread. It is commonly used to evaluate variability in areas like stock market performance or investment volatility.

Sampling Method and Its Types

Sampling is the process of selecting a subset of individuals or elements from a larger population to represent that population. The main types of sampling methods include:

Simple Random Sampling: A method where every member of the population has an equal chance of being chosen.

Systematic Sampling: A method where every nth member of the population is selected, with the value of n determined by the desired sample size.

Stratified Random Sampling: A method where the population is divided into subgroups (strata) based on specific characteristics, and members are randomly selected from each stratum.

Cluster Sampling: A method where the population is divided into groups (clusters), and entire clusters are randomly chosen to form the sample.

Convenience Sampling: A method where participants are selected based on their accessibility or willingness to take part.

Quota Sampling: A method where a specific number of participants are selected from certain subgroups to ensure proportional representation.

The choice of sampling method depends on the research objective and the characteristics of the population being studied.

Population Mean and Sample Mean

The population mean is the average of a variable across all individuals or elements in a population. It is represented by the symbol μ (mu) and is calculated by summing all values in the population and dividing by the total number of observations.

The sample mean is the average of a variable within a subset (sample) drawn from the population. It is denoted by x bar (x̄) and is calculated by summing the values in the sample and dividing by the number of observations in the sample. The sample mean serves as an estimate of the population mean.

Variables and Their Types

A variable represents a characteristic or attribute that can assume different values. Variables are categorized into:

Categorical Variables: Variables that represent groups or categories, such as gender, race, or political affiliation.

Numerical Variables: Variables that take on numerical values, measurable on a scale, such as age, income, or weight.

Ordinal Variables: Variables that have a specific order or ranking, though the differences between the values are not necessarily measurable, such as education levels or satisfaction ratings.

Continuous Variables: Variables that can take any value within a given range, such as height or temperature.

Discrete Variables: Variables that assume distinct values, typically counted in whole numbers, such as the number of siblings in a family.

Frequency Distribution and Cumulative Frequency

Frequency distribution organizes data by showing the number of occurrences (frequency) for each value or range of values. It is commonly displayed in a table or graph.

Cumulative frequency is the total count of frequencies accumulated up to a certain point. It is calculated by adding each frequency to the sum of all previous frequencies. A cumulative frequency distribution presents these totals in either a table or graph.

Histograms

A histogram is a graphical tool used to represent the frequency distribution of a continuous numerical variable. The x-axis displays intervals (bins) of the variable, while the y-axis shows the frequency of values within each interval. Each bar in the histogram represents the frequency of observations that fall within the corresponding bin, with the height of the bar indicating the magnitude of the frequency.