Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Business and Finance

Business Analytics

Statistical Concepts for Analytics

Flashcards

0/50

Still learning

Mean

The mean is the average value of a data set, calculated by summing all numbers and dividing by the count. Used to assess central tendency in data sets.

Median

The median is the middle value in a sorted data set. It is less affected by outliers and skewed data than the mean.

Mode

The mode is the value that occurs most frequently in a data set. Used for categorical data to determine the most common category.

Standard Deviation

Standard deviation measures the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, whereas a high standard deviation indicates the values are spread out.

Variance

Variance quantifies the spread of a data set. A high variance indicates a wide spread around the mean, and a low variance indicates a narrow spread.

Skewness

Skewness measures the asymmetry of the probability distribution of a real-valued random variable. Positive skewness indicates a tail on the right side, and negative skewness indicates a tail on the left side.

Kurtosis

Kurtosis is a measure of the 'tailedness' of the probability distribution. Excess kurtosis describes how sharp the peak is, relative to a normal distribution.

Correlation

Correlation measures the strength and direction of a linear relationship between two variables on a scatterplot. Values range from -1 (perfect negative) to +1 (perfect positive).

Regression Analysis

Regression analysis determines the relationship between dependent and independent variables. It's used to predict outcomes and assess which variables significantly impact the response variable.

P-value

The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the value observed, under the assumption that the null hypothesis is true. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis.

Null Hypothesis

The null hypothesis ( $H_0$ ) is a general statement or default position that there is no relationship between two measured phenomena or no association among groups.

Alternative Hypothesis

The alternative hypothesis ( $H_a$ ) is what you might believe to be true or hope to prove true. It states that there is a statistically significant effect or relationship between variables.

Type I Error

A Type I error occurs when the null hypothesis is wrongly rejected when it is, in fact, true (false positive). The probability of committing a Type I error is denoted by the Greek letter alpha ( $\alpha$ ).

Type II Error

A Type II error happens when the null hypothesis is wrongly accepted when it's false (false negative). The probability of committing a Type II error is denoted by the Greek letter beta ( $\beta$ ).

Confidence Interval

A confidence interval quantifies the uncertainty in an estimate. It provides a range of values within which the true population parameter is expected to lie with a certain confidence level (often 95%).

Population

The population in statistics is the entire group that you want to draw conclusions about.

Sample

A sample is a subset of individuals from a larger population, used to make inferences about the population.

Sampling Distribution

The sampling distribution is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population.

Central Limit Theorem

The Central Limit Theorem states that the sampling distribution of the sample mean will be approximately normally distributed if the sample size is large enough, regardless of the population's distribution.

Z-Score

The Z-score is a statistical measure that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean. A Z-score of 0 means the value is exactly at the mean.

T-Test

The t-test is a statistical test used to compare the means of two groups, or to compare a sample mean to a known value when the standard deviation is not known and the sample size is small.

ANOVA (Analysis of Variance)

ANOVA is a statistical method used to test differences between two or more means. It assesses the importance of one or more factors by comparing the means' response variance.

Chi-Square Test

The Chi-square test is used to determine whether there is a significant association between categorical variables. It compares the observed frequencies to the expected frequencies.

Statistical Significance

Statistical significance is the likelihood that a relationship between two or more variables is not due to random chance. It is typically tested using a p-value with a predefined threshold, such as 0.05.

Time Series Analysis

Time series analysis involves statistical techniques used to analyze time-ordered sequence data, to extract meaningful statistics and characteristics of the data, and possibly to forecast future values.

Categorical Variable

Categorical variables represent types of data which may be divided into groups, such as 'race', 'sex', or 'educational level'.

Continuous Variable

Continuous variables can take on an unlimited number of values between the lowest and highest points of measurement. Examples include time, temperature, and distance.

Discrete Variable

Discrete variables can only take on a finite number of values. This includes counts of occurrences, such as the number of children in a family.

Dependent Variable

A dependent variable is what you measure in the experiment and what is affected during the experiment. It's called dependent because it depends on the independent variable.

Independent Variable

An independent variable is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable.

Histogram

A histogram is a graphical representation of the distribution of numerical data, where the data is divided into bins or intervals, and the frequency of data in each bin is represented by the height of the bar.

Scatterplot

A scatterplot is a type of data display that shows the relationship between two numerical variables. Each member of the dataset gets plotted as a point whose x-y coordinates relates to its values for the two variables.

R-Squared

R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

Outlier

An outlier is an observation point that is distant from other observations. Outliers may be due to variability in measurement or may indicate experimental error; they can distort statistical analyses.

Parameter

In statistics, a parameter is a numerical characteristic of the population, such as the population mean or standard deviation, as opposed to a statistic, which is a characteristic of a sample.

Statistic

A statistic is a characteristic of a sample, typically an estimate of a corresponding population parameter obtained by calculating from sample data.

Quantile

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. For example, quartiles are the three cut points that will divide a dataset into four equal-sized groups.

Interquartile Range

The interquartile range (IQR) measures the statistical dispersion and is the difference between the upper quartile (75th percentile) and lower quartile (25th percentile).

Coefficient of Variation

The coefficient of variation (CV) is a relative measure of the dispersion of data points in a data series around the mean, expressed as a percentage. It is particularly useful when comparing data series with different units or widely different means.

Probability Density Function

A probability density function (PDF) is a function that describes the likelihood of a random variable to take on a given value. The area under the PDF curve between two values represents the probability that a random variable falls within that range.

Confounding Variable

A confounding variable is an extraneous variable in a statistical model that correlates with both the dependent and independent variables. If not controlled, it can cause a spurious association, leading to invalid conclusions about the relationship between the variables.

Linear Regression

Linear regression is a basic and commonly used type of predictive analysis which shows the relationship between two or more variables by fitting a linear equation to observed data. The equation takes the form $y = b_0 + b_1x_1 + ... + b_nx_n$ .

Logistic Regression

Logistic regression is used when the dependent variable is binary (1/0, True/False, Yes/No) and it predicts the probability of the outcome using a logit function.

Survival Analysis

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event happens, such as death in biological organisms and failure in mechanical systems.

Power of a Test

The power of a statistical test is the probability that the test will reject a false null hypothesis (i.e., the probability of not making a Type II error).

Bayesian Statistics

Bayesian statistics is an approach to statistics in which all forms of uncertainty are expressed in terms of probability, unlike the frequentist approach that only uses sample data to make inferences.

Multivariate Analysis

Multivariate analysis deals with the statistical analysis of data collected on more than one dependent variable.

Factor Analysis

Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score.

Principal Component Analysis (PCA)

PCA is a dimensionality-reduction method that is typically used to reduce the dimensionality of large datasets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Nonparametric Statistics

Nonparametric statistics refers to a set of statistical tests that do not assume a specific distribution for the data, often used when the data is not normally distributed and cannot satisfy the assumptions of traditional parametric tests.

Know

Still learning

Click to flip

Know