
Explore tens of thousands of sets crafted by our community.
Statistical Concepts for Analytics
50
Flashcards
0/50
Variance
Variance quantifies the spread of a data set. A high variance indicates a wide spread around the mean, and a low variance indicates a narrow spread.
Population
The population in statistics is the entire group that you want to draw conclusions about.
Continuous Variable
Continuous variables can take on an unlimited number of values between the lowest and highest points of measurement. Examples include time, temperature, and distance.
Independent Variable
An independent variable is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable.
Outlier
An outlier is an observation point that is distant from other observations. Outliers may be due to variability in measurement or may indicate experimental error; they can distort statistical analyses.
Time Series Analysis
Time series analysis involves statistical techniques used to analyze time-ordered sequence data, to extract meaningful statistics and characteristics of the data, and possibly to forecast future values.
Linear Regression
Linear regression is a basic and commonly used type of predictive analysis which shows the relationship between two or more variables by fitting a linear equation to observed data. The equation takes the form .
Principal Component Analysis (PCA)
PCA is a dimensionality-reduction method that is typically used to reduce the dimensionality of large datasets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
R-Squared
R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.
Histogram
A histogram is a graphical representation of the distribution of numerical data, where the data is divided into bins or intervals, and the frequency of data in each bin is represented by the height of the bar.
Survival Analysis
Survival analysis is a branch of statistics for analyzing the expected duration of time until one event happens, such as death in biological organisms and failure in mechanical systems.
Discrete Variable
Discrete variables can only take on a finite number of values. This includes counts of occurrences, such as the number of children in a family.
Confounding Variable
A confounding variable is an extraneous variable in a statistical model that correlates with both the dependent and independent variables. If not controlled, it can cause a spurious association, leading to invalid conclusions about the relationship between the variables.
Factor Analysis
Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score.
Central Limit Theorem
The Central Limit Theorem states that the sampling distribution of the sample mean will be approximately normally distributed if the sample size is large enough, regardless of the population's distribution.
Multivariate Analysis
Multivariate analysis deals with the statistical analysis of data collected on more than one dependent variable.
Type II Error
A Type II error happens when the null hypothesis is wrongly accepted when it's false (false negative). The probability of committing a Type II error is denoted by the Greek letter beta ().
Z-Score
The Z-score is a statistical measure that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean. A Z-score of 0 means the value is exactly at the mean.
Nonparametric Statistics
Nonparametric statistics refers to a set of statistical tests that do not assume a specific distribution for the data, often used when the data is not normally distributed and cannot satisfy the assumptions of traditional parametric tests.
Confidence Interval
A confidence interval quantifies the uncertainty in an estimate. It provides a range of values within which the true population parameter is expected to lie with a certain confidence level (often 95%).
Sampling Distribution
The sampling distribution is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population.
T-Test
The t-test is a statistical test used to compare the means of two groups, or to compare a sample mean to a known value when the standard deviation is not known and the sample size is small.
Statistical Significance
Statistical significance is the likelihood that a relationship between two or more variables is not due to random chance. It is typically tested using a p-value with a predefined threshold, such as 0.05.
Coefficient of Variation
The coefficient of variation (CV) is a relative measure of the dispersion of data points in a data series around the mean, expressed as a percentage. It is particularly useful when comparing data series with different units or widely different means.
Alternative Hypothesis
The alternative hypothesis () is what you might believe to be true or hope to prove true. It states that there is a statistically significant effect or relationship between variables.
Chi-Square Test
The Chi-square test is used to determine whether there is a significant association between categorical variables. It compares the observed frequencies to the expected frequencies.
Categorical Variable
Categorical variables represent types of data which may be divided into groups, such as 'race', 'sex', or 'educational level'.
Logistic Regression
Logistic regression is used when the dependent variable is binary (1/0, True/False, Yes/No) and it predicts the probability of the outcome using a logit function.
Mean
The mean is the average value of a data set, calculated by summing all numbers and dividing by the count. Used to assess central tendency in data sets.
Median
The median is the middle value in a sorted data set. It is less affected by outliers and skewed data than the mean.
Sample
A sample is a subset of individuals from a larger population, used to make inferences about the population.
Dependent Variable
A dependent variable is what you measure in the experiment and what is affected during the experiment. It's called dependent because it depends on the independent variable.
Scatterplot
A scatterplot is a type of data display that shows the relationship between two numerical variables. Each member of the dataset gets plotted as a point whose x-y coordinates relates to its values for the two variables.
Statistic
A statistic is a characteristic of a sample, typically an estimate of a corresponding population parameter obtained by calculating from sample data.
Interquartile Range
The interquartile range (IQR) measures the statistical dispersion and is the difference between the upper quartile (75th percentile) and lower quartile (25th percentile).
Probability Density Function
A probability density function (PDF) is a function that describes the likelihood of a random variable to take on a given value. The area under the PDF curve between two values represents the probability that a random variable falls within that range.
Power of a Test
The power of a statistical test is the probability that the test will reject a false null hypothesis (i.e., the probability of not making a Type II error).
Standard Deviation
Standard deviation measures the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, whereas a high standard deviation indicates the values are spread out.
Type I Error
A Type I error occurs when the null hypothesis is wrongly rejected when it is, in fact, true (false positive). The probability of committing a Type I error is denoted by the Greek letter alpha ().
Bayesian Statistics
Bayesian statistics is an approach to statistics in which all forms of uncertainty are expressed in terms of probability, unlike the frequentist approach that only uses sample data to make inferences.
Mode
The mode is the value that occurs most frequently in a data set. Used for categorical data to determine the most common category.
Correlation
Correlation measures the strength and direction of a linear relationship between two variables on a scatterplot. Values range from -1 (perfect negative) to +1 (perfect positive).
Regression Analysis
Regression analysis determines the relationship between dependent and independent variables. It's used to predict outcomes and assess which variables significantly impact the response variable.
Parameter
In statistics, a parameter is a numerical characteristic of the population, such as the population mean or standard deviation, as opposed to a statistic, which is a characteristic of a sample.
Kurtosis
Kurtosis is a measure of the 'tailedness' of the probability distribution. Excess kurtosis describes how sharp the peak is, relative to a normal distribution.
Null Hypothesis
The null hypothesis () is a general statement or default position that there is no relationship between two measured phenomena or no association among groups.
Skewness
Skewness measures the asymmetry of the probability distribution of a real-valued random variable. Positive skewness indicates a tail on the right side, and negative skewness indicates a tail on the left side.
P-value
The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the value observed, under the assumption that the null hypothesis is true. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis.
ANOVA (Analysis of Variance)
ANOVA is a statistical method used to test differences between two or more means. It assesses the importance of one or more factors by comparing the means' response variance.
Quantile
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. For example, quartiles are the three cut points that will divide a dataset into four equal-sized groups.
© Hypatia.Tech. 2024 All rights reserved.