**Statistical Glossary**

**Bar graph**

A diagram representing the frequency distribution for nominal or discrete data. It consists of a sequence of bars, or rectangles, corresponding to the possible values, and the length of each is proportional to the frequency.

**Binomial distribution**

The discrete probability distribution for the number of successes when n independent experiments are carried out, each with the same probability p of success.

**Bins**

A term used to describe class intervals on a histogram.

**Bivariate data**

Data involving two random variables, such as height and weight, or amount of smoking and measure of health; often graphed in a scatter plot.

**Box-and-whisker plot**

A diagram constructed from a set of numerical data showing a box that indicates the middle 50% of the marked observations together with lines, sometime called ‘whiskers’, that go out from the quartile to the most extreme data value in that direction which is not more than 1.5 times the Inter Quartile Range from the quartile.

**Categorical data**

Data that fits into a small number of discrete categories. Categorical data is either non-ordered (nominal) such as gender or city, or ordered (ordinal) such as high, medium, or low temperature.

**Central limit theorem**

It pertains to the convergence in distribution of (normalized) sums of random variables. The distribution of the mean of a sequence of random variables tends to a normal distribution as the number in the sequence increases indefinitely.

**Circle graph**

A graph for categorical data. The proportion of elements belonging to each category is proportionally represented as a pie-shaped sector of a circle. Sometimes called a pie chart.

**Class intervals**

A subdivision within a range of values. In a histogram, the range of values is divided into sections, known as class intervals, also referred to as “bins.”

**Clusters of data**

A portion of high concentration in a data set.

**Combination**

The number of ways of picking k unordered outcomes from n possibilities.

**Complement of an event**

Suppose A is an event in the universal set U, the complement of A ("not A") consists of all the outcomes in U that are not in A. For example, if A is the event that two of three children are boys, then either zero, one, or three boys.

**Compound events**

An event made of two or more simple events.

**Conditional probability**

Let A and B be two events. The probability that A will occur given that B has already occurred is the ‘conditional probability of A given B’ and is denote by P( A B) .

**Confidence interval**

An interval, calculated from a sample, which contains the value of a certain population parameter with a specified probability.

**Confidence level**

The probability that the statistician's confidence interval contains the true, unknown population parameter.

**Correlation**

The correlation between two variables x and y is a measure of how closely related they are, or how linearly related they are. Correlation is the measure of the extent to which a change in one random variable tends to correspond to change in the other random variable. For example, height and weight have a moderately strong positive correlation.

**Correlation coefficients**

A measure of how close two random variables are to being perfectly linearly related; computed by dividing the covariance of the random variables by the product of their standard deviation. The correlation coefficient denoted by takes

values between -1 and 1; -1 represents a perfect negative correlation while 1 represents a perfect positive correlation.

**Counting Principle**

Method used to compute the number of possible outcomes of an experiment. If each outcome has independent parts, the total number of possible outcomes can be found by multiplying the number of choices for each part.

Cumulative frequency

The sum of the frequencies of all the values up to a given value.

**Cumulative relative frequency (relative cumulative frequency)**

The cumulative frequency in a frequency distribution divided by the total number of data points

**Data**

The observations gathered from an experiment, survey or observational study.

**Density function**

A mathematical function used to determine probabilities for a continuous random variable. For example, the bell-shaped curve corresponding to a normal distribution.

**Dependent event**

Two events are dependent if the occurrence of either affects the probability of the occurrence of the other.

**Designed experiment**

The process of planning an experiment or evaluation so that appropriate data will be collected, which may be analyzed by statistical methods resulting in valid and objective conclusions. Examples include: complete random design, random design, and randomized block design.

**Deterministic experiment**

A process in which the outcome is known in advance. For example tossing a two headed coin.

**Disjoint**

Sets are disjoint if they have no elements in common. For example, the sets A = {1,2,3} and B = {5,6,7} are disjoint.

**Dispersion**

A way of describing how scattered or spread out the observations in a sample are. Common measures of dispersion are the range, inter quartile range, variance, and standard deviation.

**Distribution**

The distribution of a random variable is the way in which the probability of it taking a certain value, or a value within a certain interval is described. It may be given by the cumulative distribution function, the probability mass function (discrete random variable) or the probability density function (continuous random variable).

**Element**

An object in a set is an element of that set.

**Empirical probability**

The probability of an event determined by repeatedly performing an experiment. It may be determined by dividing the number of times the event occurred by the number of times the experiment was repeated.

**Equality (of sets)**

Sets A and B are equal if they consist of the same elements. In order to establish A=B, a technique that can be useful is to show that each is contained in the other.

**Equally likely outcomes**

Every outcome of an experiment has the same probability. For example: rolling a fair die has equally likely outcomes.

**Event**

A subset of the sample space. For example, the sample space for an experiment in which a coin is tosses twice is given by {HH, HT, TH, TT} and let A= {HT, HH}, then A is an event in which Head occurs at the first place.

**Expected value**

It is the average value of a random quantity that has been repeatedly observed in replications of an experiment. For example, if a fair 6-sided die is rolled, its expected value is 3.5.

**Experiment**

Processes in which there are an observable set of outcomes are called experiments. For example, the following are all experiments: tossing a coin, rolling a die, or selecting a ball from a bag.

**Experimental probability**

The estimated probability of an event; obtained by dividing the number of successful trials by the total number of trials.

**Fair coin**

A fair coin is defined as coin where the probability of landing heads up or tails up are the same (0.5).

**Fair game**

A game is fair if each player has an equal chance of winning.

**Finite sample space**

A sample space which contains a finite number of possible outcomes.

**Frequency**

The number of times that a particular value occurs as an observation.

**Frequency distribution**

The information consisting of the possible values/groups and the corresponding frequencies is called the frequency distribution.

**Frequency table**

A table giving the number of data points in a data set falling in each of a set of given intervals.

**Geometric distribution**

A discrete probability distribution for the number of trials required to achieve the first success in a sequence of independent trials, all with the same probability ‘p’ of success. Its probability mass function is given by P[ X = x] = p(1- p)x-1 . For example, the number of times one must toss a fair coin until the first time the coin lands heads has a

geometric distribution with parameter p = 50%.

**Geometric probability**

The probability of an event as determined by comparing the areas (or perimeters, angle measures, etc.) of the regions of success of an event to the total area of the figure (sample space).

**Histogram**

A bar graph presenting the frequencies of occurrence of data points. Sometimes called a frequency histogram

**Independent event**

Two events are independent if the outcome of one event has no effect on the outcome of the other.

**Infinite sample space**

A sample space containing an infinite number of possible outcomes.

**Inter-quartile range**

The difference between the first quartile and third quartile of a set of data, (IQR).

**Intersection of Sets**

The intersection of sets A and B, denoted by A B , is the set of elements that are in both A and B.

**Interval scale**

A scale of measurement where the distance between any two adjacent units of measurement (or 'intervals') is the same but the zero point is arbitrary.

**Interval variable**

An interval variable is similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced.

**Least squares**

It is used to estimate parameters in statistical models such as those that occur in regression. Estimates for the parameter are obtained by minimizing the sum of the squares of the differences between the observed values and the predicted values under the model.

**Linear regression**

A method for finding an equation for the line that best fits the data set. The method is based on minimizing the sum of the squared vertical distances from the data points to the line of best fit.

**Line-of-best-fit**

The line that best represents the trend that the points in a scatter plot follow.

**Line plot**

A line graph that orders the data along a real number line. Also called a dot plot. For more info:

**Maximum value**

The maximum is highest point on a graph or the largest number in a data set

**Mean**

The mean is an appropriate location measure for interval or ratio variables.

**Measures of central tendency**

Measures of the location of the middle or the center of a distribution. The definition of "middle" or "center" is purposely left somewhat vague so that the term "central tendency" can refer to a wide variety of measures. The three most common measures of central tendency are the mean, median, and mode.

**Median**

Suppose the observations in a set of numerical data are ranked in ascending order. The median is the middle observation if there are an odd number of observations, and is the average of the two middlemost observations if there are an even number of observations.

**Minimum value**

The minimum is the lowest point on a graph, or the smallest number in a data set.

**Mode**

The mode is the most frequently occurring value in a set of discrete data. There can be more than one mode if two or more values are equally common.

**Mutually exclusive**

Events that have no outcomes in common. Events A and B are mutually exclusive if A Intersection B =Pi .

**N factorial**

For a positive integer n, the notation n! is used for the product n x(n -1) x (n - 2)....x 2 x1.

**Nominal data**

A data set is said to be nominal if the observations belonging to it can be assigned a code in the form of a number where the numbers are simply labels. You can count but not order or measure nominal data. For example, in a data set males could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single.

**Notation**

Notations are symbols denoting quantities, operations, etc.

**Observational study**

A study where data is collected through observation.

**Odds**

A way of representing the likelihood of an event's occurrence. The odds m:n in favor of an event means we expect the event will occur m times for every n times it does not occur.

**Ogive**

Any continuous cumulative frequency.

**One-variable data**

A collection of related behaviors that are associated in one meaningful way.

**One-variable data table**

A table showing a collection of related behaviors that are associated in one meaningful way.

**Ordinal data**

A set of data is said to be ordinal if the values / observations belonging to it can be ranked (put in order) or have a rating scale attached. For example: Questionnaire responses coded: 1 strongly disagree, 2 disagree, 3 indifferent, 4 agree, 5 strongly agree.

**Outcome**

An outcome is the result of an experiment. The set of all possible outcomes of an experiment is the sample space.

**Outliers of data**

An observation that is deemed to be unusual and possibly erroneous because it does not follow the general pattern of the data in the sample.AA

**Percentile**

The n-th percentile is the value xn /100 such that n percent of the population is less than or equal to xn /100 . The 25th, 50th and 75th percentiles are called quantiles.

**Permutation**

A permutation of n objects can be thought of as all the possible ways of their arrangement or rearrangement. The number of permutations of n objects taken r at a time’ is denoted by nPr which equals n! / (n - r)!

**Pie chart (also known as circle graph)**

A graph for categorical data. The proportion of elements belonging to each category is proportionally represented as a pie-shaped sector of a circle.

**Population**

The entire set of items from which data can be selected. For example, a poll given to a sample of voters is designed to measure the preferences of the population of all voters.

**Population variable**

Collection of related behaviors of a group that are associated in a meaningful way.

**Principle**

A basic generalization that is accepted as true and that can be used as a basis for reasoning.

**Probability**

The chance/likelihood that a particular event (or set of events) will occur expressed on a scale from 0 (impossibility) to 1 (certainty), also expressed as a percentage between 0 and 100%.

**Probabilistic experiment**

A probabilistic experiment is an occurrence such as the tossing of a coin, rolling of a die, etc. in which the complexity of the underlying system leads to an outcome that cannot be known ahead of time.

**Probability (density) function**

The probability function f(x) of a continuous distribution is defined as the derivative of its cumulative distribution function F(x).

**Probability models**

A probability model is a mathematical representation of a random phenomenon. It is defined by its sample space, events within the sample space, and probabilities associated with each event.

**Quartile**

For numerical data ranked in ascending order, the quartiles are values derived from the data which divide the data into four equal parts.

**Random experiment**

An experiment whose outcome cannot be predicted with certainty, before the experiment is run.

**Random number generator**

A program that will generate random numbers (as output).

**Random sample**

A set of data chosen from a population in such a way that each member of the population has an equal probability of being selected

**Range**

The range of a sample (or a data set) is a measure of the spread or the dispersion of the observations. It is the difference between the largest and the smallest observed value.

**Ratio variable**

A comparison of two quantities, expressed as a fraction where the quantity can assume any of a set of values.

**Regression line**

A straight line used to estimate the relationship between two variables, based on the points of a scatter plot; often determined by a least squares analysis. When it slopes down (from top left to bottom right), this indicates a negative or inverse relationship between the variables; when it slopes up (from bottom left to top right), a positive or direct relationship is indicated.

**Relative frequency**

Relative frequency is another term for proportion; it is the value calculated by dividing the number of times an event occurs by the total number of times an experiment is carried out. The probability of an event can be thought of as its long-run relative frequency when the experiment is carried out many times.

**Replacement**

Replacing/ returning an item back into the sample space after an event and thus allowing an item to be chosen more than once.

**Residual**

Residual represents unexplained (or residual) variation after fitting a regression model. It is the difference between the observed value of the variable and the value suggested by the regression model.

**Residual variance**

The square of the standard error of estimate.

**Sample**

A subset of a population that is obtained through some process, possibly random selection or selection based on a certain set of criteria, for the purposes of investigating the properties of the underlying parent population.

**Sample size**

The sample size is the number of items in a sample.

**Sample space**

The set of all possible outcomes of a probability experiment.

**Sample variance**

Sample variance is a measure of the spread or dispersion within a set of sample data.

**Sampling**

The process of selecting a proper subset of elements from the full population so that the subset can be used to make inference to the population as a whole.

**Scatter-plot**

A graph of two-variable (bivariate) data in which each point is located by its coordinates (X, Y). Set A set is a well-defined collection of objects. Sets are written using set braces {}. For example, {1,2,3} is the set containing the elements 1, 2, and 3.

**Simple event**

An event which is a single element of the sample space.

**Simulation**

An experiment that models a real-life situation.

**Single-variable data**

Data that uses only one unknown.

**Skewness**

The degree of asymmetry of a distribution.

**Spread of data**

The degree to which data are spread out around their center. Measures of spread include the mean deviation, variance, standard deviation, and interquartile range

**Standard deviation**

Standard deviation is a measure of the spread or dispersion of a set of data. It is defined as the square root of the variance.

**Standard normal distribution**

A normal distribution with parameters 0(mean) and 1(variance).

**Statistical inference**

Statistical inference makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken.

**Statistics**

The branch of mathematics that deals with the collection, organization, and interpretation of data.

**Stem-and-leaf plot**

A semi-graphical method used to represent numerical data, in which the first (leftmost) digit of each data value is a stem and the rest of the digits of the number are the leaves.

**Subset**

Set A is a subset of set B if all of the elements of set A are contained in set B. It is written as A B.

**Table**

Mathematical information organized in columns and rows.

**Theoretical probability**

The theoretical probability of an event, P (event), is the ratio of the number of outcomes in the event to the number of outcomes in the sample space, if all outcomes are equally likely.

**Theoretical regression line**

The line of best fit drawn through a scatter-plot before values are actually calculated.

**Tree diagram**

A tree diagram displays all the possible outcomes of an event.

**Unbiased estimator**

For an estimator to be unbiased it is required that on average the estimator will yield the true value of the unknown parameter. An estimator X is an unbiased estimator of the parameter theta if E(X) =theta .

**Unfair game**

A game in which all players do not have the same probability of winning.

**Union**

Combining the elements of two or more sets. Union is indicated by the U (cup) symbol.

**Union of sets**

The union of two sets A and B is the set obtained by combining the members of each the set. If A={1,2.3} and B={2,4,6} , then AUB ={ 1,2,3,4,6}.

**Universal set**

A set that contains all the elements being considered in the given discussion or problem.

**Variable**

A quantity that varies. For example, the weight of a randomly chosen member of a football team is such a variable. Variables are usually represented by letters.

**Variance**

A measure of the amount of spread (variation) in a set of data; the larger the variance, the more scattered the observations on average.

**Venn diagram**

A graphic means of showing intersection and union of sets by representing them as bounded regions.

**Weighted arithmetic mean (Weighted Average)**

A method of computing a kind of arithmetic mean of a set of numbers in which some elements of the set carry more importance (weight) than others.