NDA Maths · Statistics

Regression and Correlation

How two variables move together — correlation measures the strength of the link, regression draws the best-fit line.

Why this matters

27 PYQs across 2017–2026, with 6 HARD — the highest hard-rate of any Statistics subtopic. Almost every recent paper asks one of three shapes: properties of the correlation coefficient r under linear transformation, finding regression lines, identifying which equation is which, or computing the angle between them. Five tight concepts cover the entire surface.

Concept 1 of 5

Correlation Coefficient and Its Properties

Intuition

Correlation coefficient rr is a single number between 1-1 and +1+1 that summarises how strongly two variables move together. r=+1r = +1 is perfect positive (one rises, the other rises by the same proportion); r=1r = -1 is perfect negative; r=0r = 0 means no linear relationship. Crucially, rr is unaffected by shifts (change of origin) and unaffected in magnitude by positive scale changes — but a negative scale flips its sign.

Definition

For paired observations, r=Cov(X,Y)σXσYr = \dfrac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}, bounded by 1r1-1 \leq r \leq 1. If U=aX+bU = aX + b and V=cY+dV = cY + d, then rUV=sign(ac)rXYr_{UV} = \text{sign}(ac)\,r_{XY} — magnitude is preserved, sign flips when one of a,ca, c is negative.

Correlation Coefficient and Invariance Rule

r=Cov(X,Y)σXσYr(aX+b,cY+d)=sign(ac)rXYr = \dfrac{\text{Cov}(X,Y)}{\sigma_X\,\sigma_Y} \qquad r_{(aX+b,\,cY+d)} = \text{sign}(ac)\,r_{XY}
  • Cov(X,Y)\text{Cov}(X,Y)covariance of X and Y
  • σX,σY\sigma_X,\sigma_Ystandard deviations of X and Y
  • sign(ac)\text{sign}(ac)+1+1 if a,ca, c have same sign, 1-1 otherwise

Visualization · slide r, watch the cloud tighten

At r = ±1 the points fall exactly on a line; toward r = 0 the cloud loses any linear shape. Positive r slopes up, negative r slopes down. r is unitless and always lies in [−1, 1] — it captures tightness and direction, not how steep the line is.

Worked example

The correlation between xx and yy is r=0.6r = 0.6. Find the correlation between U=2x+5U = 2x + 5 and V=3y+1V = -3y + 1.
  1. Identify a=2, c=3a = 2,\ c = -3. Shifts b=5, d=1b = 5,\ d = 1 do not affect rr.
  2. Compute sign(ac)=sign(2×3)=sign(6)=1\text{sign}(ac) = \text{sign}(2 \times -3) = \text{sign}(-6) = -1.
  3. Apply the rule: rUV=1×0.6r_{UV} = -1 \times 0.6.
  4. Result: rUV=0.6r_{UV} = -0.6. Magnitude preserved, sign flipped.
Answer:rUV=0.6r_{UV} = -0.6
Practice this conceptself-check · 4 quick reps

Try it yourself

If rr between xx and yy is 0.4-0.4, find rr between U=x+1U = -x + 1 and V=2y5V = 2y - 5.

Practice — Level 1 (4 reps)

Quick reps to lock in the method. Try each, then check.

  1. 1.
    rr between x,yx, y is 0.70.7. rr between 2x2x and 3y3y?
  2. 2.
    rr is 0.50.5. rr between xx and y-y?
  3. 3.
    rr is 0.8-0.8. rr between x+5x+5 and y2y-2?
  4. 4.
    A computation gives r=1.4r = 1.4. Possible?

From the bank · past-year question

Example 1StatisticsEASY
If rr is the correlation coefficient between xx and yy, then what is the correlation coefficient between (3x+4)(3x+4) and (3y+3)(-3y+3)?

[Q111 · Sep · 2023]

rr is bounded by 1-1 and +1+1 — always

If a calculation gives r>1|r| > 1, the arithmetic is wrong. Use this as a sanity check at the end of any correlation computation.

Shift does not change rr; only scale-with-negative-sign flips it

Adding constants to either variable is invisible to rr. Multiplying by a positive constant is also invisible. Only a negative multiplier flips the sign — and even then, the magnitude is preserved.

Concept 2 of 5

Lines of Regression

Intuition

For two variables there are TWO regression lines — yy on xx (used to predict yy from xx) and xx on yy (used to predict xx from yy). Both lines always pass through the mean point (xˉ,yˉ)(\bar{x}, \bar{y}). If r=±1r = \pm 1 the two lines coincide; otherwise they intersect at (xˉ,yˉ)(\bar{x}, \bar{y}) at a non-zero angle.

Definition

The regression line of yy on xx has slope byx=rσyσxb_{yx} = r\,\dfrac{\sigma_y}{\sigma_x} and passes through (xˉ,yˉ)(\bar{x}, \bar{y}). The regression line of xx on yy has slope bxy=rσxσyb_{xy} = r\,\dfrac{\sigma_x}{\sigma_y} and also passes through (xˉ,yˉ)(\bar{x}, \bar{y}).

Lines of Regression (point-slope form)

yyˉ=byx(xxˉ)xxˉ=bxy(yyˉ)y - \bar{y} = b_{yx}(x - \bar{x}) \qquad x - \bar{x} = b_{xy}(y - \bar{y})
  • byxb_{yx}slope of yy on xx line =rσy/σx= r\,\sigma_y/\sigma_x
  • bxyb_{xy}slope of xx on yy line =rσx/σy= r\,\sigma_x/\sigma_y
  • (xˉ,yˉ)(\bar{x},\bar{y})the only point on BOTH regression lines

Visualization · drag the line, watch the error

24681036912
SSE = 2.97Best possible: 0.27

Red dashes are residuals (vertical distance from each point to the line). SSE is the sum of their squares. The least-squares regression line is the one that makes SSE as small as possible.

Worked example

Find the regression line of yy on xx passing through the only two data points (1,1)(1, 1) and (5,9)(5, 9).
  1. With only two points, the regression line is the line joining them — correlation is perfect (r=±1r = \pm 1).
  2. Compute slope: byx=9151=84=2b_{yx} = \dfrac{9 - 1}{5 - 1} = \dfrac{8}{4} = 2.
  3. Use point-slope through (1,1)(1, 1): y1=2(x1)y - 1 = 2(x - 1).
  4. Simplify: y=2x1y = 2x - 1.
Answer:y=2x1y = 2x - 1 or equivalently 2xy1=02x - y - 1 = 0
Practice this conceptself-check · 4 quick reps

Try it yourself

Find the regression line of yy on xx passing through (0,1)(0, 1) and (4,3)(4, 3).

Practice — Level 1 (4 reps)

Quick reps to lock in the method. Try each, then check.

  1. 1.
    Both regression lines always pass through which point?
  2. 2.
    Regression lines are x=2x = 2 and y=3y = 3. Find (xˉ,yˉ)(\bar{x}, \bar{y}).
  3. 3.
    byx=rσy/σxb_{yx} = r\,\sigma_y/\sigma_x. If r=1r = 1 and σy=σx=2\sigma_y = \sigma_x = 2, slope?
  4. 4.
    Slope of the yy-on-xx line through (1,2)(1,2) and (3,6)(3,6)?

From the bank · past-year question

Example 2StatisticsEASY
A bivariate data set contains only two points (1,1)(-1,1) and (3,2)(3,2). What will be the line of regression of yy on xx?

[Q109 · Apr · 2023]

Both regression lines always pass through (xˉ,yˉ)(\bar{x}, \bar{y})

If a PYQ gives you two regression lines and asks for the means, solve the two equations simultaneously — their intersection is exactly (xˉ,yˉ)(\bar{x}, \bar{y}). No need to compute anything from raw data.

From raw bivariate data, compute byxb_{yx} via the Pearson form

When given nn raw paired points (e.g. four (xi,yi)(x_i, y_i) values), use the computational formula byx=nxiyixiyinxi2(xi)2b_{yx} = \dfrac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2} with xˉ,yˉ\bar{x}, \bar{y} read straight from the column sums. The regression line is then yyˉ=byx(xxˉ)y - \bar{y} = b_{yx}(x - \bar{x}). Faster than the deviation-from-mean form because it works directly off the column totals.

Concept 3 of 5

Regression Coefficients and Their Link to r

Intuition

The two regression slopes byxb_{yx} and bxyb_{xy} carry the same information as rr — their product equals r2r^2, and they share the same sign as rr. This means once you know any two of {r,byx,bxy}\{r, b_{yx}, b_{xy}\}, the third is forced.

Definition

byxbxy=r2b_{yx} \cdot b_{xy} = r^2, with 0r210 \leq r^2 \leq 1. Therefore byxbxy1b_{yx} \cdot b_{xy} \leq 1 always. Also sign(byx)=sign(bxy)=sign(r)\text{sign}(b_{yx}) = \text{sign}(b_{xy}) = \text{sign}(r) — the two slopes can never have opposite signs.

Product Identity

byxbxy=r2,r=±byxbxyb_{yx} \cdot b_{xy} = r^2, \qquad r = \pm\sqrt{b_{yx}\,b_{xy}}
  • Sign of rrsame as the common sign of byxb_{yx} and bxyb_{xy}

Worked example

Two lines of regression are 3x5y+10=03x - 5y + 10 = 0 and 5x3y6=05x - 3y - 6 = 0. Find the correlation coefficient rr between xx and yy.
  1. Pairing A — first line as yy on xx: y=3x+105y = \dfrac{3x + 10}{5}, so byx=35b_{yx} = \dfrac{3}{5}. Second line as xx on yy: x=3y+65x = \dfrac{3y + 6}{5}, so bxy=35b_{xy} = \dfrac{3}{5}.
  2. Check the product: byxbxy=925=0.361b_{yx} \cdot b_{xy} = \dfrac{9}{25} = 0.36 \leq 1 — valid.
  3. Pairing B (the swap) would give both slopes 53\dfrac{5}{3}, product 259>1\dfrac{25}{9} > 1 — impossible, so pairing A is correct.
  4. Hence r2=0.36r^2 = 0.36. Both slopes are positive, so r=+0.36=0.6r = +\sqrt{0.36} = 0.6.
Answer:r=0.6r = 0.6
Practice this conceptself-check · 4 quick reps

Try it yourself

Given byx=1.2b_{yx} = -1.2 and bxy=0.3b_{xy} = -0.3, find rr.

Practice — Level 1 (4 reps)

Quick reps to lock in the method. Try each, then check.

  1. 1.
    byx=0.4b_{yx} = 0.4, bxy=0.9b_{xy} = 0.9. Find r2r^2.
  2. 2.
    byx=0.4b_{yx} = 0.4, bxy=0.9b_{xy} = 0.9. Find rr.
  3. 3.
    byx=2b_{yx} = -2, bxy=0.5b_{xy} = -0.5. Find rr.
  4. 4.
    Can byx=2b_{yx} = 2 and bxy=0.8b_{xy} = 0.8 hold for one dataset?

From the bank · past-year question

Example 3StatisticsHARD
Let two lines of regression be x+y+11=0x+y+11=0 and 2x+3y+4=02x+3y+4=0 for some data. What is the value of correlation coefficient between x and y?

[Q111 · Apr · 2026]

byxbxy1b_{yx} \cdot b_{xy} \leq 1 is non-negotiable

If your computed product exceeds 1, you have assigned the wrong line to yy on xx. Swap the assignment and recompute — the inequality byxbxy=r21b_{yx} b_{xy} = r^2 \leq 1 picks the correct pairing every time.

Both slopes share the sign of rr

You cannot have byx>0b_{yx} > 0 and bxy<0b_{xy} < 0 — if a problem seems to suggest this, the lines have been labelled wrong.

Concept 4 of 5

Identifying Which Regression Line is Which

Intuition

When NDA gives you two equations and doesn't label them, you must figure out which is yy on xx and which is xx on yy. Use the inequality byxbxy1b_{yx}\,b_{xy} \leq 1 as a sieve: there are only two possible pairings; one will satisfy the inequality, the other won't.

Definition

Two regression lines L1L_1 and L2L_2 can be paired in two ways. The correct pairing is the one for which the product of the slopes (interpreted as byxbxyb_{yx} \cdot b_{xy}) is at most 1. The wrong pairing always gives a product greater than 1 (provided the lines are distinct).

Sieve Inequality

Correct pairing satisfies byxbxy1; wrong pairing gives >1.\text{Correct pairing satisfies } b_{yx}\,b_{xy} \leq 1; \text{ wrong pairing gives } > 1.

Diagram · which regression line is which

(x̄, ȳ)y on xx on y

Both lines pass through the mean point (x̄, ȳ). The y-on-x line is the flatter one (it minimises vertical gaps); x-on-y is steeper. Their slopes are byx and 1/bxy, with byx·bxy = r².

Worked example

Two lines of regression are 2x3y+1=02x - 3y + 1 = 0 and 4x5y+3=04x - 5y + 3 = 0. Identify which is yy on xx and find byxb_{yx} and bxyb_{xy}.
  1. Pairing A: first as yy on xx gives y=2x+13y = \dfrac{2x+1}{3}, so byx=23b_{yx} = \dfrac{2}{3}. Second as xx on yy gives x=5y34x = \dfrac{5y - 3}{4}, so bxy=54b_{xy} = \dfrac{5}{4}. Product = 561\dfrac{5}{6} \leq 1 — valid.
  2. Pairing B (swap): first as xx on yy gives bxy=32b_{xy} = \dfrac{3}{2}; second as yy on xx gives byx=45b_{yx} = \dfrac{4}{5}. Product = 65>1\dfrac{6}{5} > 1 — rejected.
  3. Conclusion: the first line is yy on xx (byx=2/3b_{yx} = 2/3); the second is xx on yy (bxy=5/4b_{xy} = 5/4).
Answer:byx=23, bxy=54b_{yx} = \dfrac{2}{3},\ b_{xy} = \dfrac{5}{4}
Practice this conceptself-check · 4 quick reps

Try it yourself

Two lines of regression are x+4y7=0x + 4y - 7 = 0 and 2x+5y9=02x + 5y - 9 = 0. Identify which is yy on xx and report both slopes.

Practice — Level 1 (4 reps)

Quick reps to lock in the method. Try each, then check.

  1. 1.
    A pairing gives slope product 1.51.5. Valid byxbxyb_{yx}\cdot b_{xy}?
  2. 2.
    Pairing A product 0.80.8, pairing B product 1.251.25. Which is correct?
  3. 3.
    The product of the two regression slopes equals?
  4. 4.
    Why can the wrong pairing exceed 11?

From the bank · past-year question

Example 4StatisticsHARD
Let x3y+4=0x-3y+4=0 and 2x7y+8=02x-7y+8=0 be two lines of regression computed from some bivariate data. If byxb_{yx} and bxyb_{xy} are regression coefficients of lines of regression of yy on xx and xx on yy respectively, then what is the value of bxy+7byxb_{xy}+7b_{yx}?

[Q101 · Sep · 2024]

Try the inequality before doing anything else

Always compute both candidate pairings of byxbxyb_{yx} \cdot b_{xy}. The pairing that satisfies the 1\leq 1 condition is the correct one. Don't try to reason geometrically from the slopes — the inequality is mechanical and unambiguous.

Concept 5 of 5

Angle Between the Two Regression Lines

Intuition

The two regression lines coincide when correlation is perfect and stand perpendicular when there is no correlation. In between, they make an acute angle whose tangent you can read straight off the line equations using the ordinary "angle between two lines" formula from coordinate geometry — no need to compute rr, σx\sigma_x, σy\sigma_y first.

Definition

Treat the two regression lines as ordinary straight lines in the (x,y)(x,y) plane with slopes m1m_1 and m2m_2 (read directly from each equation after solving for yy). The acute angle θ\theta between them satisfies the standard formula below. When r=±1r = \pm 1 the slopes coincide and tanθ=0\tan\theta = 0; when r=0r = 0, 1+m1m2=01 + m_1 m_2 = 0 and the lines are perpendicular.

Angle between two lines (applied to regression)

tanθ=m1m21+m1m2\tan\theta = \left|\dfrac{m_1 - m_2}{1 + m_1\,m_2}\right|
  • m1,m2m_1, m_2slopes of the two regression lines in the (x,y)(x,y) plane
  • θ\thetaacute angle between the lines

Diagram · angle between the regression lines

θ ≈ 31°(x̄, ȳ)

The lines meet at (x̄, ȳ) at angle θ, where tan θ = |(m₂ − m₁) / (1 + m₁m₂)|. As correlation strengthens (r → ±1) the two lines rotate together and θ → 0 — they coincide at perfect correlation. As r → 0 they splay apart, signalling no linear relationship.

Worked example

Two lines of regression are x+3y+2=0x + 3y + 2 = 0 and 2x+5y+1=02x + 5y + 1 = 0. Find the tangent of the acute angle between them.
  1. Solve each line for yy. Line 1: y=13x23y = -\tfrac{1}{3}x - \tfrac{2}{3}, so m1=13m_1 = -\tfrac{1}{3}.
  2. Line 2: y=25x15y = -\tfrac{2}{5}x - \tfrac{1}{5}, so m2=25m_2 = -\tfrac{2}{5}.
  3. Apply the formula: tanθ=1/3(2/5)1+(1/3)(2/5)=1/1517/15\tan\theta = \left|\dfrac{-1/3 - (-2/5)}{1 + (-1/3)(-2/5)}\right| = \left|\dfrac{1/15}{17/15}\right|.
  4. Simplify: tanθ=117\tan\theta = \dfrac{1}{17}.
Answer:tanθ=117\tan\theta = \dfrac{1}{17}
Practice this conceptself-check · 4 quick reps

Try it yourself

Two regression lines have slopes m1=12m_1 = \dfrac{1}{2} and m2=3m_2 = 3. Find tanθ\tan\theta of the acute angle between them.

Practice — Level 1 (4 reps)

Quick reps to lock in the method. Try each, then check.

  1. 1.
    If r=±1r = \pm 1, the angle between the two regression lines?
  2. 2.
    If r=0r = 0, the angle between the regression lines?
  3. 3.
    Slopes m1=2m_1 = 2, m2=3m_2 = 3. Find tanθ\tan\theta.
  4. 4.
    Slopes m1=1m_1 = 1, m2=1m_2 = -1. The lines are?

From the bank · past-year question

Example 5StatisticsHARD
Let x+2y+1=0x+2y+1=0 and 2x+3y+4=02x+3y+4=0 be two lines of regression computed from some bivariate data. If θ\theta is the acute angle between them, then what is the value of 488tan3θ488\tan 3\theta?

[Q104 · Apr · 2024]

Slope of the xx-on-yy line is NOT bxyb_{xy} in the (x,y)(x,y) plane

The slope of the yy-on-xx line in the (x,y)(x,y) plane is byxb_{yx}. But the slope of the xx-on-yy line (written x=a+bxyyx = a + b_{xy} y) in the (x,y)(x,y) plane is 1/bxy1/b_{xy}, NOT bxyb_{xy}. When reading slopes off the line equation directly (solve for yy, take the coefficient of xx), you sidestep this trap.

Acute angle only — take absolute value

A negative tangent would correspond to the obtuse supplement. The formula's absolute value guarantees θ90\theta \leq 90^\circ. PYQs always ask the acute angle.

Summary — formulas & gotchas at a glance

A revision cheat-sheet for the formulas and gotchas above. Click any concept name to jump back to its full explanation.

Formulas (5)

  • Correlation Coefficient and Its Properties

    Correlation Coefficient and Invariance Rule

    r=Cov(X,Y)σXσYr(aX+b,cY+d)=sign(ac)rXYr = \dfrac{\text{Cov}(X,Y)}{\sigma_X\,\sigma_Y} \qquad r_{(aX+b,\,cY+d)} = \text{sign}(ac)\,r_{XY}
  • Lines of Regression

    Lines of Regression (point-slope form)

    yyˉ=byx(xxˉ)xxˉ=bxy(yyˉ)y - \bar{y} = b_{yx}(x - \bar{x}) \qquad x - \bar{x} = b_{xy}(y - \bar{y})
  • Regression Coefficients and Their Link to r

    Product Identity

    byxbxy=r2,r=±byxbxyb_{yx} \cdot b_{xy} = r^2, \qquad r = \pm\sqrt{b_{yx}\,b_{xy}}
  • Identifying Which Regression Line is Which

    Sieve Inequality

    Correct pairing satisfies byxbxy1; wrong pairing gives >1.\text{Correct pairing satisfies } b_{yx}\,b_{xy} \leq 1; \text{ wrong pairing gives } > 1.
  • Angle Between the Two Regression Lines

    Angle between two lines (applied to regression)

    tanθ=m1m21+m1m2\tan\theta = \left|\dfrac{m_1 - m_2}{1 + m_1\,m_2}\right|

Watch out for (9)

Mastery check — 5 interleaved questions

Try each one before clicking. Questions are interleaved across the concepts above, not grouped — interleaving sharpens transfer.

Example 1StatisticsEASY
The coefficient of correlation between ages of husband and wife at the time of marriage for a given set of 100 couples was noted to be 0.7. Assume that all these couples survive to celebrate the silver jubilee of their marriage. The coefficient of correlation at that point of time will be

[Q114 · Sep · 2022]

Example 2StatisticsMODERATE
Let X and Y represent prices (in xˉ\bar{\text{\phantom{x}}}) of a commodity in Kolkata and Mumbai respectively. It is given that Xˉ=65\bar{X}=65, Yˉ=67\bar{Y}=67, σX=25\sigma_X=2\cdot5, σY=35\sigma_Y=3\cdot5 and r(X,Y)=08r(X,Y)=0\cdot8. What is the equation of regression of Y on X?

[Q117 · Apr · 2020]

Example 3StatisticsMODERATE
Direction: Consider the following for the items that follow. Two regression lines are given as 3x - 4y + 8 = 0 and 4x - 3y - 1 = 0.
Consider the following statements: 1. The regression line of yy on xx is y=34x+2y=\frac{3}{4}x+2. 2. The regression line of xx on yy is x=34y+14x=\frac{3}{4}y+\frac{1}{4}. Which of the above statements is/are correct?

[Q106 · Sep · 2021]

Example 4StatisticsHARD
Let x+2y+1=0x+2y+1=0 and 2x+3y+4=02x+3y+4=0 be two lines of regression computed from some bivariate data. If θ\theta is the acute angle between them, then what is the value of 488tan3θ488\tan 3\theta?

[Q104 · Apr · 2024]

Example 5StatisticsHARD
Consider statements: 1. r=0 => regression lines parallel 2. r=+1 => regression lines perpendicular Which is/are correct?

[Q106 · Apr · 2018]

Drill every past-year question on this subtopic

27 questions from the bank — paginated, with cart and Word-export support.

Related notes