update lecture 16

kuleshov · kuleshov · commit 9d7fdc7f55db · 2023-10-26T15:20:44.000-04:00
diff --git a/slides/lecture16-boosting.ipynb b/slides/lecture16-boosting.ipynb
@@ -19,6 +19,22 @@
     "__Volodymyr Kuleshov__<br>Cornell Tech"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "source": [
+    "# Announcement\n",
+    "\n",
+    "* Project proposals have been graded over the weekend. \n",
+    "* Milestone due on 11/7.\n",
+    "* HW3 will be graded next week. Prelim will be graded after that.\n",
+    "* We will hold review session for the prelim."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -70,7 +86,7 @@
     "    ensemble.append(model)\n",
     "\n",
     "# output average prediction at test time:\n",
-    "y_test = ensemble.average_prediction(y_test)\n",
+    "y_test = ensemble.average_prediction(x_test)\n",
     "```\n",
     "<!-- Data samples taken with replacement are known as bootstrap samples. -->"
    ]
@@ -162,7 +178,7 @@
    "source": [
     "# Structure of a Boosting Algorithm\n",
     "\n",
-    "Boosting reduces *underfitting* by combining models that correct each others' errors."
+    "Boosting reduces *underfitting* via models that correct each others' errors."
    ]
   },
   {
@@ -173,7 +189,7 @@
     }
    },
    "source": [
-    "1. Fit a weak learner $g_0$ on dataset $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$. Let $f=g_0$."
+    "1. Compute weights $w^{(i)}$ for each $i$ based on $t$-th model predictions $f_t(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."
    ]
   },
   {
@@ -184,7 +200,7 @@
     }
    },
    "source": [
-    "2. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."
+    "2. Fit new weak learner $g_t$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."
    ]
   },
   {
@@ -195,18 +211,7 @@
     }
    },
    "source": [
-    "3. Fit new weak learner $g_1$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "4. Set $f_1 = g_0 + \\alpha_1 g$ for some weight $\\alpha_1$. Go to Step 2 and repeat."
+    "3. Set $f_{t+1} = f_t + \\alpha_t g_t$ for some weight $\\alpha_t$. Go to Step 1 and repeat."
    ]
   },
   {
@@ -735,7 +740,7 @@
     }
    },
    "source": [
-    "* $f(x)$ consists of $T$ smaller models $g$ with weights $\\alpha_t$ and params $\\phi_t$."
+    "* $f(x)$ consists of $T$ smaller models $g$ with weights $\\alpha_t$ & params $\\phi_t$."
    ]
   },
   {
@@ -781,18 +786,7 @@
     }
    },
    "source": [
-    "1. Fit a weak learner $g_0$ on dataset $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$. Let $f=g_0$."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "2. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."
+    "1. Compute weights $w^{(i)}$ for each $i$ based on $t$-th model predictions $f_t(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."
    ]
   },
   {
@@ -803,7 +797,7 @@
     }
    },
    "source": [
-    "3. Fit new weak learner $g_1$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."
+    "2. Fit new weak learner $g_t$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."
    ]
   },
   {
@@ -814,7 +808,7 @@
     }
    },
    "source": [
-    "4. Set $f_1 = g_0 + \\alpha_1 g$ for some weight $\\alpha_1$. Go to Step 2 and repeat."
+    "3. Set $f_{t+1} = f_t + \\alpha_t g_t$ for some weight $\\alpha_t$. Go to Step 1 and repeat."
    ]
   },
   {
@@ -887,8 +881,8 @@
    },
    "source": [
     "The resulting algorithm is often called L2Boost. At step $t$ we minimize\n",
-    "$$\\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \\phi))^2, $$\n",
-    "where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model at time $t-1$."
+    "$$\\sum_{i=1}^n (r^{(i)}_t - \\alpha g(x^{(i)}; \\phi))^2, $$\n",
+    "where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model $f_{t-1}$."
    ]
   },
   {
@@ -1416,7 +1410,7 @@
    },
    "source": [
     "At step $t$ we minimize\n",
-    "$$\\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \\phi))^2, $$\n",
+    "$$\\sum_{i=1}^n (r^{(i)}_t - \\alpha g(x^{(i)}; \\phi))^2, $$\n",
     "where $r^{(i)}_t = y^{(i)} - f_{t-1}(x^{(i)})$ is the residual error of the model $f_{t-1}$."
    ]
   },
@@ -1936,7 +1930,7 @@
    "source": [
     "We will then perform approximate functional gradient descent using\n",
     "\n",
-    "$$f_t \\gets f_{t-1} - \\alpha_t \\nabla g_t$$\n",
+    "$$f_t \\gets f_{t-1} - \\alpha_t g_t$$\n",
     "\n",
     "which is approximately $f_t \\gets f_{t-1} - \\alpha_t \\nabla J(f_{t-1}).$"
    ]
@@ -2158,7 +2152,7 @@
    },
    "source": [
     "At step $t$ we minimize\n",
-    "$$\\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \\phi))^2, $$\n",
+    "$$\\sum_{i=1}^n (r^{(i)}_t - \\alpha g(x^{(i)}; \\phi))^2, $$\n",
     "where $r^{(i)}_t = y^{(i)} - f_{t-1}(x^{(i)})$ is the residual from $f_{t-1}$."
    ]
   },

Original file line number	Diff line number	Diff line change
`@@ -19,6 +19,22 @@`
`19`	`19`	`"__Volodymyr Kuleshov__<br>Cornell Tech"`
`20`	`20`	`]`
`21`	`21`	`},`
	`22`	`+ {`
	`23`	`+ "cell_type": "markdown",`
	`24`	`+ "metadata": {`
	`25`	`+ "slideshow": {`
	`26`	`+ "slide_type": "slide"`
	`27`	`+ }`
	`28`	`+ },`
	`29`	`+ "source": [`
	`30`	`+ "# Announcement\n",`
	`31`	`+ "\n",`
	`32`	`+ "* Project proposals have been graded over the weekend. \n",`
	`33`	`+ "* Milestone due on 11/7.\n",`
	`34`	`+ "* HW3 will be graded next week. Prelim will be graded after that.\n",`
	`35`	`+ "* We will hold review session for the prelim."`
	`36`	`+ ]`
	`37`	`+ },`
`22`	`38`	`{`
`23`	`39`	`"cell_type": "markdown",`
`24`	`40`	`"metadata": {`
`@@ -70,7 +86,7 @@`
`70`	`86`	`" ensemble.append(model)\n",`
`71`	`87`	`"\n",`
`72`	`88`	`"# output average prediction at test time:\n",`
`73`		`- "y_test = ensemble.average_prediction(y_test)\n",`
	`89`	`+ "y_test = ensemble.average_prediction(x_test)\n",`
`74`	`90`	"```\n",
`75`	`91`	`"<!-- Data samples taken with replacement are known as bootstrap samples. -->"`
`76`	`92`	`]`
`@@ -162,7 +178,7 @@`
`162`	`178`	`"source": [`
`163`	`179`	`"# Structure of a Boosting Algorithm\n",`
`164`	`180`	`"\n",`
`165`		`- "Boosting reduces underfitting by combining models that correct each others' errors."`
	`181`	`+ "Boosting reduces underfitting via models that correct each others' errors."`
`166`	`182`	`]`
`167`	`183`	`},`
`168`	`184`	`{`
`@@ -173,7 +189,7 @@`
`173`	`189`	`}`
`174`	`190`	`},`
`175`	`191`	`"source": [`
`176`		`- "1. Fit a weak learner $g_0$ on dataset $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$. Let $f=g_0$."`
	`192`	`+ "1. Compute weights $w^{(i)}$ for each $i$ based on $t$-th model predictions $f_t(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."`
`177`	`193`	`]`
`178`	`194`	`},`
`179`	`195`	`{`
`@@ -184,7 +200,7 @@`
`184`	`200`	`}`
`185`	`201`	`},`
`186`	`202`	`"source": [`
`187`		`- "2. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."`
	`203`	`+ "2. Fit new weak learner $g_t$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."`
`188`	`204`	`]`
`189`	`205`	`},`
`190`	`206`	`{`
`@@ -195,18 +211,7 @@`
`195`	`211`	`}`
`196`	`212`	`},`
`197`	`213`	`"source": [`
`198`		`- "3. Fit new weak learner $g_1$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."`
`199`		`- ]`
`200`		`- },`
`201`		`- {`
`202`		`- "cell_type": "markdown",`
`203`		`- "metadata": {`
`204`		`- "slideshow": {`
`205`		`- "slide_type": "fragment"`
`206`		`- }`
`207`		`- },`
`208`		`- "source": [`
`209`		`- "4. Set $f_1 = g_0 + \\alpha_1 g$ for some weight $\\alpha_1$. Go to Step 2 and repeat."`
	`214`	`+ "3. Set $f_{t+1} = f_t + \\alpha_t g_t$ for some weight $\\alpha_t$. Go to Step 1 and repeat."`
`210`	`215`	`]`
`211`	`216`	`},`
`212`	`217`	`{`
`@@ -735,7 +740,7 @@`
`735`	`740`	`}`
`736`	`741`	`},`
`737`	`742`	`"source": [`
`738`		`- "* $f(x)$ consists of $T$ smaller models $g$ with weights $\\alpha_t$ and params $\\phi_t$."`
	`743`	`+ "* $f(x)$ consists of $T$ smaller models $g$ with weights $\\alpha_t$ & params $\\phi_t$."`
`739`	`744`	`]`
`740`	`745`	`},`
`741`	`746`	`{`
`@@ -781,18 +786,7 @@`
`781`	`786`	`}`
`782`	`787`	`},`
`783`	`788`	`"source": [`
`784`		`- "1. Fit a weak learner $g_0$ on dataset $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$. Let $f=g_0$."`
`785`		`- ]`
`786`		`- },`
`787`		`- {`
`788`		`- "cell_type": "markdown",`
`789`		`- "metadata": {`
`790`		`- "slideshow": {`
`791`		`- "slide_type": "fragment"`
`792`		`- }`
`793`		`- },`
`794`		`- "source": [`
`795`		`- "2. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."`
	`789`	`+ "1. Compute weights $w^{(i)}$ for each $i$ based on $t$-th model predictions $f_t(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors."`
`796`	`790`	`]`
`797`	`791`	`},`
`798`	`792`	`{`
`@@ -803,7 +797,7 @@`
`803`	`797`	`}`
`804`	`798`	`},`
`805`	`799`	`"source": [`
`806`		`- "3. Fit new weak learner $g_1$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."`
	`800`	`+ "2. Fit new weak learner $g_t$ on $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\}$ with weights $w^{(i)}$."`
`807`	`801`	`]`
`808`	`802`	`},`
`809`	`803`	`{`
`@@ -814,7 +808,7 @@`
`814`	`808`	`}`
`815`	`809`	`},`
`816`	`810`	`"source": [`
`817`		`- "4. Set $f_1 = g_0 + \\alpha_1 g$ for some weight $\\alpha_1$. Go to Step 2 and repeat."`
	`811`	`+ "3. Set $f_{t+1} = f_t + \\alpha_t g_t$ for some weight $\\alpha_t$. Go to Step 1 and repeat."`
`818`	`812`	`]`
`819`	`813`	`},`
`820`	`814`	`{`
`@@ -887,8 +881,8 @@`
`887`	`881`	`},`
`888`	`882`	`"source": [`
`889`	`883`	`"The resulting algorithm is often called L2Boost. At step $t$ we minimize\n",`
`890`		`- "$$\\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \\phi))^2, $$\n",`
`891`		`- "where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model at time $t-1$."`
	`884`	`+ "$$\\sum_{i=1}^n (r^{(i)}_t - \\alpha g(x^{(i)}; \\phi))^2, $$\n",`
	`885`	`+ "where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model $f_{t-1}$."`
`892`	`886`	`]`
`893`	`887`	`},`
`894`	`888`	`{`
`@@ -1416,7 +1410,7 @@`
`1416`	`1410`	`},`
`1417`	`1411`	`"source": [`
`1418`	`1412`	`"At step $t$ we minimize\n",`
`1419`		`- "$$\\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \\phi))^2, $$\n",`
	`1413`	`+ "$$\\sum_{i=1}^n (r^{(i)}_t - \\alpha g(x^{(i)}; \\phi))^2, $$\n",`
`1420`	`1414`	`"where $r^{(i)}_t = y^{(i)} - f_{t-1}(x^{(i)})$ is the residual error of the model $f_{t-1}$."`
`1421`	`1415`	`]`
`1422`	`1416`	`},`
`@@ -1936,7 +1930,7 @@`
`1936`	`1930`	`"source": [`
`1937`	`1931`	`"We will then perform approximate functional gradient descent using\n",`
`1938`	`1932`	`"\n",`
`1939`		`- "$$f_t \\gets f_{t-1} - \\alpha_t \\nabla g_t$$\n",`
	`1933`	`+ "$$f_t \\gets f_{t-1} - \\alpha_t g_t$$\n",`
`1940`	`1934`	`"\n",`
`1941`	`1935`	`"which is approximately $f_t \\gets f_{t-1} - \\alpha_t \\nabla J(f_{t-1}).$"`
`1942`	`1936`	`]`
`@@ -2158,7 +2152,7 @@`
`2158`	`2152`	`},`
`2159`	`2153`	`"source": [`
`2160`	`2154`	`"At step $t$ we minimize\n",`
`2161`		`- "$$\\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \\phi))^2, $$\n",`
	`2155`	`+ "$$\\sum_{i=1}^n (r^{(i)}_t - \\alpha g(x^{(i)}; \\phi))^2, $$\n",`
`2162`	`2156`	`"where $r^{(i)}_t = y^{(i)} - f_{t-1}(x^{(i)})$ is the residual from $f_{t-1}$."`
`2163`	`2157`	`]`
`2164`	`2158`	`},`