SlideShare a Scribd company logo
Detailed Description
on Cross Entropy Loss Function
ICSL Seminar
김범준
2019. 01. 03
 Cross Entropy Loss
- Classification 문제에서 범용적으로 사용
- Prediction과 Label 사이의 Cross Entropy를 계산
- 구체적인 이론적 근거 조사, 직관적 의미 해석
𝐻 𝑃, 𝑄 = −
𝑖=1
𝑐
𝑝𝑖 𝑙𝑜𝑔(𝑞𝑖)
• Theoretical Derivation
- Binary Classification Problem
- Multiclass Classification Problem
• Intuitive understanding
- Relation to the KL-Divergence
• Theoretical Derivation
- Binary Classification Problem
- Multiclass Classification Problem
• Intuitive understanding
- Relation to the KL-Divergence
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0
Image Classifier Prediction Label
NN
𝑥2 𝜃
ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0
Image Classifier Prediction Label
NN
𝑥2 𝜃
ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1
[0, 0, 0, 1, 1, 1]
𝑦1, … , 𝑦 𝑚𝑥1, … , 𝑥 𝑚
: Training Dataset
𝜃
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0
Image Classifier Prediction Label
NN
𝑥2 𝜃
ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1
[0, 0, 0, 1, 1, 1]
𝑦1, … , 𝑦 𝑚𝑥1, … , 𝑥 𝑚
𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝐿 𝜃 = 𝑝(𝑦1, … , 𝑦 𝑚|𝑥1, … , 𝑥 𝑚; 𝜃)
: Training Dataset
𝜃
: 에 의해 [0, 0, 0, 1, 1, 1]로 Prediction이 나올법한 정도𝜃
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
입력 image예측 label
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0
Image Classifier Prediction Label
NN
𝑥2 𝜃
ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1
[0, 0, 0, 1, 1, 1]
𝑦1, … , 𝑦 𝑚𝑥1, … , 𝑥 𝑚
𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝐿 𝜃 = 𝑝(𝑦1, … , 𝑦 𝑚|𝑥1, … , 𝑥 𝑚; 𝜃)
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝐿(𝜃))
: [0, 0, 0, 1, 1, 1]로 Prediction이 가장 나올법한 를 선택한다
𝜃
: 에 의해 [0, 0, 0, 1, 1, 1]로 Prediction이 나올법한 정도𝜃
𝜃
: Training Dataset
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
Image Classifier Prediction Label
NN
𝑥2 𝜃
ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1
𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖)
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
Image Classifier Prediction Label
𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖)
NN
𝑥2 𝜃
ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1
𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖)
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖)
𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖)
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖)
𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖)
즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖
𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖
1−𝑦 𝑖
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
: 베르누이 분포
𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖)
𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖)
즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖
𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖
1−𝑦 𝑖
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝐿 𝜃 = 𝑝 𝑦1, … , 𝑦 𝑚 𝑥1, … , 𝑥 𝑚; 𝜃
=
𝑖=1
𝑚
𝑝 𝑦𝑖 𝑥𝑖; 𝜃 ∵ 𝑖. 𝑖. 𝑑 𝑎𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛
* i.i.d : independent and identically distributed
: 베르누이 분포
𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖)
𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖)
즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖
𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖
1−𝑦 𝑖
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝐿 𝜃 = 𝑝 𝑦1, … , 𝑦 𝑚 𝑥1, … , 𝑥 𝑚; 𝜃
=
𝑖=1
𝑚
𝑝 𝑦𝑖 𝑥𝑖; 𝜃 ∵ 𝑖. 𝑖. 𝑑 𝑎𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛
=
𝑖=1
𝑚
ℎ 𝜃 𝑥𝑖
𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖
1−𝑦 𝑖
* i.i.d : independent and identically distributed
: 베르누이 분포
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 (∵log는 단조증가 함수)
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛( 𝑖=1
𝑚
[−𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − (1 − 𝑦𝑖) log(1 − ℎ 𝜃 𝑥𝑖 )]) (∵ 𝑙𝑜𝑔 성질)
𝐿 𝜃 =
𝑖=1
𝑚
ℎ 𝜃 𝑥𝑖
𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖
1−𝑦 𝑖
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛( 𝑖=1
𝑚
[−𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − (1 − 𝑦𝑖) log(1 − ℎ 𝜃 𝑥𝑖 )])
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
𝑤ℎ𝑒𝑟𝑒 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 = −𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − 1 − 𝑦𝑖 log 1 − ℎ 𝜃 𝑥𝑖
: 𝐵𝑖𝑛𝑎𝑟𝑦 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛( 𝑖=1
𝑚
[−𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − (1 − 𝑦𝑖) log(1 − ℎ 𝜃 𝑥𝑖 )])
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
𝑤ℎ𝑒𝑟𝑒 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 = −𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − 1 − 𝑦𝑖 log 1 − ℎ 𝜃 𝑥𝑖
: 𝐵𝑖𝑛𝑎𝑟𝑦 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
ℎ 𝜃 𝑥𝑖 , 𝑦𝑖 ∈ 0, 1 인 확률값
Maximize Likelihood Minimize Binary Cross Entropy
Binary Classification Problem
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0]
Image Classifier Prediction Label
NN
𝑥2 𝜃
ℎ 𝜃 𝑥2 = [0.03, 𝟎. 𝟗𝟓, 0.02] 𝑦2 = [0, 1, 0]
NN
𝑥3 𝜃
ℎ 𝜃 𝑥3 = [0.01, 0.01, 𝟎. 𝟗𝟖] 𝑦3 = [0, 0, 1]
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0]
Image Classifier Prediction Label
𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃
= 𝑝 𝑦𝑖(0) = 1 𝑥𝑖; 𝜃) (𝐴𝑠𝑠𝑢𝑚𝑒 𝑂𝑛𝑒ℎ𝑜𝑡 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔)
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0]
Image Classifier Prediction Label
𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃
= 𝑝 𝑦𝑖(0) = 1 𝑥𝑖; 𝜃)
= ℎ 𝜃 𝑥𝑖 (0)
NN
𝑥1 𝜃
ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0]
Image Classifier Prediction Label
𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃
= 𝑝 𝑦𝑖(0) = 1 𝑥𝑖; 𝜃)
= ℎ 𝜃 𝑥𝑖 (0)
같은 방법으로,
𝑝 𝑦𝑖 = [0, 1, 0] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 1
𝑝 𝑦𝑖 = [0, 0, 1] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 (2)
𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 (0)
𝑝 𝑦𝑖 = [0, 1, 0] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 1
𝑝 𝑦𝑖 = [0, 0, 1] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 (2)
즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 0 𝑦 𝑖(0)
ℎ 𝜃 𝑥𝑖 1 𝑦 𝑖(1)
ℎ 𝜃 𝑥𝑖 2 𝑦 𝑖(2)
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
[−𝑦𝑖 0 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(0) − 𝑦𝑖 1 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(1) − 𝑦𝑖 2 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(2)]
𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 0 𝑦 𝑖(0)
ℎ 𝜃 𝑥𝑖 1 𝑦 𝑖(1)
ℎ 𝜃 𝑥𝑖 2 𝑦 𝑖(2)
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
[−𝑦𝑖 0 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(0) − 𝑦𝑖 1 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(1) − 𝑦𝑖 2 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(2)]
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
𝑤ℎ𝑒𝑟𝑒 𝐻 𝑃, 𝑄 = −
𝑖=1
𝑐
𝑝𝑖 𝑙𝑜𝑔(𝑞𝑖)
: 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
[−𝑦𝑖 0 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(0) − 𝑦𝑖 1 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(1) − 𝑦𝑖 2 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(2)]
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
ℎ 𝜃 𝑥𝑖 , 𝑦𝑖는 Probability Distribution
Maximize Likelihood Minimize Cross Entropy
Multiclass Classification Problem
𝑤ℎ𝑒𝑟𝑒 𝐻 𝑃, 𝑄 = −
𝑖=1
𝑐
𝑝𝑖 𝑙𝑜𝑔(𝑞𝑖)
: 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
• Theoretical Derivation
- Binary Classification Problem
- Multiclass Classification Problem
• Intuitive understanding
- Relation to the KL-Divergence
𝐻 𝑃, 𝑄
=
𝑖=1
𝑐
𝑝𝑖 𝑙𝑜𝑔
1
𝑞𝑖
* KL-Divergence : Kullback–Leibler divergence
𝐻 𝑃, 𝑄
=
𝑖=1
𝑐
𝑝𝑖 𝑙𝑜𝑔
1
𝑞𝑖
=
𝑖=1
𝑐
(𝑝𝑖 𝑙𝑜𝑔
𝑝𝑖
𝑞𝑖
+ 𝑝𝑖 𝑙𝑜𝑔
1
𝑝𝑖
)
𝐻 𝑃, 𝑄
=
𝑖=1
𝑐
𝑝𝑖 𝑙𝑜𝑔
1
𝑞𝑖
=
𝑖=1
𝑐
(𝑝𝑖 𝑙𝑜𝑔
𝑝𝑖
𝑞𝑖
+ 𝑝𝑖 𝑙𝑜𝑔
1
𝑝𝑖
)
= 𝐾𝐿(𝑃| 𝑄 + 𝐻(𝑃)
P 자체가 갖는 entropy
KL-Divergence
Cross-entropy
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
Maximize Likelihood Minimize Cross Entropy
Multiclass Classification Problem
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
(𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) + 𝐻(𝑦𝑖) ) (∵ 𝐻 𝑃, 𝑄 = 𝐾𝐿(𝑃| 𝑄 + 𝐻 𝑃 )
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
(𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) + 𝐻(𝑦𝑖) )
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
(𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) (∵OnehotEncoding된 label의 entropy는 0)
Maximize Likelihood Minimize Cross Entropy
Multiclass Classification Problem
Minimize KL-Divergence
𝜃
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
(𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) + 𝐻(𝑦𝑖) )
= 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1
𝑚
(𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) (∵OnehotEncoding된 label의 entropy는 0)
 정보 이론의 관점에서는 KL-divergence를 직관적으로 “놀라움의 정도”로 이해 가능
 (예) 준결승 진출팀 : LG 트윈스, 한화 이글스, NC 다이노스, 삼성 라이온즈
- 예측 모델 1) :
- 예측 모델 2) :
- 경기 결과 :
- 예측 모델 2)에서 더 큰 놀라움을 확인
- 놀라움의 정도를 최소화  Q가 P로 근사됨  두 확률 분포가 닮음  정확한 예측
𝑦 = 𝑃 = [1, 0, 0, 0]
𝑦 = 𝑄 = [𝟎. 𝟗, 0.03, 0.03, 0.04]
𝑦 = 𝑄 = [0.3, 𝟎. 𝟔 0.05, 0.05]
𝐾𝐿(𝑃| 𝑄 =
𝑖=1
𝑐
(𝑝𝑖 𝑙𝑜𝑔
𝑝𝑖
𝑞𝑖
)
Maximize Likelihood Minimize Cross Entropy
Multiclass Classification Problem
Minimize KL-Divergence
Minimize Surprisal
Approximate prediction to label
Better classification performance in general

More Related Content

PDF
Recurrent neural networks rnn
PPTX
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
PPTX
Resnet.pptx
PDF
Nonlinear dimension reduction
PDF
Recurrent Neural Networks. Part 1: Theory
PDF
PPTX
Lecture 1 graphical models
PPTX
Introduction to Grad-CAM (complete version)
Recurrent neural networks rnn
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
Resnet.pptx
Nonlinear dimension reduction
Recurrent Neural Networks. Part 1: Theory
Lecture 1 graphical models
Introduction to Grad-CAM (complete version)

What's hot (20)

PDF
Convolutional Neural Network Models - Deep Learning
PDF
Generative Adversarial Network (+Laplacian Pyramid GAN)
PPT
Intro to Deep learning - Autoencoders
PDF
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
PPTX
Autoencoders in Deep Learning
ODP
Simple Introduction to AutoEncoder
PDF
Self-supervised Learning Lecture Note
PDF
Linear regression
PDF
Mobilenetv1 v2 slide
PPTX
Transformers in Vision: From Zero to Hero
PDF
Notes from Coursera Deep Learning courses by Andrew Ng
PPTX
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
PDF
Introduction of Faster R-CNN
PPTX
Convolution Neural Network (CNN)
PPTX
Machine Learning - Convolutional Neural Network
PPTX
Transformer in Vision
PPT
Support Vector machine
PDF
Recurrent Neural Networks, LSTM and GRU
PPTX
Stochastic Gradient Decent (SGD).pptx
Convolutional Neural Network Models - Deep Learning
Generative Adversarial Network (+Laplacian Pyramid GAN)
Intro to Deep learning - Autoencoders
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
Autoencoders in Deep Learning
Simple Introduction to AutoEncoder
Self-supervised Learning Lecture Note
Linear regression
Mobilenetv1 v2 slide
Transformers in Vision: From Zero to Hero
Notes from Coursera Deep Learning courses by Andrew Ng
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Introduction of Faster R-CNN
Convolution Neural Network (CNN)
Machine Learning - Convolutional Neural Network
Transformer in Vision
Support Vector machine
Recurrent Neural Networks, LSTM and GRU
Stochastic Gradient Decent (SGD).pptx
Ad

Similar to Detailed Description on Cross Entropy Loss Function (20)

PPTX
07-Convolution.pptx signal spectra and signal processing
PPTX
Deep learning study 2
PDF
PR 113: The Perception Distortion Tradeoff
PDF
Dual Spaces of Generalized Cesaro Sequence Space and Related Matrix Mapping
PDF
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
PPTX
Lec05.pptx
PDF
publisher in research
PPTX
Variational Autoencoder Tutorial
PDF
Lecture 5 backpropagation
PDF
Differential Geometry for Machine Learning
PPTX
Functions of severable variables
PDF
research on journaling
PDF
Mentor mix review
DOCX
Maximum Likelihood Estimation of Beetle
PPTX
Computer aided design
PDF
Regularisation & Auxiliary Information in OOD Detection
PPTX
Z transforms
07-Convolution.pptx signal spectra and signal processing
Deep learning study 2
PR 113: The Perception Distortion Tradeoff
Dual Spaces of Generalized Cesaro Sequence Space and Related Matrix Mapping
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
Lec05.pptx
publisher in research
Variational Autoencoder Tutorial
Lecture 5 backpropagation
Differential Geometry for Machine Learning
Functions of severable variables
research on journaling
Mentor mix review
Maximum Likelihood Estimation of Beetle
Computer aided design
Regularisation & Auxiliary Information in OOD Detection
Z transforms
Ad

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Heart disease approach using modified random forest and particle swarm optimi...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Getting Started with Data Integration: FME Form 101
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TLE Review Electricity (Electricity).pptx
Network Security Unit 5.pdf for BCA BBA.
A comparative analysis of optical character recognition models for extracting...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A comparative study of natural language inference in Swahili using monolingua...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
MIND Revenue Release Quarter 2 2025 Press Release

Detailed Description on Cross Entropy Loss Function

  • 1. Detailed Description on Cross Entropy Loss Function ICSL Seminar 김범준 2019. 01. 03
  • 2.  Cross Entropy Loss - Classification 문제에서 범용적으로 사용 - Prediction과 Label 사이의 Cross Entropy를 계산 - 구체적인 이론적 근거 조사, 직관적 의미 해석 𝐻 𝑃, 𝑄 = − 𝑖=1 𝑐 𝑝𝑖 𝑙𝑜𝑔(𝑞𝑖)
  • 3. • Theoretical Derivation - Binary Classification Problem - Multiclass Classification Problem • Intuitive understanding - Relation to the KL-Divergence
  • 4. • Theoretical Derivation - Binary Classification Problem - Multiclass Classification Problem • Intuitive understanding - Relation to the KL-Divergence
  • 5. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0 Image Classifier Prediction Label NN 𝑥2 𝜃 ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1
  • 6. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0 Image Classifier Prediction Label NN 𝑥2 𝜃 ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1 [0, 0, 0, 1, 1, 1] 𝑦1, … , 𝑦 𝑚𝑥1, … , 𝑥 𝑚 : Training Dataset 𝜃
  • 7. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0 Image Classifier Prediction Label NN 𝑥2 𝜃 ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1 [0, 0, 0, 1, 1, 1] 𝑦1, … , 𝑦 𝑚𝑥1, … , 𝑥 𝑚 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝐿 𝜃 = 𝑝(𝑦1, … , 𝑦 𝑚|𝑥1, … , 𝑥 𝑚; 𝜃) : Training Dataset 𝜃 : 에 의해 [0, 0, 0, 1, 1, 1]로 Prediction이 나올법한 정도𝜃 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) 입력 image예측 label
  • 8. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0 Image Classifier Prediction Label NN 𝑥2 𝜃 ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1 [0, 0, 0, 1, 1, 1] 𝑦1, … , 𝑦 𝑚𝑥1, … , 𝑥 𝑚 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝐿 𝜃 = 𝑝(𝑦1, … , 𝑦 𝑚|𝑥1, … , 𝑥 𝑚; 𝜃) 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑: 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝐿(𝜃)) : [0, 0, 0, 1, 1, 1]로 Prediction이 가장 나올법한 를 선택한다 𝜃 : 에 의해 [0, 0, 0, 1, 1, 1]로 Prediction이 나올법한 정도𝜃 𝜃 : Training Dataset 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
  • 9. Image Classifier Prediction Label NN 𝑥2 𝜃 ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1 𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖) 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
  • 10. Image Classifier Prediction Label 𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖) NN 𝑥2 𝜃 ℎ 𝜃 𝑥2 = 0.95 𝑦2 = 1 𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖) NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = 0.1 𝑦1 = 0 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
  • 11. 𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖) 𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖) 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
  • 12. 𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖) 𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖) 즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖 1−𝑦 𝑖 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) : 베르누이 분포
  • 13. 𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖) 𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖) 즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖 1−𝑦 𝑖 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) 𝐿 𝜃 = 𝑝 𝑦1, … , 𝑦 𝑚 𝑥1, … , 𝑥 𝑚; 𝜃 = 𝑖=1 𝑚 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 ∵ 𝑖. 𝑖. 𝑑 𝑎𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 * i.i.d : independent and identically distributed : 베르누이 분포
  • 14. 𝑝 𝑦𝑖 = 1 𝑥𝑖; 𝜃 = ℎ 𝜃(𝑥𝑖) 𝑝 𝑦𝑖 = 0 𝑥𝑖; 𝜃 = 1 − ℎ 𝜃(𝑥𝑖) 즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖 1−𝑦 𝑖 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) 𝐿 𝜃 = 𝑝 𝑦1, … , 𝑦 𝑚 𝑥1, … , 𝑥 𝑚; 𝜃 = 𝑖=1 𝑚 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 ∵ 𝑖. 𝑖. 𝑑 𝑎𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = 𝑖=1 𝑚 ℎ 𝜃 𝑥𝑖 𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖 1−𝑦 𝑖 * i.i.d : independent and identically distributed : 베르누이 분포
  • 16. 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 (∵log는 단조증가 함수)
  • 17. 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛( 𝑖=1 𝑚 [−𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − (1 − 𝑦𝑖) log(1 − ℎ 𝜃 𝑥𝑖 )]) (∵ 𝑙𝑜𝑔 성질) 𝐿 𝜃 = 𝑖=1 𝑚 ℎ 𝜃 𝑥𝑖 𝑦 𝑖 1 − ℎ 𝜃 𝑥𝑖 1−𝑦 𝑖
  • 18. 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛( 𝑖=1 𝑚 [−𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − (1 − 𝑦𝑖) log(1 − ℎ 𝜃 𝑥𝑖 )]) = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 𝑤ℎ𝑒𝑟𝑒 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 = −𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − 1 − 𝑦𝑖 log 1 − ℎ 𝜃 𝑥𝑖 : 𝐵𝑖𝑛𝑎𝑟𝑦 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
  • 19. 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛( 𝑖=1 𝑚 [−𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − (1 − 𝑦𝑖) log(1 − ℎ 𝜃 𝑥𝑖 )]) = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 𝑤ℎ𝑒𝑟𝑒 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 = −𝑦𝑖 log ℎ 𝜃 𝑥𝑖 − 1 − 𝑦𝑖 log 1 − ℎ 𝜃 𝑥𝑖 : 𝐵𝑖𝑛𝑎𝑟𝑦 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ℎ 𝜃 𝑥𝑖 , 𝑦𝑖 ∈ 0, 1 인 확률값 Maximize Likelihood Minimize Binary Cross Entropy Binary Classification Problem
  • 20. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0] Image Classifier Prediction Label NN 𝑥2 𝜃 ℎ 𝜃 𝑥2 = [0.03, 𝟎. 𝟗𝟓, 0.02] 𝑦2 = [0, 1, 0] NN 𝑥3 𝜃 ℎ 𝜃 𝑥3 = [0.01, 0.01, 𝟎. 𝟗𝟖] 𝑦3 = [0, 0, 1]
  • 21. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0] Image Classifier Prediction Label 𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃 = 𝑝 𝑦𝑖(0) = 1 𝑥𝑖; 𝜃) (𝐴𝑠𝑠𝑢𝑚𝑒 𝑂𝑛𝑒ℎ𝑜𝑡 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔)
  • 22. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0] Image Classifier Prediction Label 𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃 = 𝑝 𝑦𝑖(0) = 1 𝑥𝑖; 𝜃) = ℎ 𝜃 𝑥𝑖 (0)
  • 23. NN 𝑥1 𝜃 ℎ 𝜃 𝑥1 = [𝟎. 𝟗, 0.05, 0.05] 𝑦1 = [1, 0, 0] Image Classifier Prediction Label 𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃 = 𝑝 𝑦𝑖(0) = 1 𝑥𝑖; 𝜃) = ℎ 𝜃 𝑥𝑖 (0) 같은 방법으로, 𝑝 𝑦𝑖 = [0, 1, 0] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 1 𝑝 𝑦𝑖 = [0, 0, 1] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 (2)
  • 24. 𝑝 𝑦𝑖 = [1, 0, 0] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 (0) 𝑝 𝑦𝑖 = [0, 1, 0] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 1 𝑝 𝑦𝑖 = [0, 0, 1] 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 (2) 즉, 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 0 𝑦 𝑖(0) ℎ 𝜃 𝑥𝑖 1 𝑦 𝑖(1) ℎ 𝜃 𝑥𝑖 2 𝑦 𝑖(2) 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖)
  • 25. 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃
  • 26. 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 [−𝑦𝑖 0 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(0) − 𝑦𝑖 1 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(1) − 𝑦𝑖 2 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(2)] 𝑝 𝑦𝑖 𝑥𝑖; 𝜃 = ℎ 𝜃 𝑥𝑖 0 𝑦 𝑖(0) ℎ 𝜃 𝑥𝑖 1 𝑦 𝑖(1) ℎ 𝜃 𝑥𝑖 2 𝑦 𝑖(2)
  • 27. 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 [−𝑦𝑖 0 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(0) − 𝑦𝑖 1 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(1) − 𝑦𝑖 2 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(2)] = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 𝑤ℎ𝑒𝑟𝑒 𝐻 𝑃, 𝑄 = − 𝑖=1 𝑐 𝑝𝑖 𝑙𝑜𝑔(𝑞𝑖) : 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
  • 28. 𝑁𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝑝 𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖 = 𝑝(𝑦𝑖|𝑥𝑖) 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛(− 𝑙𝑜𝑔 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 [−𝑦𝑖 0 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(0) − 𝑦𝑖 1 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(1) − 𝑦𝑖 2 𝑙𝑜𝑔ℎ 𝜃(𝑥𝑖)(2)] = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 ℎ 𝜃 𝑥𝑖 , 𝑦𝑖는 Probability Distribution Maximize Likelihood Minimize Cross Entropy Multiclass Classification Problem 𝑤ℎ𝑒𝑟𝑒 𝐻 𝑃, 𝑄 = − 𝑖=1 𝑐 𝑝𝑖 𝑙𝑜𝑔(𝑞𝑖) : 𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
  • 29. • Theoretical Derivation - Binary Classification Problem - Multiclass Classification Problem • Intuitive understanding - Relation to the KL-Divergence
  • 30. 𝐻 𝑃, 𝑄 = 𝑖=1 𝑐 𝑝𝑖 𝑙𝑜𝑔 1 𝑞𝑖 * KL-Divergence : Kullback–Leibler divergence
  • 31. 𝐻 𝑃, 𝑄 = 𝑖=1 𝑐 𝑝𝑖 𝑙𝑜𝑔 1 𝑞𝑖 = 𝑖=1 𝑐 (𝑝𝑖 𝑙𝑜𝑔 𝑝𝑖 𝑞𝑖 + 𝑝𝑖 𝑙𝑜𝑔 1 𝑝𝑖 )
  • 32. 𝐻 𝑃, 𝑄 = 𝑖=1 𝑐 𝑝𝑖 𝑙𝑜𝑔 1 𝑞𝑖 = 𝑖=1 𝑐 (𝑝𝑖 𝑙𝑜𝑔 𝑝𝑖 𝑞𝑖 + 𝑝𝑖 𝑙𝑜𝑔 1 𝑝𝑖 ) = 𝐾𝐿(𝑃| 𝑄 + 𝐻(𝑃) P 자체가 갖는 entropy KL-Divergence Cross-entropy
  • 33. 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 Maximize Likelihood Minimize Cross Entropy Multiclass Classification Problem
  • 34. 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 (𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) + 𝐻(𝑦𝑖) ) (∵ 𝐻 𝑃, 𝑄 = 𝐾𝐿(𝑃| 𝑄 + 𝐻 𝑃 )
  • 35. 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 (𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) + 𝐻(𝑦𝑖) ) = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 (𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) (∵OnehotEncoding된 label의 entropy는 0)
  • 36. Maximize Likelihood Minimize Cross Entropy Multiclass Classification Problem Minimize KL-Divergence 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿 𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 𝐻 𝑦𝑖, ℎ 𝜃 𝑥𝑖 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 (𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) + 𝐻(𝑦𝑖) ) = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1 𝑚 (𝐾𝐿(𝑦𝑖||ℎ 𝜃 𝑥𝑖 ) (∵OnehotEncoding된 label의 entropy는 0)
  • 37.  정보 이론의 관점에서는 KL-divergence를 직관적으로 “놀라움의 정도”로 이해 가능  (예) 준결승 진출팀 : LG 트윈스, 한화 이글스, NC 다이노스, 삼성 라이온즈 - 예측 모델 1) : - 예측 모델 2) : - 경기 결과 : - 예측 모델 2)에서 더 큰 놀라움을 확인 - 놀라움의 정도를 최소화  Q가 P로 근사됨  두 확률 분포가 닮음  정확한 예측 𝑦 = 𝑃 = [1, 0, 0, 0] 𝑦 = 𝑄 = [𝟎. 𝟗, 0.03, 0.03, 0.04] 𝑦 = 𝑄 = [0.3, 𝟎. 𝟔 0.05, 0.05] 𝐾𝐿(𝑃| 𝑄 = 𝑖=1 𝑐 (𝑝𝑖 𝑙𝑜𝑔 𝑝𝑖 𝑞𝑖 )
  • 38. Maximize Likelihood Minimize Cross Entropy Multiclass Classification Problem Minimize KL-Divergence Minimize Surprisal Approximate prediction to label Better classification performance in general