Call:
lm(formula = altura ~ idade, data = ex1)
Coefficients:
(Intercept) idade
62.504 7.545
Regressão Linear
“Large groups of people make up all their utterances out of the same stock of lexical forms and grammatical constructions. A linguistic observer therefore can describe the speech-habits of a community without resorting to statistics.” (Bloomfield 1935: 37)
“I think we are forced to conclude that grammar is autonomous and independent of meaning, and that probabilistic models give no particular insight into some of the basic problems of syntactic structure.” (Chomsky 1957: 17)
Problema \(\rightarrow\) Pergunta(s) \(\rightarrow\) Hipótese(s) \(\rightarrow\) Verificar/Observar/Testar \(\rightarrow\) Inferência/Conclusão
Devemos desenvolver uma rotina de racioncínio questionador
Modelos de Regressão
library(tidyverse)
# Criar dois vetores numéricos e colocá-los num dataframe
idade = c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 9, 11, 12, 12)
altura = c(60, 65, 97, 98, 100, 105, 107, 105, 119, 122, 125, 132, 142, 147, 153)
ex1 = data.frame(idade, altura)
# Modelo
modAltura = lm(altura ~ idade, data = ex1)
modAltura
Call:
lm(formula = altura ~ idade, data = ex1)
Coefficients:
(Intercept) idade
62.504 7.545
Call:
lm(formula = altura ~ idade, data = ex1)
Coefficients:
(Intercept) idade
62.504 7.545
Utilizando os coeficientes para gerar a reta de regressão
Call:
lm(formula = altura ~ idade, data = ex1)
Residuals:
Min 1Q Median 3Q Max
-12.5946 -3.1391 -0.0477 4.2242 11.8601
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.5040 3.7691 16.58 3.98e-10 ***
idade 7.5453 0.5135 14.69 1.78e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.651 on 13 degrees of freedom
Multiple R-squared: 0.9432, Adjusted R-squared: 0.9388
F-statistic: 215.9 on 1 and 13 DF, p-value: 1.782e-09
1 2 3 4 5 6
-10.04928458 -12.59459459 11.86009539 5.31478537 -0.23052464 4.76947536
7 8 9 10 11 12
6.76947536 -2.77583466 3.67885533 -0.86645469 2.13354531 1.58823529
13 14 15
-3.50238474 -6.04769475 -0.04769475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.5040 3.7691 16.58 3.98e-10 ***
idade 7.5453 0.5135 14.69 1.78e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimate
: coeficientes estimados (intercept e slope)Std. Error
: erro padrão de cada coeficientet value
: teste-t para cada coeficiente (\(H_0\) coeficiente \(= 0\))Pr(>|t|)
: valores de p dos testes-t
Signif. codes
: Níveis de significância sugeridos pelo RResidual standard error: 6.651 on 13 degrees of freedom
Multiple R-squared: 0.9432, Adjusted R-squared: 0.9388
F-statistic: 215.9 on 1 and 13 DF, p-value: 1.782e-09
\(\bar{X}\) de x = 9
\(s\) de x = 3,3
\(\bar{X}\) de y = 7,5
\(s\) de y = 2
Corr de x e y = 0,816
Regressão linear: \(y = 3+0,5x\)
\(R^2=0,67\)
E se acrescentarmos dados de adultos ao nosso modelo de altura?
idade = c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 9, 11, 12, 12,
17, 18, 19 ,22, 25, 26, 26, 28, 31, 32, 32, 35, 38, 40, 41, 45, 50,
55, 61, 62, 62, 63)
altura = c(60, 65, 97, 98, 100, 105, 107, 105, 119, 122, 125, 132, 142, 147, 153,
170, 175, 168, 165, 180, 176, 171, 169, 181, 185, 175, 160, 170, 168, 170,
182, 177, 170, 172, 165, 166, 160)
ex2 = data.frame(idade, altura)
# Modelo
modAltura2 = lm(altura ~ idade, data = ex2)
modAltura2
# Visualizar linha de regressão
ggplot(data = ex2, aes(x = idade, y = altura)) +
geom_point() +
geom_smooth(method = lm, se = F)
Call:
lm(formula = altura ~ idade, data = ex2)
Coefficients:
(Intercept) idade
115.918 1.256
english
do pacote languageR
english
do pacote languageR
Call:
lm(formula = RTlexdec ~ Familiarity, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.49212 -0.11285 -0.00596 0.10569 0.65072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.780397 0.007178 944.59 <2e-16 ***
Familiarity -0.060676 0.001810 -33.52 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1406 on 4566 degrees of freedom
Multiple R-squared: 0.1975, Adjusted R-squared: 0.1973
F-statistic: 1124 on 1 and 4566 DF, p-value: < 2.2e-16
Call:
lm(formula = RTlexdec ~ Familiarity, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.49212 -0.11285 -0.00596 0.10569 0.65072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.780397 0.007178 944.59 <2e-16 ***
Familiarity -0.060676 0.001810 -33.52 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1406 on 4566 degrees of freedom
Multiple R-squared: 0.1975, Adjusted R-squared: 0.1973
F-statistic: 1124 on 1 and 4566 DF, p-value: < 2.2e-16
english
do pacote languageR
old
e young
)
Call:
lm(formula = RTlexdec ~ AgeSubject, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.25776 -0.08339 -0.01669 0.06921 0.52685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.660958 0.002324 2866.44 <2e-16 ***
AgeSubjectyoung -0.221721 0.003286 -67.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
Call:
lm(formula = RTlexdec ~ AgeSubject, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.25776 -0.08339 -0.01669 0.06921 0.52685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.660958 0.002324 2866.44 <2e-16 ***
AgeSubjectyoung -0.221721 0.003286 -67.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
english
do pacote languageR
:
RTlexdec
) em função da frequência escrita da palavra (WrittenFrequency
)english
do pacote languageR
:
RTlexdec
) em função da categoria da palavra (WordCategory
), se verbo ou substantivo (codificados em V
e N
)Nos dados english
vimos o efeito de familiaridade com a palavra e o de faixa etária sobre o tempo de reação em modelos separados:
“we live in a multifactorial world in which probably no phenomenon is really monofactorial – probably just about everything is correlated with several things at the same time”. (Gries, Stefan Th 2013)
Como visualizar as associações das três variáveis em um mesmo gráfico?
Call:
lm(formula = RTlexdec ~ Familiarity + AgeSubject, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.38126 -0.05907 -0.00418 0.05134 0.53986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.891258 0.004595 1499.82 <2e-16 ***
Familiarity -0.060676 0.001113 -54.52 <2e-16 ***
AgeSubjectyoung -0.221721 0.002558 -86.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08643 on 4565 degrees of freedom
Multiple R-squared: 0.6967, Adjusted R-squared: 0.6966
F-statistic: 5244 on 2 and 4565 DF, p-value: < 2.2e-16
Call:
lm(formula = RTlexdec ~ Familiarity + AgeSubject, data = english)
Residuals:
Min 1Q Median 3Q Max
-0.38126 -0.05907 -0.00418 0.05134 0.53986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.891258 0.004595 1499.82 <2e-16 ***
Familiarity -0.060676 0.001113 -54.52 <2e-16 ***
AgeSubjectyoung -0.221721 0.002558 -86.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08643 on 4565 degrees of freedom
Multiple R-squared: 0.6967, Adjusted R-squared: 0.6966
F-statistic: 5244 on 2 and 4565 DF, p-value: < 2.2e-16
anova(modelo1, modelo2)
para comparar estatisticamente os valores de F (p<0.05
indica que o modelo mais complexo é superior)Podemos incluir quantas variáveis preditoras em um modelo de regressão?
english
do pacote languageR
:
RTlexdec
) em função da familiaridade com a palavra (Familiarity
) + faixa etária (AgeSubject
) + frequência escrita da palavra (WrittenFrequency
)WordCategory
)
WordCategory
quando estava em um modelo simples com sua significância neste modeloanova(modelo1, modelo2)
para comparar modelos (p<0.05
indica que o modelo mais complexo é superior)