Learning a Language? Export your Data from Duolingo.

Why learn a language? This is one avenue of life that I am constantly working on. If I had the ability to speak fluently in another language, life would be easier (emphasis on the would be). Whatever your reason for studying a language, you can keep track of your progress over time with Duolingo.

I want to show you how to export your own data from Duolingo and hopefully inspire you to do better than myself in your language learning practice. You will see what I mean… 😉

How to download your data from Duolingo?

First, you want to log into your account, then go to the “Settings.”

Next, you will want to find “Export my data,” and then click that.

There was a message that popped up stating that it could take up to 30 days for them to send my data, but the reality is, I got an email within an hour stating that the data was ready to be downloaded.

We did it!!! Now, let’s analyze the results! They should come in a csv file, so if you want to use Excel for analysis, that is an option, Tableau, R, Python, and Power BI, to name a few, are other viable options. I am going to use R. Here is a complete version of what I did: http://rpubs.com/natester/duolingoanalysis

Here is the summarized version of what I did….Looking at totals for 2019 and 2020, you will see I did better in 2019.

leaderboard_barchart <- ggplot(data=leaderboard_groupedby_year, aes(x=year, y=TOTAL_SCORE))

leaderboard_barchart + 
  geom_col(color= c('orangered','blue'),fill=c('orangered','blue'))+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank())+
  theme_void()+
  labs(title= "Total Score for 2019 (Orange) and 2020 (Blue)")+
  geom_text(aes(label = TOTAL_SCORE), 
            position = position_dodge(0.9),
            vjust = -0.5,
            size =5,
            color=c('orangered','blue'))

Visualizing the Data by Month

head(leaderboard)
##   leaderboard       date timestamp tier score year
## 1     leagues 2019-05-18  20:00:35    0    20 2019
## 2     leagues 2019-05-20  11:14:16    0    50 2019
## 3     leagues 2019-05-27  12:41:58    0    60 2019
## 4     leagues 2019-06-03  11:58:17    0    40 2019
## 5     leagues 2019-06-10  22:35:45    0    50 2019
## 6     leagues 2019-06-17  11:36:58    0    40 2019
leaderboard_linechart <- ggplot(data=leaderboard, aes(x=date, y=score))

leaderboard_linechart + 
  geom_line(color= c('cornflowerblue'),size =1)+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank())+
  labs(title= "Score Over Time for 2019 through 2020")

It looks like I reached my highest peak in my score at the beginning of 2020. What was that score?

sqldf("SELECT date, MAX(score) AS HIGHEST_SCORE
       FROM leaderboard 
       ")
##         date HIGHEST_SCORE
## 1 2020-01-27           109

109 was my highest score on ‘2020-01-27’, however, as we discovered. I showed a greater total score for 2019.

The take away from my charts are, overtime consistency matters more than one single learning sprint; especially, when it comes to learning languages.

“Stop Acting Like a Baby”

Side note on the title: The title is inspired by the various dad moments I’ve had in the last few weeks where I literally caught myself saying those exact words to my daughter who is going to be 10 months soon; shortly after saying them, I laughed, and my daughter kept innocently staring at me.

The dataset I used I originally found through Data School1. which lead me to the Advanced High School Statistics book on OpenIntro’s2. website. The data set source derived from their site but the dataset originated from “Season of birth and onset of locomotion” by J. B. Benson3.

Here is a briefing on the findings of the study…

What did I find?

The Dataset…

This snapshot was taken from R. This dataset is the same as the csv file except the columns Ctemp (Average Temperature in Celsius) and avg_crawling_age_months (Average Crawling Age in Months). You can see the code for adding these columns along with all of my other code here: https://github.com/sterlingn/babycrawl_data/tree/292ddeef79dc76c3d1047dbe37ff3c57bea78f33.

According to the summary of the dataset, found on https://www.sciencedirect.com/science/article/abs/pii/0163638393800298, there were 425 infants in the study, however, our dataset shows that there are only 414 infants. Looking at the correlation between temperature and babies crawling, our r=-.70, see the graphic below:

And so, there is a negative correlation, but this doesn’t imply causation. Given that we only have one variable to work with, it is harder to know exactly what causes babies to start crawling later than other babies, so how good of a predictor is the temperature for the crawling age? Well, let’s take a look in R.

After running the summary statistic on the simple linear model that was created, the results show that the temperature, as a coefficient, has a P value significantly less than the 0.05 threshold, which results in the rejection of the null hypothesis that there is no correlation between the two coefficients, temperature and average crawling age, however the adjusted R-squared value being at .4386 out of 1 shows that this model is not the best. There appears to be other variables that are influencing the average baby crawling age, however, temperature does seem to be an influencer. Another downfall to using the dataset I have is that the temperature and the crawling data are both averages taken from the actual data which can seriously cause issues when trying to analyze the data, because the data is already a summary of the actual data.

References:

  1. https://www.dataschool.io/resources/
  2. https://www.openintro.org/stat/textbook.php
  3. J.B. Benson. Season of birth and onset of locomotion: Theoretical and methodological implications. In: Infant behavior and development 16.1 (1993), pp. 69-81. issn: 0163-6383.
  4. https://www.sciencedirect.com/science/article/abs/pii/0163638393800298