DateA Set

First, a note to the reader. This is a two part series. This part we will look at some data reported to the FTC and in the second part, I have a surprise of sorts that will be ruined, if I give too much information. Reddington is back just because he’s ridiculously cute. He has nothing to do with scamming, but if you look into his adorable little eyes, it makes you thankful he is spared from this harm humans inflict on other humans.

The FTC files complaints from individuals who claim to have been scammed by someone posing as a government representative. What do we mean by scammed? Scamming can come in many different forms, but let’s talk about the general concept first and then provide some examples. According to the Merriam Webster online dictionary, a scam is “to obtain (something, such as money) by a scam,” and I believe defining the word with the word is never a good practice, so I am going to combine both definitions provided by Merriam Webster and here is the finalized definition, “to obtain (something, such as money) by” an act of deception. This definition is very broad so let’s look at first the list provided by the FTC of different cases they have received.

Most recently, the FTC stated that these government imposters were posing as companies providing Coronavirus treatment.
People claimed being contacted by someone claiming to be a member of “the government or a well-known business, a romantic interest, or a family member with an emergency” were received by consumers in 2019 and reported to the FTC. These consumers reported a total of “$667 million” in losses.
Within the scams that occurred that were labeled government imposters for 2019, the top imposter scam reported were Social Security imposters. There were “166,190 reports about Social Security scam, and the median individual loss was $1,500” (consumer.FTC.gov).

So how do these individuals scam people out of their money? “Gift cards” are reported to be the payment of choice (consumer.FTC.gov). Now that you have some context, let’s look at a small sample of this massive collection of data, and let’s focus on the Ohio fraud and theft reports from 2018. This analysis was completed in R and is available on my github repository (github.com/sterlingn/scammerdata).

A note to the reader on formatting: If you read this on a phone or tablet the scatter plots and bar charts should show with little issue, however, if you are viewing on a desktop, the images will appear ginormous. To get a better view of the scatter plots and the bar charts, simply click on the image and you can change the scale yourself.

library(tidyverse)
library(ggplot2)
library(sqldf)

First we will look at the Ohio fraud data. Here is a briefing on the set:

Metropolitan Areas are defined by the Office of Management and Budget, and population estimates are based on 2018 U.S. Census figures. Metropolitan Areas are ranked based on the number of reports per 100,000 population. Reports exclude state-specific data contributor reports.

I did most of my editting in excel to the database.

Ohiofraud <- read.csv("Desktop:/Ohio_normalized.csv")

knitr::kable(Ohiofraud, caption = "Ohio Fraud")

Ohio Fraud
metro_area_f	state	reports
Cleveland-Elyria	OH Metropolitan Statistical Area	714
Columbus	OH Metropolitan Statistical Area	705
Dayton	OH Metropolitan Statistical Area	621
Weirton-Steubenville	WV-OH Metropolitan Statistical Area	614
Akron	OH Metropolitan Statistical Area	596
Canton-Massillon	OH Metropolitan Statistical Area	588
Youngstown-Warren-Boardman	OH-PA Metropolitan Statistical Area	587
Cincinnati	OH-KY-IN Metropolitan Statistical Area	586
Springfield	OH Metropolitan Statistical Area	561
Toledo	OH Metropolitan Statistical Area	545
Huntington-Ashland	WV-KY-OH Metropolitan Statistical Area	489
Mansfield	OH Metropolitan Statistical Area	489
Salem	OH Micropolitan Statistical Area	484
Lima	OH Metropolitan Statistical Area	467
Wheeling	WV-OH Metropolitan Statistical Area	459
Wooster	OH Micropolitan Statistical Area	454

Which cities have the highest number of incidents?

knitr::kable((Ohiofraud[1:3,] %>%
  arrange(desc(reports))), caption ="Cities with the Highest Number of Incidents")

Cities with the Highest Number of Incidents
metro_area_f	state	reports
Cleveland-Elyria	OH Metropolitan Statistical Area	714
Columbus	OH Metropolitan Statistical Area	705
Dayton	OH Metropolitan Statistical Area	621

Cleveland with 714, Columbus with 705, and Dayton with 621

Where’s the nasty Nati?

Ohiofraud %>%filter(str_detect(metro_area,"Cincinn")

mean(Ohiofraud$reports)# 559.9375

The nasty Nati is just above the average at 586 reports. Let see if Cincinnati falls within the upper interquartile range.

summary(Ohiofraud$reports)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   454.0   487.8   573.5   559.9   600.5   714.0

Cincinnati is just under the 3rd Quartile which is 600. Was the average the best summary statistic to use? Let’s see a visual of the data.

The scatterplot shows several points that are scattered about in what appear to be clusters.

d <- Ohiofraud$reports

View(d)

qplot(Ohiofraud$reports,Ohiofraud$metro_area)

Now let’s also look at add a second column which will contain the data on id theft. I will need to revert to SQL for this query.

Ohioidtheft <- read.csv("Desktop:/Ohio_idtheft_BCNF.csv")

View(Ohioidtheft)


Ohioidtheft %
  rename(metro_area_t=Metropolitan.Area)%>%
  rename(reportsoftheft= X) %>%
  rename(state=Reports.per.100K.Population)

sqldf("SELECT *
      FROM Ohioidtheft")

##                  metro_area_t                                   state
## 1            Cleveland-Elyria        OH Metropolitan Statistical Area
## 2                    Columbus        OH Metropolitan Statistical Area
## 3                      Dayton        OH Metropolitan Statistical Area
## 4                       Akron        OH Metropolitan Statistical Area
## 5                      Toledo        OH Metropolitan Statistical Area
## 6  Youngstown-Warren-Boardman     OH-PA Metropolitan Statistical Area
## 7                  Cincinnati  OH-KY-IN Metropolitan Statistical Area
## 8                     Wooster        OH Micropolitan Statistical Area
## 9            Canton-Massillon        OH Metropolitan Statistical Area
## 10                  Mansfield        OH Metropolitan Statistical Area
## 11                       Lima        OH Metropolitan Statistical Area
## 12                      Salem        OH Micropolitan Statistical Area
## 13                Springfield        OH Metropolitan Statistical Area
## 14       Weirton-Steubenville     WV-OH Metropolitan Statistical Area
## 15         Huntington-Ashland  WV-KY-OH Metropolitan Statistical Area
## 16                   Wheeling     WV-OH Metropolitan Statistical Area
##    reportsoftheft
## 1             226
## 2             121
## 3             116
## 4             115
## 5             104
## 6              96
## 7              95
## 8              88
## 9              77
## 10             73
## 11             72
## 12             68
## 13             64
## 14             59
## 15             51
## 16             43

Do we have a primary key? Is it unique, irreducible, and not null… There should be 16 rows in this next query.

sqldf("SELECT DISTINCT metro_area_f
      FROM Ohiofraud")

So the easiest way I know of as of now, is to sort based on my primary key and add the data that way. If I had more data, I would need to first create a database, but for our purposes, this is fine.

sqldf("SELECT *
      FROM Ohiofraud f
      LEFT JOIN Ohioidtheft t
      ON metro_area.f = metro_area.t")#Error: No such column: metro_area.f?

Since the SQL join statment isn’t working. I will have to resort to the join function within…tidyverse. I am not going to show all of the queries I ran, but I used the sample queiries for the tidyverse package to figure out how to apply the same method to my dataset.

Using left join?

View(Ohiofraud)
View(Ohioidtheft)

Ohio_reports %
left_join( Ohioidtheft, by=c("metro_area_f"="metro_area_t"))

View(Ohio_reports)

I need to remove a column and rename a column. For Simplicities sake we will call our collection of data ‘r’

r %
  rename(metro_area= Ohio_reports.metro_area_f) %>%
  rename(state=Ohio_reports.state.x) %>%
  rename(fraud_reports=Ohio_reports.reports) %>%
  rename(theft_reports=Ohio_reports.reportsoftheft)
)

r %
  rename(metro_area= Ohio_reports.metro_area_f) %>%
  rename(state=Ohio_reports.state.x) %>%
  rename(fraud_reports=Ohio_reports.reports) %>%
  rename(theft_reports=Ohio_reports.reportsoftheft)

View(r)

Now we finally have our dataset to work with. Truly 80% of the work.

rp <- ggplot(r, aes(r$metro_area, r$fraud_reports))

rp+ geom_point(color = "blue")+coord_flip()

rp2 <- ggplot(r, aes(r$metro_area, r$theft_reports))
 
rp2 + geom_point(color ="red")+coord_flip()

What is the correlation between the two variables?

cor(r$theft_reports, r$fraud_reports)

## [1] 0.7238463

# 0.7238

rp <- ggplot(r, aes(r$theft_reports, r$fraud_reports, colour = r$metro_area))

rp+geom_point(mapping = aes(color=r$metro_area))+
  labs(title = "The Imperfect High Fraud High Theft Relationship",
         subtitle = "The relationship between fraud reports and theft reports.")

As you can see there is somewhat of an upward trend. If the fraud reports increase, so do the theft reports. Although a definite outlier does seem to be Cleveland-Elyria.

theft<- ggplot(r, aes(r$metro_area,r$theft_reports, colour = r$metro_area))

theft+coord_flip()+geom_col()

fraud<- ggplot(r, aes(r$metro_area,r$fraud_reports, colour = r$metro_area))

fraud+coord_flip()+geom_col()

Clevland-ELyria also has the highest number of reports here too with 714 fraud repots.

Take a look at more data at FTC.gov/data. Check out their incredible Tableau vizzes here: https://www.ftc.gov/enforcement/data-visualizations/explore-data.

https://rpubs.com/natester

Thank you for reading and stay tuned for part 2!!!

References:

Side note on the title: The title is inspired by the various dad moments I’ve had in the last few weeks where I literally caught myself saying those exact words to my daughter who is going to be 10 months soon; shortly after saying them, I laughed, and my daughter kept innocently staring at me.

The dataset I used I originally found through Data School1. which lead me to the Advanced High School Statistics book on OpenIntro’s2. website. The data set source derived from their site but the dataset originated from “Season of birth and onset of locomotion” by J. B. Benson3.

Here is a briefing on the findings of the study…

What did I find?

The Dataset…

This snapshot was taken from R. This dataset is the same as the csv file except the columns Ctemp (Average Temperature in Celsius) and avg_crawling_age_months (Average Crawling Age in Months). You can see the code for adding these columns along with all of my other code here: https://github.com/sterlingn/babycrawl_data/tree/292ddeef79dc76c3d1047dbe37ff3c57bea78f33.

According to the summary of the dataset, found on https://www.sciencedirect.com/science/article/abs/pii/0163638393800298, there were 425 infants in the study, however, our dataset shows that there are only 414 infants. Looking at the correlation between temperature and babies crawling, our r=-.70, see the graphic below:

And so, there is a negative correlation, but this doesn’t imply causation. Given that we only have one variable to work with, it is harder to know exactly what causes babies to start crawling later than other babies, so how good of a predictor is the temperature for the crawling age? Well, let’s take a look in R.

After running the summary statistic on the simple linear model that was created, the results show that the temperature, as a coefficient, has a P value significantly less than the 0.05 threshold, which results in the rejection of the null hypothesis that there is no correlation between the two coefficients, temperature and average crawling age, however the adjusted R-squared value being at .4386 out of 1 shows that this model is not the best. There appears to be other variables that are influencing the average baby crawling age, however, temperature does seem to be an influencer. Another downfall to using the dataset I have is that the temperature and the crawling data are both averages taken from the actual data which can seriously cause issues when trying to analyze the data, because the data is already a summary of the actual data.

References:

https://www.dataschool.io/resources/
https://www.openintro.org/stat/textbook.php
J.B. Benson. Season of birth and onset of locomotion: Theoretical and methodological implications. In: Infant behavior and development 16.1 (1993), pp. 69-81. issn: 0163-6383.
https://www.sciencedirect.com/science/article/abs/pii/0163638393800298

DateA Set

Part 1: Got $cammed?

Got Scammed?

“Stop Acting Like a Baby”

Viz-ualizing the CFD Data