First, a note to the reader. This is a two part series. This part we will look at some data reported to the FTC and in the second part, I have a surprise of sorts that will be ruined, if I give too much information. Reddington is back just because he’s ridiculously cute. He has nothing to do with scamming, but if you look into his adorable little eyes, it makes you thankful he is spared from this harm humans inflict on other humans.
The FTC files complaints from individuals who claim to have been scammed by someone posing as a government representative. What do we mean by scammed? Scamming can come in many different forms, but let’s talk about the general concept first and then provide some examples. According to the Merriam Webster online dictionary, a scam is “to obtain (something, such as money) by a scam,” and I believe defining the word with the word is never a good practice, so I am going to combine both definitions provided by Merriam Webster and here is the finalized definition, “to obtain (something, such as money) by” an act of deception. This definition is very broad so let’s look at first the list provided by the FTC of different cases they have received.
- Most recently, the FTC stated that these government imposters were posing as companies providing Coronavirus treatment.
- People claimed being contacted by someone claiming to be a member of “the government or a well-known business, a romantic interest, or a family member with an emergency” were received by consumers in 2019 and reported to the FTC. These consumers reported a total of “$667 million” in losses.
- Within the scams that occurred that were labeled government imposters for 2019, the top imposter scam reported were Social Security imposters. There were “166,190 reports about Social Security scam, and the median individual loss was $1,500” (consumer.FTC.gov).
A note to the reader on formatting: If you read this on a phone or tablet the scatter plots and bar charts should show with little issue, however, if you are viewing on a desktop, the images will appear ginormous. To get a better view of the scatter plots and the bar charts, simply click on the image and you can change the scale yourself.
Got Scammed?
library(tidyverse)
library(ggplot2)
library(sqldf)
First we will look at the Ohio fraud data. Here is a briefing on the set:
Metropolitan Areas are defined by the Office of Management and Budget, and population estimates are based on 2018 U.S. Census figures. Metropolitan Areas are ranked based on the number of reports per 100,000 population. Reports exclude state-specific data contributor reports.
I did most of my editting in excel to the database.
Ohiofraud <- read.csv("Desktop:/Ohio_normalized.csv")
knitr::kable(Ohiofraud, caption = "Ohio Fraud")
| metro_area_f | state | reports |
|---|---|---|
| Cleveland-Elyria | OH Metropolitan Statistical Area | 714 |
| Columbus | OH Metropolitan Statistical Area | 705 |
| Dayton | OH Metropolitan Statistical Area | 621 |
| Weirton-Steubenville | WV-OH Metropolitan Statistical Area | 614 |
| Akron | OH Metropolitan Statistical Area | 596 |
| Canton-Massillon | OH Metropolitan Statistical Area | 588 |
| Youngstown-Warren-Boardman | OH-PA Metropolitan Statistical Area | 587 |
| Cincinnati | OH-KY-IN Metropolitan Statistical Area | 586 |
| Springfield | OH Metropolitan Statistical Area | 561 |
| Toledo | OH Metropolitan Statistical Area | 545 |
| Huntington-Ashland | WV-KY-OH Metropolitan Statistical Area | 489 |
| Mansfield | OH Metropolitan Statistical Area | 489 |
| Salem | OH Micropolitan Statistical Area | 484 |
| Lima | OH Metropolitan Statistical Area | 467 |
| Wheeling | WV-OH Metropolitan Statistical Area | 459 |
| Wooster | OH Micropolitan Statistical Area | 454 |
Which cities have the highest number of incidents?
knitr::kable((Ohiofraud[1:3,] %>%
arrange(desc(reports))), caption ="Cities with the Highest Number of Incidents")
| metro_area_f | state | reports |
|---|---|---|
| Cleveland-Elyria | OH Metropolitan Statistical Area | 714 |
| Columbus | OH Metropolitan Statistical Area | 705 |
| Dayton | OH Metropolitan Statistical Area | 621 |
Cleveland with 714, Columbus with 705, and Dayton with 621
Where’s the nasty Nati?
Ohiofraud %>%filter(str_detect(metro_area,"Cincinn")
mean(Ohiofraud$reports)# 559.9375
The nasty Nati is just above the average at 586 reports. Let see if Cincinnati falls within the upper interquartile range.
summary(Ohiofraud$reports)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 454.0 487.8 573.5 559.9 600.5 714.0
Cincinnati is just under the 3rd Quartile which is 600. Was the average the best summary statistic to use? Let’s see a visual of the data.
The scatterplot shows several points that are scattered about in what appear to be clusters.
d <- Ohiofraud$reports
View(d)
qplot(Ohiofraud$reports,Ohiofraud$metro_area)Now let’s also look at add a second column which will contain the data on id theft. I will need to revert to SQL for this query.
Ohioidtheft <- read.csv("Desktop:/Ohio_idtheft_BCNF.csv")
View(Ohioidtheft)
Ohioidtheft %
rename(metro_area_t=Metropolitan.Area)%>%
rename(reportsoftheft= X) %>%
rename(state=Reports.per.100K.Population)
sqldf("SELECT *
FROM Ohioidtheft")
## metro_area_t state
## 1 Cleveland-Elyria OH Metropolitan Statistical Area
## 2 Columbus OH Metropolitan Statistical Area
## 3 Dayton OH Metropolitan Statistical Area
## 4 Akron OH Metropolitan Statistical Area
## 5 Toledo OH Metropolitan Statistical Area
## 6 Youngstown-Warren-Boardman OH-PA Metropolitan Statistical Area
## 7 Cincinnati OH-KY-IN Metropolitan Statistical Area
## 8 Wooster OH Micropolitan Statistical Area
## 9 Canton-Massillon OH Metropolitan Statistical Area
## 10 Mansfield OH Metropolitan Statistical Area
## 11 Lima OH Metropolitan Statistical Area
## 12 Salem OH Micropolitan Statistical Area
## 13 Springfield OH Metropolitan Statistical Area
## 14 Weirton-Steubenville WV-OH Metropolitan Statistical Area
## 15 Huntington-Ashland WV-KY-OH Metropolitan Statistical Area
## 16 Wheeling WV-OH Metropolitan Statistical Area
## reportsoftheft
## 1 226
## 2 121
## 3 116
## 4 115
## 5 104
## 6 96
## 7 95
## 8 88
## 9 77
## 10 73
## 11 72
## 12 68
## 13 64
## 14 59
## 15 51
## 16 43
Do we have a primary key? Is it unique, irreducible, and not null… There should be 16 rows in this next query.
sqldf("SELECT DISTINCT metro_area_f
FROM Ohiofraud")
So the easiest way I know of as of now, is to sort based on my primary key and add the data that way. If I had more data, I would need to first create a database, but for our purposes, this is fine.
sqldf("SELECT *
FROM Ohiofraud f
LEFT JOIN Ohioidtheft t
ON metro_area.f = metro_area.t")#Error: No such column: metro_area.f?
Since the SQL join statment isn’t working. I will have to resort to the join function within…tidyverse. I am not going to show all of the queries I ran, but I used the sample queiries for the tidyverse package to figure out how to apply the same method to my dataset.
Using left join?
View(Ohiofraud)
View(Ohioidtheft)
Ohio_reports %
left_join( Ohioidtheft, by=c("metro_area_f"="metro_area_t"))
View(Ohio_reports)
I need to remove a column and rename a column. For Simplicities sake we will call our collection of data ‘r’
r %
rename(metro_area= Ohio_reports.metro_area_f) %>%
rename(state=Ohio_reports.state.x) %>%
rename(fraud_reports=Ohio_reports.reports) %>%
rename(theft_reports=Ohio_reports.reportsoftheft)
)
r %
rename(metro_area= Ohio_reports.metro_area_f) %>%
rename(state=Ohio_reports.state.x) %>%
rename(fraud_reports=Ohio_reports.reports) %>%
rename(theft_reports=Ohio_reports.reportsoftheft)
View(r)
Now we finally have our dataset to work with. Truly 80% of the work.
rp <- ggplot(r, aes(r$metro_area, r$fraud_reports))
rp+ geom_point(color = "blue")+coord_flip()
rp2 <- ggplot(r, aes(r$metro_area, r$theft_reports))
rp2 + geom_point(color ="red")+coord_flip()
What is the correlation between the two variables?
cor(r$theft_reports, r$fraud_reports)
## [1] 0.7238463
# 0.7238
rp <- ggplot(r, aes(r$theft_reports, r$fraud_reports, colour = r$metro_area))
rp+geom_point(mapping = aes(color=r$metro_area))+
labs(title = "The Imperfect High Fraud High Theft Relationship",
subtitle = "The relationship between fraud reports and theft reports.")
As you can see there is somewhat of an upward trend. If the fraud reports increase, so do the theft reports. Although a definite outlier does seem to be Cleveland-Elyria.
theft<- ggplot(r, aes(r$metro_area,r$theft_reports, colour = r$metro_area))
theft+coord_flip()+geom_col()
fraud<- ggplot(r, aes(r$metro_area,r$fraud_reports, colour = r$metro_area))
fraud+coord_flip()+geom_col()
Clevland-ELyria also has the highest number of reports here too with 714 fraud repots.
Take a look at more data at FTC.gov/data. Check out their incredible Tableau vizzes here: https://www.ftc.gov/enforcement/data-visualizations/explore-data.
Thank you for reading and stay tuned for part 2!!!
References:
- Merriam Webster Online Dictionary https://www.merriam-webster.com/dictionary/scam
- https://www.consumer.ftc.gov/blog/2020/03/ftc-fda-warnings-sent-sellers-scam-coronavirus-treatments
- https://www.consumer.ftc.gov/blog/2020/01/top-frauds-2019
- https://www.consumer.ftc.gov/features/feature-0037-imposter-scams
- Tableau Interactive Viz: https://public.tableau.com/profile/federal.trade.commission#!/vizhome/GovernmentImposter/Infographic
- https://www.ftc.gov/enforcement/data-visualizations/explore-data









