Getting and Cleaning Data Quiz

Getting and Cleaning Data Quiz Answer. In this post you will get Quiz Answer & Assignment Of Getting and Cleaning Data

 

Getting and Cleaning Data Quiz

Offered By ”Johns Hopkins University”

Enroll Now

Week 1 Quiz

1.
Question 1
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

How many properties are worth $1,000,000 or more?

1 point

  • 53
  • 164
  • 31
  • 47

2.
Question 2
Use the data you loaded from Question 1. Consider the variable FES in the code book. Which of the “tidy data” principles does this variable violate?

1 point

  • Each variable in a tidy data set has been transformed to be interpretable.
  • Tidy data has one variable per column.
  • Tidy data has no missing values.
  • Numeric values in tidy data can not represent categories.

3.
Question 3
Download the Excel spreadsheet on Natural Gas Aquisition Program here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx

Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:

1
dat
What is the value of:

1
sum(dat$Zip*dat$Ext,na.rm=T)
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)

1 point

  • NA
  • 154339
  • 33544718
  • 36534720

4.
Question 4
Read the XML data on Baltimore restaurants from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

How many restaurants have zipcode 21231?

1 point

  • 17
  • 127
  • 100
  • 156

5.
Question 5
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

using the fread() command load the data into an R object

1
DT
The following are ways to calculate the average value of the variable

1
pwgtp15
broken down by sex. Using the data.table package, which will deliver the fastest user time?

1 point

  • sapply(split(DT$pwgtp15,DT$SEX),mean)
  • tapply(DT$pwgtp15,DT$SEX,mean)
  • mean(DT$pwgtp15,by=DT$SEX)
  • rowMeans(DT)[DT$SEX==1]; rowMeans(DT)[DT$SEX==2]
  • DT[,mean(pwgtp15),by=SEX]
  • mean(DT[DT$SEX==1,]$pwgtp15); mean(DT[DT$SEX==2,]$pwgtp15)

 

 

 

Week 2 Quiz

1.
Question 1
Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (hint: this is the url you want “https://api.github.com/users/jtleek/repos”). Use this data to find the time that the datasharing repo was created. What time was it created?

This tutorial may be useful (https://github.com/hadley/httr/blob/master/demo/oauth2-github.r). You may also need to run the code in the base R package and not R studio.

1 point

2013-11-07T13:25:07Z

2013-08-28T18:18:50Z

2012-06-20T18:39:06Z

2012-06-21T17:28:38Z

2.
Question 2
The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL.

Download the American Community Survey data and load it into an R object called

1
acs
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?

1 point

sqldf(“select pwgtp1 from acs”)

sqldf(“select * from acs”)

sqldf(“select * from acs where AGEP \lt< 50”)

sqldf(“select pwgtp1 from acs where AGEP \lt< 50”)

3.
Question 3
Using the same data frame you created in the previous problem, what is the equivalent function to unique(acs$AGEP)

1 point

sqldf(“select unique * from acs”)

sqldf(“select distinct pwgtp1 from acs”)

sqldf(“select AGEP where unique from acs”)

sqldf(“select distinct AGEP from acs”)

4.
Question 4
How many characters are in the 10th, 20th, 30th and 100th lines of HTML from this page:

http://biostat.jhsph.edu/~jleek/contact.html

(Hint: the nchar() function in R may be helpful)

1 point

45 92 7 2

43 99 8 6

45 31 7 25

43 99 7 25

45 31 7 31

45 31 2 25

45 0 2 2

5.
Question 5
Read this data set into R and report the sum of the numbers in the fourth of the nine columns.

https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for

Original source of the data: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

(Hint this is a fixed width file format)

1 point

222243.1

35824.9

101.83

36.5

32426.7

28893.3

 

 

Week 3 Quiz

1.
Question 1
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Create a logical vector that identifies the households on greater than 10 acres who sold more than $10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE.

which(agricultureLogical)

What are the first 3 values that result?

1 point

236, 238, 262

125, 238,262

153 ,236, 388

59, 460, 474

2.
Question 2
Using the jpeg package read in the following picture of your instructor into R

https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg

Use the parameter native=TRUE. What are the 30th and 80th quantiles of the resulting data? (some Linux systems may produce an answer 638 different for the 30th quantile)

1 point

-10904118 -10575416

-15259150 -10575416

-15259150 -594524

-14191406 -10904118

3.
Question 3
Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank (so United States is last). What is the 13th country in the resulting data frame?

Original data sources:

http://data.worldbank.org/data-catalog/GDP-ranking-table

http://data.worldbank.org/data-catalog/ed-stats

1 point

190 matches, 13th country is Spain

190 matches, 13th country is St. Kitts and Nevis

234 matches, 13th country is St. Kitts and Nevis

189 matches, 13th country is St. Kitts and Nevis

234 matches, 13th country is Spain

189 matches, 13th country is Spain

4.
Question 4
What is the average GDP ranking for the “High income: OECD” and “High income: nonOECD” group?

1 point

32.96667, 91.91304

23.966667, 30.91304

30, 37

23, 30

133.72973, 32.96667

23, 45

5.
Question 5
Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries

are Lower middle income but among the 38 nations with highest GDP?

1 point

18

0

3

5

 

 

Week 4 Quiz

1.
Question 1
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Apply strsplit() to split all the names of the data frame on the characters “wgtp”. What is the value of the 123 element of the resulting list?

1 point

“w” “15”

“” “15”

“15”

“wgtp” “15”

2.
Question 2
Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?

Original data sources:

http://data.worldbank.org/data-catalog/GDP-ranking-table

1 point

377652.4

381668.9

387854.4

293700.3

3.
Question 3
In the data set from Question 2 what is a regular expression that would allow you to count the number of countries whose name begins with “United”? Assume that the variable with the country names in it is named countryNames. How many countries begin with United?

1 point

grep(“^United”,countryNames), 3

grep(“*United”,countryNames), 5

grep(“United$”,countryNames), 3

grep(“*United”,countryNames), 2

4.
Question 4
Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. Of the countries for which the end of the fiscal year is available, how many end in June?

Original data sources:

http://data.worldbank.org/data-catalog/GDP-ranking-table

http://data.worldbank.org/data-catalog/ed-stats

1 point

16

13

7

15

5.
Question 5
You can use the quantmod (http://www.quantmod.com/) package to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. Use the following code to download data on Amazon’s stock price and get the times the data was sampled.

123
library(quantmod)
amzn = getSymbols(“AMZN”,auto.assign=FALSE)
sampleTimes = index(amzn)
How many values were collected in 2012? How many values were collected on Mondays in 2012?

1 point

250, 51

251, 47

252, 50

250, 47

 

 

Peer-graded Assignment: Getting and Cleaning Data Course Project

Assignment 1:

Download

 

 

Assignment 2:

PROMPT
Please submit a link to a Github repo with the code for performing your analysis. The code should have a file run_analysis.R in the main directory that can be run as long as the Samsung data is in your working directory. The output should be the tidy data set you submitted for part 1. You should include a README.md in the repo describing how the script works and the code book describing the variables.

 

https://github.com/oOHADIOo/Getting-and-Cleaning-Data-/blob/master/Getting%20and%20Cleaning%20Data%20Course%20Project