6.S083 / 18.S190 Problem set 1: Data analysis
Submission deadline: 11:59pm on Tuesday, April 7
(Submission instructions to follow later.)
In this problem set we will explore the COVID-19 data set from
https://github.com/CSSEGISandData/COVID-19
in more detail.
Please use the notebook from lecture 1 as the basis for loading the data.
You should make a Jupyter notebook with your solutions. We will combine the best ones to make the model solutions.
Exercise 1: Visualizing the data over time
In this exercise we will practice data cleaning and plotting. Make a new plot for each question.
-
Extract the data and country names from the CSV file as we did in class. Call the country names
all_countries
. -
Make a
Vector
calledcountries
with a subset of countries that we wish to plot, say China, Japan, South Korea, US, United Kingdom, France, Germany. Be careful to check how they are written in the data set. -
Define a variable
num_days
by extracting the number of days of data from the dataframe. -
We need to accumulate the data for those places that are split up into territories. Make a zero vector of the correct length for each country, e.g. using the function
zeros
.Loop through all the countries and add the corresponding data to that country’s data.
You may use a dictionary (
Dict
), or a matrix, or aVector
containingVector
s, or a newDataFrame
to store the data. -
Plot all countries’ data on the same graph by using a
for
loop. To do so, first make an empty plot:p = plot()
Then run the loop, adding in the data using
plot!
Finally, display the plot by evaluating the plot object
p
. (Just type its name and evaluate.) -
Now use a
log
scale on the axis. In order to do so you will need to convert the vectors to containFloat64
and replace any 0 values byNaN
(“not a number”) so thatPlots.jl
ignores those values.Is there exponential growth?
-
Turn this into an interactive visualization by adding a slider corresponding to the current day, varying between 1 and the total number of days for which you have data. You should draw only the data up to that particular day. As you move the slider the plot should update. Fix the horizontal axis using the keyword argument
xlim=(0, num_days)
inside the firstplot
command.
Exercise 2: Visualizing changes
Now let’s try to reproduce the essence of the nice visualization from https://aatishb.com/covidtrends, which is a less usual point of view. Again the slider will represent a day during the epidemic.
The horizontal axis will show the total confirmed cases until the given day, while the vertical axis will show the change in confirmed cases during the past week (7 days).
-
Make a data set
total_cases_to_date
, representing for each day the total number of confirmed cases during the whole epidemic up until that point. -
Make a set of data
new_cases
which is the total number of cases only during the past 7 days.You may use the
sum
function. -
Make the visualization using a slider representing days as in the previous exercise, plotting the total number of cases on the axis and new cases on the axis.
You need to take care that each vector being plotted has the same length.
-
Add a dot for each countries current position using the
scatter
function (which otherwise works likeplot
) and annotate the countries’ dots using something likeannotate!(x, y, text(country_name, 10, :black))
where
x
andy
are the positions at which to annotate and10
is the font size.
Exercise 3: Helping with transcripts
The goal of this exercise is to improve the quality of the auto-generated subtitles / captions of the lecture videos via crowdsourcing. Although the natural language processing is very impressive, it’s far from perfect!
-
Choose one of the lecture transcripts from lectures 1–4 in the Google Docs transcript folder.
-
Choose at least 10 lines, preferably contiguous, and correct the captions by listening to the relevant portion of the video, as given in the time-stamp. Modify the text according to what you hear, paying particular attention to the computational and mathematical words that the automatic transcription software tends to get wrong.
Please do not modify the time-stamps or line numbers.
-
Change the colour of the text that you transcribed to a colour that is not black nor the same as the previous and next block.
-
Also check (proofread) the 10 previous lines and 10 following lines and correct things if you think they’re not correct.
-
Write down in your pset submission which piece of which transcript you did. If there are none left, make sure you are first in the line for the following lectures!