COS60008 Introduction to Data Science Assignment
Department of Computer Science and Software Engineering
COS60008 Introduction to Data Science
Semester 2 2019 – Assignment 1
Due: 23:59, Friday 27 September 2019 (Week 7)
This is an individual assignment and worth 15% of your final grade. It intends to evaluate your understanding and
practical skills to deal with the first few steps in a typical data science process.
In this assignment, you are provided three data files, i.e., “data1.csv”, “data2.csv” and “data3.csv”, which contain
the data from the 1985 Ward’s Automotive Yearbook1
. The files “data1.csv” and “data2.csv” contain the same set
of cars but distinct sets of attributes for describing the car, where each car has its unique ID, The file “data3.csv”
contains a different set of cars with each car described by all attributes from both “data1.csv” and “data2.csv”.
You are asked to carry out data acquisition, preparation and exploration based on the three data sources according
to the given instructions. For example, you need to develop and implement appropriate steps to load and merge
the data from the three data files, perform data cleaning, make explorative data analysis, and report your findings.
A discussion forum and further announcements for the assignment will be available in Canvas. You are responsible
for checking Canvas on a regular basis to stay informed with regards to any updates about the assignment.
The submitted assignment must be your own work, and any parts that are not created by yourself must be properly
referenced. Plagiarism is treated very seriously at Swinburne. It includes submitting the code and/or text copied
from other students, the Internet or other resources without proper reference. Allowing others to copy your work
is also plagiarism. Please note that you should always create your own assignment even if you have very similar
ideas with other students.
Plagiarism detection software will be used to check your submissions. Severe penalties (e.g., zero mark) will be
applied in cases of plagiarism. For further information, please refer to the relevant section in the Unit Outline under
the menu “Syllabus” in Canvas and the Academic Integrity information at https://www.swinburne.edu.au/currentstudents/manage-course/exams-results-assessment/plagiarism-academic-integrity/.
This section contains the general requirements which must be met by your submitted assignment. Marks will be
deducted if you fail to meet any of the following general requirements.
You must complete Tasks 1 and 2 in the Jupyter Notebook under the Python 2 kernel.
All code for Tasks 1 and 2 must be written in one SINGLE .ipynb file, where each sub-task in Tasks 1 and 2
occupies one code cell.
You must include code-level comments in the .ipynb file to explain the key parts of your code.
You must follow the instructions given in each task to complete the corresponding task.
You must follow the rules specified in the “Submission Requirements” section to make your final submission.
(1) 1985 Model Import Car and Truck Specifications, 1985 Ward’s Automotive Yearbook, (2) Personal Auto Manuals, Insurance
Services Office, 160 Water Street, New York, NY 10038, and (3) Insurance Collision Report, Insurance Institute for Highway Safety,
Watergate 600, Washington, DC 20037.
Task 1 – Data Acquisition and Preparation (5%)
At first, you need to acquire three data files “data1.csv”, “data2.csv”, and “data3.csv”, which are included
in a single .zip file named “assignment1_data.zip”, under the menu “Assignments → Assignment 1”
in Canvas, and put them into your working folder in the Jupyter Notebook.
These data files are adapted from the “Automobile” data set in the UCI repository2
, which contain many
records of cars with each record corresponding to a specification of the car in terms of its various attributes,
e.g., id, make, fuel type, the assigned insurance risk rating and the normalised loss in use as compared to
other cars. The files “data1.csv” and “data2.csv” contain the same set of cars but two distinct sets of
attributes for describing a car. In contrast, the file “data3.csv” contains a different set of cars, where each
record of the car consists of all attributes from both “data1.csv” and “data2.csv”.
The set of 27 possible attributes for a car record and their corresponding value ranges are given below:
• id (integer between 10000 and 40000): Identifier of the car.
• symboling (-3, -2, -1, 0, 1, 2, 3): Insurance risk rating, where a value of +3 and -3 indicates that the
car is risky and probably pretty safe, respectively.
• normalised-losses (continuous from 65.0 to 256.0): Normalised losses in use as compared to other
• make (alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz,
mercury, mitsubishi, nissan, peugot, plymouth, porsche, saab, subaru, toyota, volkswagen, volvo):
Make of the car.
• fuel-type (diesel, gas): Fuel type of the car.
• aspiration (std, turbo): Aspiration of the car.
• num-of-doors (four, two): Number of doors of the car.
• body-style (hardtop, wagon, sedan, hatchback, convertible): Body of the car.
• drive-wheels (4wd, fwd, rwd): Drive-type of the car.
• engine-location (front, rear): Location of the engine.
• wheel-base (continuous from 86.6 to 120.9): Measurement of wheel-base.
• length (continuous from 141.1 to 208.1): Length of the car.
• width (continuous from 60.3 to 72.3): Width of the car.
• height (continuous from 47.8 to 59.8): Height of the car.
• curb-weight (integer between 1488 and 4066): Curb weight of the car.
• engine-type (dohc, dohcv, l, ohc, ohcf, ohcv): Type of engine used in the car.
• num-of-cylinders (eight, five, four, six, three, twelve): Number of cylinders the engine has.
• engine-size (integer between 61 and 326): Size of the engine.
• fuel-system (1bbl, 2bbl, idi, mfi, mpfi, spdi, spfi): Fuel system of the car.
• bore (continuous from 2.54 to 3.94): Bore of the cylinder.
• stroke (continuous from 2.07 to 4.17): Number of strokes.
• compression-ratio (continuous from 7.0 to 23.0): Compression ratio of the car.
• horsepower (continuous from 48.0 to 288.0): Engine power.
• peak-rpm (continuous from 4150.0 to 6600.0): Peak revolutions per minute.
• city-mpg (integer between 13 and 49): Miles per gallon for city-drive.
• highway-mpg (integer between 16 and 54): Miles per gallon for highway-drive.
• price (continuous from 5118.0 to 45400.0): Price of the car.
As a data scientist, you will be asked to analyse the data from the three data files. However, before doing
that you know that you need to carry out some data preparation operations, e.g., merging and cleaning the
data. In this task, you are asked to utilise the Python package “Pandas” to do the following:
1.1.Loading the data from the three data files into three Pandas DataFrames and checking whether the
loaded data are equivalent to the data contained in the raw data files.
1.2.Merging the obtained three DataFrames into a single one that should contain all cars from the three
DataFrames, where each car has a unique ID and is described by the 27 attributes listed above.
1.3.Cleaning the data by using the knowledge you’ve learned.
o You need to deal with the issues existing in the data, e.g., missing values, duplicates, impossible
values and extra whitespaces. However, you must NOT modify any parts of data that do not suffer
issues. Failing to do so will lead to mark reduction.
o When dealing with missing values (if any), you can remove an entire row or column ONLY IF
more than 50% of its elements are missing. Otherwise, you must find other appropriate cleaning
methods to handle missing values.
o You must be able to explain how you detect each data issue and why you choose a specific
cleaning method to deal with it.
Task 2 – Data Exploration (5%)
Now you’ve finished Task 1 and obtained a DataFrame composed of the cleaned data. You can start to
explore your data by carrying out the following steps:
2.1.Choosing two columns with categorical and numerical values, respectively, and visualising each of
them in an appropriate way. Note that you need to explore and identify potentially important columns
(and can justify your choice) instead of making random choice.
2.2.Choosing three pairs of columns and exploring the relationship between the two columns involved in
each pair via appropriate descriptive statistics and visualisation tools. Your choice of the column pairs
should intend to address some “plausible hypotheses” on the data.
2.3.Building a scatter matrix for all numerical columns.
Note: Graphs must contain appropriate titles, axis labels, etc. to make themselves self-explained. They
should be clear enough for readers to read.
Task 3 – Report (5%)
In this task, you are asked to write a report to elaborate your analyses and findings in Tasks 2 and 3. You
3.1.Create a sub-heading tilted “Task 1: Data Acquisition & Preparation” in your report under which you
o Briefly describe how you addressed this task.
o Describe how you merged the data from the three data files.
o Describe each of the data issues you detected in data cleaning, explain how you detected it, and
justify why you chose a specific data cleaning method to deal with it.
o Discuss any problems you encountered when undertaking this task and how you solved them.
3.2.Create a sub-heading named “Task 2: Data Exploration” in your report under which you need to:
o Create a sub-section with an appropriate title for each of the three sub-tasks in Task 2.
o In the sub-section for sub-task 2.1, for each selected column, include the graph(s) created for that
column, and provide a brief explanation on why you chose that column and a specific visualisation
method to explore it.
o In the sub-section for sub-task 2.2, briefly explain why you chose each of the three pairs of
columns (e.g., stating the hypotheses that you intended to address), include the descriptive
statistics and graph(s) for each of the three selected pairs, followed by a brief discussion on any
interesting findings about the presence or lack of relationship between the two involved columns.
o In the sub-section for sub-task 2.3, include the plot of the scatter matrix, and report your findings
from the plot.
The report must be saved in the PDF format and named “report.pdf” for submission.
It MUST be written in the single column format with font size between 10 and 12 points and no more than
6 pages (including tables, graphs and/or references). Penalties will apply if the report does not satisfy these
requirements. Moreover, the quality of the report will be considered when marking, e.g. organisation, clarity,
and grammatical mistakes.
Please remember to cite any sources which you’ve referred to when doing your work!
The assignment is due at:
23:59, Friday 27 September 2019 (Week 7)
Assignments submitted after this time are subjected to late submission penalties. For detailed information,
please refer to the relevant section in the Unit Outline under the menu “Syllabus” in Canvas.
You need to prepare the following two files:
1. A notebook file named assignment1.ipynb which contain all your code and code-level comments for
Tasks 1 and 2.
Note: Please make sure to clean the code before making submission to remove all unnecessary code.
You should execute the steps: “Main menu → Kernel → Restart & Run All” in the Jupyter Notebook to
ensure you see all the data printed and all the graphs displayed as expected.
2. A report file named report.pdf which must strictly follow the format requirements detailed in Task 3.
To submit, you must archive these TWO files into ONE single .zip file, name it as per your student ID (e.g.,
1234567.zip if your student ID is 1234567), and then submit it in Canvas under:
Please do NOT submit other unnecessary files.
Extensions will only be permitted in exceptional circumstances. You should always backup your code and
other assignment-related documents frequently to avoid potential loss of progress. Note that any accidental
loss of progress, working while studying, and/or a heavy load of assignments will not be accepted as the
exceptional circumstances for an extension. For detailed information, please refer to the relevant section in
the Unit Outline under the menu “Syllabus” in Canvas.
The post COS60008 Introduction to Data Science Assignment appeared first on mynursinghomeworks.