Description
Introduction
You will begin working on an applied data mining project this week. In order to examine the project components and timeline for this multi-part assignment, see the Project Overview Download Project Overview.
Directions
Identify the datasets to be used for your data mining analysis. The project should utilize at least two publicly available datasets, which have not been used in any other assignments in the class.
Submit the following items for this assignment:
A short description of the datasets
Links to the dataset
An explanation of how those datasets can be studied together and how this study will contribute a business or society.
This submission does not require to create specific research questions; however, an explanation of the problems the analysis of the combination of the selected datasets may solve is required.
Cities or states usually provide public datasets that could be used. For example, for Kansas City, MOLinks to an external site. (or one for your city or state)
Unformatted Attachment Preview
Applied Data Mining Project Overview
Introduction
The course project is a culminating learning experience in this class. The goal of the project is to
conduct a data analysis in a very similar to real-world applications settings. It will allow to apply
skills covered in the entire course.
The domain application of the analysis is open to student to choose. However, the chosen topic
should be approved by the instructor, first.
The project will require to identify data sources (Unit 4), convert data into a format feasible for
data analysis, prepare data, and then analyze it. The preliminary results of the analysis should
be reported during Unit 7, followed by a full report and an analysis presentation (Unit 8).
To meet expectations students will be asked to discuss how parallel computer can benefit the
analysis. To exceed expectation, one of parallelization methods should be applied to the data
analysis.
Directions
The project consists of the following stages:
1) Identify the datasets to be used for the analysis. The project should utilize at least two
publicly available datasets, which has not been used in any other assignments in the
class. By the Unit 4 submit a short description of the datasets, links to the dataset, and
couple of paragraphs on how those datasets can be studied together, and how this
study will contribute a business or society. This submission does not require to create
specific research questions, however, an explanation of the problems the analysis of the
combination of the selected datasets may solve is required.
Cities or states usually provide public datasets that could be used. For example,
https://data.kcmo.org/ for Kansas City, MO (or one for your city or state).
You can also find many links other datasets on this page:
https://www.datasciencecentral.com/profiles/blogs/great-github-list-of-public-datasets
2) Identify the best technology to conduct data conversion, data cleaning, and data
munging. Apply those techniques to the selected dataset and to produce a single
merged dataset for further analysis.
3) Identify the research question and what characteristics (variables) you will need to study
it.
4) Identify the need or a potential for a need in distributed computing in order to store,
manipulate, or analyze data.
5) Conduct the preliminary analysis by running one of the data mining techniques (e.g.
clustering, or regression).
6) Interpret and report the preliminary results of the analysis (Unit 7, Sunday 11:59pm).
Use any appropriate format (e.g. tables, charts) to report the results of the analysis;
writing must include results-based response to the research question.
7) Prepare the full report which must include:
a. Research question
b. Description of the datasets
c. Description of the specific data preparation process conducted
d. Description of analytical techniques
e. Description of the parallelization technologies used or a potential need in using
those technologies
f. Results of the analysis including tables and charts following basics of data
visualization.
g. Conclusions of the results, limitation, and the process of the conducted data
analysis.
(Unit 8, Friday 11:59pm)
8) Create a video presentation of the key items in the full Write a few bullet points
summary to report the key findings (Unit 8, Wednesday 11:59pm).
Purchase answer to see full
attachment