NA GIRISH – Data analytics project architecture diagram -

Description

Hello,

Don't use plagiarized sources. Get Your Custom Assignment on

NA GIRISH – Data analytics project architecture diagram

From as Little as $13/Page

please draw a diagram using draw.io tool for the architecture diagram we need to build a solution

choose any dataset as per the instructions, the main task is to do which one I highlighted below 2nd point diagram

I have attached sample document and project instructions so based on instructions need schema

1. Data Warehouse is developed to obtain business intelligence from data. Develop a proof of concept data warehouse/mart (using dimensional model) capturing data from an existing data source(s). Document your reasons for selecting the subject area(s), identify key stakeholders, formalise the business vision for developing the data warehouse from the given data source. Also explain the insights that a company may attain from the given data. It should be reflected in the reports and visuals.

(11 marks)

Note: Use data sources other than Kaggle (if possible)

2. Develop and present a suitable schema for the data warehouse. Discuss your

reasons for the design.

Unformatted Attachment Preview

Data Storage Solutions for Data Analytics
MSc in Data Analytics
Assessment Brief
Module Title:
Module Code:
Module Leader:
Stage (if relevant):
Assessment Title:
Assessment Number (if relevant):
Assessment Type:
Restrictions on Time/Length : +/- 10%
Individual/Group:
Assessment Weighting:
Issue Date:
Hand In Date:
Planned Feedback Date:
Mode of Submission:
Data Storage Solutions for Data Analytics
B9DA111
Dr Shazia A Afzal
Award
Design and Development of data storage solutions for analysis
1
Lab Work and Report
N/A
Group (3 in a Group)
50% report and 20% Presentation
18-12-23
13-12-23
TBC
Online via Moodle
Every student must submit their assignment on Moodle on or before the due date.
Assignments will not be accepted via Email.
Details of Assignment brief
Total: 100 Marks (weighting: 50%)
Following is the breakdown of tasks allocated for this assessment.
1. Data Warehouse is developed to obtain business intelligence from data. Develop a
proof of concept data warehouse/mart (using dimensional model) capturing data
from an existing data source(s). Document your reasons for selecting the subject
area(s), identify key stakeholders, formalise the business vision for developing the
data warehouse from the given data source. Also explain the insights that a
company may attain from the given data. It should be reflected in the reports and
visuals.
(11 marks)
Note: Use data sources other than Kaggle (if possible)
2. Develop and present a suitable schema for the data warehouse. Discuss your
reasons for the design.
(15 marks)
3. Using Microsoft SQL Server, implement your tables and extract, transform and load
data from the operational source(s) into the data warehouse. This can be done using
SSIS or by writing SQL statements.
(14 marks)
4. Produce four reports in support of the requirements outlined in section 1 using
SSRS. Also produce four visualisations using Tableau and discuss them. Develop a
dashboard in Tableau to show the visuals.
(20 marks)
5. Multiple studies have been done comparing the performance of graph databases
and relational ones. Use a relational database such as Adventure Works (or you can
use any other relational database of your choice) as data source and implement a
graph database in Neo4J. Write any seven queries in SQL and CQL to use as
evidence to compare both databases and discuss the results to demonstrate the
differences of storing/retrieving data in/from relational and graph databases. Note:
These queries must provide useful information from the database.
(20 marks)
6. Overall report including references/bibliography.
(10 marks)
7. Individual Contribution Report
o
Every student must write an individual contribution report based on the
reflection of learning. You must also write about your contributions to this
work. Word count: 500
(10 marks)
Note: Use font: Calibri, font-size: 12, 1.15 line spacing.
Presentation (20%)
Presentations will be held in the week 12 at the allocated times
Time allowed: 15 minutes +- 2 minutes
Every group member has to present.
At the time of presentation, all projects/code must be ready for demonstration. Q/A session
may need more time so be prepared.
(Marks will be deducted for time taken more than the allocated time).
Marks Breakdown
Visual Appeal
Comprehension
Presentation
Skills
Content
No errors in spelling, grammar and
punctuation. Clear and concise
information on each slide.
Visually appealing/engaging.
5% marks
Extensive knowledge of topic.
Complete understanding of assignment.
Accurately answering all questions
posed.
5% marks
Engaging with audience and getting
audience’s attention.
Appropriate speaking volume & body
language. Prepared and rehearsed.
5% marks
Concise summary of the topic with all
5% marks
questions answered. Comprehensive and
complete coverage of information.
Deliverables:
•
Report (Microsoft Word) including
o Source of Data
o Vision and Business Requirements for the Development of Data
Warehouse (First Part)
o Data warehouse Design (produced in Visio or in a similar system)
o Data Warehouse (development scripts) as Appendix
o ETL (project screenshots and project OR SQL code)
o Visualisations.
•
•
•
•
•
o Explanation of every section
o References/Bibliography
SSRS Reports (pdf/excel) – must be include in the submission
SSIS Project – must be included in submission
All queries including SQL and CQL, Neo4j screen shots and explanation – must be
included in submission
Presentation slides
Individual Contribution Report
General Assessment Submission Requirements for Students:
1.
Online assignments must be submitted no later than the stated deadline.
2.
All relevant provisions of the Assessment Regulations must be complied with.
3.
Extensions to assignment submission deadlines will be not be granted, other than in
exceptional circumstances. To apply for an extension please go to http://www.dbsstudents.com/Registrar/ and download the Assignment Extension Request Form.
4.
Marks will be deducted for late submissions.
5.
Students are required to retain a copy of each assignment submitted, and the
submission receipt.
6.
Assignments that exceed the word count will be penalised.
7.
Students are required to refer to the assessment regulations in their Student Guides
and on the Student Website.
8.
Dublin Business School penalises students who engage in academic impropriety (i.e.
plagiarism, collusion and/or copying). Please refer to the attached referencing
guidelines for information on correct referencing.
What is referencing and why is it necessary?
Please follow this link to the Harvard Style Referencing Guide – all referencing is required in
this format.
http://issuu.com/dbslibrary/docs/harvard-referencing-guide/1?mode=a_p
1/19/2021
[Data Storage Solutions for Data Analytics]
CA
Module Lead: Dr. Shazia Afzal
Submitted by (Group 6):
Ved Yadav
(10556665)
Priyabrata Samal (10564797)
Sriram Vallurupalli (10558533)
Contents
Question 01 ………………………………………………………………………………………………………………. 2
Question 02 ……………………………………………………………………………………………………………….. 4
Question 03 ……………………………………………………………………………………………………………….. 8
Question 04. …………………………………………………………………………………………………………….. 10
Question 05 ……………………………………………………………………………………………………………… 20
Question 06 ……………………………………………………………………………………………………………… 36
Bibliography : ……………………………………………………………………………………………………….. 36
1
Question 01
Data Warehouse is developed to obtain business intelligence from data. Develop a
proof of concept data warehouse/mart (using dimensional model) capturing data
from an existing data source(s). Document your reasons for selecting the subject
area(s), identify key stakeholders, formalise the business vision for developing the
data warehouse from the given data source. Also explain the insights that a
company may attain from the given data. It should be reflected in the reports and
visuals.
Answer:
We have selected data set from Kaggle for IndianPremierLeague (IPL), IPL is a
professional Twenty20 cricket league founded by the Board of Control for Cricket in India
(BCCI) in 2008. We will demonstrate the data warehouse using dimensional model. For
representation we have cleaned the data and removed the ambiguous data from the
columns, we are adding details about changes in this report.
The reason we have selected this subject because IPL matches are greatly popular, and it
holds lot of entities in relation. Using these relations and entities we can create a successful
relational model and that can be useful to show the graphical representation of the event
and in future it can be used to get idea looking at the data analysis done.
Stakeholder from this dimensional model would be team who are participating in the
league, players from different nations who has been auctioned, defined venue here are the
stadium’s marked for each match and the match dates which is scheduled and planned for
each match in the leagues.
This data warehouse model can be used to in many applications for prediction and future
usage. This can be used to identify team win percentage, it can be useful to get information
of ground and overseas player in each team. IPL is a massive business and while auctioning,
selecting ground or team, these datasets can be referred, and it can be useful to predict. Tis
dataset can share information on number of matches can be also planned as per previous
year data.
These predictions can help IPL shareholders to great benefit.
In this exercise we are demonstrating below reports. Details are mentioned below for each
table.
2
1. Team win percentage:
In this report we will populate the data from tables of IPL database in
graphical manner with winning matches count of every individual team in
their played seasons.
2. Number of matches in specific season.
This report has count of matches played in the league in each year right from
2008.
3. Matches count at venue.
Matches count has been demonstrated in percentage for each stadium. In
this report we have taken match details for all the seasons played so far.
4. Overseas player percentage
This graph has analysis of number of overseas players participated and their
home countries.
Below are the data change, we have performed while cleansing the csv files.
1. We have removed a value of data -1 from the table Match. It has -1 record in all the
columns.
2. While importing we have changed the data length to 52 in the column Venue_name
for Team table.
3. We have lot of records in the ball_by_ball table, we have used it for reference but
not involved in the dimensional model.
4. We have created the fact tables in the same database to connect relationship with
the other entities.
3
Question 02
Develop and present a suitable schema for the data warehouse. Discuss your reasons for the
design.
Answer:
In this database we have created multi-dimension table. Below are the details of dimension
table name:
1. [dbo].[Date_Dim] – It will hold all the date and time information.
2. [dbo].[Location_Dim] – Location_Dim it will hold the details of venue name along
with the city name. This will help us to identify the required location information
from this table.
3. [dbo].[Player] – Player table is holding details of each player participated in IPL.
4. [dbo].[Match] – Match Table will hold the details of each match.
5. [dbo].[Team] – In this table it will have details of all the team participated in the
league since IPL started.
6. [dbo].[Player_match] -It will hold information of player performance details in a
match. This can be used to get additional information about the Match as well.
7. [dbo].[Match_fact_table] – In this table we will derive match attributes like match
winner, toss winner, man of the match with respect to a particular team along with
date and location attribute.
8. [dbo].[Player_fact_table] – This table will hold all player attribute details such as
age, for the team he is playing for, if he is from India or overseas. This will help to
correlate data with other tables.
This schema will help us to create the joins between the tables.
• We have Match_id as Primary key in Match table which is used to map the foreign
key with the Match_fact_table, Player_match table and ball_by_ball tables.
• We have team_id in team table with primary key which has foreign key with
Match_fact_table, it has created relation between these two tables.
• We have Date_key in Date_dim table which has primary key and it is referred as
foreign key in Match table and Match_fact_table.
• We have Location_key as primary key in Location_dim table which has created
foreign key in Location_fact_table and match table.
• We have player_id as primary key in Player table and it is used as foreign key in
Player_match table to build relation between these tables.
Using above table schema we are able to derive the results and present a relations between
tables to demonstrate the data warehouse solution.
4
1. Team win percentage:
In this report we will populate the data from tables of IPL database in
graphical manner with winning matches count of every individual team in
their played seasons.
Below is the report query:
select tm.Team_Name,count(mat.match_winner) ‘Matches Won’ from match mat, team tm
where mat.team1 = tm.Team_Name
group by tm.Team_Name
order by ‘Matches Won’ desc
2. Number of matches in specific season.
This report has count of matches played in the league in each year right from
2008.
Below is the report query:
select dt.Year,count(mat.match_id) as Total_Matches from match mat, Date_Dim dt
where mat.Date_key = dt.Date_key
group by dt.Year
order by Total_Matches desc
3. Matches count at venue.
Matches count has been demonstrated in percentage for each stadium. In
this report we have taken match details for all the seasons played so far.
Below is the report query:
select Venue_name, count(venue_name)as No_of_matches from match as M, [dbo].[Location_Dim]
as L
where M.Location_key=L.Location_key and Venue_name != ‘NULL’
group by Venue_name
order by No_of_matches desc
4. Overseas player percentage
This graph has analysis of number of overseas players participated and their
home countries.
5
Below is the report query:
select count(pl.Player_Name) no_of_players, pl.Country_Name from Player_fact_table pf, player pl
WHERE pf.Player_id = pl.Player_Id
AND pf.overseases_player = 1
Group by pl.Country_Name
6
Below is the Database diagram of this model.
Date_Dim
Date_key
Full_Date
Year
Month
Day
Match
Location_Dim
Match_SK
Location_Key
match_id
Venue_name
Team1
City_Name
Team2
Country_Name
Toss_Winner
match_winner
Win_Type
ManOfMach
Location_key
Date_key
Match_Fact_table
Player_match
Ball_By_Ball
Match_id
Player_match_SK
MatcH_id
Date_key
PlayerMatch_key
Over_id
Toss_Winner
Match_Id
Ball_id
match_winner
Player_Id
Innings_No
ManOfMach
Country_Name
Team_Batting
team_id
Role_Desc
Team_Bowling
Location_key
Player_team
Striker_Batting_Position
Opposit_Team
Extra_Type
Season_year
Runs_Scored
is_manofThematch
Extra_runs
Age_As_on_match
Wides
IsPlayers_Team_won
Legbyes
Byes
Team
Team_SK
Noballs
Team_Id
Penalty
Team_Name
Bowler_Extras
Out_type
Caught
Bowled
Run_out
Player
LBW
PLAYER_SK
Retired_hurt
Player_key
Player_Id
Stumped
Player_id
Player_Name
caught_and_bowled
overseases_player
DOB
hit_wicket
Age_more_than_35
Batting_hand
ObstructingFeild
Team_id
Bowling_skill
Bowler_Wicket
Player_fact_table
Match_Date
Season
Striker
Non_Striker
Bowler
Player_Out
Fielders
Striker_match_SK
StrikerSK
NonStriker_match_SK
7
Question 03
Using Microsoft SQL Server, implement your tables and extract, transform and load
data from the operational source(s) into the data warehouse. This can be done using
SSIS or by writing SQL statements.
Answer :
We have created Data warehouse solution on SQL instance : DESKTOP-QEEN7S7
Database name : IndianPremierLeague
Instance Version : Microsoft SQL Server 2017
Authentication mode : Windows & SQL (we can use either way to authenticate to the
instance and check Database)
• Windows user
: DESKTOP-QEEN7S7vedya
• SQL user
: IPL_user
Below are the SQL scripts and SSIS used for creation of the data warehouse. We have
implemented it in sequential way to keep all the entities in relations and maintain the data
as per the Dimensional model.
1. We will execute below named query to create all the required tables.
1_DB_script_Create_table.sql
2. Once tables are created we can execute below named query from attached script to
insert data in to the blank tables.
2_DB_Script_Insert.sql
8
3. After insertion of the data we will run below query to map the other constraints.
3_DB_script_Constraints.sql
Attached query in zip format —
create_insert_constrai
nt.zip
4. Also one SSIS package created as sample, this can also be used to create table which
will import data and then we can map the constraints using alter query.(This table is
already covered in step2(Using scripts), but created SSIS package to demonstrate,
even SSIS can also be used for such exercise.)
9
Question 04.
Produce four reports in support of the requirements outlined in section 1 using SSRS.
Also produce four visualisations using Tableau and discuss them. Develop a
dashboard in Tableau to show the visuals.
Answer :
Below are the reports we have created using SSRS to demonstrate the data
warehouse dimensional model.
Report hosted : http://desktop-qeen7s7/Reports/browse/IndianPremierLeague
10
1. Team win percentage:
Attached here its Report file along with snapshot of the SSRS page.
11
2. Match number percentage per year.
Attached here its Report file along with snapshot of the SSRS page.
12
3. Matches count at venue.
Attached here its Report file along with snapshot of the SSRS page.
13
4. Overseas player percentage
Attached here its Report file along with snapshot of the SSRS page.
14
Below is the Visualisation using tableau and dashboard from it.
1. Team win percentage:
15
2. Match number percentage per year.
16
3. Matches count at venue.
17
4. Overseas_player_percentage
18
Dashboard which holds all the sheets together.
Herein attached dashboard of IndianPremierLeague export for reference.
s
19
Question 05
Multiple studies compare the performance of graph and relational databases. Use a
relational database such as Adventure Works as data source and implement a graph
database in Neo4J. Write any seven queries in SQL and CQL to use as an evidence to
compare both databases and discuss the results to demonstrate the differences of
storing/retrieving data in/from relational and graph databases.
Note: These queries must provide useful information from the database.
Answer
Below is the representation using Neo4j
IMPORTING : In this section we are importing CSV files in Neo4j. Below are the tables which
is used to import specific files in Neo4j.
Stores.csv
LOAD CSV WITH HEADERS FROM ‘file:///Stores.csv’ as row CREATE (STORE : Stores) SET
STORE =
{store_id:row.store_id,phone:row.phone,email:row.email,street:row.street,city:row.city,sta
te:row.state,zip_code:row.zip_code}return STORE
Staffs.csv
LOAD CSV WITH HEADERS FROM ‘file:///Staffs.csv’ as row CREATE (STAFF : Staff) SET STAFF
=
{staff_id:row.staff_id,fname:row.first_name,last_name:row.last_name,email:row.email,pho
ne:row.phone,active:row.active,store_id:row.store_id,manager_id:row.manager_id}return
STAFF
Order_items.csv
LOAD CSV WITH HEADERS FROM ‘file:///Order_items.csv’ as row CREATE (ordit : OrderItem)
SET ordit =
{order_id:row.order_id,item_id:row.item_id,product_id:row.product_id,quantity:row.quant
ity,list_price:row.list_price,discount:row.discount}return ordit
Orders.csv
LOAD CSV WITH HEADERS FROM ‘file:///Orders.csv’ as row CREATE (ord : Order) SET ord =
{order_id:row.order_id,customer_id:row.customer_id,order_status:row.order_status,requir
ed_date:row.required_date,shipped_date:row.shipped_date,store_id:row.store_id,staff_id:
row.staff_id}return ord
20
Customer.csv
LOAD CSV WITH HEADERS FROM ‘file:///Customers.csv’ as row CREATE (cus:Customer) SET
cus =
{customer_id:row.customer_id,first_name:row.first_name,last_name:row.last_name,phone
:row.phone,email:row.email,street:row.street,city:row.city,state:row.state,zipcode:row.zip_
code}return cus
CREATING CONSTRAINTS
CREATE CONSTRAINT ON (ord : Order) ASSERT ord.order_id IS UNIQUE
CREATE CONSTRAINT ON (cus:Customer) ASSERT cus.customer_id IS UNIQUE
CREATE CONSTRAINT ON (STAFF : Staff) ASSERT STAFF:staff_id IS UNIQUE
CREATE CONSTRAINT ON (STORE : Stores) ASSERT STORE.store_id IS UNIQUE
RELATIONSHIPS/MATCHING
Below graph shows – Staff working in a particular Store:
Mapping the staff to a particular store based on the relation working in
MATCH (STAFF : Staff) , (STORE : Stores) WHERE STAFF.store_id=STORE.store_id CREATE
(STAFF) –[r:WORKING_IN] – > (STORE) return STAFF,STORE, r
21
Customer orders Order:
Mapping the customer with ordered with a relation ordersMATCH (cus:Customer) , (ord :
Order) WHERE ord.customer_id=cus.customer_id CREATE (cus) –[r:orders] – > (ord) return
ord, cus, r
22
Order managed by staff:
Mapping the staff with the orders they managed with relation managed
MATCH (ord : Order) , (STAFF : Staff) WHERE ord.staff_id=STAFF.staff_id CREATE (ord) –
[r:mangaed_by] – > (STAFF) return ord,STAFF, r
23
ORDER CONTAINS ITEMS
Mapping the items in a particular order id with a relation contains
MATCH (ord : Order) , (ordit : OrderItem) WHERE ord.order_id=ordit.order_id CREATE (ord)
–[r:contains] – > (ordit) return ord,ordit, r
24
ORDER SUPPLIED BY STORE:
Mapping all the orders supplied by a particular store
Cql:
MATCH (ord : Order) , (STORE : Stores) WHERE ord.store_id=STORE.store_id CREATE (ord) –
[r:supplied_by] – > (STORE) return ord,STORE, r
25
Queries:
## Store with Customer details
This query uses tables customer orders and stores to display the particular customer visited
or had transaction at a store
SQL:
select sto.store_name ,concat(cust.first_name,’ ‘,cust.last_name) ‘Customer’
from [sales].[customers] cust, [sales].[orders] ord, [sales].[stores] sto
WHERE cust.customer_id = ord.customer_id
AND ord.store_id = sto.store_id
CQL:
MATCH (cus:Customer)–[r:orders] – >(ord:Order)-[r1:supplied_by]-> (STORE : Stores) return
STORE.store_id,cus.first_name+’ ‘+cus.last_name AS Full_Name
26
## Top 10 Order_IDs with most value
This query returns us the top 10 order ids with the highest value using values from tables
order and
SQL Query:
select TOP 10 oitm.order_id,max(oitm.list_price) as ‘Total Amount’ from
[sales].[order_items] oitm
group by oitm.order_id
order by ‘Total Amount’ desc
CQL Query:
MATCH (ordit:OrderItem) return ordit.order_id,MAX(toFloat(ordit.list_price)) AS
Total_Amount order by Total_Amount DESC LIMIT 10
27
## Discounted price for Order ID
This query gives us the prices after discount for a particular order id
SQL Query:
select ord.order_id,(SUM(oit.list_price)-SUM(oit.list_price*oit.discount)) as
After_Discount_Price
from [sales].[orders] ord ,[sales].[order_items] oit
WHERE ord.order_id = oit.order_id
group by ord.order_id
order by After_Discount_Price desc
CQL Query:
MATCH (ord:Order)-[r:contains]-> (ordit : OrderItem) return
ord.order_id,(SUM(toFloat(ordit.list_price)) SUM(toFloat(ordit.list_price)*toFloat(ordit.discount))) As After_Discount_Price order by
After_Discount_Price DESC
28
## To Skip first three staff IDS and display other staff id and their mail
Below query returns the staff ids in order where the first three ids are skipped and others
are given
SQL Query:
SELECT STAFF.staff_id,STAFF.email
FROM [sales].[staffs] STAFF
ORDER BY STAFF.staff_id
OFFSET (3) ROWS FETCH NEXT (7) ROWS ONLY
29
CQL Query:
MATCH (STAFF : Staff)
RETURN STAFF.staff_id,STAFF.email
ORDER BY toINTEGER(STAFF.staff_id) SKIP 3
## List down the staff names which contains letter ‘e’ along with respective store_ids
This returns all the names of staff whose name contains ‘e’ and the associated store ids
30
SQL query:
SELECT sta.first_name,sto.store_id FROM [sales].[staffs] sta, [sales].[stores] sto
WHERE sta.store_id = sto.store_id
AND sta.first_name like ‘%e%’
CQL Query:
MATCH (STAFF : Staff)
WHERE STAFF.fname CONTAINS ‘e’
OPTIONAL MATCH (STAFF)-[:WORKING_IN]-(STORE : Stores)
RETURN DISTINCT STAFF.fname, STORE.store_id
31
## Extract order details for customer address ‘476 Chestnut Ave. ‘
This uses a customersaddress and fetches the order id and order status for that particular
address
SQL Query:
select ord.order_id,order_status,cus.street from [sales].[customers] cus,[sales].[orders] ord
WHERE cus.customer_id = ord.customer_id
AND cus.street = ‘476 Chestnut Ave. ‘
32
CQL Query:
MATCH (cus:Customer)–[r:orders] – > (ord:Order)
WHERE cus.street =’476 Chestnut Ave.
‘ return DISTINCT ord.order_id ,ord.order_status ,cus.street
33
## Products ordered in the range of 1000 to 1500
This query returns the order ids with the list price in the range 100
SQL Query:
select ordit.order_id,ordit.product_id,ordit.item_id,ordit.list_price from
[sales].[order_items] ordit
where ordit.list_price between 1000 and 1500
CQL Query:
MATCH (ordit : OrderItem)
WHERE 1000
Purchase answer to see full
attachment

Related Questions: