Description
Project 1 – Calculate the sample size of a Dataset using Statistics
Introduction
For this project, your goal is to design a program that uses statistics to help a developer select a test data (sample) set from the complete data (population) set that is expected to be processed. Many times developers are asked to work with large data sets that are impractical to use during development, since a large data set could take minutes to run each time it is executed. Consider how many times you must run a program while developing it and imagine if it took five minutes to run each time, it would be impractical to do this, so developers often need to create a much smaller dataset to use during development. However, just selecting the first few records of a data set can lead to errors, since the first few records may not represent an accurate sample of the data. Statistics offers a clever way of solving this problem, since you can use a formula to calculate the correct sample size from a large data set to use for development.
In order to do this, you will need to design a program that takes the entire data set as the initial input, you will then need to select the column / field, which represents the data point you’re interested in using to create your sample from. For example, if you’re analyzing the number of team wins, you will need to pick the team wins’ field to use as the basis for selecting your sample data set. You would then use this to calculate the min, max, median, mode, mean, standard deviation, and eventually the sample size of the population.
Once you have the total sample size, you will use this to create a sample data set by selecting representative records of the min and max, and then fill in the remainder of the data set by randomly selecting the appropriate number of records from your population (original large data set) and saving that new sample data set to disk, so that it can be used for development. For example, let say you have a total population of 10,000 records, and you compute a sample size of 300, you would then select records that represent the (min, max) and then fill in the remaining records with randomly selected records to create a total of 300 records from the original dataset and save those records to disk. You cannot repeat records, each record must be unique. You are required to create a total of 15 test data files that are generated using the specified values for confidence interval and margin of error. By doing this you will create a selection of files that can be used at various points in development, since as the program become more complete you would increase the size of the input data to test various aspects of a program such as memory usage.
Your program is required to do the following
Take the file input as: attached a file csv type for sample data. Please include the directory for path file in the code for this file to use as a sample size data for the solution.
Use the population field as the datapoint to select your sample.
Set the margin of error (.01, .05, .10)
Calculate the following summary statistics for the selected field:
Count, Min, Max, and Standard Deviation
Sample Size at 80%, 85%, 90%, 95%, 99% confidence using .01, .05, .10 margin of error respectively
Note use .5 standard deviation instead of the actual standard deviation to calculate the sample size
Create test data file(s) that contain the number of values specified by the sample size; however, these files must contain records that include max and min of the selected data point from the original data set.
Output test data files to the output directory called “sample_files_output”
Your program should output 15 different files using the naming convention
Your code needs to be module and follow separation of concerns, don’t repeat yourself, and demonstrate SOLID to the best of your ability.
You can use Pandas for most of the functionality of the program, so once you know your steps, you should research how Pandas can be used do it.
Your final program should use static methods because you are mainly just sending commands to Pandas and Pandas dataframe is the data instance. Each Panda’s command should be its own method, since you shouldn’t directly call library functions.
You need to calculate directory paths programmatically, or your program won’t work on GitHub because it won’t be able to find the data directory path, since Github’s test environment is not the same as your own computer. Check here for my code
Your data selection algorithm must not select the same record twice, so that it results in the correct total number of records specified by the sample size.
sample_data_confidenceInterval_marginError_recordCount.csv:
For example (counts are accurate):
sample_data_number_1.28_0.1_22.csv
sample_data_number_1.28_0.01_51.csv
sample_data_number_1.28_0.05_39.csv
sample_data_number_1.44_0.1_25.csv
sample_data_number_1.44_0.01_51.csv
sample_data_number_1.44_0.05_41.csv
sample_data_number_1.65_0.1_29.csv
sample_data_number_1.65_0.01_51.csv
sample_data_number_1.65_0.05_43.csv
sample_data_number_1.96_0.1_33.csv
sample_data_number_1.96_0.01_51.csv
sample_data_number_1.96_0.05_45.csv
sample_data_number_2.58_0.1_39.csv
sample_data_number_2.58_0.01_51.csv
sample_data_number_2.58_0.05_48.csv
Z-score table
80% confidence => 1.28 z-score 85% confidence => 1.44 z-score 90% confidence => 1.65 z-score 95% confidence => 1.96 z-score 99% confidence => 2.58 z-scoreYou will be graded on:
All teacher tests passing: 45 Points
Code Grammar: 45 Points
Project 1 Canvas Quiz: 10 Points (See canvas Project 1 Module)
Grading Notes:
You will at least get 30 points for turning the project in even if it doesn’t work at all. You will automatically get a total of 30 points for the assignment. Except not having code in the app folder, which results in a 0.
You will get a 0 if your application code is not in the app folder
You will lose points for repeating lines of code. 5-10 points
You will lose 10-20 points if the code does not follow separation of concerns
You will lose 10-20 points if it appears you did not try to follow SOLID
You will lose 10-20 points for not using classes
You will lose 10-20 points for not using the correct types of methods i.e. mainly you are using static methods
Requirements to complete the assignment:
Your code needs to be module and follow separation of concerns, don’t repeat yourself, and demonstrate SOLID to the best of your ability.
You can use Pandas for most of the functionality of the program, so once you know your steps, you should research how Pandas can be used do it.
Your final program should use static methods because you are mainly just sending commands to Pandas and Pandas dataframe is the data instance. Each Panda’s command should be its own method, since you shouldn’t directly call library functions.
You need to calculate directory paths programmatically, or your program won’t work on GitHub because it won’t be able to find the data directory path, since Github’s test environment is not the same as your own computer. Check here for my code
Your data selection algorithm must not select the same record twice, so that it results in the correct total number of records specified by the sample size.
Please ask questions for any queries or doubts.
Unformatted Attachment Preview
Purchase answer to see full
attachment