Harvard University Machine Learning Algorithms in Rstudio Coding
I’m working on a programming project and need support to help me study.
I am working on a project in R on the babynames dataset.
I have this snippet of code analysis that works but I need to incorporate two machine learning algorithms in this project and I am not sure what to choose and how to implement.
Here is the code:
#install gender and genderdata packages and all applicable libraries
#The gender package in cran contains only demonstration data.
#For full data analysis, I had to download the genderdata package.
install.packages(“genderdata”, type = “source”,
repos = “http://packages.ropensci.org”)
install.packages(c(“gender”, “genderdata”),
repos = “http://packages.ropensci.org”,
type = “source”)
library(gender)
library(genderdata)
library(tibble)
library(dplyr)
#I used (method = “ssa”): United States from 1930 to 2012.
#Drawn from Social Security Administration data.
#I took a sample of random names from websites that identify gender neutral names
#the prospective parents could visit using a Google search and graphed them earlier.
#From the earlier analysis on each name, I chose 7 names that seemed the most neutral based on
#male and female trendlines in the charts.
ssa_names <- c(“Charlie”, “Royal”, “Morgan”, “Skyler”,
“Frankie”, “Oakley”, “Justice”)
ssa_years <- c(rep(c(2009, 2012), 3), 2012)
ssa_df <- tibble(first_names = ssa_names,
last_names = LETTERS[1:7],
years = ssa_years,
min_years = ssa_years – 3,
max_years = ssa_years + 3)
ssa_df
#This dataset connects first names to years but there are columns
#for minimum and maximum years for possible age range since birth dates are not always exact.
#We pass this to gender_df() function, which assigns the method that we wish to use and the names of the columns that contain the names and the birth years. The result is a tibble of predictions.
results <- gender_df(ssa_df, name_col = “first_names”, year_col = “years”,
method = “ssa”)
results
#gender_df() function calculates genders only for unique
#combinations of first names and years
ssa_df %>%
left_join(results, by = c(“first_names” = “name”, “years” = “year_min”))
gender_df(ssa_df, name_col = “first_names”,
year_col = c(“min_years”, “max_years”), method = “ssa”)
#Now, we use gender_df() to predict gender by passing it the columns
#minimum and maximum years to be used for each name
ssa_df %>%
left_join(results, by = c(“first_names” = “name”, “years” = “year_min”))
gender_df(ssa_df, name_col = “first_names”,
year_col = c(“min_years”, “max_years”), method = “ssa”)
ssa_df %>%
distinct(first_names, years) %>%
rowwise() %>%
do(results = gender(.$first_names, years = .$years, method = “ssa”)) %>%
do(bind_rows(.$results))
ssa_df %>%
distinct(first_names, years) %>%
group_by(years) %>%
do(results = gender(.$first_names, years = .$years[1], method = “ssa”)) %>%
do(bind_rows(.$results))
What I really want to do is to use two different algorithms to get to the best method to do that following.
1. Analyze the dataset for the names that are the closet to being equally assigned to either a male or female. The last chart shows the proportions but I want to find an algorithm that would come up with the top 10 names that are the most gender neutral.