Skip to content

Conversation

@ALuesink
Copy link
Contributor

@ALuesink ALuesink commented Aug 15, 2025

The refactor of GenerateViolinPlots:

  • code to functions
  • added unit tests

@ALuesink ALuesink marked this pull request as ready for review August 21, 2025 12:52
@ALuesink ALuesink changed the base branch from main to develop November 3, 2025 08:24
@rernst rernst self-requested a review November 18, 2025 14:54
Copy link
Member

@rernst rernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all lots of work done, good job! I feel like there is room for improvement, some general thoughts:

  1. Many parameters have names like metab_interest_sorted. In the context of a function, it’s not relevant whether the input is “of interest” or “sorted.” Use neutral, descriptive names that reflect the data type or role.

  2. Several functions are named after their use case rather than their functionality. name functions based on what they do, not where they are used.

  3. When breaking function calls across lines, maintain a consistent style. Preferred format:

Rfunction1(
   function_2(param_a),  
   param_b,  
   param_c,
)
  1. There is no error catching for missing files or invalid paths. Currently, the code will crash, making debugging difficult.

  2. There seems to be a lot of ad-hoc data transformations. It feels like the DIMS application is missing a standardized data format for saving and reusing data between steps.

#' @param intensity_cols: names of the columns that contain the intensities (string)
#'
#' @returns fraction_side_intensity: a vector of intensities (vector of integers)
get_intentities_for_ratios <- function(ratios_metabs_df, row_index, intensities_zscore_df, fraction_side, intensity_cols) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality of this function would get more clear with some more descriptive comments, for example before each if/else block. Secondly the name get_intentities_for_ratios implies that we get multiple intensities for multiple ratios, however the return object fraction_side_intensity implies only one value.

#' @param intensity_cols: names of the columns that contain the intensities (string)
#'
#' @returns fraction_side_intensity: a vector of intensities (vector of integers)
get_intentities_for_ratios <- function(ratios_metabs_df, row_index, intensities_zscore_df, fraction_side, intensity_cols) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name contains a typo intentities -> intensities

Comment on lines +34 to +37
get_zscore_columns <- function(colnames_zscore, intensity_cols) {
sample_intersect <- intersect(paste0(intensity_cols, "_Zscore"), grep("_Zscore", colnames_zscore, value = TRUE))
return(sample_intersect)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name get_zscore_columns implies that we get columns (data or index?) with z-scores. The descriptions describes we get sample_ids.

A better name would be something like get_sample_ids_with_zscore.

Comment on lines +44 to +53
get_list_metabolites <- function(metab_group_dir) {
# get a list of all metabolite files
metabolite_files <- list.files(metab_group_dir, pattern = "*.txt", full.names = FALSE, recursive = FALSE)
# put all metabolites into one list
metab_list_all <- lapply(paste(metab_group_dir, metabolite_files, sep = "/"),
read.table, sep = "\t", header = TRUE, quote = "")
names(metab_list_all) <- gsub(".txt", "", metabolite_files)

return(metab_list_all)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use the same 'word' for metabolite -> not metab.
  • You named the function to its use, not to its function. I think that it just creates a bunch of dataframes from a directory containing .txt files. So a better name would be something like (making it reusable) -> get_dataframes_from_dir.

# get a list of all metabolite files
metabolite_files <- list.files(metab_group_dir, pattern = "*.txt", full.names = FALSE, recursive = FALSE)
# put all metabolites into one list
metab_list_all <- lapply(paste(metab_group_dir, metabolite_files, sep = "/"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set full_names to True to get ride of the 'paste' on line 48.

Comment on lines +39 to +59
# Remove columns, move HMDB_code & HMDB_name column to the front, change intensity columns to numeric
intensities_zscore_df <- intensities_zscore_df %>%
select(-c(plots, HMDB_name_all, HMDB_ID_all, sec_HMDB_ID, HMDB_key, sec_HMBD_ID_rlvnc, name,
relevance, descr, origin, fluids, tissue, disease, pathway, nr_ctrls)) %>%
relocate(c(HMDB_code, HMDB_name)) %>%
rename(mean_controls = avg_ctrls, sd_controls = sd_ctrls) %>%
mutate(across(!c(HMDB_name, HMDB_code), as.numeric))

# Get the controls and patient IDs, select the intensity columns
controls <- colnames(intensities_zscore_df)[grepl("^C", colnames(intensities_zscore_df)) &
!grepl("_Zscore$", colnames(intensities_zscore_df))]
control_intensities_cols_index <- which(colnames(intensities_zscore_df) %in% controls)
nr_of_controls <- length(controls)

patients <- colnames(intensities_zscore_df)[grepl("^P", colnames(intensities_zscore_df)) &
!grepl("_Zscore$", colnames(intensities_zscore_df))]
patient_intensities_cols_index <- which(colnames(intensities_zscore_df) %in% patients)
nr_of_patients <- length(patients)

intensity_cols_index <- c(control_intensities_cols_index, patient_intensities_cols_index)
intensity_cols <- colnames(intensities_zscore_df)[intensity_cols_index]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be one (or more) 'prepare_data' functions.

intensity_cols_index <- c(control_intensities_cols_index, patient_intensities_cols_index)
intensity_cols <- colnames(intensities_zscore_df)[intensity_cols_index]

#### Calculate ratios of intensities for metabolites ####
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts of this block can be 'calculate' functions.

zscore_patients_df <- intensities_zscore_ratios_df %>% select(HMDB_code, HMDB_name, any_of(paste0(patients, "_Zscore")))
zscore_controls_df <- intensities_zscore_ratios_df %>% select(HMDB_code, HMDB_name, any_of(paste0(controls, "_Zscore")))

#### Make violin plots #####
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this a make create violoin plot pdf function

save_prob_scores_to_excel(diem_probability_score, output_dir, run_name)


#### Generate dIEM plots #########
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be a function.

Comment on lines +187 to +193
# metabs_iems <- lapply(top_iems, function(iem) {
# iem_probablity <- patient_top_iems_probs %>% filter(Disease == iem) %>% pull(!!sym(patient_id))
# metabs_iems_names <- c(metabs_iems_names, paste0(iem, ", probability score ", iem_probablity))
# metab_iem <- expected_biomarkers_df %>% filter(Disease == iem) %>% select(HMDB_code, HMDB_name)
# return(metab_iem)
# })
# names(metabs_iems) <- metabs_iems_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove old? code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants