Data Description The source file contains rows of BMI-related information, which is tab-delimited. Each row includes:
- A person’s unique identifier (
person_id) - An encounter identifier (
encounter_id) - A numerical BMI value (
bmi) - The person’s height in centimeters (
height_cm) - The person’s weight in kilograms (
weight_kg) - The date of measurement (
measurement_date)
This raw data from an EHR system was not collected for research purposes. There are multiple rows per person.
Task to Be Accomplished
-
Read the Data
- Import the file using a chosen delimiter (e.g., comma or tab).
- Convert columns to appropriate types (particularly for dates).
-
Clean and Filter
- Flag rows with missing or unreasonable heights/weights.
- Check for and flag large discrepancies between reported BMI and the BMI computed from height and weight.
- Write the flagged entries in a separate file and continue only with the rows that pass these checks.
-
Representative Record Selection
- For each person, flag extreme outlier measurements (use the standard IQR approach), output them in a separate file, and continue with the remaining records.
- For each person, select one “typical” row based on the median BMI among the remaining records.
- If there are no valid measurements left for a person, flag them and output them in a separate file.
- If there are multiple valid measurements, select the one closest to the median BMI.
-
Categorize BMI, Height, and Weight
- Assign people to categories (e.g., Underweight, Normal, Overweight, etc.) based on BMI, defined as:
- Underweight: BMI < 18.5 kg/m²
- Normal: 18.5 ≤ BMI < 25 kg/m²
- Overweight: 25 ≤ BMI < 30 kg/m²
- Obesity I: 30 ≤ BMI < 35 kg/m²
- Obesity II: 35 ≤ BMI < 40 kg/m²
- Obesity III: BMI ≥ 40 kg/m²
- Create categories for height (Short, Average, Tall), defined as:
- Short: height < 150 cm
- Average: 150 cm ≤ height < 180 cm
- Tall: height ≥ 180 cm
- Create categories for weight (Light, Medium, Heavy, etc.), defined as:
- Light: weight < 50 kg
- Medium: 50 kg ≤ weight < 80 kg
- Heavy: 80 kg ≤ weight < 100 kg
- Very Heavy: weight ≥ 100 kg
- Assign people to categories (e.g., Underweight, Normal, Overweight, etc.) based on BMI, defined as:
Expected Output
-
Cleaned Dataset
- One TSV file with one row per person:
- The "typical" BMI, plus the height, weight, and date of that measurement.
- Categorical variables for BMI, height, and weight.
- One TSV file with one row per person:
-
Summary Table
- A summary about the flagged entries, including:
- The number of rows removed due to missing or implausible values.
- The number of rows removed due to large discrepancies between reported and calculated BMI.
- The number of extreme outlier measurements removed.
- The number of individuals with only invalid measurements.
- The number of individuals with valid measurements.
- The distribution of the number of BMI measurements (e.g., mean, median, standard deviation) per person.
- A short overview of how many individuals fall into each BMI category (plus height/weight categories).
- May be saved as a text or Markdown file.
- A summary about the flagged entries, including:
-
Data Dictionary
- A separate text listing each column and a brief explanation (e.g.,
person_id: a unique identifier for the person).
- A separate text listing each column and a brief explanation (e.g.,