Extended the command-line in the audiogen app

gmiodice · kshitij-sisodia-arm · commit fc3118de374a · 2025-09-08T15:49:36.000+01:00
- Switched CLI parsing from positional args to getopt
- Added num_steps
- Added output file name
- Added audio lenght in seconds

Signed-off-by: Gian Marco Iodice &lt;gianmarco.iodice@arm.com&gt;
diff --git a/kleidiai-examples/audiogen/app/README.md b/kleidiai-examples/audiogen/app/README.md
@@ -21,8 +21,57 @@ This guide will show you how to build the <strong>audio generation (audiogen)</s
 
 To build the audiogen application, follow one the following sections depending on your <strong>TARGET</strong> platform:
 
-- [Build the audiogen app for Android™ (TARGET)](#build-the-audiogen-app-on-linux_host_or-macos_host_for-android_target)
 - [Build the audiogen app for macOS® (TARGET)](#build-the-audiogen-app-on-macos_host_for-macos_target)
+- [Build the audiogen app for Android™ (TARGET)](#build-the-audiogen-app-on-linux_host_or-macos_host_for-android_target)
+
+### Build the audiogen app on macOS® (HOST) for macOS® (TARGET)
+
+#### Step 1
+Navigate to the `audiogen/app/` folder. Set the `LITERT_MODELS_PATH` environment variable to the path where your Stable Audio Open Small models exported to LiteRT are located:
+
+```bash
+export LITERT_MODELS_PATH=<path_to_your_litert_models>
+```
+
+#### Step 2
+Build the audiogen application. Inside the `app` directory, create the `build` folder and navigate into it:
+
+```bash
+mkdir build && cd build
+```
+
+Next, run CMake using the following command:
+
+```bash
+cmake ..
+```
+
+Then, build the application:
+```bash
+make -j
+```
+
+#### Step 3
+Since the tokenizer used in the audiogen application is based on <strong>SentencePiece</strong>, you’ll need to download the `spiece.model` file from: https://huggingface.co/google-t5/t5-base/tree/main
+and add it to your `$LITERT_MODELS_PATH`.
+
+```bash
+curl https://huggingface.co/google-t5/t5-base/resolve/main/spiece.model -o $LITERT_MODELS_PATH/spiece.model
+```
+
+At this point, you are ready to run the audiogen application.
+
+From there, you can then run the `audiogen` application, which requires just three input arguments:
+
+- **Model Path (-m)**: The directory containing your LiteRT models and `spiece.model` files
+- **Prompt (-p)**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
+- **CPU Threads (-t)**: The number of CPU threads to use (e.g., `4`)
+
+```bash
+./audiogen -m . -p "warm arpeggios on house beats 120BPM with drums effect" -t 4
+```
+
+If everything runs successfully, the generated audio will be saved in `.wav` format (`output.wav`) in the `audiogen_app` folder. At this point, you can play it on your laptop or PC.
 
 ### Build the audiogen app on Linux® (HOST) or macOS® (HOST) for Android™ (TARGET)
 
@@ -111,67 +160,16 @@ cd /data/local/tmp/app
 
 From there, you can then run the `audiogen` application, which requires just three input arguments:
 
-- **Model Path**: The directory containing your LiteRT models and `spiece.model` files
-- **Prompt**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
-- **CPU Threads**: The number of CPU threads to use (e.g., `4`)
-- **Seed**: Specifies the seed value for the random initializer. Changing the seed will produce different audio outputs
+- **Model Path (-m)**: The directory containing your LiteRT models and `spiece.model` files
+- **Prompt (-p)**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
+- **CPU Threads (-t)**: The number of CPU threads to use (e.g., `4`)
 
 ```bash
-./audiogen . "warm arpeggios on house beats 120BPM with drums effect" 4
+./audiogen -m . -p "warm arpeggios on house beats 120BPM with drums effect" -t 4
 ```
 
 If everything runs successfully, the generated audio will be saved in `.wav` format (`output.wav`) in the same directory as the `audiogen` binary. At this point, you can then retrieve it using the `adb` tool from a different Terminal and play it on your laptop or PC.
 
 ```bash
 adb pull data/local/tmp/output.wav
 ```
-
-### Build the audiogen app on macOS® (HOST) for macOS® (TARGET)
-
-#### Step 1
-Navigate to the `audiogen/app/` folder. Set the `LITERT_MODELS_PATH` environment variable to the path where your Stable Audio Open Small models exported to LiteRT are located:
-
-```bash
-export LITERT_MODELS_PATH=<path_to_your_litert_models>
-```
-
-#### Step 2
-Build the audiogen application. Inside the `app` directory, create the `build` folder and navigate into it:
-
-```bash
-mkdir build && cd build
-```
-
-Next, run CMake using the following command:
-
-```bash
-cmake ..
-```
-
-Then, build the application:
-```bash
-make -j
-```
-
-#### Step 3
-Since the tokenizer used in the audiogen application is based on <strong>SentencePiece</strong>, you’ll need to download the `spiece.model` file from: https://huggingface.co/google-t5/t5-base/tree/main
-and add it to your `$LITERT_MODELS_PATH`.
-
-```bash
-curl https://huggingface.co/google-t5/t5-base/resolve/main/spiece.model -o $LITERT_MODELS_PATH/spiece.model
-```
-
-At this point, you are ready to run the audiogen application.
-
-From there, you can then run the `audiogen` application, which requires just three input arguments:
-
-- **Model Path**: The directory containing your LiteRT models and `spiece.model` files
-- **Prompt**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
-- **CPU Threads**: The number of CPU threads to use (e.g., `4`, `8`)
-- **Seed**: Specifies the seed value for the random initializer. Changing the seed will produce different audio outputs
-
-```bash
-./audiogen $LITERT_MODELS_PATH "warm arpeggios on house beats 120BPM with drums effect" 4 99
-```
-
-If everything runs successfully, the generated audio will be saved in `.wav` format (`output.wav`) in the `audiogen_app` folder. At this point, you can play it on your laptop or PC.
diff --git a/kleidiai-examples/audiogen/app/audiogen.cpp b/kleidiai-examples/audiogen/app/audiogen.cpp
@@ -31,6 +31,7 @@
 #include <cstdint>
 #include <cstring>
 #include <fstream>
+#include <unistd.h>
 #include <iterator>
 #include <random>
 #include <string>
@@ -39,14 +40,10 @@
 
 #include <sentencepiece_processor.h>
 
-inline long time_in_ms() {
-    using namespace std::chrono;
-    auto now = time_point_cast<milliseconds>(steady_clock::now());
-    return now.time_since_epoch().count();
-}
-
-constexpr float k_audio_len_sec = 10.0f;
-constexpr size_t k_num_steps = 8;
+constexpr size_t k_seed_default = 99;
+constexpr size_t k_audio_len_sec_default = 10;
+constexpr size_t k_num_steps_default = 8;
+const std::string k_output_file_default = "output.wav";
 
 // -- Update the tensor index based on your model configuration.
 constexpr size_t k_t5_ids_in_idx = 0;
@@ -61,9 +58,6 @@ constexpr size_t k_dit_x_in_idx = 2;
 constexpr size_t k_dit_t_in_idx = 3;
 constexpr size_t k_dit_out_idx = 0;
 
-// -- Tensor size to pre-compute the sigmas
-constexpr size_t k_t_tensor_sz = k_num_steps + 1;
-
 // -- Fill sigmas params
 constexpr float k_logsnr_max = -6.0f;
 constexpr float k_sigma_min = 0.0f;
@@ -75,6 +69,31 @@ constexpr float k_sigma_max = 1.0f;
         exit(1);                                                \
     }
 
+static inline long time_in_ms() {
+    using namespace std::chrono;
+    auto now = time_point_cast<milliseconds>(steady_clock::now());
+    return now.time_since_epoch().count();
+}
+
+static void print_usage(const char *name) {
+    fprintf(stderr,
+        "Usage: %s -m <models_base_path> -p <prompt> -t <num_threads> [-s <seed> -l <audio_len>]\n\n"
+        "Options:\n"
+        "  -m <models_base_path>   Path to model files\n"
+        "  -p <prompt>             Input prompt text (e.g., warm arpeggios on house beats 120BPM with drums effect)\n"
+        "  -t <num_threads>        Number of CPU threads to use\n"
+        "  -s <seed>               (Optional) Random seed for reproducibility. Different seeds generate different audio samples (Default: %zu)\n"
+        "  -l <audio_len_sec>      (Optional) Length of generated audio (Default: %zu s)\n"
+        "  -n <num_steps>          (Optional) Number of steps (Default: %zu)\n"
+        "  -o <output_file>        (Optional) Output audio file name (Default: %s)\n"
+        "  -h                      Show this help message\n",
+        name,
+        k_seed_default,
+        k_audio_len_sec_default,
+        k_num_steps_default,
+        k_output_file_default.c_str());
+}
+
 static std::vector<int32_t> convert_prompt_to_ids(const std::string& prompt, const std::string& spiece_model_path) {
     sentencepiece::SentencePieceProcessor sp;
 
@@ -194,22 +213,45 @@ struct TfLiteDelegateDeleter {
 
 int main(int32_t argc, char** argv) {
 
-    if (argc != 5) {
-        printf("ERROR: Usage ./audiogen <models_base_path> <prompt> <num_threads> <seed>\n");
-        return 1;
-    }
-
     // ----- Parse the cmd line arguments
     // ----------------------------------
-    const std::string models_base_path = argv[1];
-    const std::string prompt = argv[2];
-    const size_t num_threads = std::stoull(argv[3]);
-    const size_t seed = std::stoull(argv[4]);
+    // Required arguments
+    std::string models_base_path = "";
+    std::string prompt           = "";
+    size_t num_threads           = 0;
+    // Optional arguments
+    std::string output_file      = k_output_file_default;
+    size_t seed                  = k_seed_default;
+    size_t num_steps             = k_num_steps_default;
+    float audio_len_sec          = static_cast<float>(k_audio_len_sec_default);
+
+    int opt;
+    while ((opt = getopt(argc, argv, "m:p:t:s:n:o:l:h")) != -1) {
+        switch (opt) {
+            case 'm': models_base_path = optarg; break;
+            case 'p': prompt           = optarg; break;
+            case 't': num_threads      = std::stoull(optarg); break;
+            case 'o': output_file      = optarg; break;
+            case 's': seed             = std::stoull(optarg); break;
+            case 'n': num_steps        = std::stoull(optarg); break;
+            case 'l': audio_len_sec    = static_cast<float>(std::stoull(optarg)); break;
+            case 'h':
+            default:
+                print_usage(argv[0]);
+                return EXIT_FAILURE;
+        }
+    }
+
+    // Check the mandatory arguments
+    if (models_base_path.empty() || prompt.empty() || num_threads <= 0) {
+        fprintf(stderr, "ERROR: Missing required arguments.\n\n");
+        print_usage(argv[0]);
+        return EXIT_FAILURE;
+    }
 
     std::string t5_tflite = models_base_path + "/conditioners_float32.tflite";
     std::string dit_tflite = models_base_path + "/dit_model.tflite";
     std::string autoencoder_tflite = models_base_path + "/autoencoder_model.tflite";
-    std::string output_path = "output.wav";
     std::string sentence_model_path = models_base_path + "/spiece.model";
 
     // ----- Load the models
@@ -327,7 +369,7 @@ int main(int32_t argc, char** argv) {
     TfLiteIntArray* autoencoder_out_dims = autoencoder_interpreter->tensor(autoencoder_out_id)->dims;
 
     // ----- Allocate the extra buffer to pre-compute the sigmas
-    std::vector<float> t_buffer(k_t_tensor_sz);
+    std::vector<float> t_buffer(num_steps + 1);
 
     // ----- Initialize the T and X buffers
     fill_random_norm_dist(dit_x_in_data, get_num_elems(dit_x_in_dims), seed);
@@ -350,7 +392,7 @@ int main(int32_t argc, char** argv) {
     }
 
     // Initialize the t5_time_in_data
-    memcpy(t5_time_in_data, &k_audio_len_sec, 1 * sizeof(float));
+    memcpy(t5_time_in_data, &audio_len_sec, 1 * sizeof(float));
 
     auto start_t5 = time_in_ms();
 
@@ -366,7 +408,7 @@ int main(int32_t argc, char** argv) {
 
     auto start_dit = time_in_ms();
 
-    for(size_t i = 0; i < k_num_steps; ++i) {
+    for(size_t i = 0; i < num_steps; ++i) {
         const float curr_t = t_buffer[i];
         const float next_t = t_buffer[i + 1];
         memcpy(dit_t_in_data, &curr_t, 1 * sizeof(float));
@@ -394,12 +436,12 @@ int main(int32_t argc, char** argv) {
     const float* left_ch = autoencoder_out_data;
     const float* right_ch = autoencoder_out_data + num_audio_samples;
 
-    save_as_wav(output_path.c_str(), left_ch, right_ch, num_audio_samples);
+    save_as_wav(output_file.c_str(), left_ch, right_ch, num_audio_samples);
 
     // Save the file
     auto t5_exec_time          = (end_t5 - start_t5);
     auto dit_exec_time         = (end_dit - start_dit);
-    auto dit_avg_step_time     = (dit_exec_time / static_cast<float>(k_num_steps));
+    auto dit_avg_step_time     = (dit_exec_time / static_cast<float>(num_steps));
     auto autoencoder_exec_time = (end_autoencoder - start_autoencoder);
     auto total_exec_time       = t5_exec_time + dit_exec_time + autoencoder_exec_time;