Skip to content

Commit fc3118d

Browse files
gmiodicekshitij-sisodia-arm
authored andcommitted
Extended the command-line in the audiogen app
- Switched CLI parsing from positional args to getopt - Added num_steps - Added output file name - Added audio lenght in seconds Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
1 parent 14e7c07 commit fc3118d

File tree

2 files changed

+122
-82
lines changed

2 files changed

+122
-82
lines changed

kleidiai-examples/audiogen/app/README.md

Lines changed: 54 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,57 @@ This guide will show you how to build the <strong>audio generation (audiogen)</s
2121

2222
To build the audiogen application, follow one the following sections depending on your <strong>TARGET</strong> platform:
2323

24-
- [Build the audiogen app for Android™ (TARGET)](#build-the-audiogen-app-on-linux_host_or-macos_host_for-android_target)
2524
- [Build the audiogen app for macOS® (TARGET)](#build-the-audiogen-app-on-macos_host_for-macos_target)
25+
- [Build the audiogen app for Android™ (TARGET)](#build-the-audiogen-app-on-linux_host_or-macos_host_for-android_target)
26+
27+
### Build the audiogen app on macOS® (HOST) for macOS® (TARGET)
28+
29+
#### Step 1
30+
Navigate to the `audiogen/app/` folder. Set the `LITERT_MODELS_PATH` environment variable to the path where your Stable Audio Open Small models exported to LiteRT are located:
31+
32+
```bash
33+
export LITERT_MODELS_PATH=<path_to_your_litert_models>
34+
```
35+
36+
#### Step 2
37+
Build the audiogen application. Inside the `app` directory, create the `build` folder and navigate into it:
38+
39+
```bash
40+
mkdir build && cd build
41+
```
42+
43+
Next, run CMake using the following command:
44+
45+
```bash
46+
cmake ..
47+
```
48+
49+
Then, build the application:
50+
```bash
51+
make -j
52+
```
53+
54+
#### Step 3
55+
Since the tokenizer used in the audiogen application is based on <strong>SentencePiece</strong>, you’ll need to download the `spiece.model` file from: https://huggingface.co/google-t5/t5-base/tree/main
56+
and add it to your `$LITERT_MODELS_PATH`.
57+
58+
```bash
59+
curl https://huggingface.co/google-t5/t5-base/resolve/main/spiece.model -o $LITERT_MODELS_PATH/spiece.model
60+
```
61+
62+
At this point, you are ready to run the audiogen application.
63+
64+
From there, you can then run the `audiogen` application, which requires just three input arguments:
65+
66+
- **Model Path (-m)**: The directory containing your LiteRT models and `spiece.model` files
67+
- **Prompt (-p)**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
68+
- **CPU Threads (-t)**: The number of CPU threads to use (e.g., `4`)
69+
70+
```bash
71+
./audiogen -m . -p "warm arpeggios on house beats 120BPM with drums effect" -t 4
72+
```
73+
74+
If everything runs successfully, the generated audio will be saved in `.wav` format (`output.wav`) in the `audiogen_app` folder. At this point, you can play it on your laptop or PC.
2675

2776
### Build the audiogen app on Linux® (HOST) or macOS® (HOST) for Android™ (TARGET)
2877

@@ -111,67 +160,16 @@ cd /data/local/tmp/app
111160

112161
From there, you can then run the `audiogen` application, which requires just three input arguments:
113162

114-
- **Model Path**: The directory containing your LiteRT models and `spiece.model` files
115-
- **Prompt**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
116-
- **CPU Threads**: The number of CPU threads to use (e.g., `4`)
117-
- **Seed**: Specifies the seed value for the random initializer. Changing the seed will produce different audio outputs
163+
- **Model Path (-m)**: The directory containing your LiteRT models and `spiece.model` files
164+
- **Prompt (-p)**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
165+
- **CPU Threads (-t)**: The number of CPU threads to use (e.g., `4`)
118166

119167
```bash
120-
./audiogen . "warm arpeggios on house beats 120BPM with drums effect" 4
168+
./audiogen -m . -p "warm arpeggios on house beats 120BPM with drums effect" -t 4
121169
```
122170

123171
If everything runs successfully, the generated audio will be saved in `.wav` format (`output.wav`) in the same directory as the `audiogen` binary. At this point, you can then retrieve it using the `adb` tool from a different Terminal and play it on your laptop or PC.
124172

125173
```bash
126174
adb pull data/local/tmp/output.wav
127175
```
128-
129-
### Build the audiogen app on macOS® (HOST) for macOS® (TARGET)
130-
131-
#### Step 1
132-
Navigate to the `audiogen/app/` folder. Set the `LITERT_MODELS_PATH` environment variable to the path where your Stable Audio Open Small models exported to LiteRT are located:
133-
134-
```bash
135-
export LITERT_MODELS_PATH=<path_to_your_litert_models>
136-
```
137-
138-
#### Step 2
139-
Build the audiogen application. Inside the `app` directory, create the `build` folder and navigate into it:
140-
141-
```bash
142-
mkdir build && cd build
143-
```
144-
145-
Next, run CMake using the following command:
146-
147-
```bash
148-
cmake ..
149-
```
150-
151-
Then, build the application:
152-
```bash
153-
make -j
154-
```
155-
156-
#### Step 3
157-
Since the tokenizer used in the audiogen application is based on <strong>SentencePiece</strong>, you’ll need to download the `spiece.model` file from: https://huggingface.co/google-t5/t5-base/tree/main
158-
and add it to your `$LITERT_MODELS_PATH`.
159-
160-
```bash
161-
curl https://huggingface.co/google-t5/t5-base/resolve/main/spiece.model -o $LITERT_MODELS_PATH/spiece.model
162-
```
163-
164-
At this point, you are ready to run the audiogen application.
165-
166-
From there, you can then run the `audiogen` application, which requires just three input arguments:
167-
168-
- **Model Path**: The directory containing your LiteRT models and `spiece.model` files
169-
- **Prompt**: A text description of the desired audio (e.g., *warm arpeggios on house beats 120BPM with drums effect*)
170-
- **CPU Threads**: The number of CPU threads to use (e.g., `4`, `8`)
171-
- **Seed**: Specifies the seed value for the random initializer. Changing the seed will produce different audio outputs
172-
173-
```bash
174-
./audiogen $LITERT_MODELS_PATH "warm arpeggios on house beats 120BPM with drums effect" 4 99
175-
```
176-
177-
If everything runs successfully, the generated audio will be saved in `.wav` format (`output.wav`) in the `audiogen_app` folder. At this point, you can play it on your laptop or PC.

kleidiai-examples/audiogen/app/audiogen.cpp

Lines changed: 68 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
#include <cstdint>
3232
#include <cstring>
3333
#include <fstream>
34+
#include <unistd.h>
3435
#include <iterator>
3536
#include <random>
3637
#include <string>
@@ -39,14 +40,10 @@
3940

4041
#include <sentencepiece_processor.h>
4142

42-
inline long time_in_ms() {
43-
using namespace std::chrono;
44-
auto now = time_point_cast<milliseconds>(steady_clock::now());
45-
return now.time_since_epoch().count();
46-
}
47-
48-
constexpr float k_audio_len_sec = 10.0f;
49-
constexpr size_t k_num_steps = 8;
43+
constexpr size_t k_seed_default = 99;
44+
constexpr size_t k_audio_len_sec_default = 10;
45+
constexpr size_t k_num_steps_default = 8;
46+
const std::string k_output_file_default = "output.wav";
5047

5148
// -- Update the tensor index based on your model configuration.
5249
constexpr size_t k_t5_ids_in_idx = 0;
@@ -61,9 +58,6 @@ constexpr size_t k_dit_x_in_idx = 2;
6158
constexpr size_t k_dit_t_in_idx = 3;
6259
constexpr size_t k_dit_out_idx = 0;
6360

64-
// -- Tensor size to pre-compute the sigmas
65-
constexpr size_t k_t_tensor_sz = k_num_steps + 1;
66-
6761
// -- Fill sigmas params
6862
constexpr float k_logsnr_max = -6.0f;
6963
constexpr float k_sigma_min = 0.0f;
@@ -75,6 +69,31 @@ constexpr float k_sigma_max = 1.0f;
7569
exit(1); \
7670
}
7771

72+
static inline long time_in_ms() {
73+
using namespace std::chrono;
74+
auto now = time_point_cast<milliseconds>(steady_clock::now());
75+
return now.time_since_epoch().count();
76+
}
77+
78+
static void print_usage(const char *name) {
79+
fprintf(stderr,
80+
"Usage: %s -m <models_base_path> -p <prompt> -t <num_threads> [-s <seed> -l <audio_len>]\n\n"
81+
"Options:\n"
82+
" -m <models_base_path> Path to model files\n"
83+
" -p <prompt> Input prompt text (e.g., warm arpeggios on house beats 120BPM with drums effect)\n"
84+
" -t <num_threads> Number of CPU threads to use\n"
85+
" -s <seed> (Optional) Random seed for reproducibility. Different seeds generate different audio samples (Default: %zu)\n"
86+
" -l <audio_len_sec> (Optional) Length of generated audio (Default: %zu s)\n"
87+
" -n <num_steps> (Optional) Number of steps (Default: %zu)\n"
88+
" -o <output_file> (Optional) Output audio file name (Default: %s)\n"
89+
" -h Show this help message\n",
90+
name,
91+
k_seed_default,
92+
k_audio_len_sec_default,
93+
k_num_steps_default,
94+
k_output_file_default.c_str());
95+
}
96+
7897
static std::vector<int32_t> convert_prompt_to_ids(const std::string& prompt, const std::string& spiece_model_path) {
7998
sentencepiece::SentencePieceProcessor sp;
8099

@@ -194,22 +213,45 @@ struct TfLiteDelegateDeleter {
194213

195214
int main(int32_t argc, char** argv) {
196215

197-
if (argc != 5) {
198-
printf("ERROR: Usage ./audiogen <models_base_path> <prompt> <num_threads> <seed>\n");
199-
return 1;
200-
}
201-
202216
// ----- Parse the cmd line arguments
203217
// ----------------------------------
204-
const std::string models_base_path = argv[1];
205-
const std::string prompt = argv[2];
206-
const size_t num_threads = std::stoull(argv[3]);
207-
const size_t seed = std::stoull(argv[4]);
218+
// Required arguments
219+
std::string models_base_path = "";
220+
std::string prompt = "";
221+
size_t num_threads = 0;
222+
// Optional arguments
223+
std::string output_file = k_output_file_default;
224+
size_t seed = k_seed_default;
225+
size_t num_steps = k_num_steps_default;
226+
float audio_len_sec = static_cast<float>(k_audio_len_sec_default);
227+
228+
int opt;
229+
while ((opt = getopt(argc, argv, "m:p:t:s:n:o:l:h")) != -1) {
230+
switch (opt) {
231+
case 'm': models_base_path = optarg; break;
232+
case 'p': prompt = optarg; break;
233+
case 't': num_threads = std::stoull(optarg); break;
234+
case 'o': output_file = optarg; break;
235+
case 's': seed = std::stoull(optarg); break;
236+
case 'n': num_steps = std::stoull(optarg); break;
237+
case 'l': audio_len_sec = static_cast<float>(std::stoull(optarg)); break;
238+
case 'h':
239+
default:
240+
print_usage(argv[0]);
241+
return EXIT_FAILURE;
242+
}
243+
}
244+
245+
// Check the mandatory arguments
246+
if (models_base_path.empty() || prompt.empty() || num_threads <= 0) {
247+
fprintf(stderr, "ERROR: Missing required arguments.\n\n");
248+
print_usage(argv[0]);
249+
return EXIT_FAILURE;
250+
}
208251

209252
std::string t5_tflite = models_base_path + "/conditioners_float32.tflite";
210253
std::string dit_tflite = models_base_path + "/dit_model.tflite";
211254
std::string autoencoder_tflite = models_base_path + "/autoencoder_model.tflite";
212-
std::string output_path = "output.wav";
213255
std::string sentence_model_path = models_base_path + "/spiece.model";
214256

215257
// ----- Load the models
@@ -327,7 +369,7 @@ int main(int32_t argc, char** argv) {
327369
TfLiteIntArray* autoencoder_out_dims = autoencoder_interpreter->tensor(autoencoder_out_id)->dims;
328370

329371
// ----- Allocate the extra buffer to pre-compute the sigmas
330-
std::vector<float> t_buffer(k_t_tensor_sz);
372+
std::vector<float> t_buffer(num_steps + 1);
331373

332374
// ----- Initialize the T and X buffers
333375
fill_random_norm_dist(dit_x_in_data, get_num_elems(dit_x_in_dims), seed);
@@ -350,7 +392,7 @@ int main(int32_t argc, char** argv) {
350392
}
351393

352394
// Initialize the t5_time_in_data
353-
memcpy(t5_time_in_data, &k_audio_len_sec, 1 * sizeof(float));
395+
memcpy(t5_time_in_data, &audio_len_sec, 1 * sizeof(float));
354396

355397
auto start_t5 = time_in_ms();
356398

@@ -366,7 +408,7 @@ int main(int32_t argc, char** argv) {
366408

367409
auto start_dit = time_in_ms();
368410

369-
for(size_t i = 0; i < k_num_steps; ++i) {
411+
for(size_t i = 0; i < num_steps; ++i) {
370412
const float curr_t = t_buffer[i];
371413
const float next_t = t_buffer[i + 1];
372414
memcpy(dit_t_in_data, &curr_t, 1 * sizeof(float));
@@ -394,12 +436,12 @@ int main(int32_t argc, char** argv) {
394436
const float* left_ch = autoencoder_out_data;
395437
const float* right_ch = autoencoder_out_data + num_audio_samples;
396438

397-
save_as_wav(output_path.c_str(), left_ch, right_ch, num_audio_samples);
439+
save_as_wav(output_file.c_str(), left_ch, right_ch, num_audio_samples);
398440

399441
// Save the file
400442
auto t5_exec_time = (end_t5 - start_t5);
401443
auto dit_exec_time = (end_dit - start_dit);
402-
auto dit_avg_step_time = (dit_exec_time / static_cast<float>(k_num_steps));
444+
auto dit_avg_step_time = (dit_exec_time / static_cast<float>(num_steps));
403445
auto autoencoder_exec_time = (end_autoencoder - start_autoencoder);
404446
auto total_exec_time = t5_exec_time + dit_exec_time + autoencoder_exec_time;
405447

0 commit comments

Comments
 (0)