Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 10 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,48 +77,35 @@ Once you have extracted features from two languages (e.g., Python and TypeScript
--base output/py.txtpb \
--target output/ts.txtpb \
--output output/ \
--report-type directional
--report-type md
```

| Argument | Description |
| :--- | :--- |
| `--base <path>` | **Required.** Path to the "source of truth" feature registry (e.g., Python). |
| `--target <path>` | **Required.** Path to the comparison registry (e.g., TypeScript). |
| `--output <dir>` | **Required.** Path for the output directory. The report filename is auto-generated. |
| `--report-type <type>` | `symmetric` (default) for Jaccard Index, `directional` for F1/Precision/Recall, or `raw` for CSV. |
| `--alpha <float>` | Similarity threshold (0.0 - 1.0). Default is `0.8`. |
| `--report-type <type>` | `md` (default) for Markdown Parity Report, or `raw` for CSV. |

#### How Matching Works

The matcher uses the **Hungarian Algorithm** to find the optimal assignment between features in the Base and Target registries.
- **Cost Function**: Based on a similarity score derived from:
- Feature Name (normalized)
- Namespace / Module
- Feature Type (Function, Method, Class, etc.)
- **Thresholding**: Pairs with a similarity score below `--alpha` are discarded.
TODO: This needs updating

#### Understanding the Reports

`adk-scope` can generate three types of reports to help you understand the feature overlap between two languages.
`adk-scope` generates two types of reports to help you understand the feature overlap between two languages.

##### Symmetric Report (`--report-type symmetric`)
##### Markdown Parity Report (`--report-type md`)

This report is best for measuring the general similarity between two feature sets, where neither is considered the "source of truth". It uses the **Jaccard Index** to calculate a global similarity score.
This report generates a human-readable Markdown file detailing the feature parity between two SDKs.

- **What it measures**: The Jaccard Index measures the similarity between two sets by dividing the size of their intersection by the size of their union. The score ranges from 0% (no similarity) to 100% (identical sets).
- **What it means**: A high Jaccard Index indicates that both languages have a very similar set of features, with few features unique to either one. It penalizes both missing and extra features equally.

##### Directional Report (`--report-type directional`)

This report is ideal when you have a "base" or "source of truth" language and you want to measure how well a "target" language conforms to it. It uses **Precision**, **Recall**, and **F1-Score**.

- **Precision**: Answers the question: *"Of all the features implemented in the target language, how many of them are correct matches to features in the base language?"* A low score indicates the target has many extra features not present in the base.
- **Recall**: Answers the question: *"Of all the features that should be in the target language (i.e., all features in the base), how many were actually found?"* A low score indicates the target is missing many features from the base.
- **F1-Score**: The harmonic mean of Precision and Recall, providing a single score that balances both. A high F1-Score indicates the target is a close match to the base, having most of the required features and not too many extra ones.
- **Gap Analysis List**: A summary table that breaks down features into "Common Shared", "Exclusive to [Base Language]", and "Exclusive to [Target Language]".
- **Jaccard Score**: It calculates an overall similarity score using the Jaccard Index (Intersection over Union), providing a global metric of feature parity.
- **Module Breakdown**: It provides score details and status links on a per-module basis, highlighting exact matches, potential near-matches, and missing features.

##### Raw Report (`--report-type raw`)

This report provides a simple CSV output of all features (matched and unmatched) from both the base and target registries. It is useful for programmatic analysis or for importing the data into other tools.$
This report provides a simple CSV output of all features (matched and unmatched) from both the base and target registries. It is useful for programmatic analysis or for importing the data into other tools.

## Development

Expand Down
878 changes: 660 additions & 218 deletions playground.ipynb

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ dependencies = [
"scipy",
"numpy",
"jellyfish",
"RapidFuzz",
"pandas",
]


Expand Down
60 changes: 40 additions & 20 deletions report.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@
set -e

# Default values
REPORT_TYPE="symmetric"
ALPHA="0.8"
REPORT_TYPE="md"
VERBOSE=""
COMMON=""

# Parse arguments
REGISTRIES=()
while [[ "$#" -gt 0 ]]; do
case "$1" in
--base)
Expand All @@ -19,6 +20,13 @@ while [[ "$#" -gt 0 ]]; do
TARGET_FILE="$2"
shift 2
;;
--registries)
shift
while [[ "$#" -gt 0 && ! "$1" =~ ^-- ]]; do
REGISTRIES+=("$1")
shift
done
;;
--output)
OUTPUT_DIR="$2"
shift 2
Expand All @@ -27,25 +35,21 @@ while [[ "$#" -gt 0 ]]; do
REPORT_TYPE="$2"
shift 2
;;
--alpha)
ALPHA="$2"
shift 2
;;
-v|--verbose)
VERBOSE="--verbose"
shift
;;
--common)
COMMON="--common"
shift
;;
*)
echo "Unknown option: $1"
exit 1
;;
esac
done

# Extract languages
BASE_LANG_RAW=$(head -n 1 "${BASE_FILE}" | grep -o 'language: "[A-Z]*"' | grep -o '"[A-Z]*"' | tr -d '"')
TARGET_LANG_RAW=$(head -n 1 "${TARGET_FILE}" | grep -o 'language: "[A-Z]*"' | grep -o '"[A-Z]*"' | tr -d '"')

# Function to map language to short code
get_lang_code() {
case "$1" in
Expand All @@ -57,16 +61,33 @@ get_lang_code() {
esac
}

BASE_LANG=$(get_lang_code "$BASE_LANG_RAW")
TARGET_LANG=$(get_lang_code "$TARGET_LANG_RAW")
if [[ ${#REGISTRIES[@]} -eq 0 && -n "$BASE_FILE" && -n "$TARGET_FILE" ]]; then
REGISTRIES+=("$BASE_FILE" "$TARGET_FILE")
fi

if [[ ${#REGISTRIES[@]} -lt 2 ]]; then
echo "Error: Must provide at least two registries via --registries or --base/--target"
exit 1
fi

# Extract languages and construct filename
LANG_CODES=()
for REG_FILE in "${REGISTRIES[@]}"; do
LANG_RAW=$(head -n 1 "${REG_FILE}" | grep -o 'language: "[A-Z]*"' | grep -o '"[A-Z]*"' | tr -d '"')
LANG_CODES+=($(get_lang_code "$LANG_RAW"))
done

# Construct filename
# Default to markdown extension. The python script will generate CSV alongside it.
EXTENSION="md"

# Standard 2-way report
OUTPUT_FILENAME="${LANG_CODES[0]}_${LANG_CODES[1]}.${EXTENSION}"
# Ensure report type is 'md' for standard logic so unified generator runs
if [ "$REPORT_TYPE" == "raw" ]; then
EXTENSION="csv"
else
EXTENSION="md"
REPORT_TYPE="md"
fi
OUTPUT_FILENAME="${BASE_LANG}_${TARGET_LANG}_${REPORT_TYPE}.${EXTENSION}"

FULL_OUTPUT_PATH="${OUTPUT_DIR}/${OUTPUT_FILENAME}"

# Determine the directory where this script is located
Expand All @@ -75,11 +96,10 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Add 'src' to PYTHONPATH so the python script can find modules
export PYTHONPATH="${SCRIPT_DIR}/src:${PYTHONPATH}"

# Run the python matcher
# Run the python reporter
python3 "${SCRIPT_DIR}/src/google/adk/scope/reporter/reporter.py" \
--base "${BASE_FILE}" \
--target "${TARGET_FILE}" \
--registries "${REGISTRIES[@]}" \
--output "${FULL_OUTPUT_PATH}" \
--report-type "${REPORT_TYPE}" \
--alpha "${ALPHA}" \
${COMMON} \
${VERBOSE}
28 changes: 10 additions & 18 deletions run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,28 +12,20 @@ echo "Extracting Go features..."

# Py -> TS

echo "Generating symmetric reports..."
./report.sh --base output/py.txtpb --target output/ts.txtpb --output ./output --report-type symmetric

echo "Generating directional reports.. ."
./report.sh --base output/py.txtpb --target output/ts.txtpb --output ./output --report-type directional

echo "Generating raw reports..."
./report.sh --base output/py.txtpb --target output/ts.txtpb --output ./output --report-type raw
echo "Generating raw and markdown reports..."
./report.sh --base output/py.txtpb --target output/ts.txtpb --output ./output --report-type md

# Py -> Java

echo "Generating symmetric reports..."
./report.sh --base output/py.txtpb --target output/java.txtpb --output ./output --report-type symmetric

echo "Generating directional reports (py->java)..."
./report.sh --base output/py.txtpb --target output/java.txtpb --output ./output --report-type directional

echo "Generating raw and markdown reports..."
./report.sh --base output/py.txtpb --target output/java.txtpb --output ./output --report-type md

# Py -> Go

echo "Generating symmetric reports..."
./report.sh --base output/py.txtpb --target output/go.txtpb --output ./output --report-type symmetric
echo "Generating raw and markdown reports..."
./report.sh --base output/py.txtpb --target output/go.txtpb --output ./output --report-type md

# Matrix reports

echo "Generating directional reports (py->go)..."
./report.sh --base output/py.txtpb --target output/go.txtpb --output ./output --report-type directional
#echo "Generating matrix reports..."
#./report.sh --registries output/py.txtpb output/ts.txtpb output/java.txtpb output/go.txtpb --output ./output --report-type matrix --common
13 changes: 13 additions & 0 deletions score.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash
set -e

# Resolve the project root
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
export PYTHONPATH="${SCRIPT_DIR}/src:${PYTHONPATH}"

if [ "$#" -lt 2 ]; then
echo "Usage: $0 <feature1.txtpb> <feature2.txtpb> [options]"
exit 1
fi

python3 "${SCRIPT_DIR}/src/google/adk/scope/utils/score_features.py" "$@"
Loading