ChatbotsCodingPerformance

EVALUATION AND COMPARISON OF AI CHATBOTS' CODING PERFORMANCE: HOW GOOD ARE CHATBOTS TO CODE ON TASKS FROM EASY TO COMPLEX?

This is a study of comparison and evaluation of AI chatbots' coding performance. It emerged as a midterm project of "Fundamentals of Artifical Intelligence" course at Institue of Informatics, Hacettepe University.

Introduction

I focused to compare and evaluate the coding performance of AI chatbots. There are 5 chatbots to test their performances: ChatGPT, DeepSeek, Claude, Gemini. GitHub Copilot. I entered the prompts in English.

Chatbots

I used five different AI chatbots:

ChatGPT Plus GPT-4o
Gemini Advanced 2.0 Flash
DeepSeek-V3
Claude 3.7 Sonnet Pro Plan
GitHub Copilot GPT-4.1 Pro

Queries

I wanted chatbots to generate 21 different mathematical calculation codes in Python. Here are queries:

Calculation the determinant of 7x7 matrix
Calculation the determinant of 11x11 matrix
Matrix product for 7x7 matrices
Matrix product for 15x15 matrices
Matrix product for 20x20 matrices
Matrix product for 6x4 and 4x8 matrices
Matrix product for 12x4 and 4x16 matrices
Matrix prodcut for 5x10 and 20x4 matrices
Calculation the transpose of a matrix
Calculation the complex conjugate of a matrix
Calculation the hermitian conjugate of a matrix
Calculation the inverse of a square matrix
Checking whether a square matrix is symmetric and antisymmetric
Checking whether a square matrix is hermitian and anti-hermitian
Checking whether a square matrix is orthogonal or not
Checking whether a square matrix is unitary or not
Calculation eigenvalues and eigenvectors of a 3x3 matrix
Calculation eigenvalues and eigenvectors of a 7x7 matrix
Solving 2nd order differential equation with Runge-Kutta method
Solving 2nd order differential equation with Adams-Bashforth-Moulton method
Solving 2nd order differential equation with Milne method

Prompts

You can find the prompts I entered to the chatbots in prompts.txt file.

I did not send separate prompts for every single query. Instead, I merged same calculations into one prompt. Here is the first prompt I entered for the first two queries:

Write a Python code that calculates the determinant of given matrix in size 7x7 and 11x11. Write separate functions for both 7x7 and 11x11.

Evaluation

There are two conditions that the generated codes must meet:

The code must run without any sytnax error.
The code must give the correct answer.

If a generated code meet these conditions, it is accepted that the chatbot answered the query correctly.

Results

You can find the results query by query on the table below.

Here is the score table:

DeepSeek-V (19/21) ---> % 90.48
ChatGPT Plus GPT-4o & GitHub Copilot GPT-4.1 Pro (18/21) ---> % 85.71
Claude 3.7 Sonnet Pro Plan (17/21) ---> % 80.95
Gemini Advanced 2.0 Flash (13/21) ---> % 61.9

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ChatGPT		ChatGPT
Claude		Claude
Deepseek		Deepseek
Gemini		Gemini
GitHubCopilot		GitHubCopilot
README.md		README.md
prompts.txt		prompts.txt
results_table.png		results_table.png
resulttable.png		resulttable.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChatbotsCodingPerformance

EVALUATION AND COMPARISON OF AI CHATBOTS' CODING PERFORMANCE: HOW GOOD ARE CHATBOTS TO CODE ON TASKS FROM EASY TO COMPLEX?

Introduction

Chatbots

Queries

Prompts

Evaluation

Results

About

Uh oh!

Releases

Packages

Languages

oguz81/ChatbotsCodingPerformance

Folders and files

Latest commit

History

Repository files navigation

ChatbotsCodingPerformance

EVALUATION AND COMPARISON OF AI CHATBOTS' CODING PERFORMANCE: HOW GOOD ARE CHATBOTS TO CODE ON TASKS FROM EASY TO COMPLEX?

Introduction

Chatbots

Queries

Prompts

Evaluation

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages