EVALUATION AND COMPARISON OF AI CHATBOTS' CODING PERFORMANCE: HOW GOOD ARE CHATBOTS TO CODE ON TASKS FROM EASY TO COMPLEX?
This is a study of comparison and evaluation of AI chatbots' coding performance. It emerged as a midterm project of "Fundamentals of Artifical Intelligence" course at Institue of Informatics, Hacettepe University.
I focused to compare and evaluate the coding performance of AI chatbots. There are 5 chatbots to test their performances: ChatGPT, DeepSeek, Claude, Gemini. GitHub Copilot. I entered the prompts in English.
I used five different AI chatbots:
- ChatGPT Plus GPT-4o
- Gemini Advanced 2.0 Flash
- DeepSeek-V3
- Claude 3.7 Sonnet Pro Plan
- GitHub Copilot GPT-4.1 Pro
I wanted chatbots to generate 21 different mathematical calculation codes in Python. Here are queries:
- Calculation the determinant of 7x7 matrix
- Calculation the determinant of 11x11 matrix
- Matrix product for 7x7 matrices
- Matrix product for 15x15 matrices
- Matrix product for 20x20 matrices
- Matrix product for 6x4 and 4x8 matrices
- Matrix product for 12x4 and 4x16 matrices
- Matrix prodcut for 5x10 and 20x4 matrices
- Calculation the transpose of a matrix
- Calculation the complex conjugate of a matrix
- Calculation the hermitian conjugate of a matrix
- Calculation the inverse of a square matrix
- Checking whether a square matrix is symmetric and antisymmetric
- Checking whether a square matrix is hermitian and anti-hermitian
- Checking whether a square matrix is orthogonal or not
- Checking whether a square matrix is unitary or not
- Calculation eigenvalues and eigenvectors of a 3x3 matrix
- Calculation eigenvalues and eigenvectors of a 7x7 matrix
- Solving 2nd order differential equation with Runge-Kutta method
- Solving 2nd order differential equation with Adams-Bashforth-Moulton method
- Solving 2nd order differential equation with Milne method
You can find the prompts I entered to the chatbots in prompts.txt file.
I did not send separate prompts for every single query. Instead, I merged same calculations into one prompt. Here is the first prompt I entered for the first two queries:
Write a Python code that calculates the determinant of given matrix in size 7x7 and 11x11. Write separate functions for both 7x7 and 11x11.
There are two conditions that the generated codes must meet:
- The code must run without any sytnax error.
- The code must give the correct answer.
If a generated code meet these conditions, it is accepted that the chatbot answered the query correctly.
You can find the results query by query on the table below.

Here is the score table:
- DeepSeek-V (19/21) ---> % 90.48
- ChatGPT Plus GPT-4o & GitHub Copilot GPT-4.1 Pro (18/21) ---> % 85.71
- Claude 3.7 Sonnet Pro Plan (17/21) ---> % 80.95
- Gemini Advanced 2.0 Flash (13/21) ---> % 61.9