Skip to content

NguyenSiTrung/useful_command

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Useful Linux Terminal Commands for AI Engineers in NLP and NMT

This document summarizes useful Linux terminal commands for an AI engineer working in NLP and NMT.

I. File & Directory Navigation & Management

These are fundamental for navigating your project, handling datasets, and managing model files.

Command Description Example
pwd Print Working Directory - shows your current location in the file system. pwd
ls List directory contents. ls, ls -l, ls -lh, ls -la
cd Change directory. cd my_project, cd .., cd /
mkdir Make directory - creates a new directory. mkdir data, mkdir models
rm Remove files or directories (use with caution!). -r for recursive (directories), -f for force (no warnings). rm old_file.txt, rm -rf temp_dir
cp Copy files or directories. -r for recursive (directories). cp source.txt dest.txt, cp -r data backup
mv Move or rename files or directories. mv old_name.txt new_name.txt, mv file.txt my_dir/
touch Create an empty file or update the timestamp of an existing one. touch new_file.txt
find Search for files and directories based on various criteria (name, type, size, etc.). find . -name "*.txt"
locate Quickly find files by name (uses a database, which needs to be updated, usually with sudo updatedb). locate my_data.csv
du Display disk usage. -h for human-readable output. -s for a summary of a directory. du -sh my_dir
df Display disk free space. -h for human-readable output. df -h
tar Create or extract compressed archives (.tar, .tar.gz, .tgz). tar -czvf archive.tar.gz my_dir, tar -xzvf archive.tar.gz
gzip, gunzip Compress or decompress a single file using gzip compression (.gz). gzip my_file.txt, gunzip my_file.txt.gz
zip, unzip Compress or decompress files using zip compression (.zip). zip my_archive.zip file1.txt file2.txt, unzip my_archive.zip

II. Text Processing & Data Wrangling

Essential for preparing and analyzing text data, crucial in NLP/NMT.

Command Description Example
cat Concatenate and display file contents. cat file1.txt, cat file1.txt file2.txt > combined.txt
head Display the first few lines of a file (default 10). -n to specify the number of lines. head -n 5 data.txt
tail Display the last few lines of a file (default 10). -n to specify the number of lines, -f to follow a file. tail -n 20 log.txt, tail -f log.txt
less View file contents one screen at a time (useful for large files). less large_file.txt
grep Search for patterns in files. -i for case-insensitive, -r for recursive, -n for line numbers, -v for invert match. grep "error" log.txt, grep -i "word" data.txt, grep -r "pattern" my_dir
sed Stream editor - perform text transformations. Used for substitutions, deletions, insertions. sed 's/old_word/new_word/g' file.txt
awk Powerful text processing language. Used for pattern scanning and processing, data extraction, and reporting. awk '{print $1}' file.txt (print the first column), awk -F',' '{print $2}' data.csv
wc Word, line, character, and byte count. -l for lines, -w for words, -c for bytes, -m for characters. wc -l data.txt
sort Sort lines of text. -n for numerical sort, -r for reverse, -k to specify a key column. sort data.txt, sort -n numbers.txt, sort -k2 file.txt
uniq Filter out repeated adjacent lines. -c to count occurrences. Often used with sort. sort data.txt | uniq, sort data.txt | uniq -c
cut Extract sections from each line of a file. -f to select fields, -d to specify delimiter. cut -f 1,3 -d ',' data.csv
paste Merge lines of files side-by-side. -d to specify delimiter. paste file1.txt file2.txt
tr Translate or delete characters. tr 'a-z' 'A-Z' < file.txt (convert to uppercase)

III. Process Management

For monitoring and managing training processes, especially long-running deep learning jobs.

Command Description Example
ps Display currently running processes. ps aux or ps -ef for a detailed view. ps aux
top, htop Display and update sorted information about processes and system resource usage (like CPU, memory). htop is more interactive. top, htop
kill Send a signal to a process (usually to terminate it). Use ps or top to find the process ID (PID). Default signal is SIGTERM (15), SIGKILL (9) is forceful. kill 1234 (terminate PID 1234), kill -9 1234
killall Kill processes by name. killall python
nohup Run a command immune to hangups, with output to a non-tty. Useful for running long jobs in the background. nohup python train.py &
& Run a command in the background. python train.py &
jobs List background jobs. jobs
fg Bring a background job to the foreground. fg %1 (bring job number 1 to the foreground)
bg Put a suspended job into the background. bg %1
Ctrl+Z Suspend a running process (pause it). (Press Ctrl+Z while a process is running)

IV. System & Hardware Information

Useful for checking available resources (GPU, CPU, memory) before and during training.

Command Description Example
nvidia-smi NVIDIA System Management Interface - provides monitoring and management capabilities for NVIDIA GPUs. nvidia-smi
lscpu Display information about the CPU architecture. lscpu
free Display amount of free and used memory in the system. -h for human-readable output. free -h
uname Print system information. -a for all information. uname -a
uptime Tell how long the system has been running, load averages. uptime

V. Networking

These are helpful for transferring data to/from remote servers, downloading datasets, or interacting with remote machines.

Command Description Example
ping Test network connectivity to a host. ping google.com
ssh Secure Shell - connect to a remote server securely. ssh user@remote_server
scp Secure Copy - copy files between hosts on a network securely (uses SSH). scp file.txt user@remote_server:/path/to/destination
rsync Fast, versatile, remote (and local) file-copying tool. Excellent for syncing directories and backups, often preferred over scp. rsync -avz source_dir/ user@remote_server:/dest_dir/
wget Retrieve files from the web (HTTP, HTTPS, FTP). wget https://example.com/dataset.zip
curl Transfer data with URLs. More powerful than wget, can interact with APIs, etc. curl https://example.com/api/data
ifconfig / ip Display or configure network interfaces. ifconfig is older, ip is more modern and feature-rich. ifconfig, ip addr

VI. Package Management

Essential for installing and managing the software libraries you need (e.g., PyTorch, TensorFlow, Transformers).

  • apt (Debian/Ubuntu):
    • sudo apt update: Update the list of available packages.
    • sudo apt upgrade: Upgrade installed packages.
    • sudo apt install <package_name>: Install a package.
    • sudo apt remove <package_name>: Remove a package.
    • sudo apt search <keyword>: Search for packages.
  • yum (Red Hat/Fedora/CentOS):
    • sudo yum update: Update the list of available packages and upgrade installed packages.
    • sudo yum install <package_name>: Install a package.
    • sudo yum remove <package_name>: Remove a package.
    • sudo yum search <keyword>: Search for packages.
  • pip (Python Package Installer):
    • pip install <package_name>: Install a Python package.
    • pip install -U <package_name>: Upgrade a Python package.
    • pip uninstall <package_name>: Uninstall a Python package.
    • pip list: List installed Python packages.
    • pip show <package_name>: Show information about an installed package.
    • pip freeze > requirements.txt: Create a requirements file for your project.
    • pip install -r requirements.txt: Install packages from a requirements file.
  • conda (Anaconda/Miniconda):
    • conda create -n <env_name> python=<version>: Create a new environment.
    • conda activate <env_name>: Activate an environment.
    • conda install <package_name>: Install a package in the current environment.
    • conda update <package_name>: Update a package.
    • conda remove <package_name>: Remove a package.
    • conda list: List installed packages in the current environment.
    • conda env list: List all available environments.
    • conda env export > environment.yml: Export an environment to a YAML file.
    • conda env create -f environment.yml: Create an environment from a YAML file.

VII. Other Useful Commands

Command Description Example
history Display command history. !<number> to execute a command from history. history, !123
man Display the manual page for a command. man ls
which Show the full path of a command. which python
alias Create a shortcut for a command. alias ll='ls -lh'
date Display or set the system date and time. date
cal Display a calendar. cal
clear or Ctrl+L Clear the terminal screen. clear
exit Exit the current shell session. exit
sudo Execute a command with superuser (root) privileges. sudo apt update
chmod Change file permissions. chmod +x script.sh (make a script executable)
chown Change file owner and group. sudo chown user:group file.txt
ln Create links (hard links or symbolic links). -s for symbolic link. ln -s target_file link_name (create a symbolic link)
diff Compare files line by line. -u for unified diff (easier to read), -y for side-by-side diff. diff file1.txt file2.txt, diff -u old.txt new.txt
comm Compare two sorted files line by line. comm file1.txt file2.txt
time Measure the execution time of a command. time python my_script.py
screen, tmux Terminal multiplexers - allow you to manage multiple terminal sessions from a single window. screen, tmux (start a new session)
watch Execute a program periodically, showing output fullscreen. Useful for monitoring. watch -n 1 nvidia-smi (monitor GPU usage every second)

VIII. Advanced Text Processing & Data Wrangling

These commands help you perform more complex operations on text files, preparing them for model training or analysis.

Command Description Example
sed (more advanced usage) Substitute with capture groups: sed 's/\(pattern1\)\(pattern2\)/\2\1/g' file.txt (swap the order of captured patterns)
In-place editing: sed -i 's/old/new/g' file.txt (modify the file directly)
Multiple operations: sed -e 's/this/that/g' -e '/pattern/d' file.txt (chain multiple sed commands).
Range-based operations: sed '10,20s/old/new/g' file.txt (substitute only on lines 10-20)
sed 's/\([A-Z]\)\([a-z]\)/\2\1/g' names.txt (convert CamelCase to camelCase), sed -i 's/typo/correction/g' data.txt, sed -e 's/ / /g' -e '/^$/d' text.txt (remove extra spaces and empty lines)
awk (more advanced usage) Conditional logic: awk '{if ($1 > 10) print $0}' file.txt (print lines where the first field is greater than 10)
Built-in functions: awk '{print toupper($0)}' file.txt (convert to uppercase)
Arrays: awk '{count[$1]++} END {for (word in count) print word, count[word]}' file.txt (count word frequencies)
awk '{if ($3 == "error") print "Error on line " NR ": " $0}' log.txt, awk '{print length($0)}' file.txt (print the length of each line), awk '{lines[$0]++} END {for (l in lines) print l}' file.txt (remove duplicate lines)
join Join lines of two files based on a common field. Files should typically be sorted on the join field. join -1 1 -2 2 file1.txt file2.txt (join on the first field of file1 and the second field of file2)
split Split a file into multiple smaller files. Useful for dividing large datasets. split -l 1000 data.txt data_part_ (split data.txt into files with 1000 lines each, named data_part_aa, data_part_ab, etc.)
csplit Split a file into multiple files based on context lines or patterns. csplit data.txt /CHAPTER/ {*} (split data.txt into files at each line containing "CHAPTER")
shuf Randomize the order of lines in a file. Useful for shuffling datasets before training. shuf data.txt > shuffled_data.txt
nl Number lines of a file. Useful for adding line numbers for easier reference. nl data.txt > numbered_data.txt
pr Paginate or columnate files for printing. It can also be used to add headers and footers to a file, which can be useful for preparing text data for certain types of analysis or processing. pr -t -n data.txt
fmt Reformat paragraph text to fit within a specified width. fmt -w 60 long_lines.txt (wrap lines to a maximum width of 60 characters)
expand, unexpand expand converts tabs to spaces, while unexpand does the opposite. Useful for standardizing whitespace. expand -t 4 file.txt (convert tabs to 4 spaces), unexpand -a file.txt (convert spaces to tabs)
paste (parallel processing example) Combine corresponding lines of two files, which is essential for creating parallel corpora. For example, to create a tab-separated parallel corpus from two files (source.txt and target.txt): paste source.txt target.txt > parallel_corpus.tsv

About

Useful command linux for AI Engineer ( especially NLP, NMT )

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors