How do I process CSV files in Linux command line?

Use tools like awk, sed, cut, and csvkit to process CSV files. For example, 'awk -F, '{print $1,$3}' file.csv' extracts columns 1 and 3 from a CSV file.

What are the best Linux tools for CSV processing?

The best tools include awk for field processing, cut for column extraction, sed for text manipulation, csvkit for advanced CSV operations, and sort/uniq for data analysis.

How do I handle CSV files with commas in fields?

Use csvkit tools like csvcut and csvgrep which properly handle quoted fields, or use awk with proper field parsing for CSV files with embedded commas.

Linux CSV Processing - Command Line CSV Data Manipulation

Overview

Linux offers numerous tools for CSV processing, each with specific strengths:

cut - Simple column extraction
awk - Advanced field processing and calculations
sed - Text substitution and transformation
grep - Pattern matching and filtering
sort - Data sorting and organization
uniq - Duplicate removal and counting
csvkit - Specialized CSV toolkit

Sample CSV Data

For the examples below, we'll use this sample data file (employees.csv):

                        name,department,salary,years
                        John Smith,Engineering,75000,5
                        Jane Doe,Marketing,65000,3
                        Bob Johnson,Engineering,80000,7
                        Alice Brown,Sales,60000,2
                        Charlie Wilson,Marketing,70000,4
                        Diana Davis,Engineering,85000,8
                    

Basic CSV Operations

Column Extraction with cut

Extract specific columns

cut -d, -f1,3 employees.csv

Extracts name and salary columns

Extract column range

cut -d, -f2-4 employees.csv

Extracts columns 2 through 4

Skip header and extract columns

tail -n +2 employees.csv | cut -d, -f1,3

Skips the header row and extracts name and salary

Field Processing with awk

Print specific fields

awk -F, '{print $1, $3}' employees.csv

Prints name and salary with space separation

Add calculations

awk -F, 'NR>1 {print $1, $3*1.1}' employees.csv

Calculates 10% salary increase (skips header)

Format output

awk -F, 'NR>1 {printf "%-15s $%d\n", $1, $3}' employees.csv

Formats output with fixed-width columns

Conditional processing

awk -F, '$3 > 70000 {print $1, $2, $3}' employees.csv

Shows only employees with salary > $70,000

Advanced CSV Processing

Data Filtering and Searching

Filter by department

grep "Engineering" employees.csv

Shows all Engineering department employees

Case-insensitive search

grep -i "engineering" employees.csv

Case-insensitive department search

Multiple pattern search

grep -E "Engineering|Marketing" employees.csv

Shows employees from Engineering or Marketing

Data Sorting and Analysis

Sort by salary (numeric)

(head -1 employees.csv; tail -n +2 employees.csv | sort -t, -k3 -n)

Sorts by salary while preserving header

Sort by department then salary

(head -1 employees.csv; tail -n +2 employees.csv | sort -t, -k2,2 -k3,3n)

Multi-column sort: department alphabetically, then salary numerically

Count employees by department

tail -n +2 employees.csv | cut -d, -f2 | sort | uniq -c

Counts employees in each department

Data Transformation

Convert to uppercase

awk -F, '{print toupper($1), $2, $3, $4}' OFS=, employees.csv

Converts names to uppercase

Replace values

sed 's/Engineering/Tech/g' employees.csv

Replaces "Engineering" with "Tech"

Add new calculated column

awk -F, 'NR==1 {print $0",bonus"} NR>1 {print $0","$3*0.1}' employees.csv

Adds a bonus column (10% of salary)

Statistical Analysis

Summary Statistics

Calculate average salary

                        awk -F, 'NR>1 {sum+=$3; count++} END {print "Average:", sum/count}' employees.csv
                    

Calculates the average salary

Find min and max salary

                        awk -F, 'NR>1 {if(min=="" || $3max) max=$3} END {print "Min:", min, "Max:", max}' employees.csv
                    

Finds minimum and maximum salaries

Count total employees

awk 'END {print NR-1}' employees.csv

Counts total employees (excluding header)

Department salary analysis

                        awk -F, 'NR>1 {dept[$2]+=$3; count[$2]++} END {for(d in dept) print d":", dept[d]/count[d]}' employees.csv
                    

Calculates average salary by department

Using csvkit for Advanced Processing

Installation

                        # Ubuntu/Debian
                        sudo apt install csvkit
                        
                        # Or via pip
                        pip install csvkit
                    

csvkit Examples

View CSV structure

csvstat employees.csv

Shows detailed statistics for each column

Extract columns by name

csvcut -c name,salary employees.csv

Extracts columns by header name

Filter rows

csvgrep -c department -m "Engineering" employees.csv

Filters rows where department matches "Engineering"

Sort CSV data

csvsort -c salary -r employees.csv

Sorts by salary in descending order

Convert to JSON

csvjson employees.csv

Converts CSV to JSON format

Complex Processing Examples

Data Cleaning Pipeline

                        # Remove duplicates, sort by salary, and format output
                        tail -n +2 employees.csv | \
                        sort -u | \
                        sort -t, -k3 -nr | \
                        awk -F, '{printf "%-15s %-12s $%s\n", $1, $2, $3}'
                    

Complete data cleaning and formatting pipeline

Report Generation

                        awk -F, '
                        NR>1 {
                            dept[$2]++; 
                            salary[$2]+=$3; 
                            total++; 
                            totalSalary+=$3
                        } 
                        END {
                            print "DEPARTMENT REPORT"
                            print "=================="
                            for(d in dept) {
                                printf "%-12s: %d employees, avg salary: $%.0f\n", 
                                       d, dept[d], salary[d]/dept[d]
                            }
                            print "=================="
                            printf "TOTAL: %d employees, avg salary: $%.0f\n", 
                                   total, totalSalary/total
                        }' employees.csv
                    

Generates a comprehensive department report

Data Validation

                        # Check for invalid salary values
                        awk -F, 'NR>1 && ($3 !~ /^[0-9]+$/ || $3 <= 0) {
                            print "Invalid salary on line " NR ": " $0
                        }' employees.csv
                    

Validates salary data for numeric values

Best Practices

CSV Processing Best Practices

Handle Headers - Use NR>1 in awk or tail -n +2 to skip headers
Quote Handling - Use csvkit for files with quoted fields containing commas
Field Validation - Always validate data before processing
Backup Data - Keep original files when modifying data
Pipeline Approach - Chain commands for complex operations
Error Handling - Check for empty fields and invalid data

Common Pitfalls

Avoid These Common Issues

Embedded Commas - Fields with commas need proper quoting
Different Delimiters - Some CSV files use semicolons or tabs
Header Handling - Remember to preserve or skip headers appropriately
Numeric Sorting - Use -n flag for numeric sorts
Character Encoding - Be aware of UTF-8 vs ASCII encoding issues

Linux CSV Processing