Linux CSV Processing

CSV (Comma-Separated Values) processing is a common task in data analysis and system administration. Linux provides powerful command-line tools for manipulating, analyzing, and transforming CSV data efficiently.

Overview

Linux offers numerous tools for CSV processing, each with specific strengths:

  • cut - Simple column extraction
  • awk - Advanced field processing and calculations
  • sed - Text substitution and transformation
  • grep - Pattern matching and filtering
  • sort - Data sorting and organization
  • uniq - Duplicate removal and counting
  • csvkit - Specialized CSV toolkit

Sample CSV Data

For the examples below, we'll use this sample data file (employees.csv):

name,department,salary,years John Smith,Engineering,75000,5 Jane Doe,Marketing,65000,3 Bob Johnson,Engineering,80000,7 Alice Brown,Sales,60000,2 Charlie Wilson,Marketing,70000,4 Diana Davis,Engineering,85000,8

Basic CSV Operations

Column Extraction with cut

Extract specific columns

cut -d, -f1,3 employees.csv

Extracts name and salary columns

Extract column range

cut -d, -f2-4 employees.csv

Extracts columns 2 through 4

Skip header and extract columns

tail -n +2 employees.csv | cut -d, -f1,3

Skips the header row and extracts name and salary

Field Processing with awk

Print specific fields

awk -F, '{print $1, $3}' employees.csv

Prints name and salary with space separation

Add calculations

awk -F, 'NR>1 {print $1, $3*1.1}' employees.csv

Calculates 10% salary increase (skips header)

Format output

awk -F, 'NR>1 {printf "%-15s $%d\n", $1, $3}' employees.csv

Formats output with fixed-width columns

Conditional processing

awk -F, '$3 > 70000 {print $1, $2, $3}' employees.csv

Shows only employees with salary > $70,000

Advanced CSV Processing

Data Filtering and Searching

Filter by department

grep "Engineering" employees.csv

Shows all Engineering department employees

Case-insensitive search

grep -i "engineering" employees.csv

Case-insensitive department search

Multiple pattern search

grep -E "Engineering|Marketing" employees.csv

Shows employees from Engineering or Marketing

Data Sorting and Analysis

Sort by salary (numeric)

(head -1 employees.csv; tail -n +2 employees.csv | sort -t, -k3 -n)

Sorts by salary while preserving header

Sort by department then salary

(head -1 employees.csv; tail -n +2 employees.csv | sort -t, -k2,2 -k3,3n)

Multi-column sort: department alphabetically, then salary numerically

Count employees by department

tail -n +2 employees.csv | cut -d, -f2 | sort | uniq -c

Counts employees in each department

Data Transformation

Convert to uppercase

awk -F, '{print toupper($1), $2, $3, $4}' OFS=, employees.csv

Converts names to uppercase

Replace values

sed 's/Engineering/Tech/g' employees.csv

Replaces "Engineering" with "Tech"

Add new calculated column

awk -F, 'NR==1 {print $0",bonus"} NR>1 {print $0","$3*0.1}' employees.csv

Adds a bonus column (10% of salary)

Statistical Analysis

Summary Statistics

Calculate average salary

awk -F, 'NR>1 {sum+=$3; count++} END {print "Average:", sum/count}' employees.csv

Calculates the average salary

Find min and max salary

awk -F, 'NR>1 {if(min=="" || $3max) max=$3} END {print "Min:", min, "Max:", max}' employees.csv

Finds minimum and maximum salaries

Count total employees

awk 'END {print NR-1}' employees.csv

Counts total employees (excluding header)

Department salary analysis

awk -F, 'NR>1 {dept[$2]+=$3; count[$2]++} END {for(d in dept) print d":", dept[d]/count[d]}' employees.csv

Calculates average salary by department

Using csvkit for Advanced Processing

Installation

# Ubuntu/Debian sudo apt install csvkit # Or via pip pip install csvkit

csvkit Examples

View CSV structure

csvstat employees.csv

Shows detailed statistics for each column

Extract columns by name

csvcut -c name,salary employees.csv

Extracts columns by header name

Filter rows

csvgrep -c department -m "Engineering" employees.csv

Filters rows where department matches "Engineering"

Sort CSV data

csvsort -c salary -r employees.csv

Sorts by salary in descending order

Convert to JSON

csvjson employees.csv

Converts CSV to JSON format

Complex Processing Examples

Data Cleaning Pipeline

# Remove duplicates, sort by salary, and format output tail -n +2 employees.csv | \ sort -u | \ sort -t, -k3 -nr | \ awk -F, '{printf "%-15s %-12s $%s\n", $1, $2, $3}'

Complete data cleaning and formatting pipeline

Report Generation

awk -F, ' NR>1 { dept[$2]++; salary[$2]+=$3; total++; totalSalary+=$3 } END { print "DEPARTMENT REPORT" print "==================" for(d in dept) { printf "%-12s: %d employees, avg salary: $%.0f\n", d, dept[d], salary[d]/dept[d] } print "==================" printf "TOTAL: %d employees, avg salary: $%.0f\n", total, totalSalary/total }' employees.csv

Generates a comprehensive department report

Data Validation

# Check for invalid salary values awk -F, 'NR>1 && ($3 !~ /^[0-9]+$/ || $3 <= 0) { print "Invalid salary on line " NR ": " $0 }' employees.csv

Validates salary data for numeric values

Best Practices

CSV Processing Best Practices
  • Handle Headers - Use NR>1 in awk or tail -n +2 to skip headers
  • Quote Handling - Use csvkit for files with quoted fields containing commas
  • Field Validation - Always validate data before processing
  • Backup Data - Keep original files when modifying data
  • Pipeline Approach - Chain commands for complex operations
  • Error Handling - Check for empty fields and invalid data

Common Pitfalls

Avoid These Common Issues
  • Embedded Commas - Fields with commas need proper quoting
  • Different Delimiters - Some CSV files use semicolons or tabs
  • Header Handling - Remember to preserve or skip headers appropriately
  • Numeric Sorting - Use -n flag for numeric sorts
  • Character Encoding - Be aware of UTF-8 vs ASCII encoding issues

See also