Linux CSV Processing
CSV (Comma-Separated Values) processing is a common task in data analysis and system administration. Linux provides powerful command-line tools for manipulating, analyzing, and transforming CSV data efficiently.
Overview
Linux offers numerous tools for CSV processing, each with specific strengths:
- cut - Simple column extraction
- awk - Advanced field processing and calculations
- sed - Text substitution and transformation
- grep - Pattern matching and filtering
- sort - Data sorting and organization
- uniq - Duplicate removal and counting
- csvkit - Specialized CSV toolkit
Sample CSV Data
For the examples below, we'll use this sample data file (employees.csv):
Basic CSV Operations
Column Extraction with cut
Extract specific columns
Extracts name and salary columns
Extract column range
Extracts columns 2 through 4
Skip header and extract columns
Skips the header row and extracts name and salary
Field Processing with awk
Print specific fields
Prints name and salary with space separation
Add calculations
Calculates 10% salary increase (skips header)
Format output
Formats output with fixed-width columns
Conditional processing
Shows only employees with salary > $70,000
Advanced CSV Processing
Data Filtering and Searching
Filter by department
Shows all Engineering department employees
Case-insensitive search
Case-insensitive department search
Multiple pattern search
Shows employees from Engineering or Marketing
Data Sorting and Analysis
Sort by salary (numeric)
Sorts by salary while preserving header
Sort by department then salary
Multi-column sort: department alphabetically, then salary numerically
Count employees by department
Counts employees in each department
Data Transformation
Convert to uppercase
Converts names to uppercase
Replace values
Replaces "Engineering" with "Tech"
Add new calculated column
Adds a bonus column (10% of salary)
Statistical Analysis
Summary Statistics
Calculate average salary
Calculates the average salary
Find min and max salary
Finds minimum and maximum salaries
Count total employees
Counts total employees (excluding header)
Department salary analysis
Calculates average salary by department
Using csvkit for Advanced Processing
Installation
csvkit Examples
View CSV structure
Shows detailed statistics for each column
Extract columns by name
Extracts columns by header name
Filter rows
Filters rows where department matches "Engineering"
Sort CSV data
Sorts by salary in descending order
Convert to JSON
Converts CSV to JSON format
Complex Processing Examples
Data Cleaning Pipeline
Complete data cleaning and formatting pipeline
Report Generation
Generates a comprehensive department report
Data Validation
Validates salary data for numeric values
Best Practices
CSV Processing Best Practices
- Handle Headers - Use NR>1 in awk or tail -n +2 to skip headers
- Quote Handling - Use csvkit for files with quoted fields containing commas
- Field Validation - Always validate data before processing
- Backup Data - Keep original files when modifying data
- Pipeline Approach - Chain commands for complex operations
- Error Handling - Check for empty fields and invalid data
Common Pitfalls
Avoid These Common Issues
- Embedded Commas - Fields with commas need proper quoting
- Different Delimiters - Some CSV files use semicolons or tabs
- Header Handling - Remember to preserve or skip headers appropriately
- Numeric Sorting - Use -n flag for numeric sorts
- Character Encoding - Be aware of UTF-8 vs ASCII encoding issues