Tools to manipulate CSV files from the Command Line
Cleaning data and manually processing CSV files to make them readable to scripts and other applications is a tedious task.
There are some nice tools to manipulate CSV files from the command line in Linux, which can ease some of the burden and even provide lightweight alternatives to R and other analysis tools.
The following is a list of the tools I am aware of, together with their main characteristics: an “x” means the function is covered, “-” means partial coverage, “?” means that I did not check:
Name | Check | Clean | Stats | Math | Pivot | Transpose | Select | Query | Sort | Join | Convert from Excel |
---|---|---|---|---|---|---|---|---|---|---|---|
datamash | x | - | x | x | x | x | x | ||||
xsv | ? | x | x | x | x | x | x | x | |||
miller | - | x | x | x | x | x (xtab) | x | x (expr) | x | x | |
csvkit | ? | x | x | x | x (sql) | x | x | x | |||
pivot | ? | x | |||||||||
q | ? | x (sql) |
where:
- Check: check file structure
- Clean: file “cleaning”, such as changing separators, field delimiters, pretty printing tables, aligning fields
- Stats: computation of descriptive statistics (min, max, average, variance, first and third quartile, …)
- Math: computation on columns (sum, count, uniq, average, …)
- Pivot: pivoting data
- Transpose: transposing rows and columns
- Query: performing queries and selecting data1
- Sort: sorting data2
- Join: joining two CSV file over the value of a column
- Convert Excel: conversion of Excel files
Recommendations
If you want to use these tools extensively, the quirks you need to pay attention to include:
- managing input files formats, especially for specific cases, such as fields surrounded by quotes, field separators appearing in quoted fields
- management of empty fields
- management of “N/A” when piping commands
I usually end up using datamash, mainly because of their simple and clean syntax. Sometimes I use xsv to preprocess the input file. Recently, however, I started appreciating miller, which is very flexible, feature rich, and robust in managing input file formats.
See A quick tutorial on Miller for more details.