There's an old story about two senior architects that were friends in college, and met again thirty years later. After a few minutes they started talking about their favorite achievements. The first described office towers, airports, and universities he was quite proud of. The second didn't have any monuments to talk about, but shared that he thought he may have designed the perfect chair.t chair. Clearly trumped, his friend congratulated him, and asked to hear more - since the perfect chair is far more significant than yet another monument.
Sometimes, I feel that small unix utilities are to a programmer what a chair is to an architect: they continue to be essential, and are typically small, spare, do just a single thing and can clearly show elegance.
I've written quite a number of them, and have recently started packaging those related to data analysis into a project called DataGristle. My favorite utility of the set is gristle_slicer - a tool similar to the Unix program cut. While cut allows the user to select columns out of a file, gristle_slicer selects columns and rows - and uses the more functional Python string slicing syntax to do it.
It's no perfect chair but it might be a good utility.
How it works
The user provides 4 optional specifications:- -r or --records (record inclusion criteria)
- -R or --exrecords (record exclusion criteria)
- -c or --columns (column inclusion criteria)
- -C or --excolumns (column exclusion criteria)
- -r "3,5,20:30" - will include records 3,5 and those from 20 to 29 (based on 0 offset)
- -r ":15" - will include all records from the beginning to 14
- -c ":-1" - will include all columns except for the last one
Beside the specifications there's other options as well. Because it uses the internal DataGristle libraries - it will figure out the csv file delimiter and quoting by itself. And it's right about 95% of the time. In case it isn't, there are options to manually override it:
- -d '|' - allows user to explicitly identify a single-column delimiter
- -q <quotetype> - allows user to explicitly identify quoting type - using one of quote_none, quote_all, quote_minimal, or quote_nonnumeric
Lets look at some examples
Lets assume a seven-row, seven-column, pipe-delimited csv file called "sample.csv" that looks like this:
0-0|0-1|0-2|0-3|0-4|0-5|0-6
1-0|1-1|1-2|1-3|1-4|1-5|1-6
2-0|2-1|2-2|2-3|2-4|2-5|2-6
3-0|3-1|3-2|3-3|3-4|3-5|3-6
4-0|4-1|4-2|4-3|4-4|4-5|4-6
5-0|5-1|5-2|5-3|5-4|5-5|5-6
6-0|6-1|6-2|6-3|6-4|6-5|6-6
Each field has a number that indicates both the record and the field - in both cases they are based on a zero-offset. This is for the sake of convenience since gristle_slicer's specifications are also based on zero-offsets.
But from here on forward we'll depict it as a table.
If we want to see the entire table we can simply cat it, or we could use gristle_slicer like this:
$ gristle_slicer sample.csv
Which defaults to print all columns & rows:
1. Select just the first row
$ gristle_slicer sample.csv -r 0 -d'|'We'll provide the delimiter since this small file with consistent dashes can throw off the automatic detection. And it will write the high-lighted line below to stdout:
2. Select just the first column
$ gristle_slicer sample.csv -c 0 -d'|'Again, this will write the high-lighted column below to stdout:
3. Select the intersection of the first row and first column:
$ gristle_slicer sample.csv -r 0 -c 0 -d'|'4. Select the first 4 rows:
$ gristle_slicer sample.csv -r 0:3 -d'|'5. Select the first 4 rows but exclude the 2nd
$ gristle_slicer sample.csv -r 0:4 -R 1 -d'|'6. Select first and last column on the first and last rows
This can be done by including the first & last row and the first & last column:
$ gristle_slicer sample.csv -r 0,-1 -c 0, -1 -d'|'
Or by including the first & last rows and excluding the column range between the 2nd and 2nd to the last column:
$ gristle_slicer sample.csv -r 0, -1 -C 1:-1 -d'|'
7. Select the middle
There's lots of different ways to do this one. The first is to just include the 2nd to 2nd-to-last rows and 2nd to 2nd-to-last columns:$ gristle_slicer sample.csv -r 1:-1 -c 1:-1 -d'|'
Another way is to include all rows and columns (it's implied if you leave off -r & -c) and just exclude the two you don't want:
$ gristle_slicer sample.csv -R 0,-1 -C 0,-1 -d '|'
8. Select rows 2 & 4, columns 2, 4 through 6
One way to do this is to just include rows 2-4, exclude row 3, then include columns 2 and 4 through 6:$ gristle_slicer -r 2:5 -R 4 -c2,4:7 -d '|'
Another way is to just specifically include rows and columns you need:
$ gristle_slicer -r 2,4 -c 2,4,5,6 -d '|'
Another way is to combine multiple commands:
$ gristle_slicer -r 2,4 -d'|' | gristle_slicer -c 2,4,5,6 -d '|'
Next Steps
I'm probably done with this for a while. But eventually, I'll probably return and add Python slicing step logic to allow the user to pick every N item. I might modify it to work on files without delimiters. And I might upgrade it to allow columns to be referenced by their names in the headers.
neat !
ReplyDeleteHi Ken,
ReplyDeleteI have been following your blog and found your thoughts really useful. Although i have seen you writing about datagristle before, i didn't look at it closely until recently when i had a need to analyse a few csv files quickly for a datasource from which we needed to extract data and create a data model. I found it really useful and plan to use it. I particularly found gristle_determinator very useful to analyze the csv files and quickly understand the data. Do you have a script in datagristle to modify the format of a file by adding quotes around fields, if they don't exist, remove certain characters from the fields based on input and other simple manipulations.
Thanks,
Arun.
Hey thanks for your comments Arun. I really enjoy writing tools like this so I hope you continue to find it useful.
ReplyDeleteI've got a few other tools in the set for doing different common operations. They might help with some of your needs, but probably not all.
The gristle_file_converter can change a file's csv dialect. I haven't listed it as one of my "mature" utilities yet - since I think I need to really go over it, clean it up more, etc. But I still use it myself. Here's an example of changing a csv without quotes and using a pipe-delimiter to using a comma delimiter and having all fields quoted:
$ gristle_file_converter ../data/7x7.csv --delimiter '|' --outdelimiter ',' --outquoting
"0-0","0-1","0-2","0-3","0-4","0-5","0-6"
"1-0","1-1","1-2","1-3","1-4","1-5","1-6"
"2-0","2-1","2-2","2-3","2-4","2-5","2-6"
"3-0","3-1","3-2","3-3","3-4","3-5","3-6"
"4-0","4-1","4-2","4-3","4-4","4-5","4-6"
"5-0","5-1","5-2","5-3","5-4","5-5","5-6"
"6-0","6-1","6-2","6-3","6-4","6-5","6-6"
There are quite a few quoting dialects, and the python csv modules misses a few of the more obscure ones.
Regards,
Ken
Hi I found the documentation to datagristle and was really excited.. we've been having a process that fails everytime we get a bad file, and data validator will make my life so much better. i'm working on anaconda, did a pip install, import gristle works, and then gristle_validator 'C:\Users\Ananya\Desktop\Data Challenge\institutions_data.csv' -f 3 keeps failing saying syntax error on ipython and with 'gristle validator not defined' on anaconda prompt. any help here pointing out what i'm doing wrong will help much. thanks!
ReplyDeleteDatagristle doesn't use ipython, so it sounds like the issue is with the environment.
DeleteHowever, datagristle doesn't get tested against windows - so it could run into snags on that os.
i can't find gristle_slicer after importing gristle
ReplyDeleteIt should be located wherever pip is installing scripts for your os and possibly virtualenv.
Delete