After pulling in data from various sources, I recently found myself staring at a long list of domain names combined with meta information about each.
No apparent order, and I noticed at least some of them were duplicated. I wanted to change that.
# unordered.txt, ~hundreds of lines d-domain.com, miscellaneous other info c-domain.com, miscellaneous other info a-domain.com, miscellaneous other info c-domain.com, miscellaneous other info b-domain.com, miscellaneous other info
UNIX tools to the rescue: Cleaning up this mess took only one line of code in my Terminal, and I don’t even have to remember any parameters or pull up a manual page for that:
cat unordered.txt | sort | uniq > ordered.txt
What did we do here?
- The cat utility reads files sequentially, writing them to the standard output.
- The pipe (|) chains processes together by their standard streams, so that the output of each process feeds directly as input to the next one.
The sort utility sorts text and binary files by lines.
- The sorted list gets piped into uniq.
The uniq utility reads the specified input comparing adjacent lines, and writes a copy of each unique input line to the output.
- The redirection (>) directs output from a command to a file on disk, erasing it first in case it already existed.
Looking at the resulting file, it’s all neat and tidy. Happy JJS:
# ordered.txt, ~way less lines, in alphabetical order a-domain.com, miscellaneous other info b-domain.com, miscellaneous other info c-domain.com, miscellaneous other info d-domain.com, miscellaneous other info
It’s already there!
Quite often, the tasks we want to get done are made up of quite simple steps. Most data processing jobs have already been solved by a UNIX tool in one way or another. It’s just a matter of finding the tool (google “UNIX + function you’re looking for”) and then chaining them together.
Listen to this post: