Latest post

In a spirit of exploration, and because he mentioned it again in a recent 20-year roundup I decided to see what I could do with Tim Bray’s Topfew. Not the actual Topfew, because that’s beyond me, but the original incantation:

awk '{print $1}' access_log | sort | uniq -c | sort -rn | head -12

The idea was to see in a vague kind of way what visitors to my site(s) are looking for, beyond the usual soft targets.1 I’m already downloading daily logfiles to look at with Bise; Topfew offered a way to get a bit more granular still without caring about anything personal. It proved simple to whip up a quick little script that concatenates all the logfiles, does the needful and writes out the results:

cat access.* > totals
awk '{print $7}' totals | sort | uniq -c | sort -rn | head -50 > top50.txt
rm totals

A problem arose when I tried to do the same for all the logfiles I have, being a glob cleverclogs. The resulting file was huge (>6GB) and the problem was that the script spluttered to a halt complaining about an “illegal byte sequence”. Fair enough, but how to find it?

One tip said try nl file.txt | sort because nl adds line numbers and then when sort blows up it will tell you where the error is. That did not work.

Another said try iconv file.txt. That did work, blowing up on the required line.

Of course, loading the 6GB file into BBEdit and looking for line 15049548 posed its own set of problems. But sure enough, there they were! The dreaded red upside down question marks in BBEdit. Sorting that out will take a bit longer.


  1. I’ll write more about that later, when I have actually looked. 

Eat This Podcast

Using food as a vehicle to explore the byways of taste, economics and trade, culture, science, history, archaeology, geography and just about anything else.

Nominated for a James Beard Award in 2015 -- and again in 2016 -- and going from strength to strength.

Subscribe to the podcast in iTunes or by email and get a new episode every two weeks. And a newsletter to stave off the pangs between podcasts.