Downloading or streaming podcasts from BBC from linux shell.

I like listening to podcasts when I code. Not only I get to learn something when I am doing mundane tasks but it also acts as a white noise when I am concentration on the task at hand. So recently I wanted to download this podcast from bbc as mp3 files on to my laptop so that I can listen to them offline during my commute. I would have ususally done it manually but with my newly learned shell knowlege, I did it with this one liner instead.

for i in {0..5}; do if [ $i = 0 ]; then i=”” else i=”\?page\=$i” fi ; echo “$1 ” ; done | xargs curl | grep -Po ‘href=”\K.*?(?=”)’ | grep “.mp3” | grep “low” | xargs wget

The for loop creates 6 loops for each page in the downloads section. The if else condition appends the page number and formats the urls. xargs then passes this url into curl to get the contents of the page. grep then searches the returned page for links. The second and third greps search only for the links with words “.mp3” and ‘low’ in them. The second xargs converts them into arguments and passes to wget which then downloads those links to disk. When we run the command, we go through all six pages and download all the low quality mp3s for each episodes.

As a bonus we can then rename the resulting files based on the information on their id3 tags. using the id3 utility.

id3 -f “%y_%t.mp3” *.mp3

This will rename all the files with the format “year_title.mp3”.

If we dont want to download them to disk but send them to vlc as a playlist for streaming, then we can do xargs vlc rather than xargs wget at the end.


Cleaning unescaped quotes from csv with regex and vim

Regular expressions (RegEx) is a really powerful tool when it comes to text manipulation. It is simple (!= easy), fully featured, extremely versatile, fast and is implemented in every possible platform imaginable like programming languages , unix shell programs, text editors etc. Recently I figured out an elegant way of cleaning csv files where there are unescaped quotes present in the data.

To cover the basics, a character separated values (csv) is a format to store tabular data in a text file where each data point (rows) is separated by new line character and each field (column) is separated by a delimiter – usually a comma. This works well until we encounter data with the comma present in them. for example if the name has to be formatted last name, first name we run into problems. To solve this we use a quote character which encloses every textual data. This works well with most of the data but there are instances where we encounter the quote character within the data. This is where things get messy. Though this can be solved by escaping the quote character with a backslash (\), it is not always possible to introduce escape characters when you are collecting textual data using less structured methods  eg. User input via forms, sensors, etc. I recently collected data where unescaped quotes were present within the data and there was no way of cleaning them at source.

Let us consider a set of data where we have encrypted messages posted by users on a forum. The example csv look like this,


Since the data has already been collected, the problem here is to identify and escape only the problematic quotes (red ones) before reading the data as a csv. The particular technique in regex we use is called variable length negative lookahead and look-behind. This is implemented in vim as “@!” and “@<!” commands. The entire command for doing the cleaning is,


Detailed explanation of the command is,

: - start the vim command
    % - search the whole file
    s - search and replace
        / - start of search pattern
        \v - use simple regex syntax rather than vim style
                " - double quote
                    ," - comma_quote
                    | - or
                    \n - new line
                ) - 
                @! - not followed by
            & - and
                    ", - quote_comma
                    | - or
                    ^ - beginning of line
                @<!- does not precedes
                " - double quote
        / - end of search pattern and start of replace pattern
           \" - replace with escaped quote
        / - end of replace pattern
   g - replace all the matches
   c - confirm with user for every replacement


Only case where this doesn’t work is when there is a quote-comma-quote (“,”) pattern inside the data. For now, I think that this cannot be fixed by regex and needs a different approach like counting number of “,” in each line and fixing lines where the number is greater than expected but would be really happy to be proved wrong.