Open Source Software Technical Articles

Want the Best of the Wazi Blogs Delivered Directly to your Inbox?

Subscribe to Wazi by Email

Your email:

Connect with Us!

Current Articles | RSS Feed RSS Feed

Revamping Text Files with awk and sed

  
  
  

Many things get better with age, like your beloved author of this article, and also the venerable Unix commands awk and sed, which are the ultimate text-file mixmasters. With these tools you can do useful tasks such as surgically change case, search and replace in multiple files, rearrange columns, and add and remove line numbers.



Awk was created in 1977 by Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan at Bell Labs for Unix. The name represents their last initials. On Linux, awk is really gawk, the GNU Foundation's verion of awk. Awk's main ability is rearranging and extracting data from organized text files.



The sed (stream editor) programming language was created in 1973, also at Bell Labs, by Lee McMahon. Sed provides intricate ways of finding and doing things to arbitrary text strings.



Both awk and sed are programming languages capable of complex, multi-file operations. These days they're mostly used in short scripts and one-liners. Perl fans have been saying for years that Perl will supplant sed and awk, but the older utilities are fast, useful, and easy to learn, so they're not going away. Let's look at some examples that show how useful they are.



Making QFX, OFX, or QFX Bank Files Not Shout



Most financial institutions let you download account information in QIF, OFX, or QFX files. These formats are plain text files that look something like this OFX file:




<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20111124000000[-6:EST]
<TRNAMT>-25.44
<FITID>3456

<NAME>FOO MERCHANT SANTA
<MEMO>FOO MERCHANT SANTA FE 11-24-11 567567
</STMTTRN>


When you import this data into your accounting software, the text is all caps and shouts at you. My finances are thrilling enough without all the yelling, so with the help of some friends I came up with a little script to convert the NAME and MEMO values to lowercase. The file presents some interesting challenges because it has a header and footer that don't need to be touched, while the field labels must be uppercase, and the whole file suffers from eccentric white-spacing. But that's nothing to good old sed, which handles this sort of thing with careless ease:



$ sed '/\(NAME\|MEMO\)/s/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g' filename-in.ofx > filename-out.ofx



Let's break this down a piece at a time, because this a great lesson in sed.



The second part of the command is a single complex filter enclosed in single quotes. One of the most basic sed incantations is a simple find and replace, which looks like this:



$ sed 's/this/that/g'


s means substitute, g means global, and slashes are separators. So the expression means "substitute all instances of this with that." That's nice and simple. But the filter for the OFX files is more complex because it handles whitespace, mixed-case text, and arbitrary words. A lot of it is mitigating the limitations of the typewriter keyboard and trying to use the same characters both for commands and text strings, so many of the characters are shell escapes. Shell escapes are backslashes, which tell your command shell, "The character that follows is really a text character and not a command of some kind." If we didn't need escapes the first section of the sed filter would look like (NAME|MEMO). But parentheses and the pipe symbol are control characters in Linux's bash shell, and must be escaped. So the \(NAME\|MEMO\) part means "look for these strings, either NAME or MEMO, in uppercase."



The next part, s, we already know: substitute.



([ >][A-Z]\) tells sed to look for capital letters preceded by a space or a right-angle brace.



([A-Z]*\), with the addition of the asterisk wildcard, means look for text strings in uppercase.



Now here is where it truly gets ingenious: \1\L\2 means "The first capital letter match is saved in variable \1. The second uppercase text string match is saved in variable \2, and \L means change \2 to lowercase." So we have this precision miracle of changing uppercase words to title case.



This is the result:




<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20111124000000[-6:EST]
<TRNAMT>-25.44
<FITID>3456

<NAME>Foo Merchant Santa
<MEMO>Foo Merchant Santa Fe 11-24-11 567567
</STMTRN>


Many thanks to Akkana Peck, Miriam English, and the fine members of LinuxChix who came up with this great sed incantation.



Search and Replace Across Multiple Files



Sed lets you easily search and replace filename text across multiple files in the same directory. Our old friend the find command comes in handy here. Replace dirname with your own directory name:



$ find dirname -type f -exec sed -i 's/this/that/g' {} \;


This is handy for correcting case – for example /this/This/ – and updating any boilerplate text such as copyright notices or contact information, or recycling old love letters with the name of your new sweetie.


19a98812-f823-48dc-841e-bf029c63c6d7

Awk Organizes Stuff



Awk is ace at sorting any kind of organized data. A good example for such data is your /etc/passwd file – it's sorted into a fixed number of comma-delimited fields, and you can try awk commands on it non-destructively. For starters, let's extract a list of usernames sorted alphabetically with awk and sort:



$ awk -F: '{ print $1 }' /etc/passwd | sort


-F: tells awk what the field separator is, which is the colon in /etc/passwd. $1 means the first field in each line. There are seven fields, so the remaining fields are represented by $2, $3, and so on.



Awk can rearrange data in files, which is awesome when you have giant files that need to have columns moved to different places, such as an address book. Suppose you have a file of addresses organized like this:



firstname lastname mi street_address city state zip


You can easily flip the firstname and lastname fields so they're the right way around:



$ awk '{ print $2, $1, $3, $4, $5, $6, $7 }' oldfile > newfile


Awk's default field separator is whitespace. If you use any other type of field separator, such as a comma or colon, you must specify it with the -F option.



Awk can add line numbers to a file, which can be handy for presenting code examples when you're writing a howto, and save the results in a new file:



$ awk '{print NR, $0}' oldfile > newfile


NR is a built-in awk variable that means "the number of records in the input file." $0 means the whole line.



Suppose you are studying a coding howto and want to copy a code example, but remove the line numbers so you can run the code. Sed is a great tool for this. This example looks for any numbers at the beginning of each line, deletes them, and saves the results in a new file. The g (global) operator is omitted so that sed will delete numbers only at the beginning of each line:



$ sed -e 's/[0-9][^ ]* //' oldfile > newfile


Leave off the newfile to print the results to your screen.



To learn more about these useful utilities, consult man awk and man sed, and check out the excellent book sed & awk, 2nd Edition by Dale Dougherty and Arnold Robbins.




This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.


This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.

Comments

As of late, every last girlfriend will offer a good glistening, beaded container due to the fact your sweetheart needs to! Beaded sacks run the gamut with beautiful afternoon phone numbers that will replica watches, sassy, conventional dazzlers. A girl can get a good beaded backpack that sports an individual beautifully substantial gem over the buckle or simply clasp or simply an individual utilizing countless miniature bit of sparkles that will snatch a good walking around total eye. Bridal Backpack: It’s an individual's marriage ceremony together with one and only thing everyone won’t choose to put aside has to be your backpack. My oh my you bet – any bridal backpack. It’s alternative however into the dress up and also veil and also earrings and perchance any rolex replica. However , a good cool young woman could guantee that your girlfriend bridal backpack is certainly for no reason far off on this subject most of necessary daytime given that she’ll call for a blot for pulverulence, a good little water for lip colors and perchance a good hublot replica that will mop any holes for bliss the fact that come given that most of the preparation has finished and also daytime has got at last got there. An individual's bridal backpack has to be quite as exceptional mainly because everything else for your affair and may tie in with during truly feel together with shade into the slumber to your wardrobe. But if the dress up together with big event happen to be over the typical edge, make sure you offer a much more typical rolex replica; however , undertake make sure that it is actually exceptional given that it’ll support an exceptional devote an individual's heart and soul produced by daytime in advance. Buy a container that’s with a fashionable you or simply an individual utilizing amazing garment, beading or simply jewels or simply a different control. All this is certainly someday when you're needing to guarantee an individual's rolex replica works with an individual's dress. Stand-outs together with clashes won’t undertake! Which means that congratulations are in order, best of luck and would an individual's bridal backpack offer your whole hopes and dreams within the after that segment ever experience. 
Posted @ Friday, August 15, 2014 4:02 AM by dwadw
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics