Saturday, September 23, 2017

String extracts in Perl with split match and regular expressions

String extracts in Perl with split match and regular expressions


Lately I had to solve the following issue:
extract process id (pid) and program name from the header line of pmap.

The strings can take these forms from simple to complex:

 123: cmd 123: cmd -x foo 123: /usr/bin/cmd 123: /usr/bin/cmd -x foo 
and more complex with more parameters which are trickier to parse
 123: /usr/bin/cmd -x /home/foo 123: /usr/bin/cmd -x 456: -d /home/foo 
i.e. very genereally speaking there is a pid followed by a colon and then a more or less complex command line where the program name can be fully qualified and carry a number of parameters. The last example deliberately introduces the digit and colon again as parameters.

Here is a try to express the string more verbally as a sequence of

  • a number of digits
  • a colon
  • a tab
  • a program name, optionally qualified
  • optionally: an arbitrary number of space separated parameters (could me multiple spaces)

    There a various solutions to this in Perl and here Ill show two.

     # Example string $str = "123: /usr/bin/cmd -x /home/foo"; # ^ should be a tab here # First I split the string using an optional colon :* # and a sequence of white space s+ as field delimiters. # This will give me the pid and the program name and strip of the parameters ($pid,$cmd) = split /:*s+/,$str; # In case of a fully qualified program nane # everything up to the last slash needs to be removed $cmd =~ s/.*///; print "pid = $pid X cmd = $cmd "; 

    Always looking for more concise code I wondered whether these two lines couldnt be shortened. Here is a one liner which requires explanation of course.

     # Example string $str = "123: /usr/bin/cmd -x /home/foo"; # ^ should be a tab here # I try to match the following reqular expression # a sequence of digits (d+) which will become $1 if successful # a colon and a tab # an optional sequence of characters ending in slash (S+/)* # which will become $2 # a sequence of characters (S+) which will become $3 # The remainder of the string is not important as # we anchor the regular expression at the beginning. $str =~ /^(d+): (S+/)*(S+)/ ; print "pid = $1 X cmd = $3 "; 

    For easier readability I would have preferred the first code but when taking a deeper look I found some flaws in it namely the handling of incorrect strings. Assume this string below where the colon is missing and a string sits between pid and program name

     $str = "123 xyz /usr/bin/cmd -x 456: /home/foo"; 
    The codes will result in
     # Code 1 pid = 123 xyz /usr/bin/cmd -x 456 X cmd = foo # Code 2 pid = /home/ X cmd = 
    In both cases the split happens at the wrong place with unforeseeable results.
    I can use the second code though to its advantage by applying a check.
     if( $str =~ /^(d+): (S+/)*(S+)/ ) { print "pid = $1 X cmd = $3 "; } 
    i.e. only when the regular expression is really matched I will use its values. The check gives me assurance.
    I cant do this with the split in the first code other than doing a post-check by checking whether the pid really consists of digits etc. which would increase the code.

    So I decided to use the regular expression in my code since it is still fairly readable by extracting just three parts of the overall string.
    Would I want to extract more, say five or eight components, I probably would fall back to the split and a subsequent validity check.

    download file now