Effective Perl by Joseph N. Hall

Observations and Tips from the author of Effective Perl Programming

Monday, January 16, 2006

Idiomatic Perl: Counting the Number of Times a Character Occurs in a String

From time to time you may want to count the number of occurrences of a character inside a string. You could loop from the beginning of the string to the end:
my $s = "Testing one two three.";

my $m = 0;
for (my $i = 0; $i < length($s); $i++) {
$m++ if (substr($s, $i, 1) eq 'e');
}
Ugly. This is going to take all day with all those calls to substr.* You could split the string. There's a special split pattern // that divides the input string into characters. So perhaps:
my $n = 0;
for (split //, $s) {
$n++ if $_ eq 'e';
}
Maybe this doesn't seem quite right to you yet. There must be a shorter way to do this in Perl, right? Yes. The tr/// (transliterate characters) operator can count characters for you. The tr/// operator returns the number of characters it changed. For example:
my $example = "Mixed Case";
my $changed = $example =~ tr/a-z/A-Z/; # make everything uppercase
print $changed, "\n"; # 9
The value returned from tr/// above is 9, which is the number of characters changed to uppercase. Armed with this knowledge, you might think of:
my $c = $example =~ tr/e/e/;  # change 'e' to 'e' ...
which will work just fine. This is perfectly reasonable, but there's a shortcut. If you omit the second argument to tr///, no characters are changed, but tr/// still returns a count of the character(s) in its first argument.
my $c = $example =~ tr/e//;  # counts the number of 'e' just as above

my $c_lower = $example =~ tr/a-z//; # counts lowercase letters in $example
tr/// is so smart, in fact, that it knows when you aren't changing the string, meaning that you can write things like this:
my $count_ignore_case = lc($example) =~ tr/t//;
Normally, tr/// requires a modifiable value, but if it won't be making changes, you can supply a constant argument like a result from an expression.

-joseph

*Not really, but ouch.

6 Comments:

At 2:02 AM, Anonymous Anonymous said...

a slight variation on the idiom you propose could be:

my $x = 'test';
my $n_of_t = () = ($x =~ /t/g);

this is probably uglier (and slower) than using tr///, but I find it a nice idiom :-).
and it does saves you an lc() when you need to ignore case:

my $n_of_t = () = ($x =~ /t/gi);

cheers,
Aldo

 
At 6:16 AM, Anonymous Anonymous said...

Thanks for the example. Just one tiny thing I think:

This line:

print $changed, "\n"; # 9

prints 7

 
At 11:26 AM, Blogger Joseph N. Hall said...

Yup, nice catch.

% perl -le '$a = "Mixed Case"; print $a =~ tr/a-z/A-Z/'
7

 
At 11:12 AM, Anonymous Anonymous said...

Great trick! I have a similar problem, where I have a series of characters ('a' through 'z') that I want to count in a string. I tried tr/$_//; but soon found that you can't use variables in a transliteration regex, unless you wrap it in exec(), which I hear is costly. I could try Aldo's approach:

for $char ('a'..'z'){
my $x = 'text';
my $count = () = ($x =~ /$char/g);
}

which works great. Still, I was wondering if anyone has any other ideas? I'll be using this function a whole lot, so speed is a must.
Thanks for all your help!
Dex

 
At 4:05 PM, Blogger Joseph N. Hall said...

Dexter,

If you have a range of characters that may be changing, you'll have to use eval for each different range of characters. But this isn't necessarily difficult. The easiest way is probably a closure:

sub counter {
my $set = shift;
# print 'sub { return shift =~ tr/' . $set . '//; }' . "\n";
return eval 'sub { return shift =~ tr/' . $set . '//; }';
}

my $digit_ctr = counter('0-9');
my $letter_ctr = counter('a-zA-Z');
my $backslash_ctr = counter('\\\\');
my $newline_ctr = counter('\n');

my $str = '123abcABC[]\\' . "\n";
print "digits: " . $digit_ctr->($str) . "\n";
print "letters: " . $letter_ctr->($str) . "\n";
print "backslashes: " . $backslash_ctr->($str) . "\n";
print "newlines: " . $newline_ctr->($str) . "\n";

The subroutine "counter" creates a reference to an anonymous subroutine (in this case, a particular type called a closure because of the way the my variable $set is used) that takes one argument and does the desired counting using tr///. Each call to "counter" creates a subroutine by compiling a bit of Perl code, but you only need to call "counter" once for each class of characters you want to count. Once you have the subroutine that you want (or subroutines) you can reuse them as often as you like via the coderef call syntax $coderef->(args...).

 
At 12:16 PM, Blogger Rajesh Kumar Mallah said...

Ref: comment by aldo.


my $x = 'test';
my $n_of_t = () = ($x =~ /t/g);

this is probably uglier (and slower) than using


Is it true that regex is slower
than tr ?

mallah@tradeindia.com

 

Post a Comment

<< Home