Bash Script: Counting lines, words, characters

Objective

The objective is to write a shell script to mimic the functions of the wc command. This shell script would be made with bash built-ins and standard coreutils commands, and is made as accurate as possible to mimic wc.

The Idea

The file would be read one line at a time which we can do by redirecting the entire file to the input of a loop which reads line by line from the file and counts the lines as it reads them. Once a line is read the characters can be counted and using IFS characters the words are separated to count the words in that line. This process is repeated for each line in the file. The sourcecode is first shown, and then a detailed description is given.

Sourcecode

#!/bin/bash
#This code is a part of http://phoxis.org

total_l=0
total_w=0
total_c=0

function my_wc ()
{
  IFS_BAK=$IFS
  file_name="$1"
  l=0
  w=0
  c=0
  word=("")

  IFS=$'\n'
  # -r is needed to interpret the backslach characters as the part of
  # the text in the file, and not as escape sequences
  while read -r line
    do
      l=$((l+1))
      curr_line_char=${#line}
      c=$((c+curr_line_char))
      # set IFS to noraml field terminating characters
      IFS=$IFS_BAK
      read -a word <<< "$line"
      w=$((w+${#word[*]}))
      # set IFS to \n only so that in the next read command
      # only \n terminated lines at once are read
      IFS=$'\n'
  done < "$file_name"
  
  IFS=$IFS_BAK
  #If last line is not \n terminated then the last read will fail
  #and while will break. We will calculate the last line seperately
  #if it is not null
  if [ -n "$line" ]
    then
      # We should not count this line, because it 
      # is not terminated by newline character
      curr_line_char="${#line}"
      c=$((c+curr_line_char))
      read -a word <<< "$line"
      w=$((w+${#word[*]}))
  fi
  # The newline characters are also characters, so count them
  c=$((c+l))
  #   echo -e "\nFile name: $file_name \nLines : $l\tWords: $w\tCharacters: $c"
  echo "  $l  $w $c $file_name"

  total_c=$((total_c+c))
  total_l=$((total_l+l))
  total_w=$((total_w+w))
}


# Main execution sequence

file_count=$#
while [ -n "$1" ]
do
  file_name="$1"
  if [ ! -f "$file_name" ]
   then
     echo "File \"$file_name\" does not exist or is not a regular file"
     shift 1
     continue
  fi
  
  my_wc "$1"

  shift 1
done

if [ $file_count -gt 1 ]
 then
   echo -e " $total_l $total_w $total_c total"
fi

Description

The main driver

The line, word, character counting code is written as a module named my_wc () function. The main execution module calls this by supplying valid file names. First we have a look at the main execution sequence. First a the number of file names passed through the command line is backed up in the shell variable file_count . The while loop one by one takes the supplied file names by shift ing the positional parameters into $1 and passes it to the my_wc () function to process the file. Before calling the function, it checks if the file exists and it is a regular file or not. If the current file with the path at $1 is a regular file then only it is processed else it is ignored and the next file is processed.

Just like wc if there are multiple files supplied through the command line it prints the total count of all the files passed as command line parameters. The total count of lines, words and characters are kept in the total_l, total_w, total_c shell variables which are initialized to 0 at the beginning. This variables are updated by my_wc at the end of the function.

This is how the main driver works. Now we mode to the my_wc () function.

my_wc () function

First the IFS variable contents is backed up into IFS_BAK. This is needed because inside the loop we will be switching the values of the IFS to interpret a line differently depending on the field separation characters (described later). The file_name is initialized with the passed parameter. Note no error checking related to files is included here, the function trusts its caller. The line, word, and character counts for this particular file would be stored in the l, w, and c shell variables respectively. Another array is declared and defined word this would be used as a temporary array to count the words in a line.

The contents of the file_name is redirected to the input of the while loop. Before the loop the IFS character is made \n this would ensure one single line is read at each iteration by the read command in the while loop. The setting of the \n helps the process to preserve leading blank spaces in a single line, which would otherwise be ignored if the standard set of IFS characters were used.

The -r switch in read -r ensures that the backslash ‘\‘ characters inside the file should be interpreted as a single character and not an escape sequence. If this switch is not used then the the two characters in “\d” would be interpreted as one.

After reading one line the line counter is incremented by one. The characters in the current line is counted by ${#line} , and added to the total character count of the current file in c.

Now we need to count the number of words in the current line being processed. First the IFS is restored to the original value from the backup. The read -a word <<< "$line" redirects the contents of the current line into the input of read with the document here (<<<) redirection . The -a switch will store the words in the into the array word by separating the line at each IFS character encountered. Because we have restored the IFS to the original value, the words would be separated as normally it is done. The -a simplifies the task. After the words are loaded in the different indices of the word array, the array elements count is simply added to the previous word counter and the word count is updated. At the end of the loop the IFS is again made to \n and prepared for the next loop iteration. This loop will read through the entire file and count the line, words and characters.

There is one special case. If the file does not terminate with a new line, that is if the last line of the file does not have a newline character at the end, then the read would read this line, but return false, which would terminate the while loop, and this line would go unprocessed. To process this line an if - else statement is included outside. If the last line was such a line which did not end with a new line character, then the variable line would be not null, in which case it is processed separately in the body of the if statement. Here in a similar manner as in the while loop we count the number of characters and words. Note that this line is not counted. This is because this line not terminated with a newline character, therefore we will not count this a one line.

Once this process is done we need also to remember that the newline characters which terminate each line are also characters, which are not counted by the ${#line} shell substitution. Therefore the number of newline characters are added to the total characters.

The counts are simply printed in the terminal. Recall that we have also defined a total line, word, and character count variable which is defined to count the total number of lines, words, and characters of all the files passed through the command line. These variables are updated with the counts of the current file, before returning. The total count is printed by the main execution sequence if the number of passed parameters are more than 1.

Output

Sample output of this script are shown. Also the outputs of wc is also shown for comparison.

Counting the number of lines, words, and characters in the script itself.


[phoxis@localhost ~]$ ./wc_me.sh wc_me.sh
  79  243 1714 wc_me.sh

[phoxis@localhost ~]$ wc ./wc_me.sh
  79  243 1714 ./wc_me.sh

Counting all the .sh files contents with the script and wc

[phoxis@localhost ~]$ ./wc_me.sh *.sh
  83  217 1426 bin_search.sh
  47  69 537 bubble.sh
  43  80 600 calc.sh
  15  77 567 count_key_words.sh
  72  220 1486 digital_root.sh
  13  40 264 etc_passwd_missing_passes.sh
  40  105 605 fact.sh
  35  70 392 fibo.sh
  81  281 2149 id3v1.sh
  40  88 571 matrix.sh
  28  48 394 ones_compliment.sh
  32  62 426 palind.sh
  110  333 1963 prime.sh
  40  160 895 quad.sh
  89  171 1341 queue.sh
  45  137 847 rev_2.sh
  72  210 1231 rev_3.sh
  8  9 85 r.sh
  47  118 777 scalc.sh
  36  61 470 selection_sort.sh
  83  163 1140 stack.sh
  58  178 1331 str_match_stat.sh
  40  119 718 sum_n.sh
  102  515 3547 tofhanoi.sh
  43  138 931 ul_lu.sh
  79  243 1714 wc_me.sh
 1381 3912 26407 total

[phoxis@localhost ~]$ wc *.sh
   83   217  1426 bin_search.sh
   47    69   537 bubble.sh
   43    80   600 calc.sh
   15    77   567 count_key_words.sh
   72   220  1486 digital_root.sh
   13    40   264 etc_passwd_missing_passes.sh
   40   105   605 fact.sh
   35    70   392 fibo.sh
   81   281  2149 id3v1.sh
   40    88   571 matrix.sh
   28    48   394 ones_compliment.sh
   32    62   426 palind.sh
  110   333  1963 prime.sh
   40   160   895 quad.sh
   89   171  1341 queue.sh
   45   137   847 rev_2.sh
   72   210  1231 rev_3.sh
    8     9    85 r.sh
   47   118   777 scalc.sh
   36    61   470 selection_sort.sh
   83   163  1140 stack.sh
   58   178  1331 str_match_stat.sh
   40   119   718 sum_n.sh
  102   515  3547 tofhanoi.sh
   43   138   931 ul_lu.sh
   79   243  1714 wc_me.sh
 1381  3912 26407 total

Comments

There are two ways this code differs externally to wc. One is this script only reads from a regular file, and when the file does not exist or is not a regular file the error message is different than the wc command. Two is the output line formatting is different. Internally the major problem with this code is the execution time. It takes huge amount of time to count contents from a moderately large file. For example to test count the lines, words, and characters of the file /usr/share/dict/linux.words . wc would count it in no time, but the script would take a huge amount of time, however it counts the lines, words, and characters correctly. For small to medium files this script works fast.

5 thoughts on “Bash Script: Counting lines, words, characters”

Boudhayan Gupta says:

April 5, 2011 at 3:30 pm

It’s SLOW! Why the heck would you want to write it in an interpreted language? Time to write a Bash compiler for LLVM, I guess…

1. phoxis says:
  
  April 5, 2011 at 3:32 pm
  
  Definitely it is slow, as i have discussed in the “Comments” section. I would never write a script for this matter, and would always prefer to use the wc command or best to write my own C code which is my primary language of choice. This was just a demonstration and nothing else.
  
  1. Prantik Maitra says:
    
    April 6, 2011 at 12:00 am
    
    One must write this code because it is the one which comes in our university exams….
    I guess this reason is sufficient…
    
  2. Prantik Maitra says:
    
    April 6, 2011 at 12:03 am
    
    It won’t really matter whether it is slow or fast…what does matter is that the above demonstration will bring marks for us…
    
tinkerbelle86 says:

April 11, 2011 at 3:28 am

wow, i couldnt do this, im in awe of people who can :)