Bash Script: Find occurrence count of a string in a file

A file is given, and a string is given. The number of time the string occurs in the file, and the line number in which the string occurs in the file needs to be found out and printed.

The Idea

We will use bash script and use bash builtin features to do this. The approach would be to read the file a line at once, and searching the string in that line with the help of regular expression, and then decide about if it occurs. Once the string is found to match a sub string, then it is replaced with NULL and then re-searched with the same string to find yet another result in the string, while no more results are found in that line. Then the next line is considered while the file does not end.

Sourcecode

#!/bin/bash
# Code to find occurrence of a string in a file. This will
# count the number of occurrence of the string in the file 
# and the line numbers and the number of times it occurs in
# each line.

file_name="$2"
string="$1"

if [ $# -ne 2 ]
  then
   echo "Usage: $0 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurrence=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurrence=$((total_occurrence+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurrence\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurrence # : Line Number : Nos of Occurrence in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo

Description

The script above requires two parameters. The first parameter is treated as the string to be searched, and the second parameter is considered to be the file name. First the input is validated by checking if exactly 2 arguments are passed to the shell script. If no then it prints the usage and exits with error code 1. If both the parameters are provided, then first the file is checked for if it exists and is a regular file with the -f switch in the test or [ ] command. If the file passes the validation test then it is allowed for further processing else the script is terminated by returning error code 2 to the calling process. The file name if taken into the shell variable file_name, and the string to be matched is taken into the shell variable string. The variables names are used as follows:

line_no_list : This is an array which would hold each line number of lines where the string occurs. This is treated as a 2d array. Each i^th location would contain the line number where the string occurs, and each (i + 1)^th location would contain the number of times the string occurs in that particular line. This is initialized to a null list.
current_line_indx : This is an index variable which indicates the current line number in the file being processed. This is used to assign the line number in the array line_no_list where the string occurs at each i^th location. This is initialized to 1.
line_no_index : This is an index variable which indexes the line_no_list array, and always points the current location being processed. If a line number is to be assigned in the array line_no_list[line_no_indx]=$curr_line_indx is done. To update the the number of occurrences count in that particular line line_no_list[line_no_indx+1] location is incremented. This is initialized with 0.
total_occurrences : This is a counter variable which counts the number of total occurrences of the string in the file. Maintaining this variable avoids recalculation of the total lines from the array. This is initialized with 0.
line : This will hold the current line of the file being processed.
flag : This indicates if the inner loop is entered or not, and helps updating the line_no_indx index shell variable. A value 0 means the inner while loop is not entered, a value 1 means that the inner while loop was entered.

The $file_name is redirected into the outer while loop, which would read lines with the read command until the file has ended. The flag is reset to 0. The inner while loop checks if the line contains the string, with the expression [[ " $line " == "*$string*" ]] which would be evaluated as true if the regular expression matches any sub string in the current line. The regular expression * $string * would match any sub string which contains the string contained in the shell variable $string . If this expression is true, ie. if the line contains the string, then the while loop is entered, and flag is set to 1. The current line number curr_line_indx is assigned in the current position of the array by line_no_list[line_no_indx]=$curr_line_indx to indicate the occurrence of the string in this line. The immediate next position of the array holds the count of occurrences of the string in this particular line indicated by curr_no_indx, so the next position is incremented and the value is updated by the statement line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1)). The total_occurrence variable is updated by a unit increment.

In order to find how many such matching sub string occurs in that single line, the first matched sub string in the line is removed by the shell substitution syntax line =${line/$string/}. The general syntax is ${string/pattern/replace}, which will replace the first occurrence of the pattern from the left side of the string with the pattern replace. In the above code the first occurrence of $string from the left side is replaced with a NULL string, therefore removed, and the shell variable line is updated with this replaces value. This updated value with the pattern removed one time, is re-searched for another occurrence with the while loop which checks this updated line variable with the regular expression and finds out if still any more sub string exists in the current line which matches the regular expression, if yes then the occurrence counter of that particular line is incremented to indicate another occurrence of the string in that line. If the updates line variable does not anymore contain any sub string matching the regular expression then the inner while loop terminates. If this while loop was entered then we have done processing with the current line position and so we need to update the line_no_indx to point to the next location, and make it ready for the next while iteration. This is done by checking the flag value and incrementing the line_no_indx 2 steps. An increment of 2 here is justified by 2D array the storage mechanism described above. After each iteration of the outer while loop the currently being processed line number of the file counter curr_line_indx is
incremented.

After this scan process is complete the information we need are stored in the array and the counters. The total occurrences are printed with the help of total_occurrenceshell variable. The number of lines in which the string occurs is found by dividing the value of line_no_indx by 2 (because the 2D storage structure). Then the number of occurrences of each the string in each line is printed from the array with the help of the while loop. line_no_list[ i ] would get the line number where the string occurs, and line_no_list[i + 1] would get the number of times the string occurs in this line number. This finishes the process.

Sample Output

[phoxis@localhost Unix  Shell Programming]$ ./str_match_stat.sh the test_file

The string "the" occurs "63" times
The string "the" occurs in "16" lines

[ Occurrence # : Line Number  : Nos  of Occurrence  in this line ]:
1  :   1  :   6
2  :   5  :   3
3  :   6  :   2
4  :   12   :   1
5  :   14   :   3
6  :   16   :   4
7  :   18   :   13
8  :   19   :   3
9  :   20   :   2
10   :   22   :   5
11   :   23   :   7
12   :   25   :   7
13   :   27   :   1
14   :   31   :   1
15   :   37   :   2
16   :   40   :   3

Comments

The process only scans the file for one single string. This can be expanded to accept multiple strings and process then sequentially.
Bash interpretation is slow, implementation is not good for practical purposes. Better to write one in C.