Searching is fundamental to any programming language. It would be a waste to have data structures, and no way of searching through them. Searching is fundamental in in life (we even search for the meaning of our very own existence), so it only makes sense that after we create collections of data, that we want to search through them. By searching we are able to find answers to problems we are trying to solve. Take a phone book for example, or a telephone directory. Most of these come with tools that help make search relatively simple. For example, there are guides within the book that show you which letter you are currently on (making it easy to jump quickly to a section). If you wanted to call “Konstantin Addaquay” for example, you would just go the A section, which gives you listings of all people in the area with a last name starting with the letter A. Now imagine that telephone directories didn’t come already sorted alphabetically. It would be almost impossible trying to find someone in there (so many entries). Searching and sorting go hand in hand in computer science. Most of the time, the most efficient way to search is to have data that is already sorted. That is why this post and the next one will be covering searching and sorting algorithms.
The two searching algorithms we will be looking at are sequential/linear search, and binary search. So if one algorithm can be way faster than another, then why don’t we just use that one every time we perform search. In an ideal world this would be possible. So welcome to life! Where things are just never the way we like them to be. Or maybe, just maybe they are exactly the way they need to be, and we only resist what is. Truth is, sometimes we get data that is either sorted (Other times we don’t). When data items are random or not sorted, we use sequential or linear search. When data is already sorted, we use binary search. Please note, binary search only works when your data is sorted, and we must take into account the cost of sorting the data before we even begin searching through it. When writing code, performance is everything so we need to think things through before we execute them. Lets look at each search algorithm.
This is the most basic kind of search. Here, given a list of data, we start from the beginning of the collection and iterate through each item until we find what we are looking for. Suppose we had a dataset with 4 billion items. Using sequential search it is actually possible that the item we are looking for is the 4 billionth item. So the worse case scenario is that we search 4 billion items in order to find what we are looking for. The runtime complexity for linear search is O(n), which is a Big O notation representing the worst runtime for a given algorithm. The “n” you see stands for number of operations for a given search. Yes its possible that the item you are looking for could be the 5th item, but the truth is we really don’t know (its not sorted). If we find the item early then hurray! But there is a possibility that its located at the last spot (or closer to the end), in which case we would have to go through all items before we find what we are looking for. Its also possible that we don’t find the item at all. In which case we would have also gone through 4 billion items for nothing (what a waste of system resources!). This way of searching is the brute-force technique, because we could potentially visit all items.
Enough talk. Lets write our first search algorithm. You have seen this algorithm many times as a programmer (its what most of us write) on a daily basis. Lets take a look at it and see how we can improve upon it.
We have all written code like this. Especially in the first case where i use a for-loop (you could also use a while loop). When performing sequential search you have to loop through everything in order to find your item. Its also best to return from your function when the item is found, so the loop does not go all the way to the end (break the loop). That does not guarantee that it will be more performant, as you item you are searching for could still be located at the end of the array (in the worse case scenario O(n)). In the code, we have an items array. We call a function called itemSearch and pass it a parameter for which item to search. If the item is found, we print to the console stating which index the item was found at. The function returns a boolean value to the caller. So if the item was not found, false is returned to the caller and they can handle the results from there (in our case we are simply outputting a message to the console that the item was not found).
Now some advice on for-loops. They are great, BUT they can be prone to typos. You have to somehow be aware of the “i” variable as you loop, and have to write extra pieces of code (like using bracket notation ). It can be hard to understand the logic at first glance. Just take a look at the code, its verbose. Try to stay away from them if you can. To be honest for-loops are OLD SCHOOL. Again i am not saying don’t use them (as they work). I want to say there are better ways to loop through data than using for-loops that might be more beneficial. We can take advantage of functional style programming by using some of the built-in array helper methods. In this case, i provided code for the same functionality using forEach. If you can master array helper methods, trust me you will be able to work with collections of data more easily. When writing code for a web app, thats what we do for the most part (work with collections of data – we even build UI based on data). Compare the code using forEach versus using the for-loop. Its so much shorter and cleaner. I hope this is convincing enough. Using forEach, we pass in an anonymous function. That function gets called for each element in the array. So for each element we can check to see if the itemValue is of a certain value. We keep doing this until we find what we are looking for. Lets take a look at other examples.
Finding Maximum and Minimum
Imagine that you are given some kind of large sorted data (an array). The first index has a value of 10 and the last has a value of 1000. Because this data is sorted, finding the maximum and minimum values is pretty easy. The minimum is located at position 0 and the max is located at array length minus 1. Things are so much easier when the data is already sorted. But what if the data you have received is not sorted? How would you find the maximum and minimum values. You cannot find it without looping through the entire array life. Sequential search comes to the rescue! However a problem like this is not solved through just iterating through the data. You have to somehow find a way to compare each number within the data list in order to find out which is the minimum (the same goes for finding the maximum value). Lets code this out.
We have an items array with data that is not sorted. In order to find the minimum value in the collection, we start off with the first item in the array setting that to the current minium. We loop through every single item in the collection, if the current minimum is greater than the current item in the iteration, we set that item to the new minimum. The opposite is true for finding the maximum (if the current minimum is less than the current item in the iteration, set the current maximum to that item).
Self Organizing Data
So is there a way to make sequential search faster? How you might ask? We know that if you received data that is not sorted, then the only way is to loop through that data until you find what you are looking for. There is no running away from that. But there are still some enhancements we can do that will ensure that your data can be found quicker. Overtime we can write algorithms that self organize items in the collection. Chances are that 80% of the searches done are to retrieve 20% of your data (80-20 rule). This is the approach we will be using. Every-time a search is done, we bring the data close to the beginning of the collection. Lets enhance our itemSearch function.
Pretty cool huh? The more users search for the same data, the closer it gets to the beginning of the array list. We know that Big O for sequential search is O(n), but with self organizing data, its possible to get data that is searched for a lot, at almost a runtime of O(1). Again this wont be for all your data set. Data that is not frequently searched for will not be as performant as data that is. Sequential search is still not the fastest search algorithm, but sometimes there is just no way of getting around them. Employing or knowing techniques such as these may just come in handy. Lets look at a faster algorithm.
If we have a data set that is sorted, then binary search is a better way to search. Simply put, binary search is awesome. How awesome? In as little as 32 steps, you could find any item in a collection with 4 billion records! (that’s 4,000,000,000). That is some serious speed right there! Using regular search, it could potential take you 4 billion steps to find that same number (32 versus 4billion is a huge gap). So this algorithm is pretty impressive, however you can only use it if the data has already been sorted. Binary search works like how humans would do search if they were looking in a dictionary for example. If i asked you to search for the LOVE in a dictionary, you wouldn’t turn to the ‘A‘ section and start your search there. You would start more towards the middle of the book (say around letter M). Based on whether the alphabet you land on is high or lower, you know which direction to search (higher or lower). Alphabets M and above are higher that L, you could simply ignore all those other alphabets (narrows you collection from 26 to 13. if the collection was 1 billion, you would have narrowed it down to 500 million). This is very powerful. You can almost reduce your data list in half just because you know which direction to search. In a number guess game (where you guess a number between 1-100), if you guessed a number 60 and that final number we are looking for is higher, then you know for sure that numbers 1 thru 60 can be ignored. And you only guess higher number. This is similar to how binary search works. In a worst case scenario, it would take log2 n. This is logarithmic time or O(log n). When we looked at linear search, its runtime was linear time O(n). Linear time means in the worst case scenario the algorithm will search linearly through the entire collection, and logarithmic time means the search is done logarithmically! (if there is even such a word). Remember logarithms in high school? If the number of operations that need to be performed is 100, then logarithmic time would be log2 100 which is 2. This means use these algorithm it would only take 6/7 steps to complete. Given a sorted data set, for example 1-100, we know that the lowest number is 1 and the highest is 100. The means that we have this range of number to work in. If your first guess is 50 and i say “too high”, then we have just narrowed our highest number from 100 to 50. so now we have a range of 1 to 50 to work in. We keep doing this till we find our number. Lets code this out.
Lets see what is going on here. Our binary search function takes two parameters: a sorted array list and the item we want to find. Now there are only 5 or so items here, but it could easy be a billion items (the algorithm would still be the same). In our function we define a low and high variables. We also declare mid and guessed variables. Mid is the middle index (between low/high) of the collection and guessed is the number at that index. As long as the low of our collection’s low less than the high, we iteratively loop our array list and keep narrowing the numbers. Notice that in each iteration there is a new high or low as it keeps on changing. When the guessed number is equal to the item to find passed in, then we have found our number and exist the loop (low equals high). If the number cannot be found we simply return a null.
So this is how our binary search is going to work. But i have an interview question for you. What if we have repeated numbers in our sorted data? Yes this is possible. Lets just say that the data that comes back is [1,2,4,7,7,7,8,9]. We all know this is possible. So the question is how do you count multiple occurrences or indexes? Once you find the interested number or string (yes we can use this search for strings as well, not only numbers), we can traverse to the left and right of the number searching for instances of that number. You can write a while loop that looks for 7 (if thats the interested number) in either direction. The moment 7 stops repeating, we exit the loop (no need to go all the way to the end).