/*****************************************************************************
Sparse.cpp -(Parser.cc) -
began: 6/16 created searchout function 6/17
fixed parser function: 6/21- finished / tested 6/25 - 6/26
completed searchOut and auxillary functions 6/26 - 6/27. Tested
and fixed.
6/27 - added all searchable files and made outputted paths be full .html paths.
6/28 - 6/29 Put title parsing features in cheCkfile - performed some debugging
so it would compile - still need to fix logical errors with this.
7/3 - 7/4 Finished debugging checkFile function so that title of the article
could be included in the search. Needed to test for EOL characters
in title so that they are not entered into the array and made sure
title tag test did not extend beyond six characters.
7/6 - Fixed case problems in first character of search term so case is always
irrelevant in search.
8/14 - added preliminary OR search capability to parse function
8/26 - Modified with accompanying files - search (main), locations, and decode
for a more object oriented approach
9/8 - 12 p to 3 p - debugged files with or functionality - merged file back
into one as object oriented approach presented bizarre error messages
that could not be dealt with at this time
9/11 2:30 a - 3:30 a: further deubugged or functionality - never could
determinbe what was causing an absence of output in the test runs -
"all of a sudden it works"- printed source SunWS_cache? folder
9/12 - 9/19 Wasted a lot of time determining the source of the problem to be
the new CC compiler - it does not compile correctly to produce standard
output that works with CGI - use g++ instead.
9/26 - 9/27 Checked to verify how cgi using the sequence from
the reference site outputs buff2 - spaces are
decoded to +s at that point in time. Debugged
parse() function so that it seems to work fully.
Program no longer hangs on execution (changed some
logic expressions - which were doing assignments
inadvertently). Seems to work with quotes. Tosses
ors from expression. Next step make sure truth
tables is being set-up properly - as the correct
number of articles are not being returned - check on
this.
9/29 Fixed call to andOr() happening multiple times by assigning to
varible Fixed ability to call by increasing d (string length to < 5, not 4)
10 / 01 - Fixed first character getting that was causing it to
obtain the next letter of the first word searched for each
successive search term. reversed rows / cols
-OR searching seems to be fixed. AND still has issues
10 / 04
site - ready
Finished primary debugging; beta testing; placed on
for second stage
10/05 More Debugging. Logic errors for phrases beginning with
same letter fixed Could not place on 10/04
Usage: called by the searchform for techNJ, which will pass environment
content variable with the search criteria. Parses this in order
to verify input and generates links to html files with cgi calls
to highlight instances of the search word of that document when the
user clicks upon it.
Output: dynamically generated html code with cgi links to specific html
documents
Errors / Restrictions: If user wishes to search for word and or OR,
these must be placed in quotes - see parse function
Restrictions: Cannot Accept + or & operators for input because these are
part of the cgi encoding sequence. Future revisions might address this
in conjunction with the getting out of the & part of parse if
the last character is a quote
*****************************************************************************/
#include <fstream.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <iostream.h>
#include <fstream.h>
#include <ctype.h>
//#define DEBUG//conditional compilation
//for debugging
//Global variables
//Definition of the string data type - taken from examples
const int ucount = 100; //number of URLs search is capable of returning
//(for now)
int atUcount = 0; //counter for URL array
const int strmax = 500; //maximum length of string
const int strmin = 40; //maximum length of a small string
typedef char bigstring[strmax]; //defines a big string as array of 200 chars
typedef char string[strmin]; //defines a small string as array of 10 chars
bigstring search[strmin]; //variable which will store and output search
//search string can hold a maximum of 40
//entries
//this should be enough
int bs = 0;
//variable to increment bigstring counter
//for search variable; needs to be global as
//it will be referenced in the checkfile
//function
int srcTruthTable[strmin] = {
0
};
/*Source and Target (Target declared in function so it is reinitialized each
time through). Truth tables to indicate where and values in search string. If
a word is encountered that is part of an and statement, that location in the
truth table will be set to a 1. When a value is checked, a check will be made
to determine whether the corresponding entry in the truth table and the next
consecutive one are both 1. If they are, an and value has been encountered, and
another word must be searched for in order for the search to return a yes. When
all consecutive ones are matched and the target truth table matches the source
truth table for ones, the search returns a yes.
*/
char firstChar[40]; //array to hold the first characters of the
//search terms
char capFirstChar[40]; //array to hold the first characters
//(capital) of search terms
14 char radissues = "\0"; //variable to store status of radio selection
bigstring URLs[ucount]; //array to hold the urlsc
char titles[100][1000]; //the variable to contain the titles
int numFound = 0; //number of articles found in search
//-------------------------------FILE LOCATIONS---------------------------------
const int numIssues = 7; //the number of issues
const int maxfiles = 20; //the maxiumum # of html files in an issue
short int chk[numIssues]; //the chk variable
string location[numIssues][maxfiles] = {
{
"adamessentials.html",
"learningenglish.html",
"whereinspace.html",
"win96toc.html",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
},
//winter 96 issue - 4
{
"ldclass.html",
"openingnight.html",
"pipandzena.html",
"radsounds.html",
"rosieswalk.html",
"sbwdeluxe.html",
"sprsum96toc.html",
"strategygames.html",
"turtleteasers.html",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0"
},
//spring/summmer96 issue - 9
{
"arnold.html",
"art4metoo.html",
"artland.html",
"bigcalc.html",
"contrib.html",
"editor.html",
"events.html",
"expreshn.html",
"fall96toc.html",
"magictales.html",
"minspeak.html",
"netsites.html",
"njtarpws.html",
"ownvoice.html",
"storytell.html",
"tecktrek.html",
"treasures.html",
"wordsaround.html",
"writeaway.html",
"\0"
},
//fall96 issue - 19
{
"blaster.html",
"blindstud.html",
"blndadlt.html",
"contrib.html",
"editorial.html",
"edspicks.html",
"graph.html",
"graphact.html",
"jespy.html",
"major.html",
"mathkeys.html",
"measure.html",
"monclair.html",
"money.html",
"pwwebspe.html",
"selectin.html",
"snootz.html",
"spr97toc.html",
15 "telltime.html",
"zillions.html"
},
//spring97 issue - 20
{
"bowne.html",
"contrib.html",
"cowriter.html",
"dazzle.html",
"drew.html",
"earle.html",
"editorial.html",
"europe.html",
"inclusion.html",
"index.html",
"mathpad.html",
"pintoo.html",
"premiere.html",
"sensory.html",
"trainingmod.html",
"ultreader.html",
"\0",
"\0",
"\0",
"\0"
},
//winter98 issue - 16
{
"ReforminTeacherEd.html",
"TechInTwo.html",
"AugcommSys.html",
"ProgProfile.html",
"ParentPersp.html",
"resources.html",
"AccessArt.html",
"Reviews.html",
"Editorial.html",
"Contributors.html",
"index.html",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0",
},
//fall98 issue - 11
{
"features.html",
"curriculum.html",
"canWeTalk.html",
"intellitalk.html",
"rev2000.html",
"contribute.html",
"editor.html",
"resources.html",
"jwrite.html",
"sumChart.html",
"curicBox1.html",
"sunburst.html",
"curicBox2.html",
"index.html",
"\0",
"\0",
"\0",
"\0",
"\0",
"\0"
}
};
//2000 issue - 14
//-Finish File Locations
/*****************************************************************************
/*****************************************************************************
**
FUNCTION PROTOTYPES
******************************************************************************
*/
void parse(bigstring);
//function responsible for parsing string once
void searchOut();
//function to generate html links to searches
void checkFile(string, bigstring);
//1st function for below
/****AUXILARRY FUNCTIONS****/
int compareThem(string, string);
//function for comparisons of the search/sample
int andOr(string);
//function to determine whether word given is
and / or
//-------------------------------------------------
//
MAIN PROGRAM
//-------------------------------------------------
int main() {
//initialize arrays
for (int z = 0; z < ucount; z++) {
strncpy(URLs[z], "\0", strmax);
} //initialize the URLs array
for (int z = 0; z < numIssues; z++) {
strncpy(titles[z], "\0", strmax);
} //initialize titles array
for (int z = 0; z < numIssues; z++) {
chk[z] = 0;
}
//initialize numIssues array
for (int z = 0; z < strmin; z++) {
strncpy(search[z], "\0", strmax);
} //initialize search array
//loop to get input - this part of program copied from site above
//Part 1
char * endptr;
//note to self:
// int i;
//is and the a and b chars really necessary?
double contentlength;
char buff[10000] = "\0";
//char a,b;
//#ifndef DEBUG
//only use these if it is a "real" run
const char * len1 = getenv("CONTENT_LENGTH");
contentlength = strtol(len1, & endptr, 10);
fread(buff, contentlength, 1, stdin);
//#ifndef
//Part 2
int x;
int y;
char hexstr[100] = "\0";
char buff2[10000] = "\0";
//loop to eliminate make sure user does not mess up input
for (x = 0, y = 0; x < strlen(buff); x++, y++)
{
switch (buff[x]) {
/* Convert all + chars to space chars */
case '+':
buff2[y] = ' ';
break;
/* Convert all %xy hex codes into ASCII chars */
case '%':
/* Copy the two bytes following the % */
strncpy(hexstr, & buff[x + 1], 2);
/* Skip over the hex */
x = x + 2;
/* Convert the hex to ASCII */
/* Prevent user from altering URL delimiter sequence */
if (((strcmp(hexstr, "26") == 0)) || ((strcmp(hexstr, "3D") == 0))) {
buff2[y] = '%';
y++;
strcpy(buff2, hexstr);
y = y + 2;
break;
}
buff2[y] = (char) strtol(hexstr, NULL, 16);
break;
/* Make an exact copy of anything else */
default:
buff2[y] = buff[x];
break;
}
} //end for loop - now buff2 has the content length string in it without
//any hex.
//bigstring buff3 = "search=\"learning and alphasmart\" OR
Dell & issues = all & sub = Search!";
bigstring buff3 = "search=\"Amy Anne DISDIER\" OR
Anne & issues = these & 2000 = on & sub = Search!";
//above for debugging only
parse(buff2);
//call the parse routine to extract information
/////////////////////////////
#ifdef DEBUG
cout << "got here!\n";
cout.flush();
#endif
/////////////////////////////
searchOut();
//begin output and make call to generate strings
//FOR DEBUGGING PURPOSES TO MAKE SURE THE SOURCE TABLE IS LOADING
/////////////////////////////
#ifdef DEBUG
cout << "\n\t\tSource truth table:\n\n";
for (x = 0; x < strmin; x++) {
cout << srcTruthTable[x] << " ";
if ((x % 5) == 0) {
cout << "\n";
}
} //end for
cout << "\n*****************************************\n\n";
#endif
/////////////////////////////
/******OUTPUT TO CLIENT BROWSER PART******/
cout << "Content-type: text/html\n\n" << "<HTML>\n<HEAD><TITLE>Your Search
Results! < /TITLE>" <<
"</HEAD>\n<BODY BGCOLOR=\"#FEFEC8\">\n<H2>The following articles were
returned: < /H2>\n<BR>\n<UL>";
x = 0;
//reinitialize
//all-purpose counter
while ((x < ucount) && (strcmp(URLs[x], "\0"))) {
//if URLs = 0 then equal
cout << "<LI><A HREF=\"" << URLs[x] << "\">" << titles[x] << "</A></LI>\n";
x++;
//increment x
} //end while
cout << "\n<BR>\n<BR><FONT COLOR=\"#FF0000\"><H3><B>" << numFound << " <
/B></FONT > article(s) found.\n < BR > \n < /H3>\n" <<
"<H4>Choose one of the links above or:</H4>\n" <<
"<FORM>\n<input type=\"button\" value=\"Search Again\"
onclick = " <<
"\"window.location=\'http://www.tcnj.edu/~technj/2000/newsrch.ht
ml '\">\n" <<
"<input type=\"button\" value=\"Back to home\" onclick=" <<
"\"window.location=\'http://www.tcnj.edu/~technj\'\">\n" <<
"</FORM>\n</BODY>\n</HTML>";
cout << "\n\n";
/////////////////////////////
#ifdef DEBUG2
cout << "\n\t\tTarget truth table:\n\n";
for (x = 0; x < strmin; x++) {
cout << trgTruthTable[x] << " ";
if ((x % 5) == 0) {
cout << "\n";
}
} //end for
cout << "\n*****************************************\n\n";
#endif
/////////////////////////////
cout.flush();
return 0;
} //end main
//------------------------------------------------------------------------------
//
PARSE FUNCTION
//------------------------------------------------------------------------------
void parse(bigstring argument) {
int c = 7;
//counter variable - initialize to
int d = 0;
//another counter variable
int quoteOn = 0;
//determines whether a quoted word is currently
being read
string temp = "\0";
//temporary string to hold
int firstTime = 1;
//for a quoted expression - to end it
while (argument[c] != '&') {
/* for OR functionality*/
if (((quoteOn) && ((argument[c + 1] == ' ') || (argument[c + 1] == '&'))) &&
//beginning of next word
((argument[c] == '\"') || (argument[c] == '\''))) {
quoteOn = 0;
//turn quoting off
if (firstTime) {
firstTime = 0;
search[bs][++d] = '\0';
}
//dump everything else on this
/*NO CALL to andOr needed- coming out of a quote*/
c = c + 1;
//get past quote and space
//if quoted phrase - get out
} //end 1st if - TO TURN QUOTES OFF AND END A WORD
else if (((argument[c] == '\"') || (argument[c] == '\'')) &&
(!quoteOn)) {
quoteOn = 1;
c++;
//get past the quote and continue
} //end 2nd else-if - TO TURN QUOTES ON
else if (argument[c] == ' ') {
//store character and increment both vars
if (!quoteOn) {
//only create a new word if this is not a
//quoted phrase
search[bs][++d] = '\0';
//dump everything else on this
/////////////////////////////
#ifdef DEBUG
cout << "\nPresend: " << search[bs] << "\nlENGTH = " << d;
#endif
/////////////////////////////
if (d < 5) { //declared a variable here so call happens once - \0
counts d will be 4
int call = andOr(search[bs]);
//call to andOR search
if (call == 2) {
//dictates when search is an and
//set truth table
search[bs][0] = '\0';
//first set location in string to null to dump
//set preceding word to beand connected by and
srcTruthTable[bs - 1] = 1;
srcTruthTable[bs] = 1;
//and next word to come followed by and
bs--;
//decrement to undo upcoming increment
//If user wishes to search for word and or OR, these must be
placed in quotes
}
if (call == 1) //it is an or - will be tossed
{
search[bs][0] = '\0';
//get rid of whatever was being read
bs--;
//and decrement to undo increment
}
}
/*Insert procedure this is not an empty string*/
d = 0;
//reinitialize d
bs++;
//next word
c++;
//get past space
} else {
//the quotes are on
search[bs][d++] = argument[c++];
//INSERT CHARACTER AND MOVE ON
}
//either way
} //end 3rd else if - TO END A WORD NORMALLY
else {
search[bs][d++] = argument[c++];
}
//only insert next character
} //end while
/*This Code block for OR functionality*/
search[bs][++d] = '\0';
//dump everything else on search
//now SEARCH has the word stored within it - get next variable issues
//for debugging purposes
/////////////////////////////
#ifdef DEBUG
cout << "\n\n\nBegin here...\n";
for (int i = 0; i <= bs; i++) {
cout << search[i] << "\n";
cout << "Bigstring: " << i << "\n";
}
cout << "\n\n";
#endif
/////////////////////////////
/*CODE FROM HERE ON THROUGH REMAINDER OF METHOD DEALS WITH
CHECKBOXES AND OTHER STUFF - NOT THE SEARCH TERMS! */
while (argument[c] != '=') {
c++;
}
c++;
//get past equal sign
radissues = argument[c];
//radissues is a character - no need for null
//type of search determined - all issues
if (radissues == 't') { //issues encountered - more parsing
//else everything is already nulll
do {
d = 0;
//reinitialize d
strncpy(temp, "\0", 5);
while (argument[c] != '&') {
c++;
}
//get to next variable
c++;
//throw away amperstand
while (argument[c] != '=') {
temp[d++] = argument[c++];
}
temp[++d] = '\0'; //dump everything else on temp
if (strcmp(temp, "w96") == 0) chk[0] = 1;
if (strcmp(temp, "s96") == 0) chk[1] = 1;
if (strcmp(temp, "f96") == 0) chk[2] = 1;
if (strcmp(temp, "s97") == 0) chk[3] = 1;
if (strcmp(temp, "w98") == 0) chk[4] = 1;
if (strcmp(temp, "f98") == 0) chk[5] = 1;
if (strcmp(temp, "2000") == 0) chk[6] = 1;
} while (strcmp(temp, "sub"));
} //end if
//do while this is not equal to sub
else {
for (int x = 0; x < numIssues; x++) {
chk[x] = 1;
}
//check the whole article
} //end else
} //end of parse function - array chk set
//------------------------------------------------------------------------------
//
SEARCHOUT FUNCTION
//------------------------------------------------------------------------------
void searchOut() {
int count = 0;
int c = 0;
//big counter
//small (inner loop counter)
for (count = (numIssues - 1); count >= 0; count--) {
if ((chk[count] == 1) && (count == (numIssues - 1))) {
for (c = 0; c < maxfiles; c++) {
if (strcmp(location[count][c], "\0")) {
string fileLoc = "../../2000/";
bigstring address = "http://www.tcnj.edu/~technj/2000/";
strcat(fileLoc, location[count][c]);
strcat(address, location[count][c]);
checkFile(fileLoc, address);
//call function to chk file
}
} //end for
} //end 1st if
if ((chk[count] == 1) && (count == (numIssues - 2))) {
for (c = 0; c < maxfiles; c++) {
if (strcmp(location[count][c], "\0")) {
string fileLoc = "../../Fall98/";
bigstring address = "http://www.tcnj.edu/~technj/Fall98/";
strcat(fileLoc, location[count][c]);
strcat(address, location[count][c]);
checkFile(fileLoc, address);
//call function to chk file
}
} //end for
} //end 2nd if
if ((chk[count] == 1) && (count == (numIssues - 3))) {
for (c = 0; c < maxfiles; c++) {
if (strcmp(location[count][c], "\0")) {
string fileLoc = "../../win98/";
bigstring address = "http://www.tcnj.edu/~technj/win98/";
strcat(fileLoc, location[count][c]);
strcat(address, location[count][c]);
checkFile(fileLoc, address);
//call function to chk file
}
} //end for
} // end 3rd if
if ((chk[count] == 1) && (count == (numIssues - 4))) {
for (c = 0; c < maxfiles; c++) {
if (strcmp(location[count][c], "\0")) {
string fileLoc = "../../spr97/";
bigstring address = "http://www.tcnj.edu/~technj/spr97/";
strcat(fileLoc, location[count][c]);
strcat(address, location[count][c]);
checkFile(fileLoc, address);
//call function to chk file
}
} //end for
} //end 4th if
if ((chk[count] == 1) && (count == (numIssues - 5))) {
for (c = 0; c < maxfiles; c++) {
if (strcmp(location[count][c], "\0")) {
string fileLoc = "../../fall96/";
bigstring address = "http://www.tcnj.edu/~technj/fall96/";
strcat(fileLoc, location[count][c]);
strcat(address, location[count][c]);
checkFile(fileLoc, address);
//call function to chk file
}
} //end for
} //end 5th if
if ((chk[count] == 1) && (count == (numIssues - 6))) {
for (c = 0; c < maxfiles; c++) {
if (strcmp(location[count][c], "\0")) {
string fileLoc = "../../sprsum96/";
bigstring address = "http://www.tcnj.edu/~technj/sprsum96/";
strcat(fileLoc, location[count][c]);
strcat(address, location[count][c]);
checkFile(fileLoc, address);
//call function to chk file
}
} //end for
} //end 6th if
if ((chk[count] == 1) && (count == (numIssues - 7))) {
for (c = 0; c < maxfiles; c++) {
if (strcmp(location[count][c], "\0")) {
string fileLoc = "../../win96/";
bigstring address = "http://www.tcnj.edu/~technj/win96/";
strcat(fileLoc, location[count][c]);
strcat(address, location[count][c]);
checkFile(fileLoc, address);
//call function to chk file
}
} //end for
} //end 7th if
} //end BIG FOR
} //END FUNCTION SEARCHOUT
//------------------------------------------------------------------------------
//
CHECKFILE FUNCTION
//------------------------------------------------------------------------------
void checkFile(string filespec, bigstring adr) {
int done = 0;
//to determine if word has been found
string sample = "\0";
//to hold the word being tested
bigstring line = "\0";
//to hold the current line
int lineLength = 0;
//length of the line of characters being read
int pcount = 0;
//parsing counter
int scount = 0;
//sample counter
int searchLength[strmin];
//lengths of the search variables stored in
//corresponding
//positions in this array
int gotTitle = 0;
int titleRead = 0;
int titleCount = 0;
int URLFlag = 0; //title truth variable
//indicates that the title is now being read
//title string counter
//flag to test whether a valid URL has been
//gathered from a file
ifstream fin; //input file stream
//for loop here - collect and store lower and upper case characters of all
search terms;
//then store the lengths of all the search strings
for (int i = 0; i <= bs; i++) {
firstChar[i] = (char) tolower(search[i][0]);
capFirstChar[i] = (char) toupper(search[i][0]);
searchLength[i] = strlen(search[i]);
}
fin.open(filespec);
//open the file
if (!fin) {
cout << "Content-type: text/html\n\n" << "Could not open" << filespec;
exit(0);
}
//NOW THE FILE IS OPEN AND GOOD
done = 0;
//reinitialze done to false
//************************INITIALIZE TRUTH TABLE*********************
int trgTruthTable[strmin] = {
0
};
//FOR THIS FILE/////////////
int oldpcount = 0;
//need should an and be encountered
//in the file
while (fin && !done) {
//while the end of file has not been encountered...
/////////////////////////////////////////////////////////////////////
fin.getline(line, strmax); //read either 500 (strmax) characters or up to
//EOL
lineLength = strlen(line); //store the length of the line
for (pcount = 0; pcount < lineLength; pcount++) {
/*****************************FOR GETTING TITLE*************************/
if ((line[pcount] == '<') && (!gotTitle)) {
int maxRead = 0;
strncpy(sample, "\0", strmin);
//reinitialize sample
scount = 0;
//and the counter
pcount++;
//get past the <
while (line[pcount] != '>' && (maxRead < 6)) {
sample[scount] = line[pcount];
//space to get out
scount++;
//increment sample counter
pcount++;
//otherwise - increment only pcount
if (pcount > lineLength) {
break;
}
//protection against core dump
maxRead++;
//maximum it should read
} //end while
scount++;
sample[scount] = '\0';
//dump the rest of characters in sample
if ((compareThem(sample, "title")) == 0) {
titleRead = 1;
gotTitle = 1;
//title read is now set for loop below
}
pcount++;
} //END 1st if
//must increment to get past the >
/*****************************FOR READING TITLE*************************/
if ((line[pcount] != '<') && titleRead) {
//is title being read - if yes
//store char for it
if (((int) line[pcount]) > 31) {
//cut out weird chars
titles[atUcount][titleCount] = line[pcount];
titleCount++;
} //increment titleCounter
//pcount is incremented in BIG for loop - no need to increment here
} //END 2nd if
/*****************************FOR STOPPING TITLE*************************/
if ((line[pcount] == '<') && titleRead) {
pcount++;
//get past the <
if (line[pcount] == '/') {
//test to see if it is an end tag
int maxRead = 0;
strncpy(sample, "\0", strmin);
//reinitialize sample
scount = 0; //and the counter
pcount++; //get past the / - yes it is?
while ((line[pcount] != '>') && (maxRead < 6)) {
sample[scount] = line[pcount];
//space to get out
scount++;
//increment sample counter
pcount++;
//otherwise - increment only pcount
if (pcount > lineLength) {
break;
}
//protection
maxRead++; //maximum it should read
//AGAINST BLOODY CORE DUMP!
} //end while
scount++;
sample[scount] = '\0';
//NOW dump the rest of characters in sample
if ((compareThem(sample, "title")) == 0) {
gotTitle = 1;
titleRead = 0;
}
//title has been obtained - no longer reading
pcount++;
//must increment to get past the >
}
/////////////////////////////
#ifdef DEBUG
cout << "\nTitle obtained: " << filespec << "\n";
cout.flush();
#endif
/////////////////////////////
} //END 3rd if
/******************************FOR SAMPLE*******************************/
if (!titleRead && gotTitle && (!isalnum(line[pcount - 1]))) {
for (int i = 0; i <= bs; i++) {
//label for and
oldpcount = pcount;
//need should an and be encountered
if ((line[pcount] == firstChar[i]) || (line[pcount] ==
capFirstChar[i])) {
/////////////////////////////
#ifdef DEBUG
cout << "\nOld Pcount in loop" << oldpcount << "\n";
#endif
/////////////////////////////
//if a match, look for chars
strncpy(sample, "\0", strmin);
//reinitialize sample
scount = 0; //and the counter
while (strlen(sample) != searchLength[i]) {
sample[scount] = line[pcount];
pcount++;
scount++;
if (pcount > lineLength) {
break;
}
//protection
//AGAINST BLOODY CORE DUMP!
} //end while
//increment both counters
scount++;
sample[scount] = '\0';
//dump the rest of characters in sample
/*FINISH*/
//test to see if word in sample matches the search word
/////////////////////////////
#ifdef DEBUG
cout << "looking up: " << sample << " comparing to " << search[i] << " bs=" <<
bs << "\n";
#endif
/////////////////////////////
int compSrch = compareThem(sample, search[i]);
/////////////////////////////
#ifdef DEBUG
cout << "\ncompSrch (Before truth tables): " << compSrch << "\n";
#endif
/////////////////////////////
if (compSrch == 0) {
//got it!
if (srcTruthTable[i] == 0) {
/////////////////////////////
#ifdef DEBUG
cout << "Supposedly storing " << adr << "\n";
#endif
/////////////////////////////
done = 1;
//got a word from the file - get out
i = bs + 1;
//get out of inner loop
pcount = lineLength;
//got a word - get out of outer loop
strncpy(URLs[atUcount], adr, strmax);
//store results in ucount (array - for now)
URLFlag = 1;
atUcount++;
numFound++;
//increment all necessary vars
gotTitle = 0;
//valid URL flag - increment URL counter
} else {
//ANDS
trgTruthTable[i] = 1;
int srcStart = 0;
//location where this group of "ands begins"
int srcEnd = 0;
//location where this group of "ands ends"
int iA = i;
//1st location counter
int iB = i;
//2nd location counter
int ok = 1;
//assumes that this will be the last of the ands
//Problem
while (srcTruthTable[iA] == 1) {
if (iA > 0) {
//cannot go past the first element of the string
iA--;
} else {
break;
}
//get out i must be 0
} //end while
(iA != 0) ? srcStart = iA + 1: srcStart = 0;
//location of beginning of ands stored
while (srcTruthTable[iB] == 1) {
if (iB <= (strmin)) {
iB++;
} else {
break;
}
//last position in truth table - get out
} //end while
srcEnd = iB - 1;
/////////////////////////////
#ifdef DEBUG
cout << "\nstart" << srcStart << "\nend" << srcEnd << "\n";
#endif
/////////////////////////////
for (int iC = srcStart; iC <= srcEnd; iC++) {
/////////////////////////////
#ifdef DEBUG
cout << "\ntrgtruth " << iC << " is " << trgTruthTable[iC] << "\n";
#endif
/////////////////////////////
if (trgTruthTable[iC] == 0) {
//if this equals 0 and not entirely fulfilled
ok = 0;
iC = srcEnd + 1;
//this is not yet ready to qualify as a file
} //set not ok and break out of for loop
} //end for loop
if (ok) {
//repeat above if all conditions passed - else
//forget it!
done = 1;
//got a word from the file - get out
i = bs + 1;
//get out of inner loop
pcount = lineLength;
//got a word - get out of outer loop
strncpy(URLs[atUcount], adr, strmax);
//store results in ucount (array - for now)
URLFlag = 1;
atUcount++;
numFound++;
//increment all necessary vars
gotTitle = 0;
//valid URL flag - increment URL counter
} //end ifok conditionO
} //end else
} //end if compSrch == 0
/////////////////////////////
#ifdef DEBUG
cout << "\ncompSrch " << compSrch << "\n";
#endif
/////////////////////////////
//THEY DID NOT COMPARE CORRECTLY - SEE IF THIS IS AN AND (or OR) with the
same
letter
if (compSrch != 0)
{
pcount = oldpcount;
}
//restore the parsing counter to the beginning
//of the word to be checked
} //end for
}
//end Inner for loop for testing each search
//term
} //END 4TH OUTER IF
} //end for
} //end while
/****************(Cleanup)****FOR VALIDATING TITLE********************/
if (!URLFlag) {
//if the URL flag is not set - dump the tile
//no need to set ucount - it was never
//incremented if true
strncpy(titles[atUcount], "\0", strmax);
//reinitialize the array and counter
titleCount = 0;
//pcount is incremented in BIG for loop - no need to increment here
} //END 5th if
fin.close();
//close the file
} //end (auxillary) function checkfile
//------------------------------------------------------------------------------
//
AUXILLARY COMPARETHEM FUNCTION
//------------------------------------------------------------------------------
int compareThem(string a, string b) {
int lengthA = strlen(a);
int lengthB = strlen(b);
int comp = 0;
//comparison counter
int tmp1, tmp2;
//2 temporary integers
if (lengthA != lengthB) {
return 1;
}
//nope - already know these are not =
for (comp = 0; comp < lengthA; comp++) {
tmp1 = tolower(a[comp]);
tmp2 = tolower(b[comp]);
if (tmp1 != tmp2) {
return 1;
}
//not equal
} //end for
return 0;
//all compared ok - must be equal
} //end auxillary functionc compareThem
//------------------------------------------------------------------------------
//
AUXILLARY ANDOR FUNCTION
//returns 0 if neither
//returns 1 if OR
//returns 2 if AND
//------------------------------------------------------------------------------
int andOr(string AO) {
string tester = "\0";
int length = strlen(AO);
//get the string length of the argument
for (int loop = 0; loop < length; loop++) {
tester[loop] = tolower(AO[loop]);
}
//now test it
/////////////////////////////
#ifdef DEBUG
cout << "\nIn AndOR: AO = " << AO << "\ntester = " << tester << "\n";
#endif
/////////////////////////////
if ((strcmp(tester, "or")) == 0) {
return 1;
} else if ((strcmp(tester, "and")) == 0) {
return 2;
} else return 0;
} //end auxiliary ANDOR function
The Search Routine
In order to conduct the development of the "workhorse" of the
search engine, I
opted to divide the code into two major parts or functions: one
function, parse(string)
which separates the information of the given in the CGI environment
variables and another function searchOut(),
which performs the manipulation of strings and file operations on all
of the selected documents on the site
using another function, checkfile().
The program‘s hierarchy of these functions along with two auxiliary
functions,
andOr(string, string), for supporting and / or operations, and
comparethem(string, string),
for determining the equality of any two string expressions is
indicated in the flowchart
shown in figure 2. This program, denoted, "sparse.cpp," acts as
the complete backend
responsible for dynamically returning HTML search results to the
user. One might notice
that the program is relatively large in comparison to typical C/C++
classes designed by
students and that the program‘s functionality might be better
defined if the the large
sparse program was broken down into several different C/C++ classes.
A more object-
oriented approach was in fact attempted in the course of developing
this system, but due
to extensive problems with CGI interfacing in the program caused by
compiler and
system upgrades (unbeknownst to the me as the programmer at the
moment they were
performed), enough time to complete such an approach was not
available.
The Search Routine Code Examined
In a nutshell, the program‘s parse and search functions are as
follows::
a. The parse function obtains its string, denoted argument as the
main function passes
buff2 to it.
b. The parse function gathers each possible search term from this
string (or phrases
enclosed in quotes counting as a single search term), and stores
them in an array of search
terms search. Ands and Ors are eliminated from the search string as
the program sets a
global array of 0s and 1s denoted the sourcetruthtable. Taken
verbatim from the
program‘s documentation, the sourcetruthtable works in tandem with
the
targetruthtable, another global array of 0s and 1s, to facilitate
and/or match
expressions as follows:
[The] truth tables indicate "and" and "or" values in search
string. If
a word is encountered that is part of an and statement, that location
in the
corresponding truth table will be set to a 1. When a value is
checked, a
check will be made to determine whether the corresponding entry in
the
truth table and the next consecutive one are both 1. If they are, an
and
value has been encountered, and another word must be obtained in
order
for the search to return a yes. When all consecutive ones [in a
particular
location] are matched and [a group of 1s] in the target truth table
matches
the location and number of 1s in the source truth table, the search
criteria
returns a yes.
In this scheme, the source truth table, sets as the environment
variable for the search
string is parsed and each search term or phrase is added to
successive places in the array
search. As of the time of this writing, the search engine can analyze
a maximum of 40
different search terms of phrases (delimited by single or double
quotes) or any
combination thereof, not counting the And or Or words separating
these phrases. This
number is based on a static (constant) integer expression in the
program, strmin and could
be increased to accommodate larger numbers of terms, but this was believed
to be sufficient at the time of this writing. As already mentioned,
the auxiliary function,
andOr(string) contains the subroutine that actually determines
whether a specific word is
AND, OR, or neither of these values and returns a status integer
indicating the result. It
should be noted that the search engine, in addition to being full
text and supporting up to
40 possible search terms, and supporting user selection of which
documents to retrieve,
supports the ability to search for entire quoted phrases of words in
addition to singletons.
The phrases must be placed between corresponding single or double
quotes for this
function to operate.
c. After the source truth table has been set and all search words
have been loaded into the
search array, the parse function moves on to set the global
variables associated with
the issues to be searched. Like the truth tables, these variables
use a Boolean binary
system to establish whether certain issues should be searched. Each
issue has a
variable associated with it. If that variable is set to one, the
issue will be searched in
the call to searchOut(); otherwise the issue will be skipped.
Furthermore, whether or
not the radio button associated with the selection of issues is set
determines whether
the checkbox variables from the form will be assessed at all. If the
value of the radio
button is "all," all of the Boolean binary variables associated
with the individual issues
will be checked.; this is the default setting. If it is "these,"
only issues whose
checkboxes have been set on the form will be checked. All of the
variable setting
occurs on the search engine‘s HTML form.
d. Now that the global variables have been set, the program returns
to main() where the
searchOut() function is called. Upon entry into this function, each
of the global
variables containing the Boolean binary value determining whether a
specific issue is
to be searched is evaluated, starting with the most current issue.
If searchOut()
determines an issue is to be evaluated and the number of articles in
a certain issue has
not been exceeded (i.e. the a file is not equal to an empty string),
the string variable
fileLoc is set to the relative path to the directory containing the
issue, the address
variable is set to the URL that contains the issue should it need to
be returned as valid
later, and the checkfile(string, string) function is called to check
that file. This process
is repeated in searchOut() for every valid file (not "\0" empty
string) in every valid
issue to be searched.
e. Upon the calling of the checkfile(string, string) function, the
system issues a call to
open the file to be parsed and proceeds to parse the file line by
line until either the
search criteria have been satisfied to return the file as containing
the search item(s) or
the end of the file has been reached. (Of course, safe guards are
present to stop the
program should a invalid file handle be called). Before the search
can begin, however,
the checkfile(string, string) function must initialize variables
necessary to its operation.
These include:
• the target truth table to all 0s - once the inner loop is entered
and a group of ones
makes their way into the target truth table surrounded by 0s having
matching
locations in the source truth table, the search can return a true
result.
• the corresponding letter to the first character of the search
term in both cases
(should it be a letter) and the length of the search term so that
each time that character
arises in the text, a possible match can be evaluated. This
information is stored for
each search term i (where i is 0 to 39) in the arrays firstChar[i],
capFirstChar[i]
(capitalized first character), and searchLenth[i] respectively.
• variables for storing information regarding whether the title has
been obtained for
the document. The search cannot begin searching for sample words
until the title of
the document has been gotten. The title associated with the URL path
to be
returned to the user upon a successful search is based upon the title
of the HTML
document which appears between the <title> </title> HTML
tags. No search is
returned as valid for a document if the sample search term only
resides between these
two tags (as the engine does not actually begin searching for a match
until the title has
been obtained- thus the title must appear at the beginning of every
HTML document-
as is appropriate - in the header). This creates the restriction that
every document
must have a title, even if it is only one character for this search
engine to operate
properly. This is necessary to return something more descriptive as a
link should the
document be returned to the user as a match other than a URL path.
This should not
cause a problem in the future as good HTML coding practice REQUIRES a
title field
to be present within the header field of an HTML document. All
principle
information stored regarding the title is found in the variables
gotTitle (a binary
Boolean variable to determine whether the title has been obtained),
titleRead (a binary
Boolean variable to determine whether the title is currently being
read), and titleCount, a
counter for the this process of getting the title.
• a value determining whether the search is ready to terminate and
return, done, a
binary Boolean variable originally initialized to 0, and a binary
Boolean value
corresponding to whether the URL for this file should be stored; this
only gets set to 1
later should the document return true to the query.
The checkfile(string, string) function uses the auxiliary function
compareThem(string
string) as already noted to compare a sample value against a search
term.
f. Once searchOut() has called checkfile(string, string) for every
valid file, that is not an
"\0" empty string file in every issue, the program returns
control back to the main()
function where a few simple cout functions combined with a for loop
write out HTML
results to the user. At the conclusion of this process, a
cout.flush() is called to
complete the process and the program terminates normally, returning
0.