I. | Getting Started - Setting up your PC for wordup development. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
II. | Using the Shell - Some common shell commands. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
III. | Using GIT - Source code management. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IV. | Hardware Administration - Gigablast hardware resources. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
V. | Directory Structure - How files are laid out. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VI. | Kernels - Kernels used by Gigablast. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VII. | Coding Conventions - The coding style used at Gigablast. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VIII. | Debugging Gigablast - How to debug gb. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IX. | Code Overview - Basic layers of the code. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
XIV. | Fighting Spam - Search engine spam identification. |
ssh-keygen -t dsa
ssh-copy-id -i ~/.ssh/id_dsa.pub destHost
Command | Description |
Export the LD_PATH variable, used to tell the OS where to look for dynamic libraries. | |
Ctrl+p | Show the previously executed command. |
Ctrl+n | Show the next executed command. |
Ctrl+f | Move cursor forward. |
Ctrl+b | Move cursor backward. |
Ctrl+a | Move cursor to start of line. |
Ctrl+e | Move cursor to end of line. |
Ctrl+k | Cut buffer from cursor forward. |
Ctrl+y | Yank (paste) buffer at cursor location. |
Ctrl+Shift+- | Undo last keystrokes. |
history | Show list of last commands executed. Edit /home/username/.bashrc to change the number of commands stored in the history. All are stored in /home/username/.history file. |
!xxx | Execute command #xxx, where xxx is a number shown from the 'history' command. |
ps auxww | Show all processes. |
ls -latr | Show all files reverse sorted by time. |
ls -larS | Show all files reverse sorted by size. |
ln -s <x> <y> | Make directory or file y a symbolic link to x. |
cat xxx | awk -F":" '{print $1}' | Show contents of file xxx, but for each line, use : as a delimiter and print out the first token. |
dsh -c -f hosts 'cat /proc/scsi/scsi' | Show all hard drives on all machines listed in the file hosts. -c means to execute this command concurrently on all those machines. dsh must be installed with apt-get install dsh for this to work. You can use double quotes in a single quoted dsh command without problems, so you can grep for a phrase, for instance. |
apt-cache search xxx | Search for a package to install. xxx is a space separated list of keywords. Debian only. |
apt-cache show xxx | Show details of the package named xxx. Debian only. |
apt-get install xxx | Installs a package named xxx. Must be root to do this. Debian only. |
adduser xxx | Add a new user to the system with username xxx. |
This will copy the git repository to the destination directory. |
More information available at github.com
sudo apt-get install git-core git-doc git config --global user.name "Your Name" git config --global user.email "your@email.com" git config --global color.ui true ssh-keygen -t rsa -C "your@email.com" -f ~/.ssh/git_rsa cat ~/.ssh/git_rsa.pubCopy and paste the ssh-rsa output from the above command into your Github profile's list of SSH Keys.
ssh-add ~/.ssh/git_rsaIf that gives you an error about inability to connect to ssh agent, run:
eval `ssh-agent -a`Then test and clone!
ssh -T git@github.com git clone git@github.com:gigablast/open-source-search-engine
mwells@gf36:/a$ dmesg | tail -3 scsi5: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 13 89 68 82 00 00 18 00 Info fld=0x1389688b, Current sd08:34: sense key Medium Error I/O error: dev 08:34, sector 323627528If you do a cat /proc/scsi/scsi you can see what type of hard drives are in the server:
mwells@gf36:/a$ cat /proc/scsi/scsi Attached devices: Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO Type: Direct-Access ANSI SCSI revision: 03 Host: scsi4 Channel: 00 Id: 00 Lun: 00 Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO Type: Direct-Access ANSI SCSI revision: 03 Host: scsi5 Channel: 00 Id: 00 Lun: 00 Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO Type: Direct-Access ANSI SCSI revision: 03So for this error we should replace the rightmost hard drive with a spare hard drive. Usually we have spare hard drives floating around. You will know by looking at colo.html where all the equipment is stored.
We generally use gdb to debug Gigablast. If you are running gb, the gigablast process, under gdb, and it receives a signal, gdb will break and you have to tell it to ignore the signal from now now by typing handle SIG39 nostop noprint for signal 39, at least, and then continue execution by typing 'c' and enter. When debugging a core on a customer's machine you might have to copy your versino of gdb over to it, if they don't have one installed.
There is also a /gb/bin/gdbserver that you can use to debug a remote gb process, although no one really uses this except Partap used it a few times.
The most common way to use gdb is:
You can also use gdb to do poor man's profiling by repeatedly attaching gdb to the gb pid like gdb ./gb <pid> and seeing where it is spending its time. This is a fairly effective random sampling technique.
If a gb process goes into an infinite loop you can get it to save its in-memory data by attaching gdb to its pid and typing print mainShutdown(1) which will tell gdb to run that function which will save all gb's data to disk so you don't end up losing data.
To debug a core type gdb ./gb <coreFilename> Then you can examine why gb core dumped. Please copy the gb binary and move the core to another filename if you want to preserve the core for another engineer to look at. You need to use the exact gb that produced the core in order to analyze it properly.
It is useful to have the following .gdbinit file in your home directory:
set print elements 100000 handle SIG32 nostop noprint handle SIG35 nostop noprint handle SIG39 nostop noprint set overload-resolution offThe overload-resolution gets in the way when tring to print the return value of some functions, like uccDebug() for instance.
Problem | Cause |
Core in RdbTree | Probably bad ram |
1. | Calling atoip() twice or more in the same printf() statement. atoip() outputs into a single static buffer and can not be shared this way. |
2. | Not calling class constructors and destructors when mallocing/freeing class objects. Need to allow class to initialize properly after allocation and free any allocated memory before it is freed. |
Indexdb - Used to hold the index. Datedb - Like indexdb, but its scores are dates. Titledb - Used to hold cached web pages. Spiderdb - Used to hold urls sorted by their scheduled time to be spidered. Checksumdb - Used for preventing spidering of duplicate pages. Sitedb - Used for classifying webpages. Maps webpages to rulesets. Clusterdb - Used to hold the site hash, family filter bit, language id of a document. Catdb - Used to classify a document using DMOZ.
Checksumdb | DB | Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time. |
Clusterdb | DB | Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time. |
Datedb | DB | Like indexdb, but its scores are 4-byte dates. |
Indexdb | DB | Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb. |
MemPool | DB | Used by RdbTree to add new records to tree without having to do an individual malloc. |
MemPoolTree | DB | Unused. Was our own malloc routine. |
Msg0 | DB | Fetches an RdbList from across the network. |
Msg1 | DB | Adds all the records in an RdbList to various hosts in the network. |
Msg3 | DB | Reads an RdbList from several consecutive files in a particular Rdb. |
Msg5 | DB | Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList. |
MsgB | DB | Unused. A distributed cache for caching anything. |
Rdb | DB | The core database class from which all are derived. |
RdbBase | DB | Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection. |
RdbCache | DB | Can cache RdbLists or individual Rdb records. |
RdbDump | DB | Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file. |
RdbList | DB | A list of Rdb records. |
RdbMap | DB | Maps an Rdb key to an offset into an RdbFile. |
RdbMem | DB | Memory manager for RdbTree so it does not have to allocate space for every record in the three. |
RdbMerge | DB | Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively. |
RdbScan | DB | Reads an RdbList from an RdbFile, used by Msg3. |
RdbTree | DB | A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree. |
SiteRec | DB | A record in Sitedb. |
Sitedb | DB | An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url. |
SpiderRec | DB | A record in spiderdb. |
Spiderdb | DB | An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is old or new to the index, and the priority of the url, currently from 0 to 7. |
TitleRec | DB | A record in Titledb. |
Titledb | DB | An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class. |
1145413139583 81 WARN db [31956] Key out of order in list of records. 1145413139583 81 WARN db [31956] Corrupt filename is indexdb1215.dat. 1145413139583 81 WARN db [31956] startKey.n1=af6da55 n0=14a9fe4f69cd6d46 endKey.n1=b14e5d0 n0=8d4cfb0deeb52cc3 1145413139729 81 WARN db [31956] Removed 0 bytes of data from list to make it sane. 1145413139729 81 WARN db [31956] Removed 6 recs to fix out of order problem. 1145413139729 81 WARN db [31956] Removed 12153 recs to fix out of range problem. 1145413139975 81 WARN net Encountered a corrupt list. 1145413139975 81 WARN net Getting remote list from twin instead. 1145413139471 81 WARN net Received good list from twin. Requested 5000000 bytes and got 5000010. startKey.n1=af6cc0e n0=ae7dfec68a44a788 endKey.n1=ffffffffffffffff n0=ffffffffffffffff
tttttttt tttttttt tttttttt tttttttt t = termId (48bits) tttttttt tttttttt dddddddd dddddddd d = docId (38 bits) dddddddd dddddddd dddddd0r rrrggggg r = siterank, g = langid wwwwwwww wwwwwwww wwGGGGss ssvvvvFF w = word position , s = wordspamrank pppppb1M MMMMLZZD v = diversityrank, p = densityrank M = unused, b = in outlink text L = langIdShiftBit (upper bit for langid) Z = compression bits. can compress to 12 or 6 bytes keys. G: 0 = body 1 = intitletag 2 = inheading 3 = inlist 4 = inmetatag 5 = inlinktext 6 = tag 7 = inneighborhood 8 = internalinlinktext 9 = inurl F: 0 = original term 1 = conjugate/sing/plural 2 = synonym 3 = hyponym
tttttttt tttttttt tttttttt tttttttt t = termid (48bits) tttttttt tttttttt ssssssss dddddddd s = ~score dddddddd dddddddd dddddddd dddddd0Z d = docId (38 bits) Z = delbitWhen Rdb::m_useHalfKeys is on and the preceeding key as the same 6 bytes as the following key, then the following key, called a half key, only requires 6 bytes, therefore, has the following bitmap:
ssssssss dddddddd dddddddd dddddddd d = docId, s = ~score dddddddd dddddd1Z Z = delbitEvery term that Gigablast indexes, be it a word or phrase, is hashed using the hash64() routine in the hash.h. This is a very fast and effective hashing function. The resulting hash of the term is called the termid. It is constrained to 48 bits.
tttttttt tttttttt tttttttt tttttttt t = termId (48bits) tttttttt tttttttt DDDDDDDD DDDDDDDD D = ~date DDDDDDDD DDDDDDDD ssssssss dddddddd s = ~score dddddddd dddddddd dddddddd dddddd0Z d = docId (38 bits) And, similar to indexdb, datedb also has a half bit for compression down to 10 bytes: DDDDDDDD DDDDDDDD D = ~date DDDDDDDD DDDDDDDD ssssssss dddddddd s = ~score dddddddd dddddddd dddddddd dddddd1Z d = docId (38 bits)Datedb was added along with variable-sized keys (mentioned above). It is basically the same as indexdb, but has a 4-byte date field inserted. IndexTable.cpp was modified slightly to treat dates as scores in order to provide sort by date functionality. By setting the sdate=1 cgi parameter, Gigablast should limit the termlist lookups to Datedb. Using date1=X and date2=Y cgi parameters will tell Gigablast to constrain the termlist by those dates. date1 and date2 are currently seconds since the epoch. Gigablast will search for the date of a document in this order, stopping at the first non-zero date value:
dddddddd dddddddd dddddddd dddddddd d = docId dddddddd hhhhhhhh hhhhhhhh hhhhhhhh h = hash of site name hhcccccc cccccccc cccccccc cccccccD c = content hash, D = delbitThe low bits of the top 31 bits of the docId are used to determine which host in the network stores the Titledb record. See Titledb::getGroupId().
00000000 00000000 00000000 pppNtttt t = time to spider, p = ~ of priority tttttttt tttttttt tttttttt ttttRRf0 R = retry #, f = forced? dddddddd dddddddd dddddddd dddddddD d = top 32 bits of docId, D = delbit N = 1 iff url not in titledb (isNew)Each Spiderdb record also records the number of times the url was tried and failed. In the Spider Controls you can specify how many until Gigablast gives up and deletes the url from Spiderdb, and possibly from the other databases if it was indexed.
cccccccc hhhhhhhh hhhhhhhh cccccccc h = host name hash cccccccc cccccccc cccccccc cddddddd c = content, collection and host hash dddddddd dddddddd dddddddd dddddddD d = docId , D = delbit
dddddddd dddddddd dddddddd dddddddd d = domain hash (w/ collection) uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu u = special url hash uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuuSitedb maps a url to a site file number (sfn). The site file is now called a ruleset file and has the name sitedbN.xml in the working directory. All rulesets must be archived in the Bitkeeper repository at /gb/conf/. Therefore, all gb clusters share the same ruleset name space. Msg8.cpp and Msg9.cpp are used to respectively get and set sitedb records.
dddddddd dddddddd dddddddd dddddddd d = domain hash uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu u = special url hash uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu Data Block: . number of catides (1 byte) . list of catids (4 bytes each) . sitedb file # (3 bytes) . sitedb version # (1 byte) . siteUrl (remaining bytes)Catdb is a special implementation of Sitedb. While it is similar in that a single record is stored per url and the keys are created using the same hashes, the record stores additional category information about the url. This includes how many categories the url is in and which categories it is in (their ids). Like sitedb, Msg8 and Msg9 are used to get and set catdb. Msg2a is used to generate a full catdb using directory information and calling Msg9.
Dns | Net | A DNS client built on top of the UdpServer class. |
DnsProtocol | Net | Uses UdpServer to make a protocol for talking to DNS servers. Used by Dns class. |
HttpMime | Net | Creates and parses an HTTP MIME header. |
HttpRequest | Net | Creates and parses an HTTP request. |
HttpServer | Net | Gigablast's highly efficient web server, contains a TcpServer class. |
Multicast | Net | Used to reroute a request if it fails to be answered in time. Also used to send a request to multiple hosts in the cluster, usually to a group (shard) for data storage purposes. |
TcpServer | Net | A TCP server which contains an array of TcpSockets. |
TcpSockets | Net | A C++ wrapper for a TCP socket. |
UdpServer | Net | A reliable UDP server that uses non-blocking sockets and calls handlers receiving a message. The handle called depends on that message's type. The handler is UdpServer::m_handlers[msgType]. |
UdpSlot | Net | Basically a "socket" for the UdpServer. The UdpServer contains an array of a few thousand of these. When none are available to conduct receive a request, the dgram is dropped and will later be resent by the requester in a back-off fashion. |
REtttttt ACNnnnnn nnnnnnnn nnnnnnnn R = is Reply?, E = hadError? N=nice iiiiiiii iiiiiiii iiiiiiii iiiiiiii t = msgType, A = isAck?, n = dgram # ssssssss ssssssss ssssssss ssssssss i = transId C = cancelTransAck dddddddd dddddddd dddddddd dddddddd s = msgSize (iff !ack) (w/o hdrs!) dddddddd ........ ........ ........ d = msg content ...The niceness (N bit) of a datagram can be either 0 or 1. If it is 0 (not nice) then it will take priority over a datagram with a niceness of 1. This just means that we will call the handlers for it first. It's not a very big deal.
AdultBit.cpp | Build | Used to detect if document content is naughty. |
Bits.cpp | Build | Sets descriptor bits for each word in a Words class. |
Categories.cpp | Build | Stores DMOZ categories in a hierarchy. |
Lang.cpp | Build | Unused. |
Language.cpp | Build | Enumerates the various languages supported by Gigablast's language detector. |
LangList.cpp | Build | Interface to the language-specific dictionaries used for language identification by XmlDoc::getLanguage(). |
Linkdb.cpp | Build | Functions to perform link analysis on a docid/url. Computes a LinkInfo class for the docId. LinkInfo class contains Inlink classes serialized into it for each non-spammy inlink detected. Also contains a Links class that parses out all the outlinks in a document. |
PageAddUrl.cpp | Build | HTML page to add a url or file of urls to spiderdb. |
PageInject.cpp | Build | HTML page to inject a page directly into the index. |
Phrases.cpp | Build | Generates phrases for every word in a Words class. Uses the Bits class. |
Pops.cpp | Build | Computes popularity for each word in a Words class. Uses the dictionary files in the dict subdirectory. |
Pos.cpp | Build | Computes the character position of each word in a Words class. HTML entities count as a single character. So do back-to-back spaces. |
Spam | Build | Computes the probability a word is spam for every word in a Words class. |
Spider.cpp | Build | Has most of the code used by the spidering process. SpiderLoop is a class in there that is the heart of the spider. It is the control loop that launches spiders. |
StopWords.cpp | Build | A table of stop words, used by Bits to see if a word is a stop word. |
Words.cpp | Build | Breaks a document up into "words", where each word is a sequence of alphanumeric characters, a sequence of non-alphanumeric characters, or a single HTML/XML tag. A heart of the build process. |
Xml.cpp | Build | Breaks a document up into XmlNodes where each XmlNode is a tag or a sequence of characters which are not a tag. |
XmlDoc.cpp | Build | The main document parsing class. A huge file, pretty much does all the parsing. |
XmlNode | Build | Xml classes has an array of these. Each is either a tag or a sequence of characters that are between tags (or beginning/end of the document). |
dmozparse | Build | Creates the necessary dmoz files Gigablast needs from those files downloadable from DMOZ. |
EBADTITLEREC | Document's TitleRec is corrupted and can not be read. |
EURLHASNOIP | Url has no ip. |
EDOCCGI | Document has CGI parms in url and allowCgiUrls is specified as false in the ruleset. |
EDOCURLIP | Document is an IP-based url and allowIpUrls is specified as false in the ruleset. |
EDOCBANNED | Document's ruleset has banned set to true. |
EDOCDISALLOWED | Robots.txt forbids this document to be indexed. But, if it has incoming link text, Gigablast will index it anyway, but just index the link text. |
EDOCURLSPAM | The url itself contains naughty words and the do url sporn checking is enabled in the Spider Controls. |
EDOCQUOTABREECH | The quota for this site has been exceeded. Quotas is based on quality of the url. See the quota section in the Overview file. |
EDOCBADCONTENTTYPE | Content type, as returned in the mime reply and parsed out by HttpMime.cpp, is not supported for indexing |
EDOCBADHTTPSTATUS | Http status was 404 or some other bad stats. |
EDOCNOTMODIFIED | Spider Controls have use IfModifiedSince enabled and document was not modified since the last time we indexed it. |
EDOCREDIRECTSTOSELF | The mime redirects to itself. |
EDOCTOOMANYREDIRECTS | Url had more than 6 redirects. |
EDOCBADREDIRECTURL | The redirect url was empty. |
EDOCSIMPLIFIEDREDIR | The document redirected to a simpler url, which had less path components, did not have cgi, or for whatever reason was prettier to look at. The current url will be discarded and the redirect url will be added to spiderdb. |
EDOCNONCANONICAL | Doc has a canonical link reference a url that was not itself. Used for deduping at spider time. |
EDOCNODOLLAR | Document did not contain a dollar sign followed by a price. Used for building shopping indexes. |
EDOCHASBADRSS | This should not happen. |
EDOCISANCHORRSS | This should not happen. |
EDOCHASRSSFEED | only index documents from rss feeds is true in the Spider Controls, and the document indicates it is part of an RSS feed, and does not currently have an RSS feed linking to it in the index. Gigablast will discard the document, and add the url of the RSS feed to spiderdb. When that is spidered the url should be picked up again. |
EDOCNOTRSS | If the Spider Controls specify only index articles from rss feeds as true and the document is not part of an RSS feed. |
EDOCDUP | According to checksumdb, a document already exists from this hostname with the same checksumdb hash. See Deduping section. |
EDOCTOOOLD | Document's last modified date is before the maxLastModifiedDate specified in the ruleset. |
EDOCLANG | The document does not match the language given in the Spider Controls. |
EDOCADULT | Document was detected as adult and adult documents are forbidden in the Spider Controls. |
EDOCNOINDEX | Document has a noindex meta tag. |
EDOCNOINDEX2 | Document's ruleset (SiteRec) says not to index it using the <indexDoc> tag. Probably used to just harvest links then. |
EDOCBINARY | Document is detected as a binary file. |
EDOCTOONEW | Document is after the minLastModifiedDate specified in the ruleset. |
EDOCTOOBIG | Document size is bigger than maxDocSize specified in the ruleset. |
EDOCTOOSMALL | Document size is smaller than minDocSize specified in the ruleset. |
ETTRYAGAIN | |
ENOMEM | We ran out of memory. |
ENOSLOTS | We ran out of UDP sockets. |
ECANCELLED | An administrator disabled spidering in the Master Controls thereby cancelling all outstanding spiders. |
EBADIP | Unable to get IP address of url. |
EBADENGINEER | |
EIPHAMMER | We would hit the IP address too hard, violating sameIpWait in the Spider Controls if we were to download this document. |
ETIMEDOUT | If we timed out downloading the document. |
EDNSTIMEDOUT | If we timed out looking up the IP of the url. |
EBADREPLY | DNS server sent us a bad reply. |
EDNSDEAD | DNS server was dead |
Ads | Search | Interface to third party ad server. |
Highlight | Search | Highlights query terms in a document or summary. |
IndexList | Search | Derived from RdbList. Used specifically for processing Indexdb RdbLists. |
IndexReadInfo | Search | Tells Gigablast how much of what IndexLists to read from Indexdb to satisfy a query. |
IndexTable | Search | Intersects IndexLists to get the final docIds to satisfy a query. |
Matches | Search | Identifies words in a document or string that match supplied query terms. Used by Highlight. |
Msg17 | Search | Used by Msg40 for distributed caching of search result pages. |
Msg1a | Search | Get the reference pages from a set of search results. |
Msg1b | Search | Get the related pages from a set of search results and reference pages. |
Msg2 | Search | Given a list of termIds, download their respective IndexLists. |
Msg20 | Search | Given a docId and query, return a summary or or document excerpt. Used by Msg40. |
Msg33 | Search | Unused. Did raid stuff. |
Msg36 | Search | Gets the length of an IndexList for determining query term weights. |
Msg37 | Search | Calls a Msg36 for each term in the query. |
Msg38 | Search | Returns the Clusterdb record for a docId. May also get for Titledb record if its key is in the RdbMap. |
Msg39 | Search | Intersects IndexLists to get list of docIds satisfying query. Uses Msg38 to cluster away dups and same-site results. Re-intersects lists to get more docIds if too many were removed. Uses Msg2, Msg38, IndexReadInfo, IndexTable. This and IndexTable are the heart of the query resolution process. |
Msg3a | Search | Calls multiple Msg39s to distribute the query based on docId parity. One host computes the even docId search results, the other the odd. And so on for different parity levels. Merges the docIds into a final list. |
Msg40 | Search | Uses Msg20 to get the summaries for the final list of docIds returned from Msg3a. |
Msg40Cache | Search | Used by Msg17 to cache search results pages. Basically, caching serialized Msg40s. |
Msg41 | Search | Queries multiple clusters and merges the results. |
PageDirectory | Search | HTML page to display a DMOZ directory page. |
PageGet | Search | HTML page to display a cached web page from titledb with optional query term highlighting. |
PageResults | Search | HTML/XML page to display the search results. |
PageRoot | Search | HTML page to display the root page. |
Query | Search | Parses a query up into QueryWords which are then parsed into QueryTerms. Makes a boolean truth table for boolean queries. |
SearchInput | Search | Used to parse, contain and manage all parameters passed in for doing a query. |
Speller | Search | Performs spell checking on a query. Returns a single recommended spelling of the query. |
Summary | Search | Generates a summary given a document and a query. |
Title | Search | Generates a title for a document. Usually just the <title> tag. |
TopTree | Search | A balanced binary tree used for getting the top-scoring X search results from intersecting IndexLists in IndexTable, where X is a large number. Normally we just do a linear scan to find the minimum scoring docId and replace him with a higher scoring docid, but when X is large this linear scan process is too slow. |
AutoBan | Admin | Automatically bans IP addresses that exceed daily and minute query quotas. |
CollectionRec | Admin | Holds all of the parameters for a particular search collection. |
Collectiondb | Admin | Manages all the CollectionRecs. |
Conf | Admin | Holds all of the parameters not collection specific (Collectiondb does that). Like maximum memory for the gb process to use, for instance. Corresponds to gb.conf file. |
Hostdb | Admin | Contains the array of Hosts in the network. Each Host has various stats, like ping time and IP addresses. |
Msg1c | Admin | Perform spam analysis on an IP. |
Msg1d | Admin | Ban documents identified as spam. |
Msg30 | Admin | Unused. |
PageAddColl | Admin | HTML page to add a new collection. |
PageHosts | Admin | HTML page to display all the hosts in the cluster. Shows ping times for each host. |
PageLogin | Admin | HTML page to login as master admin or as a collection's admin. |
PageOverview | Admin | HTML page to present the help section. |
PagePerf | Admin | HTML page to show the performance graph. |
PageSockets | Admin | HTML page for showing existing network connections for both TCP and UDP servers. |
PageStats | Admin | HTML page for showing various server statistics. |
Pages | Admin | Framework for displaying generic HTML pages as described by Parms.cpp. |
Parms | Admin | All of the control parameters for the gb process or for a particular collection are stored in this file. Some controls are assigned to a specific page id so Pages.cpp can generate the HTML page automatically for controlling those parameters. |
PingServer | Admin | Does round-robin pinging of every host in the cluster. Ping times are displayed on PageHosts. |
Stats | Admin | Holds various statistics that PagePerf displays. |
Sync | Admin | Unused. Syncs to twins Rdbs together. |
log(LOG_DEBUG,"query: a query debug message #%li.",n);The first parameter to the log() subroutine is the type of log message. These types are defined in Log.h.
Subtype | Description |
addurls | related to adding urls |
admin | related to administrative things, sync file, collections |
build | related to indexing (high level) |
conf | configuration issues |
disk | disk reads and writes |
dns | dns networking |
http | http networking |
loop | |
net | network later: multicast pingserver. sits atop udpserver. |
query | related to querying (high level) |
rdb | generic rdb things |
spcache | related to determining what urls to spider next |
speller | query spell checking |
thread | calling threads |
topics | related topics |
udp | udp networking |
uni | unicode parsing |
/sbin/iptables -t nat -A PREROUTING -p tcp -m tcp --dport 80 -j DNAT --to-destination 64.62.168.XX:8000This command should be executed at startup to perform the mapping.
gf0:/a# echo 1 > /proc/sys/net/ipv4/ip_forward gf0:/a# /sbin/iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j DNAT --to 64.62.168.52:8000 gf0:/a# /sbin/iptables -t nat -A POSTROUTING -d 64.62.168.52 -p tcp --dport 8000 -j SNAT --to 64.62.168.3
BigFile | Core | A virtual file class that allows virtual files bigger than 2GB by using smaller 512MB files. |
Dir | Core | Used to read the files in a directory. |
DiskPageCache | Core | Used by BigFile to read and write from/to a page cache. |
Domains | Core | Used to extract the Top Level Domain (TLD) from a url. |
Entities | Core | List of all the various HTML entities. |
Errno | Core | List of all the error codes and their associated error messages. Used by mstrerror(). |
File | Core | A basic file class that recycles file descriptors to get around the 1024 limit. |
HashTable | Core | A basic hashtable that grows automatically. |
HashTableT | Core | A templatized version of HashTable.cpp. |
Log | Core | Used to log messages. |
Loop | Core | Used to control the flow of execution. Reacts to signals. |
Mem | Core | A malloc and new wrapper that helps isolate memory leaks, prevent double frees and identify overflows and underflows as well as track and limit memory consumption. |
Mime | Core | Unused. |
Strings | Core | Unused. |
Threads | Core | Gigablast's own threads class which uses the Linux clone() call to do its own LWP threads. |
Unicode | Core | Unicode support. |
UnicodeProperties | Core | Unicode support. |
Url | Core | For breaking a url up into its various components. |
Vector | Core | Unused. |
blaster | Core | Download the urls listed in a file in parallel. Very useful for testing query performance. |
create_ucd_tables | Core | Unicode support. |
dnstest | Core | Test dns server by sending it a bunch of lookup requests. |
fctypes | Core | Various little functions. |
gbfilter | Core | Called via a system call by Gigabot to convert pdfs, Microsoft Word documents, PowerPoint documents, etc. to HTML. |
hash | Core | Contains a bunch of fast and convenient hashing functions. |
iana_charset | Core | Unicode support. Autogenerated. |
ip | Core | Routines for manipulating IP addresses. |
main | Core | The main.cpp file. |
monitor | Core | Monitors an external gb process and sends an email alert on 3 failed queries in a row. |
thunder | Core | Tests UDP throughput between two machines. |
types | Core | Defines the well used key_t type, a 12-byte key, and some functions for 16-byte keys. |
uniq2 | Core | Like the 'uniq' command but also counts the occurrences and prints those out. |
urlinfo | Core | Displays information for a url given through stdin. |
File (.cpp or .h) | Layer | Description |
Ads | Search | Interface to third party ad server. |
AdultBit | Build | Used to detect if document content is naughty. |
AutoBan | Admin | Automatically bans IP addresses that exceed daily and minute query quotas. |
BigFile | Core | A virtual file class that allows virtual files bigger than 2GB by using smaller 512MB files. |
Bits | Build | Sets descriptor bits for each word in a Words class. |
Categories | Build | Stores DMOZ categories in a hierarchy. |
Checksumdb | DB | Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time. |
Clusterdb | DB | Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time. |
CollectionRec | Admin | Holds all of the parameters for a particular search collection. |
Collectiondb | Admin | Manages all the CollectionRecs. |
Conf | Admin | Holds all of the parameters not collection specific (Collectiondb does that). Like maximum memory for the gb process to use, for instance. Corresponds to gb.conf file. |
DateParse | Build | Extracts the publish date from a document. |
Datedb | DB | Like indexdb, but its scores are 4-byte dates. |
Dir | Core | Used to read the files in a directory. |
DiskPageCache | Core | Used by BigFile to read and write from/to a page cache. |
Dns | Net | A DNS client built on top of the UdpServer class. |
DnsProtocol | Net | Uses UdpServer to make a protocol for talking to DNS servers. Used by Dns class. |
Doc | Build | A container class to store the titleRec and siteRec and other classes associated with a document. |
Domains | Core | Used to extract the Top Level Domain (TLD) from a url. |
Entities | Core | List of all the various HTML entities. |
Errno | Core | List of all the error codes and their associated error messages. Used by mstrerror(). |
File | Core | A basic file class that recycles file descriptors to get around the 1024 limit. |
HashTable | Core | A basic hashtable that grows automatically. |
HashTableT | Core | A templatized version of HashTable.cpp. |
Highlight | Search | Highlights query terms in a document or summary. |
Hostdb | Admin | Contains the array of Hosts in the network. Each Host has various stats, like ping time and IP addresses. |
HttpMime | Net | Creates and parses an HTTP MIME header. |
HttpRequest | Net | Creates and parses an HTTP request. |
HttpServer | Net | Gigablast's highly efficient web server, contains a TcpServer class. |
IndexList | Search | Derived from RdbList. Used specifically for processing Indexdb RdbLists. |
IndexReadInfo | Search | Tells Gigablast how much of what IndexLists to read from Indexdb to satisfy a query. |
IndexTable | Search | Intersects IndexLists to get the final docIds to satisfy a query. |
Indexdb | DB | Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb. |
Lang | Build | Unused. |
Language.h | Build | Enumerates the various languages supported by Gigablast's language detector. |
LangList | Build | Interface to the language-specific dictionaries used for language identification by XmlDoc::getLanguage(). |
LinkInfo | Build | Used by the link analysis routine. Contains an array of LinkTexts. |
LinkText | Build | Contains the link information for a specific document that links to a specific url, such as quality, IP address, number of total outgoing links, and the link text for that url. |
Links | Build | Parses out all the outgoing links in a document. |
Log | Core | Used to log messages. |
Loop | Core | Used to control the flow of execution. Reacts to signals. |
Matches | Search | Identifies words in a document or string that match supplied query terms. Used by Highlight. |
Mem | Core | A malloc and new wrapper that helps isolate memory leaks, prevent double frees and identify overflows and underflows as well as track and limit memory consumption. |
MemPool | DB | Used by RdbTree to add new records to tree without having to do an individual malloc. |
MemPoolTree | DB | Unused. Was our own malloc routine. |
Mime | Core | Unused. |
Msg0 | DB | Fetches an RdbList from across the network. |
Msg1 | DB | Adds all the records in an RdbList to various hosts in the network. |
Msg10 | Build | Adds a list of urls to spiderdb for spidering. |
Msg13 | Build | Tells a server to download robots.txt (or get from cache) and report if Gigabot has permission to download it. |
Msg14 | Build | The core class for indexing a document. |
Msg15 | Build | Called by Msg14 to set the Doc class from the previously indexed TitleRec. |
Msg16 | Build | Called by Msg14 to download the document and create a new titleRec to set the Doc class with. |
Msg17 | Search | Used by Msg40 for distributed caching of search result pages. |
Msg18 | Build | Unused. Was used for supporting soft banning. |
Msg19 | Build | Determine if a document is a duplicate of a document already indexed from that same hostname. |
Msg1a | Search | Get the reference pages from a set of search results. |
Msg1b | Search | Get the related pages from a set of search results and reference pages. |
Msg1c | Admin | Perform spam analysis on an IP. |
Msg1d | Admin | Ban documents identified as spam. |
Msg2 | Search | Given a list of termIds, download their respective IndexLists. |
Msg20 | Search | Given a docId and query, return a summary or or document excerpt. Used by Msg40. |
Msg22 | Search/Build | Return the TitleRec for a docId or url. |
Msg23 | Build | Get the link text in a document that links to a specified url. Also returns other info besides that link text. |
Msg24 | search | Get the gigabits (aka related topics) for a query. |
Msg25 | build | Set the LinkInfo class for a document. |
Msg28 | admin | Set a particular parm or set of parms on all hosts in the cluster. |
Msg2a | admin | Makes catdb, for assigning documents to a category in DMOZ. |
Msg2b | admin | Makes catdb, for assigning documents to a category in DMOZ. |
Msg3 | DB | Reads an RdbList from several consecutive files in a particular Rdb. |
Msg37 | Search | Calls a Msg36 for each term in the query. |
Msg38 | Search | Returns the Clusterdb record for a docId. May also get for Titledb record if its key is in the RdbMap. |
Msg39 | Search | Intersects IndexLists to get list of docIds satisfying query. Uses Msg38 to cluster away dups and same-site results. Re-intersects lists to get more docIds if too many were removed. Uses Msg2, Msg38, IndexReadInfo, IndexTable. This and IndexTable are the heart of the query resolution process. |
Msg3a | Search | Calls multiple Msg39s to distribute the query based on docId parity. One host computes the even docId search results, the other the odd. And so on for different parity levels. Merges the docIds into a final list. |
Msg40 | Search | Uses Msg3a to get final docIds in search results. Uses Msg20 to get the summaries for these docIds. |
Msg40Cache | Search | Used by Msg17 to cache search results pages. Basically, caching serialized Msg40s. |
Msg41 | Search | Sends Msg40s to multiple clusters and merges the results. |
Msg5 | DB | Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList. |
Msg7 | build | Injects a url into the index using Msg14. |
Msg8 | Build | Gets the Sitedb record given a url. |
Msg9 | Build | Adds a Sitedb record to Sitedb for a given site/url. |
MsgB | DB | Unused. A distributed cache for caching anything. |
Multicast | Net | Used to reroute a request if it fails to be answered in time. Also used to send a request to multiple hosts in the cluster, usually to a group (shard) for data storage purposes. |
PageAddColl | Admin | HTML page to add a new collection. |
PageAddUrl | Build | HTML page to add a url or file of urls to spiderdb. |
PageCatdb | Admin/Build | HTML page to lookup the categories of a url in catdb. |
PageDirectory | Search | HTML page to display a DMOZ directory page. |
PageGet | Search | HTML page to display a cached web page from titledb with optional query term highlighting. |
PageHosts | Admin | HTML page to display all the hosts in the cluster. Shows ping times for each host. |
PageIndexdb | Admin/Search | HTML page to display an IndexList for a given query term or termId. Can also add or delete individual Indexdb records. |
PageInject | Build | HTML page to inject a page directly into the index. |
PageLogin | Admin | HTML page to login as master admin or as a collection's admin. |
PageOverview | Admin | HTML page to present the help section. |
PageParser | Admin/Build | HTML page to show how a document is analyzed, parsed and its terms are scored and indexed. |
PagePerf | Admin | HTML page to show the performance graph. |
PageReindex | Admin/Build | HTML page to reindex or delete the search results for a single term query. |
PageResults | Search | HTML/XML page to display the search results. |
PageRoot | Search | HTML page to display the root page. |
PageSitedb | Admin/Build | HTML page to allow urls or sites to be entered into Sitedb, for assigning spidered urls to a ruleset. |
PageSockets | Admin | HTML page for showing existing network connections for both TCP and UDP servers. |
PageSpamr | Admin/Build | HTML page for removing spam from the index. |
PageSpiderdb | Admin/Build | HTML page for showing status of spiders and what is in spiderdb. |
PageStats | Admin | HTML page for showing various server statistics. |
PageTitledb | Admin/Build | HTML page for show a Titledb record for a given docId. |
Pages | Admin | Framework for displaying generic HTML pages as described by Parms.cpp. |
Parms | Admin | All of the control parameters for the gb process or for a particular collection are stored in this file. Some controls are assigned to a specific page id so Pages.cpp can generate the HTML page automatically for controlling those parameters. |
Phrases | Build | Generates phrases for every word in a Words class. Uses the Bits class. |
PingServer | Admin | Does round-robin pinging of every host in the cluster. Ping times are displayed on PageHosts. |
Pops | Build | Computes popularity for each word in a Words class. Uses the dictionary files in the dict subdirectory. |
Pos | Build | Computes the character position of each word in a Words class. HTML entities count as a single character. So do back-to-back spaces. |
Query | Search | Parses a query up into QueryWords which are then parsed into QueryTerms. Makes a boolean truth table for boolean queries. |
Rdb | DB | The core database class from which all are derived. |
RdbBase | DB | Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection. |
RdbCache | DB | Can cache RdbLists or individual Rdb records. |
RdbDump | DB | Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file. |
RdbList | DB | A list of Rdb records. |
RdbMap | DB | Maps an Rdb key to an offset into an RdbFile. |
RdbMem | DB | Memory manager for RdbTree so it does not have to allocate space for every record in the three. |
RdbMerge | DB | Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively. |
RdbScan | DB | Reads an RdbList from an RdbFile, used by Msg3. |
RdbTree | DB | A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree. |
Robotdb | Build | Caches and parses robots.txt files. Used by Msg13. |
SafeBuf | General | Used to print messages safely into a buffer without worrying about overflow. Will automatically reallocate the buffer if necessary. |
Scores | Build | Computes the score of each word in a Words class. Used to weight the final score of a term being indexed in TermTable::hash(). |
SearchInput | Search | Used to parse, contain and manage all parameters passed in for doing a query. |
SiteRec | DB | A record in Sitedb. |
Sitedb | DB | An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url. |
Spam | Build | Computes the probability a word is spam for every word in a Words class. |
SpamContainer | Build | Used to remove spam from the index using Msg1c and Msg1d. |
Speller | Search | Performs spell checking on a query. Returns a single recommended spelling of the query. |
SpiderCache | Build | Spiderdb records are preloaded in this cache so SpiderLoop::spiderUrl() can get urls to spider as fast as possible. |
SpiderLoop | Build | The heart of the spider process. Continually gets urls from the SpiderCache and calls Msg14::spiderUrl() on them. |
SpiderRec | DB | A record in spiderdb. |
Spiderdb | DB | An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is old or new to the index, and the priority of the url, currently from 0 to 7. |
Stats | Admin | Holds various statistics that PagePerf displays. |
Stemmer | Build | Unused. Given a word, computes its stem. |
StopWords | Build | A table of stop words, used by Bits to see if a word is a stop word. |
Strings | Core | Unused. |
Summary | Search | Generates a summary given a document and a query. |
Sync | Admin | Unused. Syncs to twins Rdbs together. |
TcpServer | Net | A TCP server which contains an array of TcpSockets. |
TcpSockets | Net | A C++ wrapper for a TCP socket. |
TermTable | Build | A hash table of terms from a document. Consists of termIds and scores. Used to accumulate scores. TermTable::hash() is arguably a heart of the build process. |
Threads | Core | Gigablast's own threads class which uses the Linux clone() call to do its own LWP threads. |
Title | Search | Generates a title for a document. Usually just the <title> tag. |
TitleRec | DB | A record in Titledb. |
Titledb | DB | An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class. |
TopTree | Search | A balanced binary tree used for getting the top-scoring X search results from intersecting IndexLists in IndexTable, where X is a large number. Normally we just do a linear scan to find the minimum scoring docId and replace him with a higher scoring docid, but when X is large this linear scan process is too slow. |
UCNormalizer | General | Unicode support. |
UCPropTable | General | Unicode support. |
UCWordIterator | General | For iterating over Unicode characters. |
UdpServer | Net | A reliable UDP server that uses non-blocking sockets and calls handlers receiving a message. The handle called depends on that message's type. The handler is UdpServer::m_handlers[msgType]. |
UdpSlot | Net | Basically a "socket" for the UdpServer. The UdpServer contains an array of a few thousand of these. When none are available to conduct receive a request, the dgram is dropped and will later be resent by the requester in a back-off fashion. |
Unicode | Core | Unicode support. |
UnicodeProperties | Core | Unicode support. |
Url | Core | For breaking a url up into its various components. |
Url2 | Build | For hashing/indexing a url. |
Vector | Core | Unused. |
Words | Build | Breaks a document up into "words", where each word is a sequence of alphanumeric characters, a sequence of non-alphanumeric characters, or a single HTML/XML tag. A heart of the build process. |
Xml | Build | Breaks a document up into XmlNodes where each XmlNode is a tag or a sequence of characters which are not a tag. |
XmlDoc | Build | XmlDoc::hash() hashes a TitleRec (and SiteRec which indicates the ruleset to use) into a TermTable. Another heart of the build process. |
XmlNode | Build | Xml classes has an array of these. Each is either a tag or a sequence of characters that are between tags (or beginning/end of the document). |
blaster | Core | Download the urls listed in a file in parallel. Very useful for testing query performance. |
create_ucd_tables | Core | Auto generated. Used by Unicode stuff. |
dmozparse | Build | Creates the necessary dmoz files Gigablast needs from those files downloadable from DMOZ. |
dnstest | Core | Test dns server by sending it a bunch of lookup requests. |
fctypes | Core | Various little functions. |
gbfilter | Core | Called via a system call by Gigabot to convert pdfs, Microsoft Word documents, PowerPoint documents, etc. to HTML. |
hash | Core | Contains a bunch of fast and convenient hashing functions. |
iana_charset | Core | Unicode support. |
ip | Core | Routines for manipulating IP addresses. |
main | Core | The main.cpp file. |
monitor | Core | Monitors an external gb process and sends an email alert on 3 failed queries in a row. |
thunder | Core | Tests UDP throughput between two machines. |
types | Core | Defines the well used key_t type, a 12-byte key, and some functions for 16-byte keys. |
uniq2 | Core | Like the 'uniq' command but also counts the occurrences and prints those out. |
urlinfo | Core | Displays information for a url given through stdin. |
Spam Scores | Points | Occurrences | Total |
url has - or _ or a digit in the domain | 20 | 0 | 0 |
tld is info or biz | 20 | 0 | 0 |
tld is gov,edu, or mil | -20 | 0 | 0 |
title has spammy words | 20 | 0 | 0 |
page has img src to other domains | 5 | 1 | 5 |
page contains spammy words | 5 | 3 | 15 |
consecutive link text has the same word | 10 | 0 | 0 |
links to amazon, allposters, or zappos | 10 | 0 | 0 |
has 'affiliate' in the links | 40 | 0 | 0 |
has an iframe to amazon | 30 | 0 | 0 |
links to urls > 128 chars long | 5 | 0 | 0 |
links have ?q= or &q= | 5 | 0 | 0 |
page has google ads | 15 | 0 | 0 |
Raw Total | 20 | ||
Force Multiplier | 0.020000 | ||
Final Score | 0 |