{"id":41,"date":"2010-01-08T12:57:29","date_gmt":"2010-01-08T19:57:29","guid":{"rendered":"http:\/\/blogs.oregonstate.edu\/osucws\/?p=41"},"modified":"2013-05-11T23:17:45","modified_gmt":"2013-05-12T06:17:45","slug":"we-are-searching-except","status":"publish","type":"post","link":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/osu-search\/we-are-searching-except\/","title":{"rendered":"We are searching, except&#8230;"},"content":{"rendered":"<p>So, we&#8217;ve been talking about search, and people no doubt wonder if their site will be found with OSU&#8217;s Google Search.<\/p>\n<p>Most cases, the answer is yes, but in some cases the answer is no.\u00a0 For the no&#8217;s there are reasons why and is what I want to talk about in this post.<\/p>\n<p><strong>The Exceptions<\/strong><\/p>\n<p>Why are there exceptions?\u00a0 There are exceptions for a few reasons.\u00a0 First is the license limit on our Appliance, which is currently one million documents.\u00a0 549,998 is our current document amounts and we are still indexing sites as we are made aware.\u00a0 So if a site has a large number of documents, for example a site that has an individual page show up for a dictionary, where each entry of the dictionary is considered a document, then that will eat up the million document limit fairly quickly.\u00a0 Relating to the previous example, some exceptions are because of the applications users or departments use.\u00a0 For example, currently a Joomla CMS results in a large number of documents returned because of the way the application works. \u00a0 Second, if there are sites that are not maintained, which get hacked or spammed, we don&#8217;t want to index sites that have spam inserted into it which may likely show up in the search descriptions.\u00a0 Third, if crawling the site results in an endless loop, where documents in the site refer to itself, so the crawler basically gets stuck, don&#8217;t crawl those.\u00a0 Fourth, if a site returns a large number of errors, then there is something wrong with the site and that is consuming the Appliance resources, such as CPU and memory.\u00a0 Fifth encompasses all these aspects, which is the administration overhead.\u00a0 With all the other functions CWS supports, if a particular search aspect would result in significant administration overhead, we would need to make the best decision to minimize that overhead.<\/p>\n<p>So what are our current exceptions?<\/p>\n<p>1.\u00a0 ONID home directories are not searched.\u00a0 Why?\u00a0 Mostly because some users do not maintain their sites, and the sites result in spam entries, and across twenty thousand or more, it&#8217;s too much of an overhead to manage.\u00a0 A policy decision was made for this.<br \/>\n2.\u00a0 http:\/\/ecampus.oregonstate.edu\/ask-ecampus\/knowledge-base\/\u00a0 Why? This site returned over 250 thousand documents.<br \/>\n3.\u00a0 http:\/\/www.cof.orst.edu\/org\/iawa\/\u00a0 Why?\u00a0 This site returned over 160 thousand documents.<br \/>\n4.\u00a0 http:\/\/oregonstate.edu\/tac\/index.php?option=\u00a0 Why?\u00a0 This site returned over 600 thousand documents (due to the way the application handles pages)<br \/>\n5.\u00a0 Group sites at http:\/\/oregonstate.edu\/groups\/ Why?\u00a0 This is for the same reason as #1.\u00a0 As part of the move to people.oregonstate.edu for group sites, we will be reevaluating this.<br \/>\n6.\u00a0 http:\/\/oregonstate.edu\/webprojects\/wiki Why?\u00a0 has 2 million errors<br \/>\n7.\u00a0 http:\/\/www.familybusinessonline.org\/index.php? Why?\u00a0 This site returned over 400 thousand documents (due to the way the application handles pages).<br \/>\n8.\u00a0 http:\/\/oregonstate.edu\/cla\/anthropology\/gallery\/kingston\/main.php? Why?\u00a0 This site was caught in a loop.<br \/>\n9.\u00a0 http:\/\/bioe.oregonstate.edu\/reservations\/ Why?\u00a0 This site was caught in a loop.<br \/>\n10.\u00a0 http:\/\/oregonstate.edu\/aepcore\/index? Why?\u00a0 This site was caught in a loop.<br \/>\n11.\u00a0 http:\/\/hort.oregonstate.edu\/event\/\u00a0 Why?\u00a0 This site was caught in a loop.<br \/>\n12.\u00a0 http:\/\/recycle.oregonstate.edu\/EarthDay\/eventCalendar.cfm?\u00a0 Why?\u00a0 This site was caught in a loop.<br \/>\n13.\u00a0 http:\/\/extension.oregonstate.edu\/clackamas\/announcement\/\u00a0 Why?\u00a0 This site was caught in a loop.<br \/>\n14.\u00a0 http:\/\/physics.oregonstate.edu\/event\/\u00a0 Why?\u00a0 Events list returning excessive results.<br \/>\n15.\u00a0 regexp:http:\/\/www\\\\.osualum\\\\.com\/?.*cid=[0-9]+.*?\u00a0 This is a regular expression statement that if it has the url form specified then it is not being crawled.\u00a0 Why?\u00a0 This site was caught in a loop.<br \/>\n16.\u00a0 http:\/\/oregonstate.edu\/sli\/aggregator\/announcement\/\u00a0 Why?\u00a0 This site was caught in a loop.<br \/>\n17.\u00a0 regexp:http:\/\/oregonstate\\\\.edu\/womenscenter\/library.*browse=*\u00a0 This is a regular expression statement that if the url has the aspects specified within it, then it is not being crawled.\u00a0 Why?\u00a0 This site was caught in a loop.<\/p>\n<p>If your site is on this list, and you want to discuss this, then <a title=\"Contact CWS\" href=\"http:\/\/oregonstate.edu\/cws\/contact\">contact us<\/a>.\u00a0 We do want to reevaluate sites periodically<\/p>\n<p>We also do not index every type of file extension.\u00a0 Image files, media files, archive or binary files are not crawled.\u00a0 There would just be way too many that would exceed our license.<\/p>\n<p>So those are the exceptions are reasons why.\u00a0 We don&#8217;t necessarily expect everyone to be happy or agree with the exceptions made, however, we have to make the best decisions to support OSU as a whole and keep in mind the limitations of our search engine.\u00a0 However, stating that, we do want to periodically review our decisions, and also determine if alternative solutions can be implemented.\u00a0 So if there is a concern, then please <a title=\"Contact CWS\" href=\"http:\/\/oregonstate.edu\/cws\/contact\">contact us<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So, we&#8217;ve been talking about search, and people no doubt wonder if their site will be found with OSU&#8217;s Google Search. Most cases, the answer is yes, but in some cases the answer is no.\u00a0 For the no&#8217;s there are reasons why and is what I want to talk about in this post. The Exceptions&hellip; <a href=\"https:\/\/dev.blogs.oregonstate.edu\/osucws\/osu-search\/we-are-searching-except\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":5047,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1508],"tags":[],"class_list":["post-41","post","type-post","status-publish","format-standard","hentry","category-osu-search"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p3sAOi-F","_links":{"self":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/posts\/41","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/users\/5047"}],"replies":[{"embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/comments?post=41"}],"version-history":[{"count":17,"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/posts\/41\/revisions"}],"predecessor-version":[{"id":802,"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/posts\/41\/revisions\/802"}],"wp:attachment":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/media?parent=41"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/categories?post=41"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/osucws\/wp-json\/wp\/v2\/tags?post=41"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}