Commit Graph

431 Commits (66145a0410fd371f5726d4f8a35bb124624d8940)

Author SHA1 Message Date
Michael Peter Christen 4d3cc02168 replaced old bzip2 library against better documented commons-compress
13 years ago
Michael Peter Christen c15fcde1c8 add-on to latest commit
13 years ago
Michael Peter Christen 81737dcb18 removed stack trace from swf parser since we cant do anything there
13 years ago
Michael Peter Christen acf8d521a2 fix for http://bugs.yacy.net/view.php?id=126
13 years ago
Michael Peter Christen 89142d1e8d removed (not all) warnings
13 years ago
Roland 'Quix0r' Haeder a093ccf5eb Now used synchronization in all close() methods to make sure all objects
13 years ago
Michael Peter Christen ba6aaabc51 refactoring + parser bugfixes
13 years ago
Michael Peter Christen 09484955dc added new entry class for embed tags
13 years ago
Michael Peter Christen 453010bd68 - solved problems with backpath normalization
13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
Michael Peter Christen f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
13 years ago
Michael Peter Christen a1a5b015d8 refactoring: moved document Classification to cora package
13 years ago
Michael Peter Christen 4d5da75814 fix for parser problem if a <a>-tag is 'within' html tags with unclosed
13 years ago
Michael Peter Christen 046f3a7e8d check if httpc has decompressed the release file and rename the file
13 years ago
Michael Peter Christen e101c2e0e2 added changes from copperdust (submitted by email):
13 years ago
Michael Peter Christen 8d63a5887c bugfixes
13 years ago
Michael Peter Christen 9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
13 years ago
Michael Peter Christen 7e4e3fe5b6 free some memory after parsing html
13 years ago
Michael Peter Christen 4540174fe0 memory hacks
13 years ago
Michael Peter Christen 2e5cd6a1b2 fixed parser extension deny list generation and usage
13 years ago
Michael Peter Christen 8bee1472c9 there is no noindex, only nofollow in links
13 years ago
Michael Peter Christen c560a582ac fix for single-word vocabulary lines
13 years ago
Michael Peter Christen ef78f22ee1 performance hack
13 years ago
Michael Peter Christen 1f4f60654a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
reger 32104360ce PDFParser - return at least first 3 pages of PDF
13 years ago
Michael Peter Christen eadb58dd87 small enhancements in pdf parser
13 years ago
reger b616de5973 PDFParser - return at least first 3 pages of PDF
13 years ago
Michael Peter Christen 7f9b6b7a0c added switches to ConfigParser to accept/deny documents by their
13 years ago
Michael Peter Christen 4901cee3cc suppress auto-tagged subject entries when sending out or receiving
13 years ago
Michael Peter Christen 83009d86f7 added the vocabulary navigator. It can be very simply tested by
13 years ago
Michael Peter Christen a58dc4a91f added autotagging to document condenser:
13 years ago
Michael Peter Christen 254adea51c small fixes
13 years ago
Michael Peter Christen b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large
13 years ago
Michael Christen e6d51363ee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
Marek Otahal 72adbeae90 !Important: move from Hashtable to HashMap
13 years ago
Michael Christen fa8da7f89d vocabularies are now also used as source for a did-you-mean computation
13 years ago
Michael Christen eaec14ecc4 Dictionaries from words caches can now be used as autotagging vocabulary
13 years ago
Michael Peter Christen 91940fdf56 redesign of WordCache to be prepared to hold multiple
13 years ago
Michael Christen bd40a10230 added autotaggig stub .. only reading and parsing of vocabularies at
13 years ago
Michael Christen c04bfaa51b refactoring
13 years ago
Michael Christen 1f4afb4dc0 performance hacks
13 years ago
Michael Christen 762e0ecfb6 fixed localization dictionaries, see
13 years ago
Michael Christen 9cd469e6d6 added pull request from als plus an NPE fix
13 years ago
Al Sutton 39898cb94a Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer
13 years ago
Al Sutton 4c67a964a1 Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer
13 years ago
Al Sutton 3f9b9f953f Added close() to ensure buffer close actions are invoked
13 years ago
Al Sutton d73c84f9a0 Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown
13 years ago
Al Sutton f02ea27b31 Added missing closure of ByteArrayInputSteam
13 years ago
Al Sutton 8993cac4d8 Initial performance improvements
13 years ago
orbiter ebd840ebf6 - enhanced description on search front page
13 years ago
orbiter e22f8497c9 - tested the ARC methods
13 years ago
orbiter 5a55397f99 some last-minute performance hacks
13 years ago
apfelmaennchen 564374d1fe - included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand.
13 years ago
orbiter 804e48888b smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter
13 years ago
orbiter 85d6bf4ac4 fixed urls to media content during indexing
13 years ago
orbiter 0d858d48ec replaced String with StringBuilder in suggestion process
13 years ago
orbiter d2ea250d99 refactoring:
13 years ago
low012 277b454a62 *) added comments
13 years ago
orbiter 6b22865dbc - removed some warinings
13 years ago
orbiter 8a428d3e77 ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation
13 years ago
orbiter 85a5487d6d YaCy can now use the solr index to compute text snippets. This makes search result preparation MUCH faster because no document fetching and parsing is necessary any more.
13 years ago
orbiter 0819e1d397 protection against OOM cases in image parser. See also bugs.yacy.net/view.php?id=54
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
orbiter 610b01e1c3 - added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index.
13 years ago
orbiter b5252ef91f added new word recommendation library in DictionaryLoader_p.html
13 years ago
orbiter 1c007188ad bugfixes in html parser
13 years ago
orbiter 231074bf0a fixed a parsing bug by reverting SVN 7766
13 years ago
low012 24e76a7b69 *) Replaced occurrences of "Wikimedia" with "MediaWiki" where applicable. (Thanks to the folks of 0x20.be for pointing this out.)
13 years ago
orbiter 5dd2efc9a2 - bugfixes in html parser
13 years ago
orbiter 51cf697acd refactoring: moved all score-related classes to new ranking package
13 years ago
sixcooler eb14111200 encapsulate potential expensive objects in TextSnippet to allow GC them asap
13 years ago
sixcooler a311596881 finishing up my commits (7855-7858) which could be helpful for
13 years ago
sixcooler 9170a434ed throwing an exception again in FileUtils.copy(reader, writer)
13 years ago
sixcooler ce248cc8dd less byte-arrays of response-content, less byte-array <-> stream conversation
13 years ago
sixcooler 59b767eebd stop loading via http at defined maximum of bytes - even size is unknown before loading
13 years ago
orbiter 299af4943c added another memory protection hack
14 years ago
orbiter b06faab9d3 do not allocate a StringBuilder object in case that there is not enough memory for that
14 years ago
orbiter 2d4bb139d3 - added counting of links with noindex tag for solr index
14 years ago
orbiter bda3eec0ff added parsing of canonical link element to html parser
14 years ago
orbiter 9706fc55aa enhanced content scraper (should discover urls much faster in case of very large plain texts)
14 years ago
orbiter f667b9c289 enhanced identificator: using AtomicInteger for counter
14 years ago
orbiter 115abc8917 - more attributes for search progress bar
14 years ago
orbiter 77fe69395d added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html
14 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks
14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
14 years ago
orbiter e28bd0d038 fix for some possible causes of memory leaks
14 years ago
orbiter 10e2f588f8 - enhanced ybr ranking computation
14 years ago
orbiter 3ed4a09368 small features, some bug fixes and performance hacks
14 years ago
orbiter 205cc75157 abstraction of surrogate main element (xmlns:geo was missing for wiki extracts)
14 years ago
orbiter 021840e5ba removed (almost) deadlocks and unnecessary CPU load
14 years ago
orbiter 9248a4eef4 reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder'
14 years ago
orbiter 76f2817e00 a fix for the snippet computation and hopefully better snippets
14 years ago
orbiter deda54d684 - relaxed matching of string-search (this is now case-insensitive)
14 years ago
orbiter 15e3a57b4e removed unused functions in condenser
14 years ago
orbiter e3d19d0a90 fix in Document inboundlinks/outboundlinks sorting
14 years ago
orbiter 4e8fa03514 added more attributes to html evaluation
14 years ago
orbiter 528da7c9ea removed unused class and added license header for new class
14 years ago
orbiter f6077b3cc0 added more attributes for html parser and enhanced data structures
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
14 years ago
orbiter 958ff4778e enhanced location search:
14 years ago
orbiter c17d102bd8 enhanced speed for OrderedScoreMap inc method and size comparisment in concurrent environments
14 years ago
orbiter b788182954 some enhancements to scoring speed
14 years ago
orbiter 01690eab86 fix for mediawiki importer and wikicode parser
14 years ago
orbiter 4c013d9088 more UTF8 getBytes() performance hacks
14 years ago
orbiter 564184909a enhanced the surrogate parser: better reading of UTF-8 characters
14 years ago
orbiter 156cf02703 - added an index constraint 'has location' to the condenser
14 years ago
orbiter 0430a94eaa the location search shows now not re-evaluated locations but only such locations that are attached as metadata to web pages
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
orbiter f3baaca920 - enhancements to DNS IP caching and crawler speed
14 years ago
orbiter 78d4c45d09 enhancement during search process: fast fail of search in case that all index feeder have terminated.
14 years ago
orbiter a50f28e6e7 - fixed missing save operation for peer name change
14 years ago
orbiter 1989ebc24b removed more warnings
14 years ago
orbiter 8f11d3a5bb redesigned the ScoreMap classes:
14 years ago
orbiter 694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion
14 years ago
orbiter 30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
14 years ago
lotus cb6d307bba adding extension for parser
14 years ago
orbiter 3820525464 more memory protection: auto-flush of caches in case of memory shortage
14 years ago
orbiter e1b6916423 always try to guess the size of a StringBuilder to prevent too many memory re-allocations
14 years ago
low012 3b40b98256 *) set SVN properties
14 years ago
orbiter cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
14 years ago
orbiter 8d14916c74 more patches for a better out-of-memory management
14 years ago
orbiter f8d0454c53 small bug fixes and experiments with search speed enhancement
14 years ago
orbiter 5e186e0122 continuing the fight against deadlocks during time formatting: better caching.
14 years ago
orbiter a92d80a545 performance enhancements using an alternative to a insensitive collator (a complex string compare):
14 years ago
orbiter e717bf74ba more logging, more care about OOMs
14 years ago
orbiter 5892fff51f introduction of dht-burst modes: this can expand the number of target peers in some cases where a better heuristic is needed. The problematic cases are either when a muti-word search is made (still a hard case for our term-oriented DHT) or when a network operator wants that all robinson peers are asked. We therefore introduced two new network steering values that switch on more peers during the peer selection. Because the number of peers can now be very large, the number of maximum httpc connections was also increased.
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
orbiter 0cdfb82963 replaced more appearance of double values by float values
14 years ago
orbiter eb12e15738 moved all Double values to Float values because of
14 years ago
orbiter 88773e4daa changed the default port from 8080 to 8090
14 years ago
low012 9f38c0023d *) Minor changes, mainly cleaning up a little bit, no functional changes.
14 years ago
orbiter 54e77e6255 refactoring
14 years ago
orbiter 10ae8d961b - cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring)
14 years ago
low012 9eae33f886 *) Ooops...
14 years ago
low012 a001e8075c *) minor enhancements
14 years ago
low012 11ea966f9e *) added SID file (Commodore 64) sound file parser
14 years ago
orbiter 3ca06d6290 patch for http://forum.yacy-websuche.de/viewtopic.php?p=21460#p21460
14 years ago
low012 936e976c23 *) added FreeMind (http://freemind.sourceforge.net/) mindmap parser
14 years ago
low012 3d95981f7d *) cleaning up the code a little bit
14 years ago
low012 2a6499364d *) minor changes
14 years ago
low012 c0274bd123 *) minor changes
14 years ago
orbiter 59b70a5a92 another fix to the ftp crawler: now correct directory listings according to rfc2640 (path with spaces) and better title names for such files
14 years ago
orbiter 9b25a33fd9 - fixed numerous bugs
14 years ago
orbiter 7bdb13bf7f more fixes to smb crawling: better file names
14 years ago
orbiter c288fcf634 redesigned CrawlStartScanner user interface and added more features:
14 years ago
f1ori 9d2159582f * fix system update if urls are in blacklist (for example for very general blacklists like *.de)
14 years ago
orbiter 56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
orbiter c36da90261 added a very fast ftp file list generator to site crawler:
14 years ago
orbiter 4e2c14efbb fixed bugs in parser and ftp client
14 years ago
orbiter f0651e5f2f added image search to yacyinteractive.html
14 years ago
orbiter b769cce433 - added a catch-all parser for all documents that cannot be parsed: they will contributed with their document url for the search index only
14 years ago
low012 9b3fae9496 *) cleaning up the code a little bit
14 years ago
low012 025e3f4790 *) renamed classes according to standard Java coding conventions
14 years ago
f1ori a025b1da89 * fix bug when browsing local filesystem (e. g. repository) with yacy
14 years ago
orbiter 4c72885cba added a sitemap entry parser and loader for sitemaps
14 years ago
orbiter fb92f9ae8e added mime type image/jpeg (image/jpg is wrong but it is left here because it does not harm and this error also exists in configuration of web servers)
14 years ago
f1ori 7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null)
14 years ago
orbiter 58e74282af added a word counter statistic in condenser which is used by the did-you-mean to calculate best matches for given search words.
14 years ago
orbiter de722090b5 enhancements in did-you-mean guessing
14 years ago
orbiter 0d363a94d7 more performance hacks
14 years ago
orbiter b8aee6d402 performance hacks for better search performance
14 years ago
orbiter aacf572a26 - enhancements for search speed
14 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
14 years ago
f1ori e670e1ef8e add charset auto-detection for htmlParser
14 years ago
f1ori ddcd5ae78c fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2989
14 years ago
f1ori 8fe1102452 fix http://forum.yacy-websuche.de/viewtopic.php?p=20889#p18426
14 years ago
orbiter 10a9cb1971 simplified snippet computation process and separated the algorithm into two classes
14 years ago
orbiter 84a023cbc8 fixed several search bugs
14 years ago
lotus d2a3d08c44 avoid div. by zero
14 years ago
orbiter 570ca577c6 performance hacks
14 years ago
orbiter 114bdd8ba7 fixed old sitemap importer which was not able to parse urls containing post elements
14 years ago
orbiter c0b08ac59b slighlty changed way of pdf parser integration
14 years ago
orbiter 5fe828fa06 - replaced pdfbox and fontbox version 1.1.0 with 1.2.1
14 years ago
orbiter 24502fe3de performance hacks
14 years ago
orbiter 22047ffad5 enhanced computation speed of many replaceAll string operations
14 years ago
orbiter 3988a95fb5 added ability in rss reader to parse atom feeds
14 years ago
orbiter 9d080f387e change in handling of the all-visible home path for storage in YaCy:
14 years ago
orbiter 0010cd9db1 Support for indexing of RSS feeds!
14 years ago
orbiter 844f158686 - removed dependencies in header framework:
14 years ago
orbiter 5e7081cd19 refactoring towards a unified loading mechanism for MultiProtocolURIs
14 years ago
orbiter e10cd115a9 - added a new RSS reader interface. This is not finished but you can now load and look at RSS feeds. It will be used to index RSS feeds in a way that is appropriate for such kind of data.
14 years ago
orbiter 933dc1a600 removed old rss parser (will be replaced with parser from cora package)
14 years ago
orbiter 5924a0d851 - enhanced concurrency in database index access for multicore
14 years ago
orbiter 989948e1a9 fixed generic image parser
15 years ago
orbiter 27d8a8b53e removed wrong com.sun.codec class access in generic image parser
15 years ago
orbiter b6fb239e74 redesign of parser interface:
15 years ago
low012 d4851441b0 *) Added Android packages to parser in order to be able to create a decentralized search for direct downloads of Android apps.
15 years ago
orbiter 150cf42a1b migrated all my LGPL 3 -licensed files to the LGPL 2.1 because LGPL 3 is not compatible to the GPL 2
15 years ago
orbiter 5a4684f21f allow words with length >= 2 (you can't search for 'wm' with 3-letter words...)
15 years ago
orbiter 37b8827a7a - removed the UPnP library sources from sbbi and added the jar library again. The library was included to get support for fedora releases, but after this time the fact that the sbbi cannot be part of fedora should be re-discussed. If this will still not be possible, then we may integrate the sbbi UPnP package using reflection.
15 years ago
orbiter 777195e8d1 more abstraction for access of LoaderDispatcher and cache
15 years ago
orbiter 7bcfa033c9 more abstraction of the htcache when using the LoaderDispatcher:
15 years ago
orbiter 87087f12fe - scanned remote search process and enhanced some data structure and synchronizations here and there
15 years ago
orbiter de4f30bb2e UTF-8 fix
15 years ago
orbiter 3a1cebb598 bugfixes
15 years ago
orbiter 60e71876ad - more abstraction (HashMap -> Map)
15 years ago
orbiter 2eea806005 less errors in image parser
15 years ago