Commit Graph

470 Commits (411aab02e3abf3bde0891b1cba4d77de1f3a9971)

Author SHA1 Message Date
orbiter 90dd197ae7 - no latency for local crawls
15 years ago
orbiter c855fc48c6 only load robots.txt for http and http protocol
15 years ago
orbiter 3300930fc5 - (almost) fixed FTP crawler
15 years ago
orbiter 9623d9e6d2 added a smb loader component for the YaCy crawler
15 years ago
orbiter b88f5fbb4b slightly changed crawling policy
15 years ago
orbiter 7684a575c4 fix for deletion of error database each time when YaCy starts up
15 years ago
orbiter c09a995930 better logging of double occurrences of urls in the crawler
15 years ago
orbiter 727dd9b193 - fixed a bug in robots.txt parser
15 years ago
orbiter 46c4f8b68a better look-ahead into the crawl queue: show more on crawl monitor
15 years ago
orbiter 564927ce72 redesign of CrawlResult data structures because of OOM occurrences during URL deletion processes.
15 years ago
lotus 945e0ba5a5 allow global search if res. observer disabled index transmission
15 years ago
lotus 11188cd7eb resource observer now uses the Java 6 method to check for free space. thus, disk observing now needs Java 6 installed.
15 years ago
orbiter 8ce936bcdd added an api recording function: it shall be possible to record
15 years ago
orbiter e80e060ca6 - increased thread priority for server threads
15 years ago
orbiter 473b11033d fixed network switch process - crawling did not work after a switch before this fix
15 years ago
orbiter bd05e57d3b fix for http://forum.yacy-websuche.de/viewtopic.php?p=18563#p18563
15 years ago
orbiter 5df628a2a4 - added BEncoder class
15 years ago
orbiter 82f57f79e5 more PMD enhancements
15 years ago
orbiter a06f7ddb33 more PMD recommendations
15 years ago
orbiter 66c0a8e849 more PMD recommendations
15 years ago
orbiter bc96d74813 - clean-up of robots.txt parser
15 years ago
orbiter dd459281c8 applied code changes that are recommended by PMD
15 years ago
lotus eac2daf2e8 * reenable DHT if yet enough memory is available
15 years ago
orbiter d77a8f3b3e added some modifications recommended by PMD for better performance
15 years ago
orbiter d1973bae2a code cleanup: removed unused code and unused methods
15 years ago
orbiter a3b8b7b5c5 some redesign of the main menu structure:
15 years ago
orbiter eeca2ded92 fix for http://forum.yacy-websuche.de/viewtopic.php?p=18500#p18500
15 years ago
orbiter dff4f95c78 some patches to get the torrent parser working
15 years ago
orbiter 362b7a929b added extensive memory protection logic to avoid out of memory errors that may be caused by the RowCollection memory allocation function
15 years ago
orbiter e34e63a039 preset of proper HashMap dimensions: should prevent re-hashing and increase performance
15 years ago
orbiter 4a5100789f replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration.
15 years ago
orbiter 2d8f3ee301 some performance hacks
15 years ago
orbiter 4c99d4683d possible fix for lost crawl profile handles: clean-up job did wrong measurement to see if crawl is still running.
15 years ago
lotus 6edc168cfe option to disable dht by memory limit:
15 years ago
orbiter 4431b9767e added about 450 replacements for printStackTrace() methods to pipe such traces into the log at DATA/LOG/
15 years ago
orbiter e3025ee691 - new icon for OAI-PMH loading action
15 years ago
orbiter b0b7a4f9a5 - added function to OAI-PMH reader that can pull all records from a server using an evaluation of the resumption token to get URL to retrieve remaining records
15 years ago
lotus 79251e6f60 configurable disk space hardlimit for dht
15 years ago
orbiter a0e891c63d - some redesign in UI menu structure to make room for new 'Content Integration' main menu containing import servlets for Wikimedia Dumps, phpbb3 forum imports and OAI-PMH imports
15 years ago
orbiter 30f108f97d added stub of oai-pmh importer (not working yet)
15 years ago
orbiter 52470d0de4 - fix for xls parser
15 years ago
orbiter 5e8038ac4d - refactoring of blacklists
15 years ago
orbiter 26fafd85a5 - more refactoring
15 years ago
orbiter 3528b970d6 - refactoring
15 years ago
orbiter a8ce192f63 - shifted main classes to new package net.yacy
15 years ago
orbiter b79f4f062f refactoring of yacy documents and parsers: they depend now only on the kelondro classes
15 years ago
orbiter e7f18ba24b refactoring
15 years ago
orbiter ce8dc575ca refactoring
15 years ago
orbiter bea3b99aff moved table and util classes
15 years ago
orbiter c0e0e1f422 moved blob classes
15 years ago
orbiter 1e4f8b56ed accumulated classes from different packages into the new rwi package
15 years ago
orbiter 194da25a2f moved kelondro index
15 years ago
orbiter 4446acc8cd moved kelondro order
15 years ago
orbiter f677d534b1 start of a really extensive refactoring which will produce a hierarchical package structure with the domain yacy.net as package root
15 years ago
orbiter 735e2737e3 * added index segments
15 years ago
orbiter 6e0dc39a7d - some fixes to prevent blocking situations
15 years ago
orbiter 04a548a1e3 - temporary integrated the transferURL servlet as static class instead as a class that is called using reflection to investigate the OOM problems in that class
15 years ago
orbiter 6aa474f529 - better logging for web cache access and fail reasons
15 years ago
orbiter 3671c37989 added experimental oai-pmh reader and integrated it with the existing dublin core parser
15 years ago
orbiter 2e6bdce086 - added more logging to balancer
15 years ago
hermens 62a7341c4d Fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2204
15 years ago
low012 f65bfaa9af *) Removed base tag from errror page. This has been added by myself a long time ago as a workaround for some weird behavior of my router, but as it turns out, it does more bad than good in general: If HTTPS is used for communication with YaCy, entering a wrong passwort led to an errror page with a form which would send username and password unencrypted with the user possibly being unaware of this.
15 years ago
orbiter 3e9dcfc204 fix for http://forum.yacy-websuche.de/viewtopic.php?p=17504#p17504
15 years ago
orbiter 573d03c7d7 added configuration to enable ram table copy
15 years ago
orbiter ce972ff4ef update to default ranking profile which has now some settings to deny some phpbb3 pages which are redundant in the index when crawling phpbb3.
15 years ago
orbiter 44579fa06d - fixed a problem loading images through yacy's document loader,
15 years ago
orbiter 72e5407115 refactoring of snippet cache
15 years ago
orbiter 8e56c2ace6 fix for fixes from this afternoon
15 years ago
orbiter cf739edc2e fix for possible deadlock, see
15 years ago
orbiter 92edd24e70 fixed problem with switching of networks
16 years ago
orbiter 0575f12838 fix for deadlock
16 years ago
orbiter c0e17de2fb - fixes for some problems with the new crawling/caching strategies
16 years ago
orbiter 634a01a9a4 replaced wget-requests with caching requests
16 years ago
orbiter c6c97f23ad - added cache usage properties to crawl start
16 years ago
orbiter c4ae2cd03f fixed bug that caused deletion of crawl profiles at every application startup
16 years ago
orbiter 161d2fd2ef redesign of access to the HTCache (now http.client.Cache):
16 years ago
orbiter 4da9042e8a code simplification
16 years ago
orbiter 1d8d51075c refactoring:
16 years ago
orbiter 5bb8074150 removed the indexing queue. This queue was superfluous since the introduction of the blocking queues last year, where documents are parsed, analysed and stored in the index with concurrency.
16 years ago
orbiter b332dfad67 - inserted request object into response object which carries this now instead generating new objects
16 years ago
orbiter ca72ed7526 -removed superfluous crawl cache
16 years ago
orbiter 13c63f4082 a set of small fixes to crawling behaviour
16 years ago
orbiter 43c8defd79 enhanced parser with more extension + mime attributes
16 years ago
orbiter b2263bc720 enhanced document type recognition
16 years ago
lotus aa38eb5a20 * maxfilesize -1 for infinite filesize
16 years ago
lotus 9cfe89c8fc * process content-length as soon as it is received
16 years ago
lotus 9f083bb6b2 check filetype before loading (no more mp4 loading)
16 years ago
f1ori f814e0fa81 enable warnings and fix most of it
16 years ago
orbiter 57a88d435b redesign of parser mime type detection and parser steering
16 years ago
orbiter 21b8704fb4 refactoring of the ParserDispatcher and ParserConfig: resulted into Idiom, Parser and Classification classes
16 years ago
orbiter dafffd0153 refactoring of parsers and document processing
16 years ago
orbiter 024744245c small refactoring to prepare for new queues
16 years ago
orbiter 24cb6d68bc - renamed Stack to RecordStack to avoid name confusion with new classes
16 years ago
orbiter 995da28c73 all stack/heap files that had been stored in DATA/PLASMA are now stored in the network-specific QUEUES path
16 years ago
orbiter 409538e17a code cleanup and code simplifcation
16 years ago
orbiter 1f1399e5c5 extending visibility of objects and methods to avoid synthetic accessor methods and increase performance
16 years ago
orbiter 154bbc3364 code cleanup: call of static methods directly to the class
16 years ago
orbiter 222850414e simplification of the code: removed unused classes, methods and variables
16 years ago
orbiter 93dfb51fd4 problems with code style
16 years ago
orbiter 9a674d8047 - After the removal of the Tree class some code simplifications are possible. This affects mostly the Records class, which can be refactored and the result of the refactoring results in a reduced number of classes.
16 years ago
orbiter c5122d6836 completed migration of BLOBTree to BLOBHeaps:
16 years ago
orbiter ae015e8e98 refactoring of blob package classes
16 years ago
orbiter ce1adf9955 serialized all logging using concurrency:
16 years ago
orbiter b8e738a7be a collection of
16 years ago
orbiter d58b395993 fix for http://forum.yacy-websuche.de/viewtopic.php?p=15693#p15693
16 years ago
orbiter b6e274f211 omit most of forced crawl delays by using a separat delay table which flushes delayed URLs at the correct time
16 years ago
orbiter d50be59088 - added a automatic re-construction of the domain stack after 10 minutes. this includes then urls to the domain stack that were left over in case of stack size limitations when the domain stack was created the last time
16 years ago
orbiter 5fdba0fa51 - fixed a not working selection rule in balancer
16 years ago
orbiter f5602404d5 another speed boost for the balancer
16 years ago
orbiter 95e8cbd1c3 new fully redesigned balancer and bugfixes regarding lost profile handles and killed crawls
16 years ago
orbiter 42ae40b9f6 some bugfixes to database close() methods
16 years ago
orbiter 88426912ad more refactoring to make the segment object easier to use and to be prepared to integrate author navigation
16 years ago
orbiter 99bf0b8e41 refactoring of plasmaWordIndex:
16 years ago
orbiter 3d4b826ca5 migration of all databases that use the deprecated BLOBTree format into the BLOBHeap format. Old databases are migrated automatically.
16 years ago
orbiter 63a0255166 - refactoring: added new content package, which will contain connector classes for different types of data sources to import texts into the YaCy index
16 years ago
orbiter addecdb18c simplified code, removed one unused method in all implementing classes
16 years ago
lotus 734680dc70 initialize the ResourceObsever in own thread
16 years ago
orbiter d2ac0aa682 - fixed possible bugs in Stack (may affect Crawler reset) and RandomAccess handling
16 years ago
orbiter 138422990a - removed useCell option: the indexCell data structure is now the default index structure; old collection data is still migrated
16 years ago
lotus 635b0a9da7 code-split
16 years ago
orbiter fa3adbbfc6 added domain checks to surrogate reader and RWI transfer receiver to prevent spaming using surrogates
16 years ago
lotus ab0030d7a7 allow dht-out for remote-crawl processing peers on default settings
16 years ago
orbiter 4e97a31009 corrections in dublin core syntax
16 years ago
orbiter 7dfe7e7cc6 fixed some problems with surrogate reader. This is now ready for testing.
16 years ago
orbiter 9050a3c4c5 alpha version of surrogate reading and indexing.
16 years ago
orbiter ad78e3a59f - less lines in rssTerminal
16 years ago
orbiter bc80dc913a added new surrogate reader (surrogates are parsed documents on batches)
16 years ago
orbiter e58320a507 added more info in log fore debugging
16 years ago
orbiter c0e8ed5461 fixed problem with not http client
16 years ago
orbiter c2359f20dd refactoring: better abstraction of reference and metadata prototypes.
16 years ago
shostakovich 1f37cc6107 Robots.txt is now reused after one day. See forum-topic:
16 years ago
orbiter 9bfb2641db - removed deprecated threads
16 years ago
orbiter b6c2167143 - patch for bad web structure dumps
16 years ago
orbiter 0139988c04 - added writing of temporary file names and renaming to final file name when index dump/merge are done. Interrupted merges can be cleaned up.
16 years ago
orbiter 3621aa96ab - added a memory protection for the IndexCell migration
16 years ago
orbiter d39a5b42ca more care about open file handles. Now files also close on windows and can be deleted afterwards.
16 years ago
orbiter 029495e64d fixed bug introduced in SVN 5756 in EcoTable.put()
16 years ago
orbiter 587838bd09 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5758 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter 96eaecda3e - added migration class to go from index collections to the index cell data structure.
16 years ago
orbiter 37f892b988 added new concurrent merger class for IndexCell RWI data
16 years ago
borg-0300 8c494afcfe svn attributes added
16 years ago
orbiter 67aaffc0a2 - added Latency control to the crawler:
16 years ago
orbiter 61f9dbf0cc - fixed a display problem in watch crawler
16 years ago
orbiter b3f75e48fa - enhanced balancer: auto-solving of waiting-deadlocks
16 years ago
orbiter d99ff745aa fix for http://forum.yacy-websuche.de/viewtopic.php?p=13378#p13378
16 years ago
borg-0300 fd0976c0a7 refactoring
16 years ago
borg-0300 ce79239322 "typo"
16 years ago
orbiter 7dff1cba62 removed option to use different primary keys in kelondro tables
16 years ago
orbiter 7f67238f8b refactoring of plasmaWordIndex: less methods in the class, separated the index to CachedIndexCollection
16 years ago
orbiter 14a1c33823 refactoring of wordIndex class
16 years ago
orbiter f6d989aa04 added new class RowSetArray which arranges RowSet objects like Elements in a hashtable, but still provides the functionality of sorted enumeration. The new class is now integrated into the ObjectIndexCache, which is the core class to provide index functions to all database files. The new index access is about twice as fast as before. This has strong speed enhancement effects on all parts of YaCy.
16 years ago
orbiter efcd95dc37 simplification of (internal) query process / refactoring
16 years ago
orbiter aa44d9bad9 more refactoring of kelondro.text / deleted de.anomic.index
16 years ago
orbiter 76ef5f0f14 refactoring of index package: better names for the classes (to be continued)
16 years ago
orbiter 8444357291 added new row interator in kelondro tables files that enumerates rows
16 years ago
orbiter 9559bc23fd automatic clean-up of dead connections
16 years ago
orbiter 4f9dae2571 remove reference in crawl entries
16 years ago
orbiter c12bb8a6d0 - refactoring of the http client
16 years ago
orbiter 62505bb3cb more bugfixes as recommendet by findbugs
16 years ago
orbiter 411f2212f2 more memory leak fixing hacks
16 years ago
orbiter 6c627dbdff update to the server core
16 years ago
orbiter 6a876ecb88 first fixes to the DHT transmission process
16 years ago
orbiter c25c334b75 replaced old DHT transmission method with new method. Many things have changed! some of them:
16 years ago
orbiter 65a1de6c05 longer timeout for remote crawl queries
16 years ago
orbiter 94110df85a moved logging partially to kelondro
16 years ago
orbiter 024da2916b refactoring of logging
16 years ago
orbiter 83ce65707a (almost) completed partition of classes in kelondro
16 years ago
orbiter 7ee494fde5 more refactoring of kelondro:
16 years ago
orbiter bf93767ec6 refactoring of kelondro database classes
16 years ago
orbiter fc27bf8c4c refactoring of kelondro classes:
16 years ago
orbiter 91af105373 last changes before release
16 years ago
orbiter 05c235de32 fix for npe
16 years ago
orbiter 2b32248079 fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1516&p=10545#p10545
16 years ago
orbiter 4d5b401f00 try to fix some performance problems with the internal index management:
16 years ago
orbiter c6880ce28b removed the permanent cache flush and replaced it with a periodic cache flush
16 years ago
orbiter 07fc115e90 removed active profiling in kelondroRowSet
16 years ago
orbiter e004da48d3 - added fast fingerprint computation for files (any). Will be used in new index dump method
16 years ago
orbiter bb935fdbb0 less organization overhead for DNS caching and prefetching
16 years ago
orbiter e34ac22fbd - added new monitoring servlet at
16 years ago
orbiter d376d81fc4 replaced busy thread control of crawl stacker by blocking threads
16 years ago
f1ori 0881190b19 * Robots.txt: don't interpret Crawl-Delays for other robots
16 years ago
orbiter 243e73f53b removed unnecessary usage of kelondroBLOBTree
16 years ago
orbiter 7535fd7447 - refactoring of CrawlEntry and CrawlStacker
16 years ago
orbiter 2802138787 - refactoring of CrawlStacker (to prepare it for new multi-Threading to remove DNS lookup bottleneck)
16 years ago
orbiter 1779c3c507 - added a read cache to the RAFile interface to RandomAccessFile
16 years ago
orbiter 47292e696a more performance hacks
16 years ago
orbiter d39d420b39 performance hacks
16 years ago
orbiter fa26a8f25a fix for deadlock-like behavior in balancer
16 years ago
orbiter 1918a0173e added more exception handling during crawling
16 years ago
orbiter dba7ef5144 extended crawling constraints:
16 years ago
orbiter ef66438662 - more space in error db to store larger error messages
16 years ago
orbiter 674ad2d55b different handling of error cases that occur during loading files with http or ftp:
16 years ago
lotus 16723d0fa6 ask another peer if crawljob loading fails
16 years ago
orbiter 1b18d4bcf3 enhancement to crawling and remote crawling:
16 years ago
orbiter 3f746be5d4 - consolidation and refactoring of many DHT target - computing methods
16 years ago
lotus 5cf0cbb47e javadoc
16 years ago
lotus 8d07607d1d update to resource observer:
16 years ago
orbiter 1778fb420d - added some performance tweaks to the new BLOB buffer
16 years ago
orbiter 382226da94 fix for bug introduced in SVN 5281: parameters were switched
16 years ago
danielr f2fd043797 refactoring (moved duplicate code into methods)
16 years ago
orbiter 826ca79735 refactoring and new architecture to store the files of the web cache:
16 years ago
orbiter 6fb865fbdc - fix of bug in iterator in kelondroBLOBHeap which caused bug in crawl profile listing
16 years ago
orbiter 2d65887723 - fix for bug in new profile handling
16 years ago
orbiter ff68f394dd fix for problem with balancer and lost crawl profiles:
16 years ago
orbiter 9ac16f565b - fixed several bugs in database management functions
16 years ago
orbiter c8bdd965ec - larger update time for status page
16 years ago
orbiter ce57de6cb3 - fixed re-setting of DHT Send/Receive settings
16 years ago
orbiter e1f67262f7 - added and removed some debugging output
16 years ago
orbiter 21dbb39afa switched two balancer cases
16 years ago
orbiter 1bbf362cef update to the crawl balancer: better organization and better crawl delay prediction
16 years ago
orbiter ddcf285499 - fixed a bug in performance setting (did not work with german translation)
16 years ago
orbiter 0cd0fee546 fixed bug with wrong proxy result enqueueing. See:
16 years ago
lotus fd9233244e configurable free disk space via disk.free
16 years ago
lotus 73f233bb11 * set resource observer to 1000MB
16 years ago
orbiter a28faabfd2 fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1351&p=9242#p9242
16 years ago
f1ori bea6c13139 * with r5137 robotParser didn't work at all -> fix
16 years ago
f1ori ae677e1738 * fix problem in robotparser, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1421&p=9742
16 years ago
orbiter 39964e88fa fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1329#p9121
16 years ago
orbiter 3f3673b6e5 extended balancer:
16 years ago
orbiter d09ddabd09 corrected a design mistake (5-byte hashes not necessary)
16 years ago
orbiter 77ee0765a4 - added domain statistic generation to IndexControlURLs_p.html servlet
16 years ago
orbiter 80a7bc93d6 - added statistical evaluation about domains that appear during crawling
16 years ago
orbiter 05dbba4bab added logging conditions to all fine and finest log line calls
16 years ago
orbiter d3d41e2ee4 - fixed problem with searching with quotes (still not complete, but not as bad as before)
16 years ago
danielr 9ff4fc11da partial fix (images,audio,video) for proxy and content-type problem http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1374
16 years ago
lotus d9d9c522a1 addendum to last commit
16 years ago
lotus 480497f7c9 changed recrawl
16 years ago
orbiter 536e77e8b7 modifications towards a single database operation to read/write http header and cached file at once:
16 years ago
danielr 3c68905540 remove redundant null checks
16 years ago
orbiter 7989335ed6 Preparations to replace the HTCache with a new storage data structure:
16 years ago
danielr be28af50f5 - fixed "yacy2yacy no proxy"-problem
16 years ago
danielr a087090bbb fixed starting crawl results in "No parser available to parse mimetype 'application/octet-stream'"
17 years ago
danielr 621b473b18 * removed some warnings of findbugs (http://findbugs.sf.net)
17 years ago
danielr 17b7845eb5 * refactoring
17 years ago
danielr 3bb870bfcd added final where possible
17 years ago
orbiter 50ef5c406f - refactoring of robots parser (removed opaque Objects[] result vector)
17 years ago
orbiter c3d461d191 - removed superfluous copyright statement
17 years ago
orbiter 3ca98fee42 removed superfluous copyright statement
17 years ago
orbiter 05c26d58d9 fixed missing remove operation in balancer
17 years ago
orbiter 606b323a2d fixed bug that appeared when a new crawl ist started
17 years ago
orbiter 28d5703f8a - fixed a bug in Robots.txt loader which could have caused that robots.txt files had been loaded from the same domain more than once
17 years ago
orbiter 7b1c9e6aee discovered and removed a (possibly large) memory leak:
17 years ago
orbiter 0f5fe8cc53 refactoring of method calling for objects from kelondroMapDataMining
17 years ago
orbiter 4acf0a61cd refactoring of kelondroObjects (mainly renaming to kelondroMap)
17 years ago
orbiter 1e6d12f146 Major update to BLOB data structures:
17 years ago
orbiter 81f75f5056 - removed unnecessary classes (these objects are much easier to handle using generics)
17 years ago
orbiter 7052f2f61f - added copyright header of ResourceObserver
17 years ago
orbiter 1400cdc91e - refactoring of resourceObserver (moved it to crawler)
17 years ago
orbiter a6719dfd2b - refactoring of robots parser
17 years ago
orbiter e81be7d4f2 added many missing user-agent declarations for yacy http client connections.
17 years ago