orbiter
c72a5cf326
added stub for PHPBB3 extraction code using direct access to mySQL
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5979 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
e735d3a69f
fix for http://forum.yacy-websuche.de/viewtopic.php?p=15175#p15175
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5978 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
63a0255166
- refactoring: added new content package, which will contain connector classes for different types of data sources to import texts into the YaCy index
...
- refactoring: migrated data objects for the new connector classes
- added a DAO interface class to specify an abstract interface for database retrieval connector methods
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5977 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
f246928c20
first attempt to add 'real' Navigation to yacy search results: host navigation
...
- after a search is started, it is analysed how many hits are in each site
- this can be done really efficient, because the navigation information is hidden in the url hash and can be computed very fast
- the search result shows a column on the right with the hosts and the hits per host
- after a click on a host the search is modified using the efficient site: - operator
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5976 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
54b9e99c01
- more information about peer tags
...
- peer tag is by default '*'
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5975 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
26a46b5521
increased default maximum file size for database files to 2GB
...
Other file sizes can now be configured with the attributes
filesize.max.win and filesize.max.other
the default maximum file size for non-windows OS is now 32GB
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5974 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
addecdb18c
simplified code, removed one unused method in all implementing classes
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5972 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
borg-0300
47fce9020c
small change (Orbiter's wish)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5971 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
borg-0300
e07b14e5d7
finally a working fix for 5960
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5970 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
borg-0300
3ebb904d2c
fix for 5960, http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2119
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5969 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
734680dc70
initialize the ResourceObsever in own thread
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5968 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
e005cfea37
fix for bug in -incell option of URLAnalysis
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5967 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
a7e392f31b
The collection index will not be supported any more.
...
Existing indexes based on the old index collections must be migrated with YaCy 0.8
- removed index collection classes and all migration tools
- added a 'incell' reference collection feature in URL analysis
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5966 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
a2f48863fc
- added prototype for navigation index
...
- refactoring of word index prototype
(no functional changes so far)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5965 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
47fd226bdb
proper parsing of sentences
...
does not affect tokens/words
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5964 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
27eb8d62cb
- new development cycle
...
- removed temporary configuration with safe setting for indexer threads (=1) and replaced it with best value computed during performance tests (1/2 of number of processors)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5963 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
b7457d3807
patch for http://forum.yacy-websuche.de/viewtopic.php?p=14720#p14720
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5960 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
bffbe43e09
fix for http://forum.yacy-websuche.de/viewtopic.php?p=14522#p14522
...
fix for http://forum.yacy-websuche.de/viewtopic.php?p=14955#p14955
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5959 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
f133d6065c
fix for http://forum.yacy-websuche.de/viewtopic.php?p=14955#p14955
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5958 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
82af994041
added missing loglevel
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5956 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
ad9762746d
no exception in case of uniq() time-out, see also
...
http://forum.yacy-websuche.de/viewtopic.php?p=13177#p13177
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5955 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
1efe686e3f
fix for http://forum.yacy-websuche.de/viewtopic.php?p=13960#p13960
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5954 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
13fb84ab81
you can define your default number of search results displayed by search.items
...
this applies only to requests through the classic-style page
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5953 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
f2e4d156e8
removed debug messages
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5950 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
709bfc2cd4
added a memory check in http post protocol
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5949 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c01d6f43e1
- fixed problem with thread dump if no arguments are given
...
- rejecting peers that are older than 6 hours (not-seen during 6 hours)
- 0.78, targeting 0.8 at the end of the week
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5948 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
a49edd9415
fix for bug in search with site: constraint
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5947 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c1e5fad9a7
fix for http://forum.yacy-websuche.de/viewtopic.php?p=14767#p14767
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5944 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
8ee3a94e82
fix for non-caching of sitehash, see http://forum.yacy-websuche.de/viewtopic.php?p=14440#p14440
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5942 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
borg-0300
21930d05ed
fix for [B@...
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5941 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
b6ba387e01
fix for http://forum.yacy-websuche.de/viewtopic.php?p=14751#p14751
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5940 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
4338dcf936
fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2093&hilit=
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5937 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
bad7ce9286
experimental option trayIcon.force for unsupported platforms. java 1.6 needed
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5936 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
low012
ea27853c59
*) some refactoring
...
*) added one assertion
*) no functional changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5935 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
low012
d164b42604
*) cosmetics
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5934 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
17150b2950
fixed bug in snippet computation
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5932 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
89aeb318d3
enhanced the wikimedia dump import process
...
enhanced the wiki parser and condenser speed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
5fb77116c6
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see
...
http://localhost:8080/IndexImportWikimedia_p.html
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
hermens
df733af4fa
Try not to loose content from ram during IndexCell.delete by moving ram.delete after the dangerous operations on the array (array.get and array.delete)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5929 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
hermens
ac72005f2f
Let IndexCell.remove remove entries from the ram portion of the DB as well.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5928 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
8ba7ff5353
a fix and another speed enhancement for the RWI cache
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5927 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
05f077e85f
added stack trace output to solve problem in
...
http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2076&hilit=&p=14612#p14612
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5926 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
71a4cadf31
better and more performant synchronization in SimpleARC, the caching object for word hashes. Speeds up indexing.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5925 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
e6773cbb33
better handling of RWI cache for concurrency and less overhead when writing new entries -> even more indexing speed
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5924 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c097531e3d
added a catch Exception to all thread to check if any of them silently dies without any other notification
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5922 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
083533e5ec
fix for bugs in IODispatcher
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5921 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
21fbca0410
better scaling of HEAP dump writer for small memory configurations;
...
should prevent OOMs during cache dumps
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5920 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
6e0b57284d
better care for states of the IODispatcher
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5919 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
1db9cdd4e4
fixed bug in writing of robots.txt entries in case that host names exceeded 64 characters and some other problems
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5918 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
bde88b684a
* splitt off yacyRelease from yacyVersion
...
* added some gui infos about signatures
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5916 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
057ce14c8e
more fixes (character encoding, parser exceptions, http client failure, blob writing)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5914 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
d2ac0aa682
- fixed possible bugs in Stack (may affect Crawler reset) and RandomAccess handling
...
- increased default memory size to 180MB
- fixed possible bug in http client reset (there was a deadlock)
- bug in BOBHeap marked, but not solved, cause is still unknown.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5912 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
1351d903a1
don't follow links like mailto:
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5909 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
e88a66bcae
temporary disabling computation of all sublinks (check needed)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5908 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
low012
ff5f82d780
*) removed description of removed commands from wikiHelp ([= =])
...
*) used format function of Netbeans for wikiCode to make it more readable, no functional changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5907 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
eacf95213a
fix for crawling of mailto-links
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5906 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
9c6ac43f66
fixes for wiki parser
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5905 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
3a64c9d02f
- fix for problem with concurrency when computing word hashes
...
- fix for search in case that a urlfilter was used and zero results were returned
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5904 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
d3f8aa5a2a
set of small fixes
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5903 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
low012
78ffb61297
*) got rid of unnecessary variable which might also fix IndexOutOfBoundsException
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5902 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
d31e6f9c14
fix for http://forum.yacy-websuche.de/viewtopic.php?p=14457#p14457
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5899 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
8d6212233b
fix for IODispatcher
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5896 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
f678472f46
fix for quote problem in json output
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5895 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
d079d6dfdb
small changes in surrogate reader, wiki code and portal test
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5894 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
07f09742bb
set of small fixes and comments
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5893 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
borg-0300
06ed4ef7b3
* better picture handling
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5891 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
5a634cab23
removed generation of anchor link sets in document types that describe container formats.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5890 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
low012
f1244264b8
*) hopefully fixed bug reported in http://forum.yacy-websuche.de/viewtopic.php?t=2057
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5882 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
2e3186189b
fix for mediawikiIndex surrogate producer + added concurrency
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5880 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
apfelmaennchen
6f5ea7b1a8
small fix for previous post
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5879 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
apfelmaennchen
138a0747e3
added serverObjects.putJSON as JSON has very particulare encoding requirements
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5877 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
d977dd9a96
fix for surrogate loader
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5870 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
9cb68353da
fix for bug in ProfilingGraph for ppm >> 10000 ppm (!)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5868 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
9e4db75aac
reduced internal logging and reduced memory that internal logging can use
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5867 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c10c257255
attempt to fix a deadlock situation where the IODispatcher did not work.
...
I suspect the dispatcher thread has crashed and queues filled so no indexing process was able to write data.
This fix tries to heal the problem, but I am unsure if it helps. To get a better view of the problem, some more log outputs had been inserted.
Added also a new attribut indexer.threads to get a control over the number of default threads for the indexer (default is 1)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5866 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
09987e93fd
fixed some more bad handling of byte[]
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5865 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
1bcc1450cb
more explaining error message in case of IOExceptions during html parsing
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5864 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
fe51f4d668
less synchronization may help to prevent deadlocks
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5863 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
58802e4201
added missing success test in storeDocumentIndex,
...
see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1922&hilit=
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5862 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
171e62bee5
addition to the fix from last commit (which did not work)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5860 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
059949a0d1
tried to fix problem with snippet fetch for second search page when verify=false
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5859 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
b08991e278
moved some constants, rename of Tray class
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5858 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
138422990a
- removed useCell option: the indexCell data structure is now the default index structure; old collection data is still migrated
...
- added some debugging output to balancer to find a bug
- removed unused classes for index collection handling
- changed some default values for the process handling: more memory needed to prevent OOM
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5856 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
1b9e532c87
some concurrency for wikipedia dump reader
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5855 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
25d2160288
small fix
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5853 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
16baa7ad24
To translate a mediawiki dump into the YaCy surrogate format do the following:
...
- download a wikipedia dump, i.e. dewiki-20090311-pages-articles.xml.bz2
from http://download.wikimedia.org/dewiki/20090311/
- move dewiki-20090311-pages-articles.xml.bz2 to DATA/HTCACHE/
- start the conversion; open a command shell, move to the yacy home directory and execute
java -Xmx2000m -cp classes:lib/bzip2.jar de.anomic.tools.mediawikiIndex -convert DATA/HTCACHE/dewiki-20090311-pages-articles.xml.bz2 DATA/SURROGATES/in/ http://de.wikipedia.org/wiki/
this generates a series of files to DATA/SURROGATES/in
if YaCy is running (it may run concurrently), it fetches all new dumps in the surrogate-in directory. The export process is transaction-save, that means YaCy will not start reading a dump while the dump is not completely finished.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5851 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
0b2c98edc9
some more work on the wikipedia-dump exporter (not finished yet)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5850 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
5195c94838
two patches for performance enhancements of the index handover process from documents to the index cache:
...
- one word prototype is generated for each document, that is re-used when a specific word is stored.
- the index cache uses now ByteArray objects to reference to the RWI instead of byte[]. This enhances access to the the map that stores the cache. To dump the cache to the FS, the content must be sorted, but sorting takes less time than maintenance of a sorted map during caching.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5849 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
9416f5c26f
more speed test cases: kelondro provides map functions that are more than 20% faster than standard java classes and use less than halve of the memory of java classes:
...
just start IndexTest (here with 1000000 test objects)
Performance test: comparing HashMap, TreeMap and kelondroRow
generated 1000000 test data entries
STANDARD JAVA CLASS MAPS
sorted map
time for TreeMap<byte[]> generation: 2110
time for TreeMap<byte[]> test: 2516, 0 bugs
memory for TreeMap<byte[]>: 29 MB
unsorted map
time for HashMap<String> generation: 1157
time for HashMap<String> test: 1516, 0 bugs
memory for HashMap<String>: 61 MB
KELONDRO-ENHANCED MAPS
sorted map
time for kelondroMap<byte[]> generation: 1781
time for kelondroMap<byte[]> test: 2452, 0 bugs
memory for kelondroMap<byte[]>: 15 MB
unsorted map
time for HashMap<ByteArray> generation: 828
time for HashMap<ByteArray> test: 953, 0 bugs
memory for HashMap<ByteArray>: 9 MB
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5847 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
b53790abb1
more performance hacks: 10% more speed for Base64.compare() which is really often used in YaCy code
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5846 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
8ffb9889e1
some fixes and performance hacks
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5845 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
dfb96ecb72
more fixes
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5844 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
1b8d346b4c
fixes in connection with transiton to byte[] hashes
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5843 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
0b0a46d35a
* fix transferRWI as suggested by celle (thanks!)
...
see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2000#p14023
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5842 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
996572de95
quickfix
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5841 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
380ed2dac0
performance and debugging additions
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5840 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
635b0a9da7
code-split
...
allow cgi indexing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5839 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
fa3adbbfc6
added domain checks to surrogate reader and RWI transfer receiver to prevent spaming using surrogates
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5837 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
76af84d732
* add custom comparator to ScoreCluster for byte[]
...
* fixes http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2010
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5836 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
ab0030d7a7
allow dht-out for remote-crawl processing peers on default settings
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5834 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
low012
d1116c049f
*) added new method "contains()" to Blacklist interface
...
*) implemented contains() in class AbstractBlacklist
*) used new method in Blacklist_p to prevent double entries in blacklists
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5832 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
08445e42f0
* don't throw exception, in case of bad charset in http-header
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5831 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
2f860a2564
* convert byte[] hashes to string for log output
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5830 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
d93a2a6552
* ignore whitespaces so you can copy&paste signatures better
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5828 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
fbcbcc5bdb
export of yacy document objects as dublin core record in xml
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5826 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
d7cbf4cdd4
more performance hacks: less overhead in word hash computation
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5825 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
29e96c1a60
bugfixes and performance hacks
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5824 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
4e97a31009
corrections in dublin core syntax
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5823 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
44daec7936
* introduce signatures to autoupdate
...
as long as there aren't publickeys for the updatelocations set,
no signatures are checked
* wiki-article follows...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5822 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
538e375901
replaced old caching method for computed word hashes with a better method. The word hash computation is a new performance bottleneck (after the IO bottleneck was removed with the IndexCell data structure) and a better caching for word hashes was necessary.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5821 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
9e853e1977
partly reverting SVN 5818: identical comparator required for join operator
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5820 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
e16c25ddf7
(peak-) performance hacks
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5819 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
63cd152969
fixes
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5818 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
7dfe7e7cc6
fixed some problems with surrogate reader. This is now ready for testing.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5817 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
3a1364ed5c
removed example lines from SurrogateReader sources; added additional example file
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5816 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
9050a3c4c5
alpha version of surrogate reading and indexing.
...
see the example file for an explanation.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5815 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
b15b059c0d
fix for latest commit
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5813 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c8624903c6
full redesign of index access data model:
...
terms (words) are not any more retrieved by their word hash string, but by a byte[] containing the word hash.
this has strong advantages when RWIs are sorted in the ReferenceContainer Cache and compared with the sun.java TreeMap method, which needed getBytes() and new String() transformations before.
Many thousands of such conversions are now omitted every second, which increases the indexing speed by a factor of two.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5812 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
dd6b5005ff
* fix missing charset handling in getpageinfo_p
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5811 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
bd5f4c78d8
- added default profile for surrogate indexing
...
- integrated surrogate indexing into indexing queue process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5810 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
ad78e3a59f
- less lines in rssTerminal
...
- crawl more documents: if remote crawling is enabled, a remote crawl list is also loaded if a local crawl is running in case that the indexer is idle
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5809 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
bc80dc913a
added new surrogate reader (surrogates are parsed documents on batches)
...
this will open a new way to insert indexes to YaCy (instead crawling)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5808 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
12d81e98eb
- fixed bad search results when searching for empty string
...
- simplified result handling and page composition in case that nothing was searched
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5807 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
8a24350036
- fix for join method with new generalized RWI data structure (caused by latest commit)
...
- added more functions to mediawiki parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5806 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
e58320a507
added more info in log fore debugging
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5805 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
89ec3acb3e
- full abstraction of index content type: the kelondro full text index may now also contain indexes about other content than text, i.e. navigation indexes or reverse linking indexes.
...
- during index joins all word positions are maintained: better ranking for word distance possible; exact phrase match can be implemented soundly
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5804 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
borg-0300
7a48090fcf
- fix for "uk" language
...
- svn attributes added
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5803 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
dc2af61bc9
allow up to 50 results from remote peers
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5802 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c0e8ed5461
fixed problem with not http client
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5801 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
8862a2fed0
ups
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5799 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
de68948bc5
better handling of free memory computation and emrgency cache flush for index cell
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5798 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
f1ori
fcb77c3140
* added .im (Isle of Man) to TLD-list
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5794 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
b81c7467d8
protection against too many files in RICELL in case of massive emergency dumps caused by low memory
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5791 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
d4d87d90c4
- extended experimental wikipedia dump parser
...
- removed historic, possibly unused code from wiki parser that was in conflict with actual wikipedia wiki code
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5790 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c3aff2521e
fix for NPE
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5789 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
57c00dd8c9
fix for bad filtering of common http error
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5788 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
14361f1ca4
added log message for index generation in HeapReader
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5787 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c08f9b36a4
refactoring of wiki parser.
...
This was done to prepare the wiki parser as parser for wikipedia dumps, which will be used for performance test (to omit crawling)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5785 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
44e01afa5b
- refactoring
...
- a little bit more abstraction
- new interfaces for index abstraction
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5783 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
82fb60a720
increased memory limit for emergency cache flush
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5782 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
low012
9180617dd9
*) Classes to handle import of lists (especially blacklists) from XML files, not used yet, but will be used soon.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5780 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
lotus
596e6215dc
fix in case of white space in path name
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5779 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
b887f4a116
keep more free mem
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5778 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
c2359f20dd
refactoring: better abstraction of reference and metadata prototypes.
...
This is a preparation to introduce other index tables as used now only for reverse text indexes. Next application of the reverse index is a citation index.
Moved to version 0.74
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5777 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
ab656687d7
more strict BLOB initialization .. may also help to save some ram
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5776 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
5b138ada16
fixes to web structure reference collection and url construction
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5775 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
a29a11e526
added evaluation of incoming links in webstructure api
...
the api hash changed, new XML schema.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5774 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
f6691411b5
- migration of files from SplitTable (which are used for the URL-DB) to a different file name format.
...
- the file generation logic is slightly different: files may now have only a maximum size of one gigabyte and a maximum age of one month.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5773 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
shostakovich
1f37cc6107
Robots.txt is now reused after one day. See forum-topic:
...
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1669&p=13565#p13565
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5772 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago
orbiter
f21a8c9e9c
a different naming scheme for BLOBArray files. This may be necessary if blobs are written more often than once in a second.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5771 6c8d7289-2bf4-0310-a012-ef5d649a1542
16 years ago