reger
982601017e
crawling of filenames with + fails due to url decoding
...
modified UTF8.decodeURL to apply x-www-form-urlencoded ( space -> + ) to the query part of the url only.
11 years ago
reger
3b559e7846
optimize pdfParser
...
skip starting reader thread if all content already read
11 years ago
reger
09f73b790f
fix pdfParser not closed warning from pdfbox
...
for encrypted pdf on exit due to missing permission to extract
11 years ago
reger
92d1604a31
Crawler hostbalancer does not delete finished queue files,
...
use alternative delete to fight the sympthom (and fix deletion of host dirs on startup)
Root cause (which class holds a lock on .stack) not found.
http://mantis.tokeek.de/view.php?id=404
11 years ago
Michael Peter Christen
0c324d735c
NPE fix for postprocessing without term index
11 years ago
Michael Peter Christen
922979aae1
added option to prefer http over https in unique-protocol ranking
11 years ago
Michael Peter Christen
b3b174e2b8
fixed webgraph postprocessing and status display in Crawler_p servlet
11 years ago
Michael Peter Christen
e6b28f5958
removed check on protocol for double content (user request)
11 years ago
reger
d8d318233e
fix logging settings
...
- add missing .level
- remove obsolete jena settings
- set default level=INFO to prevent debug logging of not explicite specified classes
11 years ago
Michael Peter Christen
698f053658
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
f23c4142e0
added option to configure a custom user agent within allip networks
11 years ago
reger
8e233e2eb4
- fix typo in Message_p (defaultpath)
...
- use more existing switchboardconstants for getproperties
- replace depriciated call defaultservlet
11 years ago
orbiter
d7d38f9135
made number of open files in crawler configurable and increased default
...
maximum number of open files from 100 to 1000. This number can be
changed with the attribut crawler.onDemandLimit
11 years ago
Michael Peter Christen
8ad41a882c
fixed several problems with postprocessing:
...
- unique-postprocessing was destroying results from other
postprocessings; removed cross-updates as they had been not necessary
- unique-postprocessing did not restrict on same protocol
- inefficient concurrent update cache was redesigned completely
- increased limits for concurrent blocking queues to prevent early
time-out
11 years ago
reger
ca5437dd50
fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149
...
local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows)
for win like file:///c:/tmp or file://localhost/c:/tmp
for linux like file:///tmp or file://localhost/tmp
Host is ignored and path must be absolute
11 years ago
Michael Peter Christen
ff5b3ac84d
added new fields http_unique_b and www_unique_b which can be used for
...
ranking to prefer urls containing a www subdomain or using the https
protocol
11 years ago
sixcooler
5b1c4ef191
Monitoring and limit connection-count for Jetty
11 years ago
Michael Peter Christen
f0db501630
better handling of ranking parameters and new default values for date
...
navigation which is done using ranking in solr.
11 years ago
Michael Peter Christen
53948da7d0
tried to make last_modified recognition smarter
11 years ago
Michael Peter Christen
2d03037965
'Last-Modified', not 'Last-modified' according to
...
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
11 years ago
Michael Peter Christen
3dc5fb0050
fix for operator precedence bug (cast binds stronger than bitwise AND)
...
in peer hash hashing. This should not change anything if java casts long
to int by masking with 0xFFFFFFFFL but you never know. The important
thing is, that the hashCode() should not return numbers that have the
same order as the hash code order because hashing of seeds is used to
remove the order in some places.
11 years ago
Michael Peter Christen
6634b5b737
debug code for index distribution testing
11 years ago
orbiter
49e344e8d9
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
orbiter
7705e36703
fix for latest generic warning fix
11 years ago
sixcooler
10326892a8
avoid erros from ConnectHandler, correction for #6d16fa9
11 years ago
orbiter
97983ba89f
fixed generics warnings for generic array instantiation that appeared
...
after migration to Java 7
11 years ago
sixcooler
830057d788
lower Segment-size (hope to get Segments of 10GB)
...
see:
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5216&p=30036#p30034
11 years ago
orbiter
c028ae9b09
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
reger
e31493e139
"Use remote proxy for yacy" has no function, remove option and related config item
...
see/fix bug http://mantis.tokeek.de/view.php?id=23
http://mantis.tokeek.de/view.php?id=189
11 years ago
orbiter
181784a5cb
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
reger
0587077d06
cleanup obsolete and not used serverswitch Authentify code
...
as auth is mostly delegated to Jetty container.
11 years ago
orbiter
c9f66be20b
move unnecessary nested else out of condition
11 years ago
orbiter
0d8072aa99
removed warnings
11 years ago
orbiter
88f4af90da
removed warnings
11 years ago
orbiter
0f425e01ca
another circle computation enhancement
11 years ago
reger
a8d162810c
Exclude = from percent-encoding in MultiProtocolURL
...
fix http://mantis.tokeek.de/view.php?id=185 and http://mantis.tokeek.de/view.php?id=280
11 years ago
reger
024f8e9b33
fix truncated urls containing ","
...
adressing http://mantis.tokeek.de/view.php?id=58
Exclude comma from percent-encoding in MultiProtocolURL (see RFC 1738 2.2 and RFC 3986 2.2)
11 years ago
Michael Peter Christen
9112f0a2df
enhanced circle tool initialization
11 years ago
Michael Peter Christen
a1ac4c3b76
automatically clear graphics cache
11 years ago
Michael Peter Christen
505f58c79c
enhanced circle computation time and memory footprint
11 years ago
reger
cd8c0dbda9
assign serialVersionUID for proxyservlet, too.
11 years ago
reger
b300d7f4ce
set serialVersionUID on urlproxyservlet to skip compiler warning
...
- remove commented out code
11 years ago
reger
e9060d31bd
update to Jetty 9
...
besides adjustments in code it makes the servlet settings in web.xml significant.
This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).
11 years ago
reger
1432a817dd
respect "index media" switched off in CrawlStartExpert.html
...
fix http://mantis.tokeek.de/view.php?id=64
11 years ago
orbiter
39e1913585
next development step: migration to java 1.7
...
This includes also a small code change to test generic type inference, a
java 1.7 feature
11 years ago
Michael Peter Christen
4e734815e8
enhanced snippets: remove lines which are identical to the title and
...
choose longer versions if possible. Prefer the description part.
11 years ago
Michael Peter Christen
e84e07399a
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
89f76da24b
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
sixcooler
390f03e041
o not check for segments-count on optimize:
...
this is also done in Solr and our getSegmentsCount() does not return
up-to-date values
11 years ago
reger
8a7c68e4c7
content of surrogates/out never accessed (remove)
...
After import the conent is never accessed but may take up a lot of disk space,
also the getLoadedOAIServer (which lists the files in surrogate out) is not used.
Making the surrogate.out obsolete. Removed keeping of xmls after import.
11 years ago