reger
8e233e2eb4
- fix typo in Message_p (defaultpath)
...
- use more existing switchboardconstants for getproperties
- replace depriciated call defaultservlet
11 years ago
orbiter
d7d38f9135
made number of open files in crawler configurable and increased default
...
maximum number of open files from 100 to 1000. This number can be
changed with the attribut crawler.onDemandLimit
11 years ago
Michael Peter Christen
8ad41a882c
fixed several problems with postprocessing:
...
- unique-postprocessing was destroying results from other
postprocessings; removed cross-updates as they had been not necessary
- unique-postprocessing did not restrict on same protocol
- inefficient concurrent update cache was redesigned completely
- increased limits for concurrent blocking queues to prevent early
time-out
11 years ago
reger
ca5437dd50
fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149
...
local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows)
for win like file:///c:/tmp or file://localhost/c:/tmp
for linux like file:///tmp or file://localhost/tmp
Host is ignored and path must be absolute
11 years ago
Michael Peter Christen
ff5b3ac84d
added new fields http_unique_b and www_unique_b which can be used for
...
ranking to prefer urls containing a www subdomain or using the https
protocol
11 years ago
sixcooler
5b1c4ef191
Monitoring and limit connection-count for Jetty
11 years ago
Michael Peter Christen
f0db501630
better handling of ranking parameters and new default values for date
...
navigation which is done using ranking in solr.
11 years ago
Michael Peter Christen
53948da7d0
tried to make last_modified recognition smarter
11 years ago
Michael Peter Christen
2d03037965
'Last-Modified', not 'Last-modified' according to
...
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
11 years ago
Michael Peter Christen
3dc5fb0050
fix for operator precedence bug (cast binds stronger than bitwise AND)
...
in peer hash hashing. This should not change anything if java casts long
to int by masking with 0xFFFFFFFFL but you never know. The important
thing is, that the hashCode() should not return numbers that have the
same order as the hash code order because hashing of seeds is used to
remove the order in some places.
11 years ago
Michael Peter Christen
6634b5b737
debug code for index distribution testing
11 years ago
orbiter
49e344e8d9
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
orbiter
7705e36703
fix for latest generic warning fix
11 years ago
sixcooler
10326892a8
avoid erros from ConnectHandler, correction for #6d16fa9
11 years ago
orbiter
97983ba89f
fixed generics warnings for generic array instantiation that appeared
...
after migration to Java 7
11 years ago
sixcooler
830057d788
lower Segment-size (hope to get Segments of 10GB)
...
see:
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5216&p=30036#p30034
11 years ago
orbiter
c028ae9b09
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
reger
e31493e139
"Use remote proxy for yacy" has no function, remove option and related config item
...
see/fix bug http://mantis.tokeek.de/view.php?id=23
http://mantis.tokeek.de/view.php?id=189
11 years ago
orbiter
181784a5cb
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
reger
0587077d06
cleanup obsolete and not used serverswitch Authentify code
...
as auth is mostly delegated to Jetty container.
11 years ago
orbiter
c9f66be20b
move unnecessary nested else out of condition
11 years ago
orbiter
0d8072aa99
removed warnings
11 years ago
orbiter
88f4af90da
removed warnings
11 years ago
orbiter
0f425e01ca
another circle computation enhancement
11 years ago
reger
a8d162810c
Exclude = from percent-encoding in MultiProtocolURL
...
fix http://mantis.tokeek.de/view.php?id=185 and http://mantis.tokeek.de/view.php?id=280
11 years ago
reger
024f8e9b33
fix truncated urls containing ","
...
adressing http://mantis.tokeek.de/view.php?id=58
Exclude comma from percent-encoding in MultiProtocolURL (see RFC 1738 2.2 and RFC 3986 2.2)
11 years ago
Michael Peter Christen
9112f0a2df
enhanced circle tool initialization
11 years ago
Michael Peter Christen
a1ac4c3b76
automatically clear graphics cache
11 years ago
Michael Peter Christen
505f58c79c
enhanced circle computation time and memory footprint
11 years ago
reger
cd8c0dbda9
assign serialVersionUID for proxyservlet, too.
11 years ago
reger
b300d7f4ce
set serialVersionUID on urlproxyservlet to skip compiler warning
...
- remove commented out code
11 years ago
reger
e9060d31bd
update to Jetty 9
...
besides adjustments in code it makes the servlet settings in web.xml significant.
This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).
11 years ago
reger
1432a817dd
respect "index media" switched off in CrawlStartExpert.html
...
fix http://mantis.tokeek.de/view.php?id=64
11 years ago
orbiter
39e1913585
next development step: migration to java 1.7
...
This includes also a small code change to test generic type inference, a
java 1.7 feature
11 years ago
Michael Peter Christen
4e734815e8
enhanced snippets: remove lines which are identical to the title and
...
choose longer versions if possible. Prefer the description part.
11 years ago
Michael Peter Christen
e84e07399a
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
89f76da24b
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
sixcooler
390f03e041
o not check for segments-count on optimize:
...
this is also done in Solr and our getSegmentsCount() does not return
up-to-date values
11 years ago
reger
8a7c68e4c7
content of surrogates/out never accessed (remove)
...
After import the conent is never accessed but may take up a lot of disk space,
also the getLoadedOAIServer (which lists the files in surrogate out) is not used.
Making the surrogate.out obsolete. Removed keeping of xmls after import.
11 years ago
sixcooler
b8cee9b7d8
remove tables from tabletracker on close to avoid lots of dead entrys in
...
/PerformanceMemory_p.html
11 years ago
reger
1600414450
fix NPE on continuing crawls after YaCy restart
...
(Agent is then nulll)
11 years ago
Michael Peter Christen
229f2248b8
added configuration option for maxmimum load and minimum ram for
...
postprocessing
11 years ago
orbiter
f15c832587
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
Marc Nause
c97da1a0d8
First draft of a blacklist API.
11 years ago
Michael Peter Christen
d4f65833a1
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
c1c1be8f02
fix for slow crawling and better logging in balancer
11 years ago
Michael Peter Christen
3acf416335
npe fix
11 years ago
reger
2eb7682772
add html5 audio/video <source> tag to html content scraper
...
- <source src=.. type=..> tag content is added to embed collection
11 years ago
reger
0b6db04e40
fix contentscraper img height/width parsing
...
prevent numberformat exception on common "100px" property
- include in test case
11 years ago
reger
ffc5b75c73
optimize and fix lat / lon assignment
11 years ago
reger
9313447de2
reimplement tighter lat/lon calc in URIMetadataNode
...
from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272
11 years ago
reger
d812f80784
add exit proxy link to UrlProxy
...
on proxied pages a link to exit proxy is added to top of page.
Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.
11 years ago
reger
78d08998db
throw MalformedURLException on unknown protocol
...
on other than the supported http https ftp file smb \\ mailto
11 years ago
reger
bb8181b2be
fix: resolve url without path but searchpart
...
e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/"
fixes http://mantis.tokeek.de/view.php?id=47
added test case for getHost
11 years ago
orbiter
a3542f29b4
npe fix
11 years ago
orbiter
c48d2a2a02
npe fix
11 years ago
reger
121d25be38
recover sax fatal error on OAI-PMH import of xml with entity error
...
this allows to continue loading next resumptionToken even if import file caused sax parser error
fix http://mantis.tokeek.de/view.php?id=63
11 years ago
reger
81dc2aa536
add current css to HTMLResponseWriter to fix metadata view
...
(using css from metas.template except js links)
11 years ago
orbiter
2fd8a0ead6
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
orbiter
8e5ce7cd51
fixed a situation where finished crawls had not been detected.
11 years ago
orbiter
2f63bd0261
enhanced Host Balancer strategy: fair round robin
11 years ago
orbiter
0c88a32c36
do not apply lazy value instantiation for numeric or boolean values
...
because that is misleading and confusing in case of 0- or false-values
and may cause NPEs in retrieval functions.
11 years ago
orbiter
8e04030596
in case of short memory, do not cut down robinson peers to 1, just
...
reduce by 50%
11 years ago
reger
86f6975edc
exclude html tags in in/outboundlinks_anchortext_txt parsed text
...
- some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags,
remove all tags for text property (inline img tags are still parsed)
- added test case for above (to htmlParserTest)
- fix solr test case
11 years ago
orbiter
ccb1864d55
catch IllegalArgumentException for wrong process types (that is needed
...
for migrations when new process types are introduced or disappear)
11 years ago
orbiter
4ee4ba1576
fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of
...
lazy value instantiation of 0-value in crawldepth_i
11 years ago
orbiter
12ba890205
removed warnings
11 years ago
reger
d51f9cc863
add custom Jetty errorhandler
...
to provide custom error page footer line
- remove redundant mime check in UrlProxyServlet
11 years ago
reger
c193a02023
defer creation of new ArrayList after possible early return
...
(to skip not used object allocation)
11 years ago
reger
727dfb5875
refactore URIMetadataNode to further unify interaction with index
...
- URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
11 years ago
reger
79e7947442
- remove empty http0_9 status text array
...
and unused default_charset = ISO-8859-1
11 years ago
reger
2dabe2009d
- remove unused manual http KeepAlive config
...
(reducing references to obsolete httpdemon)
- add port info to settings_http
11 years ago
Michael Peter Christen
5746aae3db
add canonical links to the same crawldepth, not the next crawldepth
11 years ago
Michael Peter Christen
74ab5ef9fa
increased runtime for postprocessing query job
11 years ago
Michael Peter Christen
8b32dd5f9e
special strategy for balancer: do not remove targets with zero wait time
...
from the queue
11 years ago
Michael Peter Christen
9c6228d948
fix for deadlocks in crawler
11 years ago
Michael Peter Christen
10cf8215bd
added crawl depth for failed documents
11 years ago
Michael Peter Christen
7fefebaeca
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
c2f62e783f
- better subgraph handling, less overhead for crawls without the
...
webgraph
- usage of crawler crawldepth cache for the linkgraph target depth
computation
11 years ago
Michael Peter Christen
06afb568e2
new Strategies in Balancer:
...
- doublecheck cache now records the crawl depth as well
- doublecheck cache is available from the outside (made static)
- no more need to crawl hosts with lowest depth first, instead all hosts
which have only singleton entries are preferred to reduce the number of
files.
11 years ago
Michael Peter Christen
1aea01fe5b
fix for Table in case that requested file does not exist and paths also
...
do not exist
11 years ago
reger
710054bb37
implement gzip input handling directly in defaultservlet
...
(making reference to legacy httpdemon obsolete)
11 years ago
Michael Peter Christen
9a5ab4e2c1
removed clickdepth_i field and related postprocessing. This information
...
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
11 years ago
Michael Peter Christen
da86f150ab
- added a new Crawler Balancer: HostBalancer and HostQueues:
...
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
11 years ago
Michael Peter Christen
075b6f9278
refactoring of the crawl balancer: the balancer is turned into an
...
interface and the old balancer class is moved into LegacyBalancer to
make room for a fresh implementation of a crawl balancer.
11 years ago
Michael Peter Christen
8470dfe3f8
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
reger
46016fa153
autoupdate fails to download latest release (1.71) due to default release blacklist
...
- removed the default version blacklist regex from init (for future versions)
!!! left existing update blacklist setting untouched !!!
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)
- moved old blacklist patch to migration.java
11 years ago
Michael Peter Christen
8aeef73d49
fix for virtual root nodes
11 years ago
Michael Peter Christen
7c7fbb9818
find depth-matches also for edge targets
11 years ago
Michael Peter Christen
dd12dd392f
introduction of a data structure for HyperlinkEdges which should use
...
less memory as it does no double-storage of source links for each edge
of the graph.
11 years ago
Michael Peter Christen
6ea8bb7348
using MultiProtocolURL for edge data which is faster (hash computation
...
is now much easier) and smaller in size
11 years ago
Michael Peter Christen
b21c208b4d
enhanced hashcode computation for MultiProtocolURL
11 years ago
Michael Peter Christen
ce1d1b2fa0
fix for maximum tag length in parser
11 years ago
Michael Peter Christen
17e0956312
refactoring of SystemLoad calls (only one backend tool)
11 years ago
Michael Peter Christen
a37d067692
refactoring
11 years ago
orbiter
95780eed32
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
Michael Peter Christen
67beef657f
strong redesign of html parser: object recursion is now made using a
...
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
11 years ago
Michael Peter Christen
6bd8c6f195
fix for wrong status codes of error pages
11 years ago
Michael Peter Christen
9e503b3376
also delete the robots.txt file from the cache when a new crawl is
...
started
11 years ago
orbiter
67501c9dda
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago