reger
28b8bc290a
fix use of NETWORK_SEARCHVERIFY for rwi verification
...
was not used to set the searchevent parameter (done in SearchEventCache.getEvent)
- remove unused corresponding QueryParams.filterfailurls param.
9 years ago
reger
020630efd8
remove unused network scanner parameter from queryparameter
...
Search event is not using networkscanner
(removed filterscannerfail param always init to false)
9 years ago
luc
ad5586f8f6
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
luc
8ebefa4233
Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
...
failing. Looks like it was broken since Commit
b43811d38c
9 years ago
luc
7736ee5a42
Updated MediaWimporter main() : display usage in console and stop
...
properly without calling System.exit
9 years ago
reger
cdb8f3b10d
make current ranking score value avail. to search interface / api
...
Update the result score result field with the result queue ranking value to reflect
the actual calculated/used score,
for rwi & solr stack results.
(calc. etc. is unchanged, it's just that result entry carries the latest val
as api retrieves the number from it)
9 years ago
luc
27d11f8671
Fixed isSolrDump function : PushBackInputStream was not unread when
...
returning false (for example with a WikiMedia dump).
9 years ago
Michael Peter Christen
135a123a77
less logging in new language detection
9 years ago
Michael Peter Christen
ef8cd80593
fix for npe
9 years ago
reger
0347bfa71f
Apply collection query constraint/modifiert to rwi result stack.
...
Collection is not available in pure rwi entries (but in local solr metadata)
But if user wishes to filter by query constraint also rwi shall adhere to this
(even if only rwi entries with parsed or solr received metadata may fit)
9 years ago
luc
2a67d2ba6f
Corrected error management for unsupported image formats, parsing
...
errors, and unavailable resources : avoid logging to much Exceptions as
these errors easily occur when searching images.
9 years ago
Michael Peter Christen
d6e9834040
Merge branch 'master' of
...
https://github.com/Scarfmonster/yacy_search_server
# Conflicts:
# .classpath
# build.xml
9 years ago
Michael Peter Christen
d82d311995
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
...
# Conflicts:
# .classpath
9 years ago
reger
b5371ea8c1
read/init crawl queue in a thread
...
to speed-up YaCy start on large existing crawler queues
9 years ago
reger
1160b13172
remove unused md5 from ViewFile servlet params
9 years ago
reger
e163ea88f6
fix vsdParser (Visio) parser return statement
...
(final block un-necessary throw)
9 years ago
reger
b2c8bc0ae6
remove md5_s from default index fields
...
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
9 years ago
luc
e40ae0943b
- No max dimensions specified : render raw image data when source and
...
target image format are the same.
- Corrected scaling condition.
9 years ago
reger
90686a75a2
fix flux factor (additional crawl delay by access count) calculation
9 years ago
luc
4af27289e5
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
297fdb60d3
throw exception if crawler hostqueue can't create hostpath directory.
...
In rare cases hostname may not be a valid filesystem directory name,
which can't be created (e.g. containing '*' char). To prevent crawl queue
looping on this invalid entry by throwing a malformedurlexception.
9 years ago
luc
755efac17d
Use same max file size when loading all resource bytes or opening stream
...
content
9 years ago
luc
bc6c79fc12
Corrected scaling function for non RGB images.
9 years ago
luc
1565559df8
Refactoring : extracted write InputStream method.
9 years ago
luc
f0478bb14d
BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
...
imageio-bmp-3.2 library.
- better BMP format flavours support
- handle PNG encoded icons
- handle transparency
Added some javadoc url references to .classpath
9 years ago
luc
07437986e7
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
97cc03ef6a
start using a template for urlproxy header
...
It is included as iframe /proxmsg/urlproxyheader.html
to allow full servlet functionallity and flexibility to display some
index/meta data in future.
9 years ago
luc
f01d49c37a
Process large or local file images dealing directly with content
...
InputStream.
9 years ago
luc
3c4c77099d
If available, check content length before downloading. Check also
...
content length is not over Integer.MAX_VALUE.
9 years ago
luc
5bbb2e1730
Ensure resource is closed when reading a full file InputStream
9 years ago
luc
6291a57300
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
0d3c5b223e
have psParser cleanup temp file
9 years ago
reger
7d0d19cb8e
avoid File.deleteOnExit() on temp files
...
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
9 years ago
luc
bfe51001e3
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
02e4489a23
set tmpfile.deleteOnExit by default,
...
to make sure files are removed on shutdown.
9 years ago
reger
2985baaa01
Exclude repetitive protocol part in tokenized url
...
used as description if none is avail. from parser.
9 years ago
reger
ca3d26a401
harmonize wordsintitle & CollectionSchema.title_words_val calculation,
...
remove obsolete partial init of wordreference from urimetadata
9 years ago
reger
52a9040ae6
Sort out double keywords (dc_subject) early in parsed documents
...
- by direct using Set vs. List
- remove not neede String[] getter
9 years ago
luc
49331dc523
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
47d70732f6
improve locale translator
...
- skip empty line
- robustness file section detection (space independant)
9 years ago
sixcooler
646afe9183
do not store subfield *_coordinate + make all num-fields being docvalues
9 years ago
sixcooler
194df613de
not using 'location' as defaultfacetfield - since we removed it being
...
default.
9 years ago
sixcooler
d3b9349b6f
simplification / speedup of GenerationMemoryStrategy
9 years ago
sixcooler
4a905ec134
fix to not let the AccessTracker-Log grow to much, but have enough data
...
to monitor.
(+gitignore-correction)
9 years ago
reger
20e18d79f8
harmonize document title for archive parsers
9 years ago
luc
f11b5e8309
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
112ae013f4
update bzip and bzip parser process,
...
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
9 years ago
reger
e76a90837b
update zip and tar parser process,
...
to return one document for the file with combined parser results of the
containing files.
9 years ago
luc
4e673ffc9a
Ensure closing of InputStream even when an exception occurs.
9 years ago
luc
10696b53f7
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
8532565c7d
optimize order of parsers to try
...
- start with a parser matching the remote supplied mime
9 years ago
reger
681889ae64
use current tar library for untar files
...
- remove old source copy
9 years ago
reger
5d71fc70e3
fix tarParser early exit on looping content
...
- adjust check of data available according to doc
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
9 years ago
luc
bcc2e7cb5b
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
2fcf6f104c
fix bzipParser recognition
...
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
9 years ago
luc
745e97a575
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
a60b1fb6c2
differentiate api call getLocalPort() from getConfigInt()
9 years ago
reger
11f3666660
increase use of pre.defined CATCHALL_QUERY string
9 years ago
reger
a58ee49307
Optimize internal imagequery focus on using content_type to select images
...
(in favor of url file extension)
9 years ago
luc
fc3294382e
Updated javadocs for warning on target encoding format potential errors.
9 years ago
luc
aa70ff4ff6
Corrected images alpha channel rendering
9 years ago
reger
d223cf0ae4
adjust MediaWiki importer geo coordinate calculation
...
- allow lat/long 0.xxx
- south / west assignment
include test class
9 years ago
reger
2b775d5be6
fix typo in WikiCode coordinate calculation
9 years ago
reger
bbe9df2bb3
fix MediawikiImporter for bz2 dump
...
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
9 years ago
reger
c6687dd560
fix a system.out to log.fine
...
in bmpParser
9 years ago
reger
e53c6bbd51
fix init of peer flags
...
(remove hiding of ssl flag)
9 years ago
Michael Peter Christen
ac034db8bc
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
...
# Conflicts:
# htroot/js/highslide/highslide.js
# source/net/yacy/document/ImageParser.java
9 years ago
reger
826f14f37f
fix unnececary set null of peer flags, causing reread
...
remove obsolete version flags
9 years ago
luc
5902ce032e
Corrected NullPointerException case when ImageIO reader is not found for
...
image format.
9 years ago
reger
c6495a5b62
add a log entry on parsing ajax crawling scheme snapshot
...
(prev. commit 9252e36aeb
)
9 years ago
reger
9252e36aeb
implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
...
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
- htmlparser tries to get content of htmls snapshot (using a different url)
- htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
9 years ago
Michael Peter Christen
d1ae999ef9
replaced HashMap with LinkedHashMap to preserve the object order
9 years ago
Michael Peter Christen
7d075a1d76
added log lines
9 years ago
Michael Peter Christen
092dac086e
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger
7a64bebb86
init Recrawl job chunk size to max crawl loader during job start, to use some system preferences
...
and allow injection of recrawl urls before queue is empty
During recrawl the balancer hangs on the very last urls often on hosts with huge delay time,
by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.
9 years ago
luc
d6522fa4a2
Integrated haraldk/TwelveMonkeys library to first add TIF image format
...
support.
9 years ago
Michael Peter Christen
9244694e64
Merge branch 'master' of git@github.com:yacy/yacy_search_server.git
9 years ago
Michael Peter Christen
151ccd50a9
fix for image size field values (must be multi-valued)
9 years ago
reger
c9937973e3
unescape MultiProtocolURL getAttributes() return values.
...
use getAttributes() to get query parameters as clear text (w/o url encoding)
use getSearchpartMap() to get in internal format (url encoded)
fix for http://mantis.tokeek.de/view.php?id=606
9 years ago
reger
78e8c6f3e5
refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
...
not used for genericImageParser
9 years ago
reger
d54c5d310a
add links with image extension not automatically to image links.
...
With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.
9 years ago
reger
851e8f6c8a
check jpeg file signature in genericImageParser
...
to fail early without further object allocation if source is not a jpeg.
9 years ago
reger
fb75fea446
use recrawljob w/o sort results by date
...
This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues
to prevent solr exception causing recrawljob to fail with
org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.
9 years ago
reger
43c27aa550
upd to solr/lucene 5.3.1
9 years ago
reger
688f7b2a5c
allow/display svg images in image results previews
...
svg is not supported by awt but by most browser. Image content is delivered as received (without size adjustment)
9 years ago
reger
d5330391de
remove some unused var allocation in parser
9 years ago
Michael Peter Christen
3d7dd9d3aa
follow-up to latest commit: also flush the search cache if all crawls
...
had been terminated.
9 years ago
Michael Peter Christen
c737ff235d
in case that the include_string contains several entries including
...
1-char tokens and also more-than-1-char tokens, then remove the 1-char
tokens to prevent that we are to strict. This will make it possible to
be a bit more fuzzy in the search where it is appropriate.
9 years ago
Michael Peter Christen
8e555d79a3
add also 1-character tokens to the token list because that could be also
...
searched for. A full-string search for a filename may fail if those
1-char tokens are omitted
9 years ago
reger
7c82cd4415
add a end condition to svgParser for wrong content
...
(if parser choosen just by file extension)
9 years ago
reger
356d4d1301
remove rdfParser from init (current function identical with genericParser)
9 years ago
reger
c647d899e3
add svgParser to parse metadate from svg images
...
Reads document level included title and description and skips the graphic content to save bandwidth.
svg metadata element is not interpreted
- remove rdfParser from init (current function identical with genericParser)
9 years ago
reger
bad34804fe
optimize parseInt for <img> tag attribute parsing
...
Performance better as using Numberformat.parse or parseInt(substring())
9 years ago
Michael Peter Christen
6ebc2451a9
Merge pull request #14 from luccioman/master
...
Translator refactoring : no more regular expression processing
9 years ago
reger
2f51baff4f
check for loading error (includs unsupported formats)
...
to prevent blank thumbnail display in image search because of not handled source which don't load on click.
Now the cross icon indicates the problem (inlcuding not supported format)
9 years ago
luc
5578886f6f
Merge branch 'master' of https://github.com/luccioman/yacy_search_server.git
9 years ago
luc
c38d6c1f37
Correction for mantis 535: inurl: parameter doesn't work on URLs with
...
upper-case letters
9 years ago
reger
52e3eb4ce8
harmonize/correct assignment to Ymarkmeta.mime
...
replace use of deprecated
9 years ago
Michael Peter Christen
87f358058e
Fix for index entries which have id's not computed as hash from the url.
...
This makes it possible to operate with outside-computed url hashes in
enterprise environments not using the build-in crawler from YaCy.
9 years ago
reger
3f2b8ab5e5
optionally include mime in p2p url exchange string
...
if doctype decodes to ambiguous mime and default conversion is not equal to original
9 years ago