luc
f0478bb14d
BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
...
imageio-bmp-3.2 library.
- better BMP format flavours support
- handle PNG encoded icons
- handle transparency
Added some javadoc url references to .classpath
9 years ago
luc
07437986e7
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
97cc03ef6a
start using a template for urlproxy header
...
It is included as iframe /proxmsg/urlproxyheader.html
to allow full servlet functionallity and flexibility to display some
index/meta data in future.
9 years ago
luc
f01d49c37a
Process large or local file images dealing directly with content
...
InputStream.
9 years ago
luc
3c4c77099d
If available, check content length before downloading. Check also
...
content length is not over Integer.MAX_VALUE.
9 years ago
luc
5bbb2e1730
Ensure resource is closed when reading a full file InputStream
9 years ago
luc
6291a57300
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
0d3c5b223e
have psParser cleanup temp file
9 years ago
reger
7d0d19cb8e
avoid File.deleteOnExit() on temp files
...
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
9 years ago
luc
bfe51001e3
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
02e4489a23
set tmpfile.deleteOnExit by default,
...
to make sure files are removed on shutdown.
9 years ago
reger
2985baaa01
Exclude repetitive protocol part in tokenized url
...
used as description if none is avail. from parser.
9 years ago
reger
ca3d26a401
harmonize wordsintitle & CollectionSchema.title_words_val calculation,
...
remove obsolete partial init of wordreference from urimetadata
9 years ago
reger
52a9040ae6
Sort out double keywords (dc_subject) early in parsed documents
...
- by direct using Set vs. List
- remove not neede String[] getter
9 years ago
luc
49331dc523
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
47d70732f6
improve locale translator
...
- skip empty line
- robustness file section detection (space independant)
9 years ago
sixcooler
646afe9183
do not store subfield *_coordinate + make all num-fields being docvalues
9 years ago
sixcooler
194df613de
not using 'location' as defaultfacetfield - since we removed it being
...
default.
9 years ago
sixcooler
d3b9349b6f
simplification / speedup of GenerationMemoryStrategy
9 years ago
sixcooler
4a905ec134
fix to not let the AccessTracker-Log grow to much, but have enough data
...
to monitor.
(+gitignore-correction)
9 years ago
reger
20e18d79f8
harmonize document title for archive parsers
9 years ago
luc
f11b5e8309
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
112ae013f4
update bzip and bzip parser process,
...
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
9 years ago
reger
e76a90837b
update zip and tar parser process,
...
to return one document for the file with combined parser results of the
containing files.
9 years ago
luc
4e673ffc9a
Ensure closing of InputStream even when an exception occurs.
9 years ago
luc
10696b53f7
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
8532565c7d
optimize order of parsers to try
...
- start with a parser matching the remote supplied mime
9 years ago
reger
681889ae64
use current tar library for untar files
...
- remove old source copy
9 years ago
reger
5d71fc70e3
fix tarParser early exit on looping content
...
- adjust check of data available according to doc
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
9 years ago
luc
bcc2e7cb5b
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
2fcf6f104c
fix bzipParser recognition
...
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
9 years ago
luc
745e97a575
Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger
a60b1fb6c2
differentiate api call getLocalPort() from getConfigInt()
9 years ago
reger
11f3666660
increase use of pre.defined CATCHALL_QUERY string
9 years ago
reger
a58ee49307
Optimize internal imagequery focus on using content_type to select images
...
(in favor of url file extension)
9 years ago
luc
fc3294382e
Updated javadocs for warning on target encoding format potential errors.
9 years ago
luc
aa70ff4ff6
Corrected images alpha channel rendering
9 years ago
reger
d223cf0ae4
adjust MediaWiki importer geo coordinate calculation
...
- allow lat/long 0.xxx
- south / west assignment
include test class
9 years ago
reger
2b775d5be6
fix typo in WikiCode coordinate calculation
9 years ago
reger
bbe9df2bb3
fix MediawikiImporter for bz2 dump
...
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
9 years ago
reger
c6687dd560
fix a system.out to log.fine
...
in bmpParser
9 years ago
reger
e53c6bbd51
fix init of peer flags
...
(remove hiding of ssl flag)
9 years ago
Michael Peter Christen
ac034db8bc
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
...
# Conflicts:
# htroot/js/highslide/highslide.js
# source/net/yacy/document/ImageParser.java
9 years ago
reger
826f14f37f
fix unnececary set null of peer flags, causing reread
...
remove obsolete version flags
9 years ago
luc
5902ce032e
Corrected NullPointerException case when ImageIO reader is not found for
...
image format.
9 years ago
reger
c6495a5b62
add a log entry on parsing ajax crawling scheme snapshot
...
(prev. commit 9252e36aeb
)
9 years ago
reger
9252e36aeb
implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
...
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
- htmlparser tries to get content of htmls snapshot (using a different url)
- htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
9 years ago
Michael Peter Christen
d1ae999ef9
replaced HashMap with LinkedHashMap to preserve the object order
9 years ago
Michael Peter Christen
7d075a1d76
added log lines
9 years ago
Michael Peter Christen
092dac086e
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago