/hydrus/ - Hydrus Network

Archive for bug reports, feature requests, and other discussion for the hydrus network.

Index Catalog Archive Bottom Refresh
Name
Options
Subject
Message

Max message length: 12000

files

Max file size: 32.00 MB

Total max file size: 50.00 MB

Max files: 5

Supported file types: GIF, JPG, PNG, WebM, OGG, and more

E-mail
Password

(used to delete files and posts)

Misc

Remember to follow the Rules

The backup domains are located at 8chan.se and 8chan.cc. TOR access can be found here, or you can access the TOR portal from the clearnet at Redchannit 3.0.

Uncommon Time Winter Stream

Interboard /christmas/ Event has Begun!
Come celebrate Christmas with us here


8chan.moe is a hobby project with no affiliation whatsoever to the administration of any other "8chan" site, past or present.

(12.55 KB 340x175 mariadb-usa-inc.png)

Hydrus software optimization thread Anonymous 06/07/2018 (Thu) 02:07:57 Id: 1e8781 No. 9068
ITT: create proposals for making Hydrus more optimized. Proposal: Why can't Hydrus switch to MariaDB? If it is faster, then it should be better. The only trouble is having the need to rewrite the queries, which from an SQL standpoint should be a non-issue, right? List of Databases with Open Source License and Open Source APIs: SQLite - Currently used in Hydrus, has minimal features MySQL - A more well-rounded SQL Database with user management PostgreSQL - An SQL with complex features with less performance MariaDB - SQL/NoSQL database with heavy optimizations ElasticSearch - A literal search engine instead of a normal Database Teradata - IDK https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems https://www.infoworld.com/article/2611812/mysql/mysql-face-off--mysql-or-mariadb-.html
You're aware MariaDB and the like are server software and would rely starting a second process alongside the hydrus client, right? SQLite is the only one who supports being loaded from a file within an application, which makes it the best -might even say only- fit for desktop software like this.
>>9077 Are there other SQL-like software that loads like a file while still out performing SQLite?
>>9087 take your pick: https://en.wikipedia.org/wiki/Embedded_database mongo or levelDB are good, but they're noSQL and would require extensive query rewrites - I'm also pretty sure hydrus benefits more from the relational database model, which those don't provide.
Honestly, there are lower fruit to pick in order to optimize Hydrus before even touching its database. After the initial processing of mappings, the bulk of I/O access is spent on the files themselves which AFAIK is single-threaded.
>>9094 Multi-threaded Python won't end too well… Some say Go or Rust, but I know it is a meme to rewrite everything.
>>9077 kde/plasma also starts up a mysql/mariadb instance for everything pim related and users hate it because they never managed to write their software in a way that wouldn't crash the database. All in all, i think the startup time required for a mysqlish database is negligible on a modern system but the amount of code required to make it act like a embedded database is astronomical and the exact opposite of what this project needs.
What about using FreeNAS in conjunction with Hydrus for ZFS-like performance? Or is there a distro that is best suited for image and file hoarding with RAID-like redundency?
>>9112 Well can we layout a pros vs cons of Embedded Database vs Optimized database like MariaDB?
Would it be possible to use some ORM library for SQL and let user choose SQL backend?
I would not mind runing mariadb daemon for hydrus. In fact, i am running one right now, and it would be great if i could set hydrus up to just connect to an existing database.
>>9068 What about file system parities? Would installing Hydrus on FreeNAS with ZFS be a good idea? What about Linux with BTRFS?
http://www.freenas.org/blog/a-complete-guide-to-freenas-hardware-design-part-i-purpose-and-best-practices/ http://www.freenas.org/blog/a-complete-guide-to-freenas-hardware-design-part-ii-hardware-specifics/ http://www.freenas.org/blog/a-complete-guide-to-freenas-hardware-design-part-iii-pools-performance-and-cache/ http://www.freenas.org/blog/a-complete-guide-to-freenas-hardware-design-part-iv-network-notes-conclusion/ http://www.freenas.org/blog/freenas-worst-practices/ Some of the points: 1. 8GB of RAM minimum, 12GB minimum if using plugins or jails, 1GB RAM per 1TB (conservative) or 3TB (liberal) 2. Don't use RAID controllers, just use Hot Bus Adapters to connect the drives to the motherboard (software "RAID") 3. FreeNAS needs bare metal, NOT VMs (but putting plugins or jails into FreeNAS is a good idea) 4. Intel CPU has more support than AMD, and LSI has the best Hot Bus Adapters (Marvell and J-Micron is okay) 5. 7200 RPM SAS or Enterprise SATA will work as HDD, do not use desktop drives for this to prevent IO errors 6. RAIDZ1 is like RAID 5, RAIDZ2 is like Z6, RAIDZ3 has triple parity, each vdev/group only has one-drive speeds 7. "ZFS intent log" should be on RAM (and on power-protected SSD if you wish), without it the whole vdev would fail
https://ponyorm.com/ can actually simplify SQL queries into something more python-friendly.
Proposal: Use a non-bloated Elasticsearch clone to find similar images quickly, and use newer techniques for people who wants to hunt down sources of images https://github.com/ascribe/image-match is Python 3 based with Elasticsearch https://github.com/dsys/match is Python 3 based with Kubernetes and Elasticsearch https://github.com/paucarre/tiefvision Lua based with deep learning (requires training) https://github.com/magwyz/pastec C++ based with OpenCV (too vague) https://github.com/pippy360/transformationInvariantImageSearch C++ based with triangulation (bigger database with high accuracy) For image hashing only: https://github.com/jenssegers/imagehash (PHP) https://github.com/corona10/goimagehash (Go) https://github.com/kevinlin311tw/caffe-cvprw15 (Deep Learn C++) https://github.com/willard-yuan/hashing-baseline-for-image-retrieval (Deep Learn Matlab) https://github.com/bunchesofdonald/photohash (Python) https://github.com/Jetsetter/dhash (Python) https://github.com/commonsmachinery/blockhash-js (JS) https://github.com/commonsmachinery/blockhash-python (Python) https://github.com/ruixuejianfei/BitScalableDeepHash (Deep Learn C++) https://github.com/mk-fg/image-deduplication-tool (Python) https://github.com/jforshee/ImageHashing (C#) https://github.com/pwlmaciejewski/imghash (JS) More information: https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5 http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html https://realpython.com/fingerprinting-images-for-near-duplicate-detection/ http://bertolami.com/index.php?engine=blog&content=posts&detail=perceptual-hashing https://www.pyimagesearch.com/2017/11/27/image-hashing-opencv-python/ Expanded use: https://github.com/soruly/whatanime.ga https://github.com/DaRealFreak/saucenao
>>9068 >PostgreSQL - An SQL with complex features with less performance t. Uber
>>9068 How about rewriting this in q++/qt? If you limit yourself to qt syntax then it is surprisingly similar to python with some c++ quirks. It's easier doing multi threading and starting separate processes in this than in python. Something like this could be easiest done this way: >Make code modular and switch the GUI to PyQt/pyside while still using python. >Experiment with the GUI code, perhaps try using QML to facilitate the GUI proposals from that one anon that made all the cool mockups. See >>8185 >Debate if it is even required to switch to c++ anymore since many qt goodies can be used via above mentioned libraries(threading/process starting/native notifications, etc). I haven't taken a look at the code but if it is already written modular then this shuldn't be too hard if the dev can stay motivated and people can life with a few months of only critical bug fixing.
>>9348 That is the issue, the dev is trying to migrate from wxPython to PyQt after the downloader overhaul, along with other key functions like parallel downloads, workflow management and mobile integration.
Bumping
>>9464 Yes but why?
As mentioned by >>9094 the bottleneck is mostly how the I/O and CPU is handled by hydrus. Imports are done sequentially when they can be sped up a lot by using multiprocessing. I'm sure other actions are still done sequentially too. A transition to a graph database like ArangoDB could be better in the long run, but that's never going to happen. Looking at the client.master.db database, I'm not sure why he added an index to the md5, sha1 and sha512 columns but not to the subtag or namespace columns. Doesn't make sense to me (and is the sha512 index really necessary?). Also it boggles my mind that foreign keys aren't being used at all.
>>9658 I am also expecting multi-threading could be a place where we can optimise the code (since most computers now run on 4/8 cores). Perhaps SQLite, MD5/SHA hashing and de-duplication are not made for multi-core and/or GPU computers.
>>9659 >multi-threading Python threads are all executed on the same core. That's why I said multiprocessing. It spreads out each subprocess across each core. Based on your post you don't know much about software, so think of a subprocess in python like a normal thread. >are not made for multi-core and/or GPU computers Everything you've mentioned can be easily sped up with multiple cores. Using a GPU would be even faster but there's no point in using that here. I'm actually pretty surprised he hasn't implemented multiprocessing functions in bottleneck situations like importing. It's very easy to split up the work once you've scanned all the files. You just divide them up by the number of cores and have each subprocess do that portion of the work. If you have 4 cores you have each core do 1/4 of the files you want to import.
>>9660 >Python threads are all executed on the same core. That's why I said multiprocessing Well due to people call 4 core Intel CPUs having "hyperthreads" making it 8 virtual cores, I would say that is easy to have those things mixed up. If I have to use a proper term Parallel Programming (as in Concurrency) would be more fitting. >Everything you've mentioned can be easily sped up with multiple cores I meant that it has not been implemented yet by the dev since (s/are not/has not been/) >I'm actually pretty surprised he hasn't implemented multiprocessing functions
Considering the recent happenings of Tumblr and booru.org purges, it is important to put focus on alternative decentralization libraries. 1. free P2P software a. BitTorrent - Most commonly used, but can't handle individual files b. WebTorrent - WebRTC version of BitTorrent, but still have the same issue c. eDonkey and GNUtella - both very obscure, not really useful or adaptive d. IPFS - currently used in Hydrus, can handle singular files in a folder structure 2. Proxies and psuedo-VPNs a. TOR - very common, maybe pozzed by CIA, has BitTorrent and IPFS compatibility (OpenBazaar) b. I2P - less common, not pozzeed, has BitTorrent compatibility, IPFS is in the works (go-i2p) c. Freenet and Retroshare - both very uncommon, has file transferring and chats as a primitive d. Zeronet - pretty dead, works with Javascript, too many unknowns 3. Blockchain data solutions (https://en.wikipedia.org/wiki/Cooperative_storage_cloud) a. Filecoin - based in IPFS, slowly developing, could be used in conjunction with Hydrus b. Sia - top data blockchain contender, has smart contracts with regular renewal for storage (https://sia.tech/) c. MaidSafe - possible competition, includes secure communication and storage (https://maidsafe.net/) d. Storj - noted, already have average pricing, made to be used along side self-host cloud (https://storj.io/) e. Ethereum Swarm - note really a good idea as the blockchain is congested by CryptoCats f. Others include https://decent.ch/ https://www.creativechain.org/ https://contentbox.one/ https://noia.network/ Others: https://cryptoslate.com/category/cryptos/storage/
4. Social media blockchain a. Steem - used in alt-media like bitchute, dtube and steemit (https://steem.io/) b. Rocketchat - used by the furrires to commuitcate (https://rocket.chat/) c. SocialX - at a whitepaper stage, to replace facebook and twitter (https://socialx.network/) d. Akasha - based in IPFS, meant to replace Tumblr (https://akasha.world/) e. BAT Token - used by Brave Browser (https://basicattentiontoken.org/) Others https://foresting.io/ and https://sola.foundation/ and https://www.synereo.com/ https://www.stateofthedapps.com/dapps/tagged/social/tab/most-relevant
>>9881 >booru.org purges What do you mean?
>>9884 Gelbooru and *.booru.org are hosted in the Netherlands, and they are using "anti-loli laws as an excuse" to force a purge on the admins.
Do you know how can I convert hydrus db to postgresql? Hydrus db consists of multiple sqlite files, how can I connect all of them?
>>9281 https://vision.fe.uni-lj.si/cvww2016/proceedings/papers/04.pdf (Quantitative Comparison of Feature Matchers Implemented in OpenCV3) https://sci-hub.tw/10.1109/m2vip.2016.7827292 (Comparison of OpenCV’s Feature Detectors and Feature Matchers)
>>10361 Got some more comparative papers 4U https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8346440 (A Comparative Analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK)
https://en.wikipedia.org/wiki/Pointwise_mutual_information Pointwise mutual information between tag X and tag Y is the logarithm of (num. of images with both tags) * (total image count) / ((num of images with tag X) * (num of images with Tag Y)) PMI can be used to find possible tag siblings https://en.wikipedia.org/wiki/Conditional_entropy Conditional entropy of X given Y is ( (num. of images with both tags) / (total image count) ) * logarithm of ( (num of images with tag X) / (num. of images with both tags) ) CE can be used to find possible tag parents and children
Nim is low-level Python, Crystal is low-level Ruby, both would be easy for the rest of us (and hopefully the dev) to pick up. Doing so would mean that Hydrus would be at least twice as fast in certain departments when compared to non-NumPy Python. (Also D is a C replacement, Go and Kotlin are Java replacements, but those are very different from the syntax of Python) Are there applications where low-level languages DON'T apply? Math calculations, in that case use SciPy/NumPy for less work. Some benchmarks: https://github.com/kostya/benchmarks https://github.com/drujensen/fib https://github.com/frol/completely-unscientific-benchmarks https://github.com/logicchains/LPATHBench
>>10290 >https://github.com/acoustid/acoustid-index (C++) You're looking for https://github.com/acoustid/chromaprint (C++) To be honest though when Hydrus starts doing audio fingerprinting it should probably just use acoustid so it can grab tags from MusicBrainz ( https://musicbrainz.org/ )
>>11053 Or maybe others as well? What if we are getting music from torrents instead and don't want MusicBrainz to know that I got them? Bumping to spark conversation >>10232 http://www.scitepress.org/Papers/2016/59263/59263.pdf (Performance Evaluation of Phonetic Matching Algorithms on English Words and Street Names) More benchmarks for major phonetic algorithms
>>9068 >PostgreSQL - An SQL with complex features with less performance 1998 wants it retard memes back.
(1.57 KB 300x300 下.png)

>>11023 >implying
>>11204 How so? Too many onyomi and kunyomi? Even then if we are not using phonetic fuzzy search, string fuzzy search can still be used (see https://en.wikipedia.org/wiki/String_metric)
Here are a list of "expert system" Video Quality Quantifier https://github.com/Netflix/vmaf (C/C++/Python) https://github.com/aizvorski/video-quality (Python) https://github.com/bavc/qctools (C++) https://github.com/Rolinh/VQMT (C++) https://github.com/google/rtc-video-quality (Python) https://github.com/kahkeng/vqats (C/C++) https://github.com/slhck/ffmpeg-quality-metrics (Python) https://github.com/honzabilek4/VideoCodecs (C++) https://github.com/jsyzgaochao/iqat (C++) Here are a list of "expert system" Image Quality Quantifier https://github.com/andrewekhalel/sewar (Python) https://github.com/jeffh/CV-Image-Quality-Analysis (Python) https://github.com/VIQET/VIQET-Desktop (C++/C#) https://github.com/arcaduf/image_quality_assessment (Python) https://github.com/bukalapak/pybrisque (Python) https://github.com/pby5/BRISQUE (C++) https://github.com/mchall/ImageQuality (C#) https://github.com/mtobeiyf/CEIQ (C/MATLAB) https://github.com/realwecan/BlindImageQualityAssessment (C++) https://github.com/grevutiu-gabriel/iqa (C/MATLAB) https://github.com/ruofeidu/ImageQualityCompare (C++) https://github.com/henrikjohansson/Colorite (Java/C++) And for NN-based Image Video Quantity Quantifier… sigh https://github.com/idealo/image-quality-assessment (348 stars) https://github.com/jongyookim/IQA_BIECON_release (41 stars) https://github.com/jongyookim/IQA_DeepQA_FR_release (32 stars) https://github.com/lidq92/CNNIQA (29 stars) https://github.com/lidq92/CNNIQAplusplus (28 stars) https://github.com/HC-2016/weighted_DCNN_IQA (17 stars) https://github.com/lidq92/WaDIQaM (17 stars) https://github.com/zwx8981/DBCNN-PyTorch (10 stars) https://github.com/VideoForage/VQA-Deep-Learning (10 stars) https://github.com/synckey/deep_biq (9 stars) https://github.com/zhl2007/pytorch-image-quality-param-ctrl (9 stars) https://github.com/michaelneuder/image_quality_analysis (9 stars) https://github.com/hervindphil/image_quality (8 stars) https://github.com/SenJia/Saliency-CNN-Image-Quality-Assessment (8 stars) https://github.com/pcpmartins/video-quality-assessment (8 stars) https://github.com/kamballu/HDR-NRIQA-PCNN (5 stars) https://github.com/JayMarx/VSBIQA (5 stars) https://github.com/etosworld/etos-image-assessment (3 stars) https://github.com/geosrs/transIQA (3 stars) https://github.com/Bobholamovic/CNN-FRIQA (3 stars) https://github.com/LeonLIU08/DeepQA-with-Pytorch (3 stars)
>>12295 Why don't you actually develop something on your own instead of endlessly shitting out github links
>>12302 Nah that is for >>12277


Forms
Delete
Report
Quick Reply