Selca is a fan operated archive of J-pop and K-pop idols' social media posts.
Our focus is on currently active groups and idols. Archival of accounts from disbanded groups and retired idols may be interrupted at any time.
Please enable javascript in order to see the contact email address.
You can contact us by sending an e-mail message to sel... @ proton.me (click the ellipsis to reveal the full address). Any feedback is welcome.
If you want to reset your password, make sure to message us from the same e-mail address that you used when registering your account.
If you're only interested in gathering our text data for your LLM training, you DO NOT need to use a crawler. We make available a list of post metadata zip files containing all of our posts' text. This is a full dump of our posts table. Please download each file exactly once, one at a time. These files will not be updated, so if you wish to get new data at a later time, you only need to download posts_latest.zip and whichever new numbered files have been added. DO NOT set up a cron job to redownload these files periodically. You may check and redownload only post_latest.zip and also download the last numbered file if it was newly added, exactly once a day. Please stop DDoSing us. You're wasting our time and resources, and getting no extra data from it.
If you're interested in mass downloading the medias we have archived, we make available a master list of all archived accounts' medias. Each line contains a link to a list of all medias we have archived from one account. These same links are also present at the top of our account pages, which themselves can be found in our searchable accounts table. By downloading every list and then downloading all the medias listed in each one, you'll be able to save all of our archived medias without placing excessive stress on our database. We also provide a similar list linking the metadata we have.
Feeding the lists described above into a mass downloader tool should provide you with a local copy of everything we have. One such tool is jdownloader, though many others exist and you may choose whichever one you prefer. However, please configure your tool to download no more than one file per second. Please download from a single IP address, so you won't be mistaken for an attacker. You do not have to mask your tool's user-agent.
Note that, as mentioned above, using a regular general purpose web crawler on our front page (without respecting /robots.txt) is a very bad idea since it will strain our database and deny service to other users without bringing you any benefit.
Please contact the e-mail address above if you have any questions in regards to scraping our data, or if you have been affected by rate limits intended for someone else, that is, if you're seeing HTTP error 429 despite not being a scraper.