• 2 Posts
  • 18 Comments
Joined 11 months ago
cake
Cake day: August 2nd, 2023

help-circle







    1. SO and Reddit are on the TODO list. It even had SO (in the bottom indeed) once but not via crawling, via SO Search API. It has very poor quality results and was super slow so I had to remove it while thinking of a better solution. Crawling entire SO might be little too much of this project at this state tho but if I have enough courage and hours at night I might parse that 20GB stack overflow archive dump and try doing something useful with it.

    Same for Reddit but here I have mixed feelings about it in general and hope it’s going to die soon being replaced by amazing Lemmy communities.

    I also used to type some question and end with “reddit” in Google to get good quality content, but here with kukei the experiment is whether blogosphere can replace it properly when index is promoting it.

    1. Why blogs?

    This is my main thing. To promote good quality blogs that I tried to follow via RSS but somehow never did. Having them all indexed (and more, some Mastodon community gave me amazing links to index) makes me actually visit them often.

    For the “SEO cancer” that where curation comes into play. Before crawling I check unknown blogs to me and decide whether something goes in or not.










  • It’s still in MVP, work in progress, hence the index is not “full”.

    For me “web development” is everything that we might need for well, web. Servers, mongo docs all goes into the index (I’m adding it every day basically but also it takes some time to index stuff and I observe how this whole thing works as index grows).

    ASP.NET goes into the index of course. If your website has dev resources and blog posts that would go into it as well. Recently one person suggested tons of Haskell blogs and they are being indexed as we speak.

    I have also a different problem, dev.to has a lot of good resources but also tons of SEO spam and low quality content. It’s also freaking huge and while it was for some time in the index I had to remove it and think about it some more.

    Where would you draw lines on mixed c content or technologies

    For now the line is: does this website have anything that web devs would need? Yes? Then it might get in.

    If it’s a blog about locomotive CPU programming then maybe not. Although mostly due to infrastructure costs. Indexing cost in the end but having some non related stuff in the index should not hurt the results.

    All of what I wrote is the state for today. I’m changing my mind often as it’s still in “having fun” state.

    PS. also thanks for the feedback!