I'm creating a curated search engine for web developers. Asking for a feedback

sznowicki@lemmy.world · edit-2 11 months ago

I'm creating a curated search engine for web developers. Asking for a feedback

sznowicki@lemmy.world · edit-2 11 months ago

I like how first queries you guys make are attempts to SQL inject and XSS it.

EDIT: if you find something let me know, PRs also welcomed ;)

nezbyte@lemmy.world · 11 months ago

There was a programming search engine called Symbol Hound that allowed for searching for special symbols like << and &&. It was my fallback search engine while programming if I couldn’t find something on the first page of Google. Sadly, that site appears to have disappeared. Does this search engine have optional support for special characters?

Kissaki@feddit.de · edit-2 11 months ago

I think the main issue as well as my main question is around scope.

You say targets we developers, but the current index is quite narrow. So will you accept significant expansion of that, as long as it may be relevant to Web developers? Where would you draw lines on mixed c content or technologies?

ASP.NET docs is definitely docs for web developers. But maybe not what you had in mind. Would that apply? The docs are h hosted on a platform with a lot of other docs of the dotnet space. Some may be relevant to “Web developers”, others not. And the line is subjective and dynamic.

My website has some technological development resources and blog posts. But also very different things. Would that fit into scope or not?

How narrow out broad would you make the index?

I guess it’s an index for search, so noise shouldn’t be a problem as long as there are gains through/of quality content.

sznowicki@lemmy.world · edit-2 11 months ago

It’s still in MVP, work in progress, hence the index is not “full”.

For me “web development” is everything that we might need for well, web. Servers, mongo docs all goes into the index (I’m adding it every day basically but also it takes some time to index stuff and I observe how this whole thing works as index grows).

ASP.NET goes into the index of course. If your website has dev resources and blog posts that would go into it as well. Recently one person suggested tons of Haskell blogs and they are being indexed as we speak.

I have also a different problem, dev.to has a lot of good resources but also tons of SEO spam and low quality content. It’s also freaking huge and while it was for some time in the index I had to remove it and think about it some more.

Where would you draw lines on mixed c content or technologies

For now the line is: does this website have anything that web devs would need? Yes? Then it might get in.

If it’s a blog about locomotive CPU programming then maybe not. Although mostly due to infrastructure costs. Indexing cost in the end but having some non related stuff in the index should not hurt the results.

All of what I wrote is the state for today. I’m changing my mind often as it’s still in “having fun” state.

PS. also thanks for the feedback!

Kissaki@feddit.de · edit-2 11 months ago

I have also a different problem, dev.to has a lot of good resources but also tons of SEO spam and low quality content. It’s also freaking huge and while it was for some time in the index I had to remove it and think about it some more.

Yeah, a public platform is unlikely to provide consistent content. If curation is not an explicit goal and practice there, I would not include them for the reasons you mentioned.

If indexing could happen not on domain but with more granular filters - URL base paths - that may be viable. Indexing specific authors on devto.

sznowicki@lemmy.world · 11 months ago

Good idea. I had this thought once to do some narrow indexing of websites, e.g. stack overflow is a big issue, indexing all of this is crazy, picking up some specific tags on the other hand feels like tons of work. In the end I adjust the whole project as it grows with hope that after every tuning it gets better.

As long as I have fun with it I’ll continue :D

Kissaki@feddit.de · 11 months ago

Of course - cutting scope is a good call to keep it manageable and fun, and not end up with creep and what you wanted to evade in the first place. :)

bigredgiraffe@lemmy.world · 11 months ago

This is a cool idea! I did notice that on mobile the search results are wider than the viewport and if I had a feature request it would be to make them way, way more compact but that might just be me hah.

You should also check out the Lenses feature that Kagi has, I think every search engine needs that feature now hah. I bookmarked your site for the next time I am searching for sure though!

sznowicki@lemmy.world · edit-2 11 months ago

Thx for the comments. I’ll fix the mobile view and will definitely redesign it all a bit over weekend. I see a lot of room for improvements.

Also will check how to submit it to Lenses. Highly appreciate it!

EDIT: mobile view is fixed, also did some small adjustments in the whitespaces between result items.

bigredgiraffe@lemmy.world · 11 months ago

Yeah! Granted I have an iPhone 12 which is small for a modern phone but I figured I should mention it :D

I have been thinking more about this idea and I love it even more, I feel like domain specific search engines are going to be more and more important in the future as the results of the major search engines get even worse and worse.

Awesome work!

KseniyaK@lemmy.ca · 11 months ago

Well, I think this may be not a bad idea at all. However, what would really stop me from using your search engine is if my search queries (or anything else I send) were somehow tied to me and/or sold to someone. Please don’t be like Google, Microsoft, or OpenAI.

sznowicki@lemmy.world · 11 months ago

Ah. This will never happen. I have zero motivation to do any GDPR stuff in this project. Even for analytics I anonymize visitors IPs so plausible don’t get them.

Also in this case it would be nonsense. For general search it makes sense that Bing knows I’m after parceljs when typing „parcel” instead of spedition companies. For such narrow search engine the user persona is known.

Kissaki@feddit.de · 11 months ago

Index categories are blog, docs, magazines. Have you considered indexing source code websites?

https://source.dot.net/ provides a web UI to exploring dotnet source code

I thought I would remember a second one, but I can’t recall right now.

Subpaths on GitHub and GitLab would be a similar fashion but would require more specific filters - unless they are projects hosted on dedicated instances.

Project issue tickets may also be very relevant to developer searches!?

sznowicki@lemmy.world · edit-2 11 months ago

Great ideas. For the source code I’m not sure but I’ll put it to the backlog of cool things I get from Lemmy and work on them one by one. Thanks!

jjjalljs@ttrpg.network · 11 months ago

I often get annoyed when I Google/ddg something like “python3 sort list in place” and get some blog, w3schools, and geeks for geeks, before I get the standard python docs. Just tell me if it’s [].sort() or sorted([]) !

Honestly for that kind of question I want the docs more than I want stack overflow.

Maybe I should just bookmark the docs instead of using search, come to think of it. But if your search prioritizes official docs that sounds like a plus.

stabbie_mcgee@lemmy.world · 11 months ago

Another option aside from bookmarking is DDG bangs, I have DDG as my default search, so I can just type !py sort list into the address bar and go directly to the python documentation.

DrakeRichards@lemmy.world · 11 months ago

It’s a good start. I’m curious why you didn’t include a section for social media like StackOverflow or Reddit. If I go to Google with a question, it’s usually for an edge case not covered by the documentation. Maybe add them as a section at the bottom to indicate that they might be less relevant?

Also, this might just be a web developer thing, but why include blogs? Almost all coding blogs I’ve seen are SEO cancer that just copy from the documentation or each other. Are there actually useful blogs out there that I’ve just been missing?

sznowicki@lemmy.world · 11 months ago

SO and Reddit are on the TODO list. It even had SO (in the bottom indeed) once but not via crawling, via SO Search API. It has very poor quality results and was super slow so I had to remove it while thinking of a better solution. Crawling entire SO might be little too much of this project at this state tho but if I have enough courage and hours at night I might parse that 20GB stack overflow archive dump and try doing something useful with it.

Same for Reddit but here I have mixed feelings about it in general and hope it’s going to die soon being replaced by amazing Lemmy communities.

I also used to type some question and end with “reddit” in Google to get good quality content, but here with kukei the experiment is whether blogosphere can replace it properly when index is promoting it.

Why blogs?

This is my main thing. To promote good quality blogs that I tried to follow via RSS but somehow never did. Having them all indexed (and more, some Mastodon community gave me amazing links to index) makes me actually visit them often.

For the “SEO cancer” that where curation comes into play. Before crawling I check unknown blogs to me and decide whether something goes in or not.

DrakeRichards@lemmy.world · edit-2 11 months ago

That makes sense. I really like that the documentation is right at the top; many times all I want to do is find the right page in the official docs. You might want to look at how results are prioritized though: right now when I search for something simple like “how to center a div”, that result from Mozilla’s docs is included but it’s hidden as the second or third result. I would expect the page that’s explicitly about centering a div to be the top result, followed by the docs page for the element itself and maybe pages for flex or grid or something. That’s a really simple example, so maybe it’s not the target of this project, but I would still hope that simple topics are covered just as well as complex ones.

EDIT: I was a bit mistaken: “how to center a div” does bring up the Mozilla documentation for centering an element, but “center a div” brings up a page about accessibility as the top result.

Ryan@lemmy.world · 11 months ago

I really like the simple design that separates the results into docs/blogs/magazines. Obviously, the results reflect the current state, but I appreciate your approach in both the design & sourcing the search results. I think there’s a lot of potential for this to be a regular part of my toolbox, hopefully this takes off!

sznowicki@lemmy.world · 11 months ago

Thanks for the kind words!

Hammerheart@programming.dev · 11 months ago

Can you add links to each section at the top so you dont have to scroll past ones you might not be interested in?

onlinepersona@programming.dev · 11 months ago

How is it specifically dev focused? How will the crawler know that the site or page is dev related?

sznowicki@lemmy.world · 11 months ago

The crawler takes only the sources that are defined in the crawler repo (it’s open source, check the github org or kukei-spider).

So in this way it’s “curated” in a sense that it would not add anything else to the index.

starman@programming.dev · 11 months ago

Looks cool, I think I’ll add it to firefox and use sometimes

sznowicki@lemmy.world · 11 months ago

Thanks! If you have some suggestions in the future I’m always open to hear

Hazzard@lemm.ee · 11 months ago

Oh, what an interesting idea! I like this, on Monday I’ll test out switching to this as my main search engine for work and try to report back how it goes!

sznowicki@lemmy.world · 11 months ago

Thanks but don’t expect too much yet. Many sources are still missing. If you notice something should be there but it’s not even being crawled feel free to reach me one Mastodon or add it directly via PR here: https://github.com/Kukei-eu/spider/blob/main/index-sources.js

I'm creating a curated search engine for web developers. Asking for a feedback

I'm creating a curated search engine for web developers. Asking for a feedback

kukei.eu