-
-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Add an option to make the output more friendly with RAG engines #202
Comments
Hi, @tranquochuy645 ! I see you're considering using AST for code splitting - I'm curious about your planned implementation approach. Since Repomix is a Node.js tool, I'd like to avoid introducing Python dependencies (like Langchain's RecursiveCharacterTextSplitter) as that would require users to maintain both Node.js and Python environments. I've been thinking tree-sitter could be a good fit here since it:
What are your thoughts on implementation details? I'm curious to hear more about how you're planning to handle the splitting logic. I'm excited about this feature and looking forward to hearing your ideas! |
Thanks so much for the encouragement, @yamadashy - really appreciate it! I totally get that adding Python dependencies to a NodeJS package is a no-go. After digging deeper into the LangChain codebase, I found two approaches they use for code parsing:
The first option isn’t great, and the second isn’t supported in LangChainJS yet. But there are some tools we can use:
My plan is to mimic the Langchain language parser module in JavaScript for the splitting part, the rest is pretty straightforward. For now, I’ll keep exploring Repomix’s codebase. Once I have a better understanding, I’ll open a draft PR so we can chat more about the implementation details. |
Also, node-tree-sitter's documentation is not that great, especially about the Query API. Posting this link here for later investigations. |
Thank you for such detailed research! I really appreciate your thorough investigation into potential approaches. I've been wanting to tackle this issue but hadn't been able to start, so this is incredibly helpful. Speaking of Tree-sitter implementations, I recently came across an interesting article about how Aider uses Tree-sitter for their codebase analysis: Also, Cline is a great JavaScript-based reference for Tree-sitter implementation: It might provide some useful insights for our implementation. I completely agree that handling large codebases is currently Repomix's biggest challenge. I'm very excited to move forward with this enhancement. |
Hello @yamadashy, I’ve been having trouble finding time to work on this, though I’m still really excited about it! Here’s an update: I’ve managed to get some functions running locally, mostly thanks to your suggestion about Cline. The code now extracts function signatures, so it's pretty close to what was discussed in #164. I’m thinking of breaking this down into two features:
I’d love to hear your thoughts on this direction and how we can move forward. |
Thanks for the update @huy-trn ! What a coincidence - I was actually about to contact you after seeing your tree-sitter implementation in the network graph! Regarding your two proposals:
I'd like to move forward with both features, with a slight preference for tackling the Code Details feature first. However, feel free to work on whichever you find more approachable or interesting! I'll explore what kind of CLI options we should provide for this feature. Let's keep collaborating and make Repomix even better! |
i recently hit into the token issue and had to move away from repomix due to it, I was just thinking of turning off or seeing if there was a way to turn off the file output, just so I can then feed it to the LLM by hand but the aider way, which is what I'm using now would be way better, I am going to keep an eye on this as I would love to just keep the RAG and code compression on and then turn off the complete file output. i am surprised how fresh this is and I am going to keep my eyes on this 200% |
I can help. Let me know, and ping @SpyC0der77 (my main) to let me know what to work on. |
Hello @SpyC0der77, Thanks so much! I’m working on this PR #336, but it’s still a draft, and honestly, the code is ugly. Unfortunately, I won’t be able to finish it anytime soon due to other commitments. It would be great if you could fork the branch and take it from there (#336) or even start fresh if you prefer. I’d be really happy to see the idea move forward! Edit: Adding tree sitter queries for more languages or improving the existing ones is really needed too, and we can easily merge the works! I'll be able to be back working on this after 2 or 3 weeks. |
I don't know how to make RAG... I can try and learn though
…On Tue, Feb 11, 2025 at 12:13 PM Huy Tran ***@***.***> wrote:
Hello @SpyC0der77 <https://github.com/SpyC0der77>,
Thanks so much!
I’m working on this PR #336
<#336>, but it’s still a draft,
and honestly, the code is ugly. Unfortunately, I won’t be able to finish it
anytime soon due to other commitments.
It would be great if you could fork the branch and take it from there (
#336 <#336>) or even start fresh
if you prefer. I’d be really happy to see the idea move forward!
—
Reply to this email directly, view it on GitHub
<#202 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BG62BJRTGROVACKWIDJCDPD2PIVSZAVCNFSM6AAAAABTLTV24KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJRGQ4TQNZXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello @yamadashy ,
The tool is already great, but larger repositories will never fit into an LLM's context window.
I know enterprise-level RAG systems have been around for a while, but sticking to user-friendly solutions, here’s what I’m thinking:
What to do: Make the output of Repomix easier to parse, and more meaningful when retrieved by any RAG engine, such as the "chat with documents" feature that’s common in most LLM applications.
How to do: Split large code files into smaller chunks using the AST, then merge them back into a single output. Also, add separators between chunks that are recognizable by most text splitters.
I’ve recently tried Langchain’s source code loader, and this approach should be easy to implement with a few additional dependencies.
If this sounds good, I’d be happy to open a PR for it! Let me know your thoughts.
The text was updated successfully, but these errors were encountered: