Idea: Add an option to make the output more friendly with RAG engines #202

huy-trn · 2024-12-10T16:56:53Z

The tool is already great, but larger repositories will never fit into an LLM's context window.

I know enterprise-level RAG systems have been around for a while, but sticking to user-friendly solutions, here’s what I’m thinking:

What to do: Make the output of Repomix easier to parse, and more meaningful when retrieved by any RAG engine, such as the "chat with documents" feature that’s common in most LLM applications.
How to do: Split large code files into smaller chunks using the AST, then merge them back into a single output. Also, add separators between chunks that are recognizable by most text splitters.

I’ve recently tried Langchain’s source code loader, and this approach should be easy to implement with a few additional dependencies.

If this sounds good, I’d be happy to open a PR for it! Let me know your thoughts.

yamadashy · 2024-12-11T15:35:22Z

Hi, @tranquochuy645 !
Thank you for this great proposal! I completely agree that improving RAG compatibility would be valuable for handling larger codebases.

I see you're considering using AST for code splitting - I'm curious about your planned implementation approach. Since Repomix is a Node.js tool, I'd like to avoid introducing Python dependencies (like Langchain's RecursiveCharacterTextSplitter) as that would require users to maintain both Node.js and Python environments.

I've been thinking tree-sitter could be a good fit here since it:

Provides accurate AST parsing for multiple languages
Is well-maintained and reliable

What are your thoughts on implementation details? I'm curious to hear more about how you're planning to handle the splitting logic.

I'm excited about this feature and looking forward to hearing your ideas!

huy-trn · 2024-12-11T18:23:19Z

Thanks so much for the encouragement, @yamadashy - really appreciate it!

I totally get that adding Python dependencies to a NodeJS package is a no-go.

After digging deeper into the LangChain codebase, I found two approaches they use for code parsing:

RecursiveCharacterTextSplitter: It uses language-specific separators and splits code like plain text.
Source Code: This one uses Tree-sitter for parsing.

The first option isn’t great, and the second isn’t supported in LangChainJS yet.

But there are some tools we can use:

node-tree-sitter: A Tree-sitter implementation for Node.js.
The Tree-sitter queries from Python LangChain, like this one.

My plan is to mimic the Langchain language parser module in JavaScript for the splitting part, the rest is pretty straightforward.

For now, I’ll keep exploring Repomix’s codebase. Once I have a better understanding, I’ll open a draft PR so we can chat more about the implementation details.

huy-trn · 2024-12-11T19:20:04Z

Also, node-tree-sitter's documentation is not that great, especially about the Query API.

Posting this link here for later investigations.

tree-sitter/node-tree-sitter#70 (comment)

yamadashy · 2024-12-13T09:52:08Z

Thank you for such detailed research! I really appreciate your thorough investigation into potential approaches. I've been wanting to tackle this issue but hadn't been able to start, so this is incredibly helpful.

Speaking of Tree-sitter implementations, I recently came across an interesting article about how Aider uses Tree-sitter for their codebase analysis:
https://aider.chat/2023/10/22/repomap.html

Also, Cline is a great JavaScript-based reference for Tree-sitter implementation:
https://github.com/cline/cline/tree/main/src/services/tree-sitter

It might provide some useful insights for our implementation.

I completely agree that handling large codebases is currently Repomix's biggest challenge. I'm very excited to move forward with this enhancement.

huy-trn · 2025-02-04T04:19:57Z

Hello @yamadashy,

I’ve been having trouble finding time to work on this, though I’m still really excited about it!

Here’s an update:

I’ve managed to get some functions running locally, mostly thanks to your suggestion about Cline. The code now extracts function signatures, so it's pretty close to what was discussed in #164.

I’m thinking of breaking this down into two features:

Code compression: As discussed in Feature Request: Code Detail Level #164, we could add a new command-line option that filters out everything except function signatures, interfaces, typedefs, etc., to reduce data size.
RAG-friendly output: Add JSON as a fourth output format (along with XML, text, and MD). The data in this format will be structured as an array of code chunks, which should make it easy to be imported into Elasticsearch or a similar service.

I’d love to hear your thoughts on this direction and how we can move forward.

yamadashy · 2025-02-04T15:12:14Z

Thanks for the update @huy-trn !

What a coincidence - I was actually about to contact you after seeing your tree-sitter implementation in the network graph!
https://github.com/yamadashy/repomix/network

Regarding your two proposals:

Code compression
RAG-friendly JSON output

I'd like to move forward with both features, with a slight preference for tackling the Code Details feature first. However, feel free to work on whichever you find more approachable or interesting!

I'll explore what kind of CLI options we should provide for this feature.
Feel free to submit a PR even before implementing the functionality - we can merge it at that stage and continue from there.

Let's keep collaborating and make Repomix even better!

VideoScape · 2025-02-06T20:38:14Z

i recently hit into the token issue and had to move away from repomix due to it, I was just thinking of turning off or seeing if there was a way to turn off the file output, just so I can then feed it to the LLM by hand but the aider way, which is what I'm using now would be way better,

I am going to keep an eye on this as I would love to just keep the RAG and code compression on and then turn off the complete file output.

i am surprised how fresh this is and I am going to keep my eyes on this 200%

SpyCoder77Alt · 2025-02-11T14:02:04Z

I can help. Let me know, and ping @SpyC0der77 (my main) to let me know what to work on.

huy-trn · 2025-02-11T17:12:54Z

Hello @SpyC0der77,

Thanks so much!

I’m working on this PR #336, but it’s still a draft, and honestly, the code is ugly. Unfortunately, I won’t be able to finish it anytime soon due to other commitments.

It would be great if you could fork the branch and take it from there (#336) or even start fresh if you prefer. I’d be really happy to see the idea move forward!

Edit:

Adding tree sitter queries for more languages or improving the existing ones is really needed too, and we can easily merge the works!

I'll be able to be back working on this after 2 or 3 weeks.

SpyCoder77Alt · 2025-02-11T19:58:50Z

I don't know how to make RAG... I can try and learn though

…

On Tue, Feb 11, 2025 at 12:13 PM Huy Tran ***@***.***> wrote: Hello @SpyC0der77 <https://github.com/SpyC0der77>, Thanks so much! I’m working on this PR #336 <#336>, but it’s still a draft, and honestly, the code is ugly. Unfortunately, I won’t be able to finish it anytime soon due to other commitments. It would be great if you could fork the branch and take it from there ( #336 <#336>) or even start fresh if you prefer. I’d be really happy to see the idea move forward! — Reply to this email directly, view it on GitHub <#202 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BG62BJRTGROVACKWIDJCDPD2PIVSZAVCNFSM6AAAAABTLTV24KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJRGQ4TQNZXGM> . You are receiving this because you commented.Message ID: ***@***.***>

huy-trn changed the title ~~Idea: Pre-processing to make the output more friendly with RAG engines~~ Idea: Make the output more friendly with RAG engines Dec 10, 2024

huy-trn changed the title ~~Idea: Make the output more friendly with RAG engines~~ Idea: Add an option to make the output more friendly with RAG engines Dec 10, 2024

yamadashy added the idea label Dec 11, 2024

yamadashy added the needs discussion Issues needing discussion and a decision to be made before action can be taken label Dec 11, 2024

huy-trn mentioned this issue Feb 4, 2025

Code compression #336

Merged

2 tasks

yamadashy pinned this issue Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Add an option to make the output more friendly with RAG engines #202

Idea: Add an option to make the output more friendly with RAG engines #202

huy-trn commented Dec 10, 2024

yamadashy commented Dec 11, 2024

huy-trn commented Dec 11, 2024 •

edited

Loading

huy-trn commented Dec 11, 2024

yamadashy commented Dec 13, 2024 •

edited

Loading

huy-trn commented Feb 4, 2025 •

edited

Loading

yamadashy commented Feb 4, 2025

VideoScape commented Feb 6, 2025

SpyCoder77Alt commented Feb 11, 2025

huy-trn commented Feb 11, 2025 •

edited

Loading

SpyCoder77Alt commented Feb 11, 2025 via email

Idea: Add an option to make the output more friendly with RAG engines #202

Idea: Add an option to make the output more friendly with RAG engines #202

Comments

huy-trn commented Dec 10, 2024

yamadashy commented Dec 11, 2024

huy-trn commented Dec 11, 2024 • edited Loading

huy-trn commented Dec 11, 2024

yamadashy commented Dec 13, 2024 • edited Loading

huy-trn commented Feb 4, 2025 • edited Loading

yamadashy commented Feb 4, 2025

VideoScape commented Feb 6, 2025

SpyCoder77Alt commented Feb 11, 2025

huy-trn commented Feb 11, 2025 • edited Loading

SpyCoder77Alt commented Feb 11, 2025 via email

huy-trn commented Dec 11, 2024 •

edited

Loading

yamadashy commented Dec 13, 2024 •

edited

Loading

huy-trn commented Feb 4, 2025 •

edited

Loading

huy-trn commented Feb 11, 2025 •

edited

Loading