I am trying to understand how tree-sitter works, but I ran into an issue when parsing "large" files. A file slightly over 32KB consistently causes tree-sitter to fail.
What I tested:
I wrote a minimal example, which demonstrates the issue. It simply generates fake java code which gets increasingly larger, until it fails. Here is that code:
const Parser = require('tree-sitter');
const Java = require('tree-sitter-java');
const parser = new Parser();
parser.setLanguage(Java);
for (let i = 1925; i < 1930; i += 1) {
const largeSource = 'class A { ' + 'void method() {} '.repeat(i) + '}';
try {
const tree = parser.parse(largeSource);
console.log(`Success: Source size: ${largeSource.length} bytes`);
} catch (err) {
console.error(`Error: Source size: ${largeSource.length} bytes`);
}
}
Output:
Success: Source size: 32736 bytes
Success: Source size: 32753 bytes
Error: Source size: 32770 bytes
Error: Source size: 32787 bytes
Error: Source size: 32804 bytes
The same issue occurs with tree-sitter-javascript, failing at 32KB and above.
My Questions:
- Is that expected behaviour? Am I missing something?
- Is this size limitation a documented behavior? Can it be configured?
My Use Case
My goal is to create chunks of code and embedding the chunks to save them in a vecror database, so I can easily find things in a large codebase.
I could split the files into chunks before parsing, but I'd like to avoid that unless necessary.
Any insights would be appreciated!
I am trying to understand how tree-sitter works, but I ran into an issue when parsing "large" files. A file slightly over 32KB consistently causes tree-sitter to fail.
What I tested:
I wrote a minimal example, which demonstrates the issue. It simply generates fake java code which gets increasingly larger, until it fails. Here is that code:
const Parser = require('tree-sitter');
const Java = require('tree-sitter-java');
const parser = new Parser();
parser.setLanguage(Java);
for (let i = 1925; i < 1930; i += 1) {
const largeSource = 'class A { ' + 'void method() {} '.repeat(i) + '}';
try {
const tree = parser.parse(largeSource);
console.log(`Success: Source size: ${largeSource.length} bytes`);
} catch (err) {
console.error(`Error: Source size: ${largeSource.length} bytes`);
}
}
Output:
Success: Source size: 32736 bytes
Success: Source size: 32753 bytes
Error: Source size: 32770 bytes
Error: Source size: 32787 bytes
Error: Source size: 32804 bytes
The same issue occurs with tree-sitter-javascript, failing at 32KB and above.
My Questions:
- Is that expected behaviour? Am I missing something?
- Is this size limitation a documented behavior? Can it be configured?
My Use Case
My goal is to create chunks of code and embedding the chunks to save them in a vecror database, so I can easily find things in a large codebase.
I could split the files into chunks before parsing, but I'd like to avoid that unless necessary.
Any insights would be appreciated!
Share Improve this question asked Mar 13 at 16:55 CodeBreakerCodeBreaker 1701 silver badge10 bronze badges1 Answer
Reset to default -1This seems to work for me:
function parseLargeSource(source) {
const chunkSize = 30000; // maximum length to return in one call
const parsedTree = parser.parse((offset) => {
if (offset < source.length) {
return source.slice(offset, offset + chunkSize);
}
return null;
});
return parsedTree;
}
It is generated by Deepseek R1, but I tested it and does exactly what I expect it to. I know that ai answers are not allowed here, but since nobody answered the question within a week and this solution works, I feel like it adds value for anyone having this issue in the future. (And it was not straight forward to get to that point, other ai models (claude 3.7 and o3 mini high) did not find a working solution, and it took many prompts to get there) However if mods feel like it makes sense to remove it, feel free to do so.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744689389a4588117.html
评论列表(0条)