
parseflow
io.github.Libres-coder/parseflow
PDF parsing server with text extraction, metadata, search, images, and TOC via MCP
Documentation
๐ ParseFlow
Universal document parsing library for PDF, Word, and Excel files
ParseFlow is a comprehensive document parsing solution that supports PDF, Word (docx), and Excel (xlsx/xls) files. It provides both a standalone library and an MCP (Model Context Protocol) server for AI assistants.
ไธญๆๆๆกฃ | Examples | GitHub
โจ Features
๐ PDF Support
- โ Text extraction with multiple strategies (raw, formatted, clean)
- โ Page-specific and range-based extraction
- โ Metadata retrieval (title, author, dates, page count)
- โ Full-text search with context
- โ Image extraction (placeholder)
- โ Table of contents (TOC) extraction (placeholder)
๐ Word (docx) Support
- โ Text extraction
- โ HTML conversion
- โ Metadata retrieval
- โ Text search with context
๐ Excel (xlsx/xls) Support
- โ Multi-sheet data extraction
- โ Multiple output formats (JSON, CSV, Text)
- โ Sheet-specific extraction
- โ Cell-based search
- โ Range extraction
- โ Workbook metadata
๐ค MCP Server
- โ 9 tools for AI assistants (5 PDF + 2 Word + 2 Excel)
- โ Works with Claude Desktop and other MCP clients
- โ Path security with allowlist support
๐ฆ Installation
Core Library
npm install parseflow-core
MCP Server (Global)
npm install -g parseflow-mcp-server
Or use with npx:
npx parseflow-mcp-server
๐ Quick Start
PDF Parsing
import { PDFParser } from 'parseflow-core';
const parser = new PDFParser();
// Extract all text
const text = await parser.extractText('document.pdf');
// Extract specific page
const page5 = await parser.extractPage('document.pdf', 5);
// Search
const results = await parser.search('document.pdf', 'keyword');
// Get metadata
const metadata = await parser.getMetadata('document.pdf');
Word Parsing
import { WordParser } from 'parseflow-core';
const parser = new WordParser();
// Extract text
const result = await parser.extractText('report.docx');
console.log(result.text);
// Convert to HTML
const html = await parser.extractHTML('report.docx');
// Search
const matches = await parser.searchText('report.docx', 'budget');
Excel Parsing
import { ExcelParser } from 'parseflow-core';
const parser = new ExcelParser();
// Extract all sheets (JSON format)
const data = await parser.extractData('spreadsheet.xlsx');
// Extract specific sheet
const sales = await parser.extractData('data.xlsx', {
sheetName: 'Q4 Sales',
format: 'json'
});
// Search in cells
const results = await parser.searchText('data.xlsx', 'revenue');
๐ ๏ธ MCP Server Usage
Configuration for Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"parseflow": {
"command": "npx",
"args": ["-y", "parseflow-mcp-server"],
"env": {
"PARSEFLOW_ALLOWED_PATHS": "C:\\Documents;D:\\Projects"
}
}
}
}
Available Tools
PDF Tools
extract_text- Extract text from PDF filessearch_pdf- Search for keywords in PDFget_metadata- Get PDF metadataextract_images- Extract images from PDFget_toc- Get table of contents
Word Tools
extract_word- Extract text/HTML from Word documentssearch_word- Search in Word documents
Excel Tools
extract_excel- Extract data from Excel spreadsheetssearch_excel- Search in Excel cells
Example Usage in Claude
"่ฏท่ฏปๅ report.docx ๆไปถ็ๅ
ๅฎน"
โ Uses extract_word tool
"ๅจ sales.xlsx ไธญๆฅๆพ 'ไบงๅA'"
โ Uses search_excel tool
"ๆๅ document.pdf ็ๅ
ๆฐๆฎ"
โ Uses get_metadata tool
๐ Documentation
- Office Examples - Word and Excel usage examples
- Release Guide - How to publish new versions
- Contributing - Contribution guidelines
- Security Policy - Security vulnerability reporting
- Code of Conduct - Community guidelines
๐๏ธ Project Structure
ParseFlow/
โโโ packages/
โ โโโ pdf-parser-core/ # Core library (parseflow-core)
โ โ โโโ src/
โ โ โ โโโ parser.ts # PDF parser
โ โ โ โโโ WordParser.ts # Word parser
โ โ โ โโโ ExcelParser.ts # Excel parser
โ โ โโโ package.json
โ โโโ mcp-server/ # MCP server (parseflow-mcp-server)
โ โโโ src/
โ โ โโโ index.ts # Server entry
โ โ โโโ tools/ # MCP tools
โ โโโ package.json
โโโ docs/ # Documentation
โโโ examples/ # Usage examples
โโโ tests/ # Test files
โโโ scripts/ # Build scripts
๐งช Testing
# Run all tests
pnpm test
# Test coverage
pnpm test:coverage
# Run specific test
pnpm test parser.test.ts
Test Files
- Wordๆต่ฏๆไปถ.docx - Word test document
- Excelๆต่ฏๆไปถ.xlsx - Excel test workbook (3 sheets)
- PDFๆต่ฏๆๆกฃ.pdf - PDF test document
๐ง Development
# Install dependencies
pnpm install
# Build all packages
pnpm build
# Watch mode
pnpm dev
# Lint
pnpm lint
# Type check
pnpm type-check
๐ Roadmap
v1.1.0 (Current)
- โ Word (docx) support
- โ Excel (xlsx/xls) support
- โ 9 MCP tools
v1.2.0 (Planned)
- Encrypted PDF support
- OCR text recognition
- PowerPoint (pptx) support
- Batch processing optimization
v2.0.0 (Future)
- Plugin system
- More document formats (CSV, TXT, RTF)
- Advanced table extraction
- Document conversion
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for details.
Ways to Contribute
- ๐ Report bugs
- ๐ก Suggest features
- ๐ Improve documentation
- ๐ง Submit pull requests
๐ฆ Packages
| Package | Version | Description |
|---|---|---|
| parseflow-core | 1.0.1 | Core parsing library |
| parseflow-mcp-server | 1.0.2 | MCP server for AI |
๐ Links
- npm Core: https://www.npmjs.com/package/parseflow-core
- npm MCP: https://www.npmjs.com/package/parseflow-mcp-server
- GitHub: https://github.com/Libres-coder/ParseFlow
- Issues: https://github.com/Libres-coder/ParseFlow/issues
- MCP Registry: https://registry.modelcontextprotocol.io/
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- pdf-parse - PDF parsing
- pdf-lib - PDF manipulation
- mammoth - Word document parsing
- xlsx - Excel spreadsheet parsing
- MCP SDK - Model Context Protocol
๐ Stats
- Test Coverage: 83%+
- Supported Formats: 3 (PDF, Word, Excel)
- MCP Tools: 9
- Dependencies: Minimal and well-maintained
๐ฌ Community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with โค๏ธ by Libres-coder
Status: ๐ Production Ready (v1.1.0)
parseflow-mcp-servernpm install parseflow-mcp-serverRelated Servers
ai.explorium/mcp-explorium
Access live company and contact data from Explorium's AgentSource B2B platform.
ai.smithery/ImRonAI-mcp-server-browserbase
Automate cloud browsers to navigate websites, interact with elements, and extract structured data.โฆ
ai.smithery/IndianAppGuy-magicslide-mcp
Generate professional PowerPoint presentations from text, YouTube videos, or structured JSON data.โฆ