An AI-powered system for analyzing competitive landscapes in drug development, focusing on specific molecular targets. This tool automates the collection, analysis, and scoring of therapeutic assets to provide comprehensive competitive intelligence for pharmaceutical research and development.
This system takes a molecular target (e.g., CD47, KRAS, PD-1) as input and generates a detailed competitive analysis by:
- You can provide your input as a
.txt,.csv, or.jsonfile. The tool supports all three formats for specifying your list of molecular targets. For.csvand.jsonfiles, make sure your file includes atargetormolecular_targetfield for each entry. and for the .txt file simple pass the name of molecular targets line by line - Scraping data from multiple authoritative sources
- Analyzing and extracting relevant drug development information
- Normalizing data across different formats and terminologies
- Calculating competitive scores and identifying strategic opportunities
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python run_ui.pyFor more commands such as file input, please check the setup instructions mentioned below
The system uses LangGraph to orchestrate a multi-step workflow through specialized AI agents:
Target Input → Scraper → Analyzer → Normalizer → Scorer → Final Report
For Data scraping, we are using Four primary sources:
- ClinicalTrials.gov: US and international clinical trials
- EUCTR: European Union clinical trials register
- PubMed: Biomedical literature for drug development publications
- Google Patents: Extracts patent publication numbers from the initial patent query results, then scrapes abstracts and metadata for each patent using a custom extraction tool. This enables the system to analyze the intellectual property landscape alongside clinical and scientific data.
Note: We do not use any third-party scraping services like Firecrawl or GPT-based scrapers. Instead, we have identified and reverse-engineered the APIs ourselves by inspecting network activity (e.g., using browser network tabs). We handle the different types of responses from each source appropriately, and finally convert all collected data into a unified JSON format for downstream processing.
The project collects a lot of data from PubMed, including abstracts, titles, authors, and publication dates. Then, we filter the data, separate out all the abstracts, and send them to the AI for analysis.
For ClinicalTrials.gov and EUCTR, we use their respective APIs to fetch structured data. For PubMed, we utilize the pymed_paperscraper library to access abstracts and metadata.
Note: There may be fluctuations in the scraped data returned from external sources. This can lead to slight changes in the number of identified assets, which in turn may cause minor variations in the calculated competitive score each time the analysis is run.
- We use OpenAI's GPT models to analyze the scraped data.
- Because AI models like GPT have a limit on how much text they can process at once, the data is split into smaller groups (batches) of 25 articles.
- Each batch is sent to the AI, which reads and analyzes the articles.
- The AI filters out studies that aren’t about therapies (like those only about diagnostics or biomarkers).
- For the relevant data, the AI pulls out important details such as drug name, how the drug works (modality and mechanism of action), sponsor, what disease it treats (indication), its status, and information about licensing or acquisitions. The result is a structured, easy-to-use summary of drug information.
- Extracts the list of assets (therapeutic candidates) from the input data
- Unifies drug names, recognizing aliases and alternative spellings (e.g., "ALX148" → "Evorpacept (ALX148)")
- Standardizes clinical trial phases (e.g., "Phase I/II" becomes "Phase II", missing values become "Preclinical")
- Normalizes mechanism of action (MoA) into a set of standard categories, or keeps the original if it doesn't match
- Standardizes modality (e.g., "mAb", "Bispecific mAb", "Fc-fusion Protein"), or keeps the original if unrecognized
- Groups the output by clinical phase and removes exact duplicate entries
- Weighted Scoring Algorithm: Assigns different weights to clinical phases
- Crowding Score Calculation: Produces 0-1 scale competitive intensity score
- White Space Analysis: Identifies underexplored therapeutic opportunities
-
ClinicalTrials.gov (US)
- Coverage: US and international clinical trials
- API Endpoint:
https://clinicaltrials.gov/api/int/studies - Data Fields: 30+ fields including phases, status, interventions, sponsors
- Search Strategy: Condition-based queries with spell-checking enabled
- Limits: 100 studies per query (expandable)
-
EUCTR (European Union Clinical Trials Register)
- Coverage: EU clinical trials
- API Endpoint:
https://www.clinicaltrialsregister.eu/ctr-search/rest/download/summary - Format: JSON summary downloads
- Scope: European regulatory submissions
-
PubMed (Biomedical Literature)
- Coverage: Global biomedical research publications
- Library: pymed_paperscraper for API access
- Data: Abstracts, titles, authors, publication dates
- Limits: 500 articles per query (configurable)
- Processing: Batch processing for large result sets
-
Google Patents
- Coverage: Global database of patents, including biomedical, pharmaceutical, and biotech innovations
- Access Method: Custom scraper or API to extract metadata and full-text when available
- Cross-Validation: Information verified across multiple sources
- Recency Filters: Prioritizes recent publications and trials
- Relevance Scoring: AI-powered relevance assessment
- Duplicate Removal: Automated deduplication across sources
The competitive crowding score is calculated using a weighted phase-based approach:
Crowding Score = (Σ(Phase_Weight × Asset_Count)) / Maximum_Possible_Score- Preclinical: 0.2 (early-stage, high uncertainty)
- Phase I: 0.4 (safety established, moderate commitment)
- Phase II: 0.6 (efficacy signals, significant investment)
- Phase III: 0.8 (late-stage, high probability of success)
- Approved: 1.0 (market-validated, maximum competitive impact)
- 0.0 - 0.3: Low competition (Blue Ocean)
- 0.3 - 0.6: Moderate competition (Strategic opportunities exist)
- 0.6 - 0.8: High competition (Crowded field)
- 0.8 - 1.0: Hyper-competitive (Saturated market)
The system identifies strategic opportunities through:
- Modality Gaps: Missing therapeutic approaches (e.g., no small molecules in advanced phases)
- Indication Gaps: Underserved disease areas
- Mechanism Gaps: Novel or underexplored mechanisms of action
- Phase Gaps: Stages with few competitors
- Geographic Gaps: Regional development disparities
- Total Competitors: Unique drug assets identified
- Phase Distribution: Asset count by development stage
- Modality Diversity: Range of therapeutic approaches
- Acquisition Signals: M&A activity indicators
- Assumption: ClinicalTrials.gov, EUCTR, and PubMed collectively capture >90% of relevant drug development activity
- Limitation: Some proprietary or early-stage programs may not be publicly disclosed
- Mitigation: Cross-referencing multiple sources to minimize gaps
- Assumption: OpenAI GPT models can accurately identify and extract drug development information
- Validation: Structured prompts with explicit criteria and examples
- Error Handling: Multiple validation passes and format verification
- Assumption: Later-stage assets represent stronger competitive threats
- Rationale: Higher investment, lower failure rates, greater market impact
- Customization: Weights can be adjusted for specific therapeutic areas
- Assumption: Assets targeting the same molecule compete directly
- Nuance: System accounts for different indications and mechanisms
- Differentiation: Modality and mechanism analysis identifies competitive positioning
- Design: System architecture is target-agnostic
- Scalability: Can analyze any molecular target with sufficient data
- Adaptability: Scoring methodology applies across therapeutic areas
- Assumption: Competitive dynamics remain relatively stable in 6-12 month timeframes
- Refresh Strategy: Designed for periodic re-analysis
- Trend Analysis: Historical data comparison capabilities
- Assumption: Clinical development activity predicts market competition
- Validation: Includes approved products and acquisition activity
- Business Intelligence: Incorporates commercial signals beyond clinical data
- Private Company Bias: Publicly traded companies more likely to disclose information
- Geographic Bias: English-language and Western database emphasis
- Temporal Lag: Publication delays may miss recent developments
- Indication Granularity: Broad disease categories may miss subspecialty competition
- Mechanism Complexity: AI may oversimplify complex mechanisms of action
- Real-Time Updates: Integration with news APIs and SEC filings
- Competitive Intelligence: Patent landscape analysis
- Market Sizing: Integration with epidemiological data
- Predictive Modeling: Machine learning for development success prediction
- Expert Validation: Human expert review workflows
pip install -r requirements.txtCreate a .env file with:
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL=gpt-4
You can provide a list of molecular targets in a .txt, .csv, or .json file (one target per line or entry). To run the analysis on all targets in the file, use:
python main.py path/to/your_targets.txt
# or
python main.py path/to/your_targets.csv
# or
python main.py path/to/your_targets.json
python main.py Enter your target molecule when prompted (e.g., "CD47", "KRAS", "PD-1").
output/analyzed_data.json: Raw extracted drug informationoutput/normalized_data.json: Standardized and cleaned dataoutput/competitive_analysis.json: Final competitive analysis with scoresoutput/scraper-results/: Raw data from each source
- LangGraph: Agent orchestration and workflow management
- LangChain: AI model integration and prompt management
- OpenAI API: GPT models for data analysis
- Requests: HTTP client for API calls
- PyMed: PubMed API integration
competitive_landscape/
├── agent/ # AI agent system
│ ├── workflow.py # Main orchestration logic
│ ├── schema.py # Data models
│ └── tools/ # Individual processing tools
├── utils/ # Input/output helpers and utilities and tokenising logic
├── scraper/ # Data collection modules
├── output/ # Analysis results
└── main.py # Entry point
This tool is designed for competitive intelligence and research purposes. Results should be validated with additional sources and expert analysis before making strategic decisions.