Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve input data #4

Open
2 tasks
RusseII opened this issue Apr 18, 2023 · 2 comments
Open
2 tasks

Improve input data #4

RusseII opened this issue Apr 18, 2023 · 2 comments

Comments

@RusseII
Copy link
Member

RusseII commented Apr 18, 2023

Right now we have 2500 URLs that were returned from https://defillama.com/docs/api endpoint and getting their URLs.

However, some protocols provided their landing page (root domain or www.domain, example ribbon.finance ) while others provided their app domain (app.domain.com). This makes the data slightly misleading / less informative as companies are usually set up in 1 of two ways

  1. A shared landing page + app domain. They use paths for routing and usually the same codebase. https://llamapay.io/ is an example of this
  2. A separate domain for the landing page and the app. I think this is most common. https://ribbon.finance and https://app.ribbon.finance is an example of this.

We should somehow segment the data between 'landing pages' and 'apps'.

If the site is doing method number 1, we can count it as an app.
If it's using method number 2, we should count their app domain as an app, but their landing page as a landing page.

As the output of this issue I'm imagining we have all the same graphs as step 1, but one of them for all 'landing pages' and then another set of all graphs for 'apps'

  • How do know if a site uses method 1 or method 2?
  • How to get the app domains for sites using method 2? (maybe we just use app.domain since that seems to be most common)

Maybe we can write a quick script using chatgpt where we can feed in a bunch of root websites, and it pings the app domain and a few other common domains to see if it has a separate app?

@Bookcliff
Copy link
Contributor

Bookcliff commented Apr 18, 2023

Findings:

  • Initially used wappalyzer.com to pull all of the technologies for each URL entered
    • Based on my reading of their docs, it doesn't look like they used recursive webscraping because I'm not seeing subdomains included in the output
    • Wappalyzer has a subdomain API to pull subdomains of a particular domain

Possible method:

  1. IF site.status === "success" :
  2. Create function to loop through URLs and clean them (remove http, https, www, etc.)
  3. Create function to fetch app.DOMAIN (IF URL doesn't already start with "app.")
  • IF app.domain returns !404 = info in json file is for landing page (should not be limited to 200 because may not have access, may have special headers, etc. ?)
  • IF app.domain returns 404 = info in json file MAY be for the app, but could also be landing page and app is under a different subdomain (other subdomains to try: staking, beta,dapp --> maybe push to list of domains and request subdomains just for these - will save credits)
  • IF we request list of subdomains, pull the app domains (good way to programmatically figure this out?)
  1. push all apps and landing pages to separate lists
    • Request all app domains that we don't already have from wappalyzer

@RusseII
Copy link
Member Author

RusseII commented Apr 19, 2023

Based on my reading of their docs, it doesn't look like they used recursive webscraping because I'm not seeing subdomains included in the output

That's correct, I confirmed with their support.

IF site.status === "success" :

We can just delete all the ones that aren't a success from our source data.

Create function to loop through URLs and clean them (remove http, https, www, etc.)

Is this just getting the root domain for all of them? So would clean .app etc as well

Approach seems good, but we might be able to simplify this by using wappanalyizers subdomains API?

  1. Get all the root domains
  2. Use wappanalizer for subdomains
  3. Optional (also do the app subdomain ping to see if wappanalizer missed any)
  4. Classify the domains into landing or app

It'd also be interesting what other subdomains people have (but less important)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants