How Does Google Know Your Website

Randula Koralage
5 min readJan 11, 2020

--

We all familiar with searching google for getting anything in our mind. How Google and every other search engine load data from billions of websites?

Basically, search engines have three primary functions:

  1. Crawl: Explore the Internet for content, looking over the code/content for each URL they find.
  2. Index: Store and organize the content found during the crawling process.
  3. Rank: Provide the pieces of content that will best answer a searcher’s query, which means that results are ordered by most relevant to least relevant.

Google crawls the Internet constantly by sending out a team of robots that are known as crawlers or spiders to find new and updated content. Content can vary and it regardless of the format. It could be a webpage, an image, a video, a PDF, etc.

Googlebot starts out by fetching a few web pages and then follows the links on those webpages to find new URLs. By hopping along this path of links, the crawler is able to find new content and add it to their index called Caffeine, a massive database of discovered URLs to later be retrieved when a searcher is seeking information that the content on that URL is a good match for.

With this technique, Google will find your web content sooner or later. But still, you can speed up the crawling process by submitting your website to the search engine.

What is a sitemap?

A sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them. Search engines like Google read this file to more intelligently crawl your site. A sitemap tells Google which pages and files you think are important in your site, and also provides valuable information about these files: for example, for pages, when the page was last updated, how often the page is changed, and any alternate language versions of a page.

Sitemaps are helpful if a site has dynamic content, is new and does not have many links to it, or contains a lot of archived content that is not well-linked.

There are different types of sitemaps.

  1. XML sitemap
    XML sitemaps are generated for use of search engines. It is submitted to search engines so they can crawl the website in a more effective manner. Using the Sitemap, search engines become aware of every page on the site, including any URLs that are not discovered through the normal crawling process used by the engine.
  2. HTML sitemap
    An HTML sitemap allows site visitors to easily navigate a website. It is a bulleted outline text version of the site navigation. Example
  3. Image sitemap
    Google receives metadata regarding the images contained on a website using image sitemaps. Visitors can do an image search on Google. Using Google image extensions in Sitemaps provides the search engine with additional information regarding the images on the website. This can help Google discover images it may not find through crawling, such as those accessed via JavaScript forms. Example
  4. Video Sitemap
    It provides Google with metadata about video content on a website. The video site operated by Google is the largest video search entity on the Web. With a Video Sitemap, site owners can inform Google of the category, title, description, running time, and intended audience for each video contained on the site. This helps Google know about all the rich video content on the site, which should improve the listing of the site on video search results. Example
  5. News sitemap
    These Sitemaps identify the title and publication date of every article. Using genres and access tags, they also specify the type of content contained in the article. Article content is further identified using descriptions like stock tickers or relevant keywords. Google News Sitemaps are recommended for sites that are new, including dynamic content, or have a path that requires clicking several links to reach a news article. Read

How to generate a sitemap?

Many Content Management Systems generate a sitemap and these are automatically updated when we add or remove pages and posts from the site. If the CMS doesn’t do this, then there’s usually a plugin available which does. There are many libraries and plugins available online for generating sitemaps. After generating a sitemap, we can submit into Google Search Console or other relevant search engine SEO tools.

What is Robot.txt file?

A robots.txt file tells search engines where they can and can’t go on your site. It lists all the content you want to lock away from search engines like Google. Inside robot.txt we assign rules to bots by stating their user-agent followed by directives. Each search engine identifies itself with a different user-agent. You can set custom instructions for each of these in your robots.txt file. It’s also good practice to add the sitemap URL(s) to robots.txt file.

Factors that improve the quality of sitemap

  • Include sidemap.xml inside robots.txt file
    It will allow search engines to better understand what content they should crawl.
  • Avoid pages have underscores in the URL
    When it comes to URL structure, using underscores as word separators are not recommended because search engines may not interpret them correctly and may consider them to be a part of a word. Using hyphens instead of underscores makes it easier for search engines to understand what your page is about. Although using underscores doesn’t have a huge impact on webpage visibility, it decreases your page’s chances of appearing in search results, as opposed to when hyphens are used.
  • Ensure you have enough text within the title tags
    Using short titles on webpages is a recommended practice. However, keep in mind that titles containing 10 characters or less do not provide enough information about what your webpage is about and limit your page’s potential to show up in search results for different keywords.
  • Avoid different URLS that directs the same content
    Populating your file with such URLs will confuse search engine robots as to which URL they should index and prioritize in search results. Most likely, search engines will index only one of those URLs, and this URL may not be the one you’d like to be promoted in search results.
  • Do not use the same meta description if not both pages are exactly the same
    A <meta description> tag is a short summary of a webpage’s content that helps search engines understand what the page is about and can be shown to users in search results. Duplicate meta descriptions on different pages mean a lost opportunity to use more relevant keywords. Also, duplicate meta descriptions make it difficult for search engines and users to differentiate between different webpages.
  • Double-check whether all URLs are working.

--

--

Randula Koralage
Randula Koralage

No responses yet