Implementing Web Crawler using Abstract Factory Design Pattern in Python
Last Updated :
11 Oct, 2021
In the Abstract Factory design pattern, every product has an abstract product interface. This approach facilitates the creation of families of related objects that is independent of their factory classes. As a result, you can change the factory at runtime to get a different object – simplifies the replacement of the product families.
In this design pattern, the client uses an abstract factory interface to access objects. The abstract interface separates the creation of objects from the client, which makes the manipulation easier and isolates the concrete classes from the client. However, adding new products to the existing factory is difficult because you need to extend the factory interface, which includes changing the abstract factory interface class and all its subclasses.
Let's look into the web crawler implementation in Python for a better understanding. As shown in the following diagram, you have an abstract factory interface class – AbstractFactory – and two concrete factory classes – HTTPConcreteFactory and FTPConcreteFactory. These two concrete classes are derived from the AbstractFactory class and have methods to create instances of three interfaces – ProtocolAbstractProduct, PortAbstractProduct, and CrawlerAbstractProduct.
Since AbstractFactory class acts as an interface for the factories such as HTTPConcreteFactory and FTPConcreteFactory, it has three abstract methods – create_protocol(), create_port(), create_crawler(). These methods are redefined in the factory classes. That means HTTPConcreteFactory class creates its family of related objects such as HTTPPort, HTTPSecurePort, and HTTPSecureProtocol, whereas, FTPConcreteFactory class creates FTPPort, FTPProtocol, and FTPCrawler.
Python3
import abc
import urllib
import urllib.error
import urllib.request
from bs4 import BeautifulSoup
class AbstractFactory(object, metaclass=abc.ABCMeta):
""" Abstract Factory Interface """
def __init__(self, is_secure):
self.is_secure = is_secure
@abc.abstractmethod
def create_protocol(self):
pass
@abc.abstractmethod
def create_port(self):
pass
@abc.abstractmethod
def create_crawler(self):
pass
class HTTPConcreteFactory(AbstractFactory):
""" Concrete Factory for building HTTP connection. """
def create_protocol(self):
if self.is_secure:
return HTTPSecureProtocol()
return HTTPProtocol()
def create_port(self):
if self.is_secure:
return HTTPSecurePort()
return HTTPPort()
def create_crawler(self):
return HTTPCrawler()
class FTPConcreteFactory(AbstractFactory):
""" Concrete Factory for building FTP connection """
def create_protocol(self):
return FTPProtocol()
def create_port(self):
return FTPPort()
def create_crawler(self):
return FTPCrawler()
class ProtocolAbstractProduct(object, metaclass=abc.ABCMeta):
""" An abstract product, represents protocol to connect """
@abc.abstractmethod
def __str__(self):
pass
class HTTPProtocol(ProtocolAbstractProduct):
""" An concrete product, represents http protocol """
def __str__(self):
return 'http'
class HTTPSecureProtocol(ProtocolAbstractProduct):
""" An concrete product, represents https protocol """
def __str__(self):
return 'https'
class FTPProtocol(ProtocolAbstractProduct):
""" An concrete product, represents ftp protocol """
def __str__(self):
return 'ftp'
class PortAbstractProduct(object, metaclass=abc.ABCMeta):
""" An abstract product, represents port to connect """
@abc.abstractmethod
def __str__(self):
pass
class HTTPPort(PortAbstractProduct):
""" A concrete product which represents http port. """
def __str__(self):
return '80'
class HTTPSecurePort(PortAbstractProduct):
""" A concrete product which represents https port """
def __str__(self):
return '443'
class FTPPort(PortAbstractProduct):
""" A concrete products which represents ftp port. """
def __str__(self):
return '21'
class CrawlerAbstractProduct(object, metaclass=abc.ABCMeta):
""" An Abstract product, represents parser to parse web content """
@abc.abstractmethod
def __call__(self, content):
pass
class HTTPCrawler(CrawlerAbstractProduct):
def __call__(self, content):
""" Parses web content """
filenames = []
soup = BeautifulSoup(content, "html.parser")
links = soup.table.findAll('a')
for link in links:
filenames.append(link['href'])
return '\n'.join(filenames)
class FTPCrawler(CrawlerAbstractProduct):
def __call__(self, content):
""" Parse Web Content """
content = str(content, 'utf-8')
lines = content.split('\n')
filenames = []
for line in lines:
splitted_line = line.split(None, 8)
if len(splitted_line) == 9:
filenames.append(splitted_line[-1])
return '\n'.join(filenames)
class Connector(object):
""" A client """
def __init__(self, abstractfactory):
""" calling all attributes
of a connector according to abstractfactory class. """
self.protocol = abstractfactory.create_protocol()
self.port = abstractfactory.create_port()
self.crawl = abstractfactory.create_crawler()
def read(self, host, path):
url = str(self.protocol) + '://' + host + ':' + str(self.port) + path
print('Connecting to', url)
return urllib.request.urlopen(url, timeout=10).read()
if __name__ == "__main__":
con_domain = 'ftp.freebsd.org'
con_path = '/pub/FreeBSD/'
con_protocol = input('Choose the protocol \
(0-http, 1-ftp): ')
if con_protocol == '0':
is_secure = input('Use secure connection? (1-yes, 0-no):')
if is_secure == '1':
is_secure = True
else:
is_secure = False
abstractfactory = HTTPConcreteFactory(is_secure)
else:
is_secure = False
abstractfactory = FTPConcreteFactory(is_secure)
connector = Connector(abstractfactory)
try:
data = connector.read(con_domain, con_path)
except urllib.error.URLError as e:
print('Cannot access resource with this method', e)
else:
print(connector.crawl(data))
Output
Output
The goal of the program is to crawl the website using the HTTP protocol or FTP protocol. Here, we need to consider three scenarios while implementing the code.
- Protocol
- Port
- Crawler
These three scenarios differ in the HTTP and FTP web access models. So, here we need to create two factories, one for creating HTTP products and another for creating FTP products – HTTPConcreteFactory and FTPConcreteFactory. These two concrete factories are derived from an abstract factory – AbstractFactory.
An abstract interface is used because the operation methods are the same for both factory classes, only the implementation is different, and hence the client code can determine which factory to using during the runtime. Let's analyze the products created by each factory.
In the case of protocol product, HTTP concrete factory creates either http or https protocol, whereas, FTP concrete factory creates ftp protocol. For port products, HTTP concrete factory generates either 80 or 443 as a port product, and the FTP factory generates 21 as a port product. And finally, the crawler implementation differs because the website structure is different for HTTP and FTP.
Here, the created object has the same interface, whereas the created concrete objects are different for every factory. Say, for example, the port products such as HTTP port, HTTP Secure port, and FTP port have the same interface, but the concrete objects for both factories are different. The same is applicable for protocol and crawler as well.
Finally, the connector class accepts a factory and uses this factory to inject all attributes of the connector based on the factory class.
Similar Reads
Implementing Weather Forecast using Facade Design Pattern in Python
Facade Design Patterns are design patterns in Python that provide a simple interface to a complex subsystem. When we look into the world around us, we can always find facade design patterns. An automobile is the best example: you don't need to understand how the engine functions. To operate the engi
3 min read
Abstract Factory Method - Python Design Patterns
Abstract Factory Method is a Creational Design pattern that allows you to produce the families of related objects without specifying their concrete classes. Using the abstract factory method, we have the easiest ways to produce a similar type of many objects. It provides a way to encapsulate a group
4 min read
Accessing Web Resources using Factory Method Design Pattern in Python
A factory is a class for building other objects, where it creates and returns an object based on the passed parameters. Here, the client provides materials to the Factory class, and the Factory class creates and returns the products based on the given material. A Factory is not a design pattern, but
4 min read
Factory Method - Python Design Patterns
Factory Method is a Creational Design Pattern that allows an interface or a class to create an object, but lets subclasses decide which class or object to instantiate. Using the Factory method, we have the best ways to create an object. Here, objects are created without exposing the logic to the cli
4 min read
Implementing News Parser using Template Method Design Pattern in Python
While defining algorithms, programmers often neglect the importance of grouping the same methods of different algorithms. Normally, they define algorithms from start to end and repeat the same methods in every algorithm. This practice leads to code duplication and difficulties in code maintenance â
4 min read
Implementing Web Scraping in Python with BeautifulSoup
There are mainly two ways to extract data from a website:Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called
8 min read
Singleton Pattern in Python - A Complete Guide
A Singleton pattern in python is a design pattern that allows you to create just one instance of a class, throughout the lifetime of a program. Using a singleton pattern has many benefits. A few of them are:To limit concurrent access to a shared resource.To create a global point of access for a reso
9 min read
Adapter Method - Python Design Patterns
Adapter method is a Structural Design Pattern which helps us in making the incompatible objects adaptable to each other. The Adapter method is one of the easiest methods to understand because we have a lot of real-life examples that show the analogy with it. The main purpose of this method is to cre
4 min read
Difference between abstract class and interface in Python
In this article, we are going to see the difference between abstract classes and interface in Python, Below are the points that are discussed in this article: What is an abstract class in Python?What is an interface in Python?Difference between abstract class and interface in PythonWhat is an Abstra
4 min read
Singleton Method - Python Design Patterns
Prerequisite: Singleton Design pattern | IntroductionWhat is Singleton Method in PythonSingleton Method is a type of Creational Design pattern and is one of the simplest design patterns available to us. It is a way to provide one and only one object of a particular type. It involves only one class t
5 min read